Enterprise AI Trend Report: Gain insights on ethical AI, MLOps, generative AI, large language models, and much more.
2024 Cloud survey: Share your insights on microservices, containers, K8s, CI/CD, and DevOps (+ enter a $750 raffle!) for our Trend Reports.
Agile, Waterfall, and Lean are just a few of the project-centric methodologies for software development that you'll find in this Zone. Whether your team is focused on goals like achieving greater speed, having well-defined project scopes, or using fewer resources, the approach you adopt will offer clear guidelines to help structure your team's work. In this Zone, you'll find resources on user stories, implementation examples, and more to help you decide which methodology is the best fit and apply it in your development practices.
Code Complexity in Practice
Architecture: Software Cost Estimation
The evolution of enterprise software engineering has been marked by a series of "less" shifts — from client-server to web and mobile ("client-less"), data center to cloud ("data-center-less"), and app server to serverless. These transitions have simplified aspects of software engineering, including deployment and operation, allowing users to focus less on the underlying systems and more on the application itself. This trend of radical simplification now leads us to the next significant shift in enterprise software engineering: moving from platforms to a "platformless" approach. The Challenges of Platform-Based Approaches In recent years, the rise of enterprise software delivery platforms, often built on Kubernetes and other cluster management systems, has transformed the way organizations deploy and manage applications. They enable rapid, scalable application deployment and the ability to incrementally roll out and roll back updates. This agility in improving application function and performance is vital for business success. However, this complexity has introduced new challenges, including the need for large, highly skilled platform engineering teams and intricate links between various systems like DevOps pipelines, deployment management, monitoring systems SecOps, and site reliability engineering (SRE). Additionally, platform engineering places a predominant emphasis on the delivery of software rather than the entire software engineering lifecycle. The Need for a New Paradigm: Platformless To overcome these challenges, there’s a clear need for a paradigm shift. We need to move the focus away from building, manufacturing, and managing platforms to a more straightforward "platformless" approach. This does not imply the elimination of platforms, but rather the creation of a boundary that makes the platform invisible to the user. In this new paradigm, the focus shifts from managing platforms to developing, building, and deploying applications with seamless integration and monitoring — but without the intricacies of platform management. Defining Platformless Platformless is an innovative concept combining four technology domains: API-First, Cloud-Native Middleware, Platform Engineering, and Developer Experience (DX). This blend allows for a holistic approach to enterprise software engineering, covering the entire lifecycle and delivering a truly platformless experience. API-First: This approach emphasizes making all functionalities available as APIs, events, and data products, ensuring easy discovery and consumption. The focus here is on designing, governing, and managing APIs to ensure consistency and ease of use across the enterprise. In a platformless environment, this API-First approach is further enhanced as all network-exposed capabilities become APIs by default, streamlining governance and management, and shifting the enterprise's focus to leveraging these APIs as a comprehensive software development kit (SDK) for the business. Cloud-Native Middleware: This component involves building and operating systems in a scalable, secure, resilient multi-cloud environment. It encompasses domain-driven design, cell-based architecture, service meshes, integrated authentication and authorization, and zero-trust architecture. Platformless architecture integrates all these components, simplifying the challenges of building and managing cloud-native infrastructure and allowing enterprises to focus more on delivering value. Platform Engineering: This involves creating toolchains and processes for easy, self-service software building, delivery, and operation. Internal Developer Platforms (IDP) born from this discipline support various roles in software delivery, including developers, testers, and operations teams. In a platformless context, these platforms become central to facilitating the software engineering process, allowing each party to concentrate solely on their areas of responsibility and expertise. Developer Experience (DX): As the heart of platformless, DX focuses on making the development environment seamless and intuitive. It includes integrated development environments, command-line interfaces, well-designed web experiences, and comprehensive documentation. DX directly impacts the productivity and creativity of developers, driving better software quality, quicker market delivery, and overall, a happier and more innovative development team. Streamlining Enterprise Software Development and Delivery With a Platformless Approach In enterprise software engineering, the shift to platformless significantly simplifies the development and management of large enterprise application systems that deliver digital experiences. As businesses evolve, they require an ecosystem of interconnected software products, ranging from user-facing apps to autonomous network programs. Platformless facilitates this by enabling the seamless integration of diverse digital assets across various business domains. It streamlines the creation of modular, secure, and reusable architectures, while also enhancing delivery through rapid deployment, continuous integration, and efficient management. This approach allows enterprises to focus on innovation and value delivery, free from the complexities of traditional platform-based systems. For example, with a platformless environment, a developer can integrate a company's system of records with multiple web, mobile, and IoT applications; discover APIs; use languages or tools of their choice; and deploy application components such as APIs, integrations, and microservices in a zero-trust environment — all without managing the underlying platform. Ultimately, this leads to improved efficiency and a greater focus on problem-solving for better business results. The journey from software delivery platforms to a platformless approach represents a major leap in the evolution of enterprise application development and delivery. While retaining the benefits of scalability and rapid deployment, platformless simplifies and enhances the development experience, focusing on the applications rather than the platform. This shift not only streamlines the development process but also promises to deliver superior applications to customers — ultimately driving business innovation and growth.
In the first two parts of our series “Demystifying Event Storming,” we embarked on a journey through the world of Event Storming, an innovative approach to understanding complex business domains and software systems. We started by exploring the fundamentals of Event Storming, understanding its collaborative nature and how it differs from traditional approaches. In Part 2, we delved deeper into process modeling, looking at how Event Storming helps in mapping out complex business processes and interactions. Now, in Part 3, we will focus on the design-level aspect of Event Storming. This stage is crucial for delving into the technical aspects of system architecture and design. Here, we’ll explore how to identify aggregates – a key component in domain-driven design – and how they contribute to creating robust and scalable systems. This part aims to provide practical insights into refining system design and ensuring that it aligns seamlessly with business needs and objectives. Stay tuned as we continue to unravel the layers of Event Storming, providing you with the tools and knowledge to effectively apply this technique in your projects. Understanding The Visual Model The Event Storming Visual Model, as discussed in previous articles, depicts a dynamic and interactive model for system design, highlighting the flow from real-world actions to system outputs and policies. As previously explained, commands are depicted as decisive actions that trigger system operations, while events are the outcomes or results of those actions within the system. These concepts, which we’ve explored in earlier sections, help guide system design. Policies, which we’ve also covered in previous articles, serve as the guidelines or business rules that dictate how events are handled, ensuring system behavior aligns with business objectives. The read model, as previously discussed, represents the information structure affected by events, influencing future system interactions. Sketches and user inputs, as mentioned before, provide context and detail, enhancing the understanding of the system’s workings. Lastly, hotspots, which we’ve touched upon in prior discussions, are identified as critical areas needing scrutiny or improvement, often sparking in-depth discussions and problem-solving during an Event Storming session. This comprehensive model, as previously emphasized, underpins Event Storming’s utility as a collaborative tool, enabling stakeholders to collectively navigate and design complex software architectures. Abstraction Levels Event Storming is used to design set of software artifacts that enforce domain logic and bussiness consistency. - Alberto Brandolini In fact, Event Storming is a powerful technique used to map the intricacies of a system at varying levels of abstraction. This collaborative method enables teams to visualize and understand the flow of events, actions, and policies within a domain. Big Picture Level At the Big Picture Level of Event Storming, the primary goal is to establish an overarching view of the system. This stage serves as the foundation for the entire process. Participants collaborate to identify major domains or subdomains within the system, often referred to as “big picture” contexts. These contexts represent high-level functional areas or components that play essential roles in the system’s operation. The purpose of this level is to provide stakeholders with a holistic understanding of the system’s structure and architecture. Sticky notes and a large canvas are used to visually represent these contexts, with each context being named and briefly described. This visualization offers clarity on the overall system landscape and helps align stakeholders on the core domains and areas of focus. During this stage, participants also focus on identifying and documenting potential conflicts within the system. Conflicts may arise due to overlapping responsibilities, resource allocation, or conflicting objectives among different domains. Recognizing these conflicts early allows teams to address them proactively, minimizing challenges during the later stages of design and development. In addition to conflicts, participants at the Big Picture Level work to define the system’s goals. These goals serve as the guiding principles that drive the system’s design and functionality. Clear and well-defined goals help ensure that the subsequent design decisions align with the system’s intended purpose and objectives. Blockers, which are obstacles or constraints that can impede the system’s progress, are another key consideration at this level. Identifying blockers early in the process enables teams to devise strategies to overcome them effectively, ensuring smoother system implementation. Conceptual boundaries define the scope and context of each domain or subdomain. Understanding these boundaries is essential for ensuring that the system operates seamlessly within its defined constraints. The Big Picture serves as a starting point for addressing these elements, allowing stakeholders to gain insights into the broader challenges and opportunities within the system. This comprehensive view not only aids in understanding the system’s structure but also lays the groundwork for addressing these elements in subsequent levels of abstraction during Event Storming. Process Level The Process Level of Event Storming delves deeper into the specific business processes or workflows within each identified context or domain. Participants collaborate to define the sequence of events and actions that occur during these processes. The primary goal is to visualize and understand the flow of actions and events that drive the system’s behavior. This level helps uncover dependencies, triggers, and outcomes within processes, providing a comprehensive view of how the system operates in response to various inputs and events. Sticky notes are extensively used to represent events and commands within processes, and the flow is mapped on the canvas. This visual representation clearly shows how events and actions connect to achieve specific objectives, offering insights into process workflows. At the Process Level, it’s essential to identify the value proposition, which outlines the core benefits that the system or process delivers to its users or stakeholders. Understanding the value proposition helps participants align their efforts with the overall objectives and ensures that the designed processes contribute to delivering value. Policies represent the rules, guidelines, and business logic that govern how events are handled within the system. They define the behavior and decision-making criteria for various scenarios. Recognizing policies during Event Storming ensures that participants consider the regulatory and compliance aspects that impact the processes. Personas are fictional characters or user profiles that represent different types of system users. These personas help in empathizing with the end-users and understanding their needs, goals, and interactions with the system. Incorporating personas into the process level enables participants to design processes that cater to specific user requirements. Individual Goals refer to the objectives and intentions of various actors or participants within the system. Identifying individual goals helps in mapping out the motivations and expected outcomes of different stakeholders. It ensures that the processes align with the diverse goals of the involved parties. Design Level At the Design Level of Event Storming, the focus shifts to the internal behavior of individual components or aggregates within the system. Participants work together to model the commands, events, and policies that govern the behavior of these components. This level allows for a more granular exploration of system behavior, enabling participants to define the contracts and interactions between different parts of the system. Sticky notes continue to be utilized to represent commands, events, and policies at this level. These notes provide a detailed view of the internal workings of components, illustrating how they respond to commands, emit events, and enforce policies. The Design Level is crucial for defining the behavior and logic within each component, ensuring that the system functions as intended and aligns with business objectives. Identifying Aggregates Event Storming intricately intertwines with the principles and vocabulary of Domain-Driven Design (DDD) to model and elucidate technical concepts. We reach a crucial and often challenging aspect of Domain-Driven Design (DDD) – understanding and identifying aggregates. Aggregates, despite being a fundamental part of DDD, are commonly one of the least understood concepts by engineers. This lack of clarity can lead to significant pitfalls in both system design and implementation. Aggregates are more than just collections of objects; they represent carefully crafted boundaries around a set of entities and value objects. These boundaries are crucial for maintaining data integrity and encapsulating business rules. However, engineers often struggle with understanding the optimal size and scope of an aggregate, leading to either overly large aggregates that become bottlenecks or too many small aggregates that make the system unnecessarily complex. I recommend reading my separate article dedicated to understanding aggregates in DDD, which lays the foundation for the concepts we’ll explore here. In the intricate process of Design Level Event Storming, especially when identifying and defining aggregates for a complex system like a campervan rental service, the foremost step is to ensure the involvement of the right mix of people, and change people. This team should ideally be a blend of domain experts, who bring in-depth knowledge of the campervan rental business, and technical specialists, such as software developers and architects. Their combined insights are crucial in ensuring that the identified aggregates align with both business realities and technical feasibility. Additionally, including individuals with a focus on user experience is invaluable, particularly for aspects of the system that directly interact with customers. Once this diverse and knowledgeable team is assembled, a pivotal initial step is to revisit and reflect upon the insights gained from the Process Level. This stage is crucial as it provides a rich tapestry of information about the business workflows, key events, commands, and the intricate policies that were identified and explored previously. It’s at this juncture that a deep understanding of how the business operates comes to the forefront, offering a nuanced perspective that is essential for the next phase of aggregate identification and design. In Event Storming, the flow often goes from a command (an action initiated) to a domain event (a significant change or result in the system). However, there’s usually an underlying business rule that dictates how and why this transition from command to event happens. This is where blank yellow sticky notes come in. The blank yellow sticky note serves as a placeholder for the business rule that connects the command to the domain event. It represents the decision-making logic or criteria that must be satisfied for the event to occur as a result of the command. When a command and its corresponding domain event are identified, a blank yellow sticky note is placed between them. This signifies that there is a business rule at play, influencing the transition from the command to the event. The blank state of the sticky note invites team members, especially domain experts, to discuss and identify the specific rule or logic. This is a collaborative process where different perspectives help in accurately defining the rule. Through discussion, the team arrives at a clear understanding of the business rules. Participants are asked to fill in these business rules on the yellow sticky notes with comprehensive details about their execution. This involves several key aspects: Preconditions: What must be true before the rule is executed? For instance, before the Rent Campervan command can succeed, a precondition might be that the selected campervan must be available for the chosen dates. Postconditions: What becomes true after the rule is executed? Following the campervan rental, a postcondition would be that the campervan’s status changes to "rented" for the specified period. Invariants: What remains true throughout the execution of the rule? An invariant could be that a customer’s account must be in good standing throughout the rental process. Additional information: Any other clarifications or details that help in understanding what the business rule does Some business rules might be straightforward, but others could lead to extensive discussions. This interaction is a crucial part of the knowledge-sharing process. It allows domain experts to clarify complex business logic and developers to understand how these rules translate into system functionality. These discussions are invaluable for ensuring that everyone has a clear and shared understanding of how the business operates and how the system should support these operations. This process goes with an in-depth analysis of the various events and commands that had emerged in earlier stages. We noticed a distinct pattern: a cluster of activities and decisions consistently revolved around the management of campervans. The technique involves physically moving these similar business rules on top of one another on the board where the Event Storming session is visualized. This action is more than just an organizational step; it’s a method to highlight and analyze the interrelations and potential redundancies among the rules. This consolidation helps in clearly seeing how different rules interact with the same set of data or conditions. It can reveal dependencies or conflicts between rules that might not have been evident when they were considered in isolation. By grouping similar rules, you simplify the overall complexity of the system. It becomes easier to understand and manage the business logic when it’s seen through grouped, related rules rather than as a multitude of individual ones. This process can also uncover opportunities to refine or merge rules, leading to more streamlined and efficient business processes. Moreover, a closer look at the operational challenges and data cohesion associated with campervans solidified our thinking. We realized that managing various aspects related to campervans under a unified system would streamline operations, reducing complexities and enhancing service efficiency. The disparate pieces of information - maintenance schedules, booking calendars, location tracking - all pointed towards a need for an integrated approach. The decision to establish an aggregate was a culmination of these observations and discussions. It was a decision driven not just by operational logic but by the natural convergence of business activities related to campervans. By forming this aggregate, we envisioned a system where all aspects of a campervan’s lifecycle were managed cohesively, ensuring seamless operations and an enhanced customer experience. This approach also brought into focus the need for enforcing consistency across the campervan fleet. By designing an aggregate to encapsulate all aspects related to each vehicle, we ensured that any changes - be it a rental status update or a maintenance check - were consistently reflected across the entire system. This consistency is crucial for maintaining the integrity and reliability of our service. A campervan, for instance, should not be available for booking if it’s scheduled for maintenance. Similarly, the location information of each campervan needs to be accurate and up-to-date to ensure efficient fleet management. Imagine a scenario where a customer books a campervan for a journey from Munich to Paris. Within the aggregate, several pieces of information about each campervan are tracked and managed, including its current location, availability status, and maintenance schedule. When the customer selects a specific campervan for their dates, the aggregate immediately updates the vehicle’s status to "rented" for that period. This update is critical to ensure that the same campervan isn’t available for other customers for the same dates, preventing double rentings. Simultaneously, let’s say this chosen campervan is due for maintenance soon after the customer’s proposed return date. The system, adhering to the rules within the aggregate, flags this campervan for maintenance, ensuring that it does not get rented again before the maintenance is completed. Invariants, or rules that must always hold true, became a cornerstone in the design of the aggregate. These invariants enforce critical business rules and ensure the validity of the system at all times. For example, an invariant in our system ensures that a campervan cannot be simultaneously booked and marked as under maintenance. Such invariants are essential for maintaining data integrity and providing a consistent, reliable service to our customers. Let’s consider a real-life scenario to illustrate this: A family eagerly plans a summer road trip from Munich to the picturesque landscapes of Paris. They find the perfect campervan on our website and proceed to rent it for their adventure. Unseen to them, but crucial for their journey, is the role of our invariant at play. As soon as they select their dates and campervan, the system springs into action. It checks the aggregate, specifically probing for two critical conditions – is the chosen campervan already rented for these dates, and is it due for maintenance? This is where the invariant exerts its influence. It steadfastly ensures that this campervan is neither engaged in another journey nor scheduled for a maintenance check during the requested time. This rule is inflexible, a cornerstone of our commitment to reliability. These invariants, embedded within our aggregate, are more than just lines of code or business policies. They are a promise – a promise of adventure without the unexpected, of journeys that create memories, not mishaps. By ensuring that each campervan is adequately prepped and available for every booking, these rules not only keep our operations smooth but also cement our reputation as a reliable and customer-centric business. In our exploration of the campervan rental business through Event Storming, we’ve identified a multitude of individual events, commands, and policies. However, these elements gain true significance only when they are clustered together as a cohesive unit. This clustering is what forms the essence of the aggregate. It’s a conceptual realization that the isolated pieces - rentals, maintenance schedules, customer interactions - are interdependent and collectively form the core of our service. Without this unification, each element would merely exist in a vacuum, lacking context and impact. The heart of this aggregate, its root, is the campervan itself. The campervan is not just a physical entity but a nexus of various business processes and customer experiences. We selected the campervan as the aggregate root because it is the central element around which all other activities revolve. Whether it’s booking a campervan, preparing it for a customer, or scheduling its maintenance, every action directly relates to the individual campervan. This choice reflects our understanding that the campervan is the linchpin of our business model, directly influencing customer satisfaction and operational efficiency. Alberto Brandolini emphasized the "blank naming" strategy. At the initial phase, rather than rushing to assign specific names or predefined functions to aggregates, the team is encouraged to recognize them in a more open-ended manner. The Naming of Aggregates This step, strategically placed at the end of the session, is more than just a labeling exercise; it is the final act of distilling and synthesizing the insights gained throughout the event. Early in the session, it might be tempting to assign names to these Aggregates. However, this urge is resisted, as premature naming can lead to misconceptions. Names given too early might not fully capture the essence of what an Aggregate represents, as they are based on an initial, incomplete understanding of the system. Therefore, the practice of waiting until the end to name these Aggregates is not just a procedural step; it is a deliberate choice to ensure accuracy and relevance in naming. Thus, the Campervan Aggregate, with the campervan as its root, becomes a powerful tool in our system architecture, encapsulating the complexity of our operations into a manageable and coherent structure. Conclusion As we conclude Part 3 of “Demystifying Event Storming: Design Level, Identifying Aggregates,” we have navigated the intricate process of identifying aggregates, a pivotal aspect in Domain-Driven Design. This journey through the Design Level has illuminated the profound utility of Event Storming in mapping complex systems, highlighting the importance of a collaborative approach and the strategic use of the blank naming strategy. The emergence of the Campervan Aggregate in our rental business model is a testament to the effectiveness of this methodology. It underscores how well-defined aggregates can streamline system design, ensuring alignment with business objectives. The decision to name these aggregates at the end of the session, based on deep insights and understanding, has been crucial in accurately reflecting their roles within the system. Looking ahead, our series will continue to explore the depths of Event Storming. In the next installment, we will delve into identifying bounded contexts, a key concept in Domain-Driven Design that further refines our understanding of complex systems. This next phase will focus on how bounded contexts help in delineating clear boundaries within the system, facilitating better organization and more efficient communication across different parts of the system.
In the dynamic world of online services, the concept of site reliability engineering (SRE) has risen as a pivotal discipline, ensuring that large-scale systems maintain their performance and reliability. Bridging the gap between development and operations, SRE is a set of principles and practices that aims to create scalable and highly reliable software systems. Site Reliability Engineering in Today’s World Site reliability engineering is an engineering discipline devoted to maintaining and improving the reliability, durability, and performance of large-scale web services. Originating from the complex operational challenges faced by large internet companies, SRE incorporates aspects of software engineering and applies them to infrastructure and operations problems. The main goal is to create automated solutions for operational aspects such as on-call monitoring, performance tuning, incident response, and capacity planning. Further Reading: Top Open Source Projects for SREs. What Does a Site Reliability Engineer Do? A site reliability engineer operates at the intersection of software engineering and systems engineering. It was a natural evolutionary role for many database administrators with deeper system administration skills once the modernization to the cloud began. The role of the SRE encompasses: Developing software and writing code for service scalability and reliability Ensuring uptime, maintaining services, and minimizing downtime Incident management, including handling system outages and conducting post-mortems Optimizing on-call duties, balancing responsibilities with proactive engineering Capacity planning, which includes predicting future needs and scaling resources accordingly Site Reliability Engineering Principles The core principles of Site Reliability Engineering (SRE) form the foundation upon which its practices and culture are built. One of the key tenets is automation. SRE prioritizes automating repetitive and manual tasks, which not only minimizes the risk of human error but also liberates engineers to focus on more strategic, high-value work. Automation in SRE extends beyond simple task execution; it encompasses the creation of self-healing systems that automatically recover from failures, predictive analytics for capacity planning, and dynamic provisioning of resources. This principle seeks to create a system where operational work is managed efficiently, leaving SRE professionals to concentrate on enhancements and innovations that drive the business forward. Measurement is another cornerstone of SRE. In the spirit of the adage, "You can't improve what you can't measure," SRE implements rigorous quantification of reliability and performance. This includes defining clear service level objectives (SLOs) and service level indicators (SLIs) that provide a detailed view of a system's health and user experience. By consistently measuring these metrics, SREs make data-driven decisions that align technical performance with business goals. Shared ownership is integral to SRE as well. It dissolves the traditional barriers between development and operations, encouraging both teams to take collective responsibility for the software they build and maintain. This collaboration ensures a more holistic approach to problem-solving, with developers gaining more insight into operational issues and operations teams getting involved earlier in the development process. Lastly, a blameless culture is crucial to the SRE ethos. By treating failures as opportunities for improvement rather than reasons for punishment, teams are encouraged to share information openly without fear. This approach leads to a more resilient organization as it promotes a DevOps culture of transparency and continuous learning. When incidents occur, blameless postmortems are conducted, focusing on what happened and how to prevent it in the future, rather than who caused it. This principle not only enhances the team's ability to respond to incidents but also contributes to a positive and productive work environment. Together, these principles guide SRE teams in creating and maintaining reliable, efficient, and continuously improving systems. The Benefits of Site Reliability Engineering Site Reliability Engineering (SRE) not only improves system reliability and uptime but also bridges the gap between development and operations, leading to more efficient and resilient software delivery. By adopting SRE principles, organizations can achieve a balance between innovation and stability, ensuring that their services are both cutting-edge and dependable for their users. Benefits Drawbacks Improved Reliability: Ensures systems are dependable and trustworthy Complexity: Can be difficult to implement in established systems without proper expertise Efficiency: Automation reduces manual labor and speeds up processes. Resource Intensive: Initially requires significant investment in training and tooling Scalability: Provides essential framework for systems to grow without a decrease in performance Balancing Act: Striking the right balance between new features and reliability can be challenging. Innovation: Frees up engineering time for feature development X Site Reliability Engineering vs DevOps Site Reliability Engineering (SRE) and DevOps are two methodologies that, while converging towards the aim of streamlining software development and enhancing system reliability, adopt distinct pathways to realize these goals. DevOps is primarily focused on melding the development and operations disciplines to accelerate the software development lifecycle. This is achieved through the practices of continuous integration and continuous delivery (CI/CD), which ensure that code changes are automatically built, tested, and prepared for a release to production. The heart of DevOps lies in its cultural underpinnings—breaking down silos, fostering cross-functional team collaboration, and promoting a shared responsibility for the software's performance and health. Learn the Difference: DevOps vs. SRE vs. Platform Engineer vs. Cloud Engineer. SRE, in contrast, takes a more structured approach to reliability, providing concrete strategies and a framework to maintain robust systems at scale. It applies a blend of software engineering principles to operational problems, which is why an SRE team's work often includes writing code for system automation, crafting error budgets, and establishing service level objectives (SLOs). While it encapsulates the collaborative spirit of DevOps, SRE specifically zones in on ensuring system reliability and stability, especially in large-scale operations. It operationalizes DevOps by adding a set of specific practices that are oriented towards proactive problem prevention and quick problem resolution, ensuring that the system not only works well under normal conditions but also maintains performance during unexpected surges or failures. Monitoring, Observability, and SRE Monitoring and observability form the foundational pillars of Site Reliability Engineering (SRE). Monitoring is the systematic process of gathering, processing, and interpreting data to gain a comprehensive view of a system's current health. This involves the utilization of various metrics and logs to track the performance and behavior of the system's components. The primary goal of monitoring is to detect anomalies and performance deviations that may indicate underlying issues, allowing for timely interventions. On the other hand, observability extends beyond the scope of monitoring by providing insights into the system's internal workings through its external outputs. It focuses on the ability to infer the internal state of the system based on data like logs, metrics, and traces, without needing to add new code or additional instrumentation. SRE teams leverage observability to understand complex system behaviors, which enables them to preemptively identify potential issues and address them proactively. By integrating these practices, SRE ensures that the system not only remains reliable but also meets the set business objectives, thereby delivering a seamless user experience. Conclusion Site reliability engineering is essential for businesses that depend on providing reliable online services. With its blend of software engineering and systems management, SRE helps to ensure that systems are not just functional, but are also resilient, scalable, and efficient. As organizations increasingly rely on complex systems to conduct their operations, the principles and practices of SRE will become ever more integral to their success. In crafting this analysis, we've touched on the multifaceted role of SRE in modern web services, its core principles, and the tangible benefits it brings to the table. Understanding the distinction between SRE and DevOps clarifies its unique position in the technology landscape, highlighting how essential the discipline is in achieving and maintaining high standards of reliability and performance in today's digital world.
In this digital world where all companies want their products to have a cutting edge over others and they want faster go to market, most companies want their teams to follow Agile scrum methodology; however, we observed most teams are following Agile scrum ceremonies for the name sake only. Among all scrum ceremonies, the Sprint retrospective is the most important and most talked about ceremony but the least paid attention to. Many times, scrum masters keep doing the same canned single routine format of a retrospective, which is: what went well? What didn't go well? and What is to improve? Let us analyze what are the problems the team faces, their impact, and recommendations to overcome. Problems and impact of routine format Sprint retrospective: Doing a routine single format made teams uninterested, and they started losing interest. Either team members stopped attending this ceremony, kept silent, or didn't participate. Often, action items came out of retrospectives not being followed up during the sprint. Status of action items not discussed in next sprint retrospective The team started losing faith in the ceremony when they saw previous sprint action items still existed and kept accumulated This leads to missing key feedback and actions sprint after sprint and hampers the team's improvements. With this, even after 20-30 sprints, teams keep making the same mistakes again and again. Ultimately, the team never becomes mature. Recommendations for Efficient Sprint Retrospective We think visually. Try following fun-filled visual retrospective techniques: Speed car retrospective Speed boat retrospective Build and reflect Mad, Sad, Glad 4 Ls Retrospective One-word Retrospective Horizontal Line retrospective Continue, stop, and start-improve What went well? What didn't go well? What is to improve? Always record, publish, and track action items. Ensure leadership does not join sprint retrospectives, which will make the team uncomfortable in sharing honest feedback. Every sprint retrospective first discusses the status of action items from the previous sprint; this will give confidence to the team that their feedback is being heard and addressed. Now let us discuss these visual, fun-filled sprint retrospective techniques in detail: 1. Speed Car Retrospective This retrospective shows that the car represents the team, the engine depicts the team's strength, the Parachute represents the impediments that slow down the car's speed, the Abyss shows the danger the team foresees ahead, and the Bridge indicates the team's suggestions on how to overcome and cross this abyss without falling into it. 2. Speed Boat Retrospective This retrospective shows that the boat represents the team; the anchors represent problems that are not allowing the boat to move or slowing it down and turn these anchors into gusts of winds, which in turn represents the team's suggestions, which the team thinks will help the boat move forward. 3. Build and Reflect Bring legos set and divide teams into multiple small groups, then ask the team to build two structures. One represents how the sprint went, and one represents how it should be and then ask each group to talk about their structures and suggestions for the sprint. 4. Mad, Sad, Glad This technique discusses what makes the team mad, sad, and glad during the sprint and how we can move from mad, sad columns to glad columns. 5. Four Ls: Liked, Learned, Lacked and Longed This technique talks about four Ls. What team "Liked," What team "Learned," What team "Lacked," and What team "Longed" during the sprint, and then discuss each item with the team. 6. One-Word Retrospective Sometimes, to keep the retrospective very simple, ask the team to describe the sprint experience in "one word" and then ask why they describe sprint with this particular word and what can be improved. 7. Horizontal Line Retrospective Another simple retrospective technique is to draw a horizontal line and, above the line, put items that the team feels are "winning items" and below line items that the team feels are "failures" during the sprint. 8. Continue, Stop, Start-Improve This is another technique to capture feedback in three categories, viz. "Continue" means which team feels the team did great and needs to continue, "Stop" talks about activities the team wants to stop, and "Start-Improve" talks about activities that the team suggested to start doing or improve. 9. What Went Well? What Didn’t Go Well? And What Is To Improve? This is well well-known and most practiced retrospective technique to note down points in the mentioned three categories. We can keep reshuffling these retrospective techniques to keep the team enthusiastic to participate and share feedback in a fun, fun-filled, and constructive environment. Remember, feedback is a gift and should always be taken constructively to improve the overall team's performance. Go, Agile team!!
Many of today’s hottest jobs didn’t exist at the turn of the millennium. Social media managers, data scientists, and growth hackers were unheard of at the time. Another relatively new job role in demand is that of a Site Reliability Engineer or SRE. The profession is quite new. It’s noted that 64% of SRE teams are less than three years old. But despite being new, the job role adds a lot of value to an organization. SRE vs. DevOps Site reliability engineering is the merging of development and operations into one. Most people tend to mix up SRE and DevOps. By principle, the two intertwine, but DevOps serves as the principle and SRE the practice. Any company looking to implement site reliability engineering in their organization might want to start by following these seven tips to build and maintain an SRE team. 1. Start Small and Internally There is a high chance that your company needs an SRE team but doesn’t need a whole department right away. Site reliability management’s role is to ensure that an online service remains reliable through alert creation, incident investigation, root cause remediation, and incident postmortem. The average tech-based company faces a few bugs every so often. In the past, operations and development teams would come together to fix those issues in software or service. An SRE approach merges those two into one. If you’re just starting to build your SRE team, you can start by putting together some people from your operations and technical department and give them the sole responsibility of maintaining a service’s reliability. 2. Get the Right People In cases where you’re ready to scale, the time might come where you’ll need to get additional help for your site reliability engineering team. SRE professionals are in hot demand nowadays. There are more than 1,300 site reliability engineering jobs on Indeed. The key to finding the right people for your SRE team is to know what you’re looking for. Here are a few qualifications to look for in a site reliability engineer. Problem-solving and troubleshooting skills: Much of the SRE team’s responsibilities have to do with addressing incidents and issues in software. Most times, these problems have to do with systems or applications that they didn’t create themselves. So the ability to quickly debug even without in-depth knowledge of a system is a must-have skill. A knack for automation: Toil can often become a big problem in many tech-based services. The right site reliability engineer will look for ways to automate away the toil, reducing manual work to a minimum so that staff only deal with high-priority items. Constant learning: As systems evolve, so will problems. So good SREs will have to keep brushing up their knowledge on systems, codes, and processes that change with time. Teamwork: Addressing incidents will rarely be a one-man-job so SREs need to work well with teams. Collaboration and communication are the skills to look out for definitely. Bird’s eye view perspective: When addressing bugs, it can be easy to get caught up with the wrong things when you’re stuck in the middle of it. That’s why good SREs will need the ability to see the bigger picture and find solutions in larger contexts. A successful site reliability engineer will find the root cause and create an overarching solution. 3. Define Your SLOs An SRE team will most likely succeed with service level objectives in place. Service level objectives or SLOs are the key performance metrics for a site. SLOs can vary depending on the kind of service a business offers. Generally, any user-facing serving system will have to set availability, latency, and throughput as indicators. Storage-based systems will often place more emphasis on latency, availability, and durability. Setting up SLOs also involves placing values that a company would like to maintain in terms of indicators. The numbers your SLOs should show are the minimum thresholds that the system should hold on to. When setting an SLO, don’t base them on current performance as this might put you in a position to meet unrealistic targets. Keep your objectives simple and avoid placing any absolutes. The fewer SLOs you have in place, the better, so only measure what indicators matter to you most. 4. Set Holistic Systems to Handle Incident Management Incident management is one of the most important aspects of site reliability engineering. In a survey by Catchpoint, 49% of respondents said that they had worked on an incident in the last week or so. When handling incidents, a system needs to be in place to keep the debugging and maintenance process as smooth as possible. One of the most important aspects of an incident management system is keeping track of on-call responsibilities. SRE team responsibilities can get extremely exhausting without an effective means to control the flow of on-call incidents. Using the right incident management tool can help resolve incidents with more clarity and structure. 5. Accept Failure as Part of the Norm Most people don’t like experiencing failure, but if your company wants to maintain a healthy and productive SRE team, one of the themes that each member must get used to is accepting failure as a part of the profession. Perfection is rarely ever the case in any system, most especially when in the early development stages. Many SRE teams mistake setting the bar too high right away and putting up unrealistic SLO definitions and targets. The best operational practice has always been to shoot for a minimum viable product and then slowly increase the parameters once the team and company as a whole build up confidence. 6. Perform Incident Postmortems to Learn from Failures and Mistakes There’s an old saying that goes this way: “Dead men tell no tales.” But that isn’t the case with system incidents. There is much to learn from incidents even after problems have been resolved. That’s why it’s a great practice to perform incident postmortems so that SRE teams can learn from their mistakes. A proper SRE approach would take into account the best practices for postmortem. When performing post-incident analysis, there are sets of parameters that site reliability crews must analyze. First, they should look into the cause and triggers of the failure. What caused the system to fail? Secondly, the team should pinpoint as many of the effects as they can find. What did the system failure affect? For example, a payment gateway error might have caused a discrepancy in payments made or collections, which can be a headache if left unturned for even a few days. Lastly, a successful postmortem will look into possible solutions and recommendations if a similar error might occur in the future. 7. Maintain a Simple Incident Management System An SRE team structure isn’t enough to create a productive team. There also needs to be a project and incident management system in place. There are various services and different IT management software use cases available to SRE teams today. Some of the factors that team managers need to consider are ease of use, communication barriers, available integrations, and collaboration capabilities. Setting Your SRE Team Up for Success An SRE team can be likened to an aircraft maintenance crew fixing a plane while it’s 50,000 feet in the air. Setting your SRE team up for success is crucial as they will assure that your company’s service is available to your clients. While errors and bugs are inevitable in any software as a service, it can be kept to a minimum, making outages and errors a rare occasion. But for that to happen, you’ll need a solid SRE team in place, proactively finding ways to avoid errors and being ready to spring into action when duty calls.
This is a continuation of the Project Hygiene series about best software project practices that started with this article. Background “It works until it doesn’t” is a phrase that sounds like a truism at first glance but can hold a lot of insight in software development. Take, for instance, the very software that gets produced. There is no shortage of jokes and memes about how the “prettiness” of what the end-user sees when running a software application is a mere façade that hides a nightmare of kludges, “temporary” fixes that have become permanent, and other less-than-ideal practices. These get bundled up into a program that works just as far as the developers have planned out; a use case that falls outside of what the application has been designed for could cause the entire rickety code base to fall apart. When a catastrophe of this kind does occur, a post-mortem is usually conducted to find out just how things went so wrong. Maybe it was some black-swan moment that simply never could’ve been predicted (and would be unlikely to occur again in the future), but it’s just as possible that there was some issue within the project that never got treated until it was too late. Code Smells... Sections of code that may indicate that there are deeper issues within the code base are called “code smells” because, like the milk carton in the fridge that’s starting to give off a bad odor, they should provoke a “there’s something off about this” reaction in a veteran developer. Sometimes, these are relatively benign items, like this Java code snippet: Java var aValue = foo.isFizz() ? foo.isFazz() ? true : false : false; This single line contains two different code smells: Multiple ternary operators are being used within the same statement: This makes the code hard to reason and needlessly increases the cognitive load of the code base. Returning hard-coded boolean values in the ternary statement, itself already being a boolean construction: This is unnecessary redundancy and suggests that there’s a fundamental misunderstanding of what the ternary statement is to be used for. In other words, the developers do not understand the tools that they are using. Both of these points can be addressed by eliminating the use of ternary statements altogether: Java var aValue = foo.isFizz() && foo.isFazz(); Some code smells, however, might indicate an issue that would require a more thorough evaluation and rewrite. For example, take this constructor for a Kotlin class: Kotlin class FooService( private val fieldOne: SomeServiceOne, private val fieldTwo: SomeServiceTwo, private val fieldThree: SomeServiceThree, private val fieldFour: SomeServiceFour, private val fieldFive: SomeServiceFive, private val fieldSix: SomeServiceSix, private val fieldSeven: SomeServiceSeven, private val fieldEight: SomeServiceEight, private val fieldNine: SomeServiceNine, private val fieldTen: SomeServiceTen, ) { Possessing a constructor that takes in ten arguments for ten different class members is an indicator that the FooService class might be conducting too much work within the application; i.e., the so-called “God Object” anti-pattern. Unfortunately, there’s no quick fix this time around. The code architecture would need to be re-evaluated to determine whether some of the functionality within FooService could be transferred to another class within the system, or whether FooService needs to be split up into multiple classes that conduct different parts of the workflow on their own. …And Their “Project” Equivalent Such a concept can be elevated to the level of the entire project around the code being developed as well: that there exist items within the project that signal that the project team is conducting practices that could lead to issues down the road. At a quick glance, all may appear to be fine in the project - no fires are popping up, the application is working as desired, and so on - but push just a bit, and the problems quickly reveal themselves, as the examples outlined below demonstrate. Victims Of Goodhart’s Law To reduce the likelihood of software quality issues causing problems for a software code base and its corresponding applications, the industry has developed tools to monitor the quality of the code that gets written like code linters, test coverage analysis, and more. Without a doubt, these are excellent tools; their effectiveness, however, depends on exactly how they’re being used. Software development departments have leveraged the reporting mechanisms of these code quality tools to produce metrics that function as gates for whether to let a development task proceed to the next stage of the software development lifecycle. For example, if a code coverage analysis service like SonarQube reports that the code within a given pull request’s branch only has testing for 75% of the code base, then the development team may be prohibited from integrating the code until the test coverage ratio improves. Note the specific wording there of whether the ratio improves, *not* whether more test cases have been added - the difference frequently comes to haunt the project’s quality. For those unfamiliar with “Goodhart’s Law,” it can be briefly summed up as, “When a measure becomes a target, it ceases to be a good measure.” What this means in software development is that teams run the risk of developing their code in exact accordance with the metrics that have been imposed upon the team and/or project. Take the aforementioned case of a hypothetical project needing to improve its test coverage ratio. Working in the spirit of the metric would compel a developer to add more test cases to the existing code so that more parts of the code base are being covered, but with Goodhart’s Law in effect, one would only need to improve the ratio however possible, and that could entail: Modifying the code coverage tool’s configuration so that it excludes whole swathes of the code base that do not have test coverage. This may have legitimate use cases - for example, excluding testing a boilerplate web request mechanism because the only necessary tests are for the generated client API that accompanies it - but it can easily be abused to essentially silence the code quality monitor. Generate classes that are ultimately untouched in actual project usage and whose only purpose is to be tested so that the code base has indeed “improved” its code coverage ratio. This has no defensible purpose, but tools like SonarQube will not know that they’re being fooled. Furthermore, there can be issues with the quality of the code quality verification itself. Code coverage signifies that code is being reached in the tests for the code base - nothing more, nothing less. Here’s a hypothetical test case for a web application (in Kotlin): Kotlin @Test fun testGetFoo() { val fooObject: FooDTO = generateRandomFoo() `when`(fooService.getFoo(1L)).thenReturn(fooObject) mockMvc.perform( MockMvcRequestBuilders.get("/foo/{fooId}", 1L) ).andExpect(status().isOk()) } This test code is minimally useful for actually verifying the behavior of the controller - the only piece of behavior being verified here is the HTTP code that the controller endpoint produces - but code coverage tools will nonetheless mark the code within the controller for this endpoint as “covered." A team that produces this type of testing is not actually checking for the quality of its code - it is merely satisfying the requirements imposed on it in the most convenient and quickest way possible - and is ultimately leaving itself open to issues with the code in the future due to not effectively validating its code. Too Much Of A Good Thing Manual execution in software testing is fraught with issues: input errors, eating up the limited time that a developer could be using for other tasks, the developer simply forgetting to conduct the testing, and so on. Just like with code quality, the software development world has produced tools (i.e., CI/CD services like Jenkins or CircleCI) that allow for tests to be executed automatically in a controlled environment, either at the behest of the developer or (ideally) entirely autonomously upon the occurrence of a configured event like the creation of a pull request within the code base. This brings enormous convenience to the developer as well as improves the ability to identify potential code quality issues within the project, but its availability can turn into a double-edged sword. It could be easy for the project team to develop an over-dependence on the service and run any and all tests only via the service and never on the developers’ local environments. In one Kotlin-based Spring Boot project that I had just joined, launching the tests for the code base on my machine would always fail due to certain tests not passing, yet: I had pulled the main code branch, hadn’t modified the code, and had followed all build instructions. The code base was passing all tests on the CI/CD server. No other coworkers were complaining about the same issue. Almost all the other coworkers on the project were using Macs, whereas I had been assigned a Windows as my work laptop. Another coworker had a Windows machine as well, yet even they weren’t complaining about the tests’ constant failures. Upon asking this coworker how they were able to get these tests to pass, I received an unsettling answer: they never ran the tests locally, instead letting the CI/CD server do all the testing for them whenever they had to change code. As bad as the answer was in terms of the quality of the project’s development - refusing to run tests locally meant a longer turnaround time between writing code and checking it, ultimately reducing programmer productivity - it at least gave me the lead to check on the difference between Windows and non-Windows environments with regards to the failing tests. Ultimately, the issue boiled down to the system clock: Instant.now() calls on *nix platforms like Linux and Mac were conducted with microsecond-level precision, whereas calls to Instant.now() on Windows were being conducted with nanosecond-level precision. Time comparisons in the failing tests were being conducted based on hard-coded values; since the developers were almost all using *nix-based Mac environments, these values were based on the lesser time precision, so when the tests ran in an environment with more precise time values, they failed. After forcing microsecond-based precision for all time values within the tests - and updating the project’s documentation to indicate as much for future developers - the tests passed, and now both my Windows-based colleague and I could run all tests locally. Ignoring Instability In software terminology, a test is deemed “unstable” if it has an inconsistent result: it might pass just as much as it might fail if one executes it, despite no changes having been done to the test or the affected code in between executions. A multitude of causes could be the fault of this condition: insufficient thread safety within the code, for example, could lead to a race condition wherein result A or result B could be produced depending on which thread is executed by the machine. What’s important is how the project development team opts to treat the issue. The “ideal” solution, of course, would be to investigate the unstable test(s) and determine what is causing the instability. However, there exist other “solutions” to test instability as well: Disable the unstable tests. Like code coverage exclusions, this does have utility under the right circumstances; e.g., when the instability is caused by an external factor that the development team has no ability to affect. Have the CI/CD service re-execute the code base’s tests until all tests eventually pass. Even this has valid applications - the CI/CD service might be subject to freak occurrences that cause testing one-off failures, for example - although it’s far likelier that the team just wants to get past the “all tests pass” gate of the software development lifecycle as quickly as possible. This second point was an issue that I came across in the same Kotlin-based Spring Boot project as above. While it was possible to execute all tests for the code base and have them pass, it was almost as likely to have different tests fail as well. Just as before, the convenience of the CI/CD testing automation meant that all that one needed to do upon receiving the failing test run notification was to hit the “relaunch process” button and then go back to one’s work on other tasks while the CI/CD service re-ran the build and testing process. These seemingly random test failures were occurring with enough frequency that it was evident that some issue was plaguing the testing code, yet the team’s lack of confronting the instability issue head-on and instead relying on what was, essentially, a roll of the dice with the CI/CD service was resulting in the software development lifecycle being prolonged unnecessarily and the team’s productivity being reduced. After conducting an investigation of the different tests that were randomly failing, I ultimately discovered that the automated database transaction rollback mechanism (via the @Transactional annotation for the tests) was not working as the developers had expected. This was due to two issues: Some code required that another database transaction be opened (via the @Transactional(propagation = Propagation.REQUIRES_NEW) annotation), and this new transaction fell outside of the reach of the “automated rollback” test transaction. Database transactions run within Kotlin coroutines were not being included within the test’s transaction mechanism, as those coroutines were being executed outside of the “automated rollback” test transaction’s thread. As a result, some data artifacts were being left in the database tables after certain tests; this “dirty” database was subsequently causing failures in other tests. Re-running the tests in CI/CD meant that these problems were being ignored in favor of simply brute-forcing an “all tests pass” outcome; in addition, the over-reliance on CI/CD for running the code base’s tests in general meant that there was no database that the team’s developers could investigate, as the CI/CD service would erase the test database layer after executing the testing job. A Proposal So, how do we go about discovering and rectifying such issues? Options might appear to be limited at first glance. Automated tools like SonarQube or CI/CD don’t have a “No, not like that!” setting where they detect where their own usefulness has been deliberately blunted by the development team’s practices. Even if breakthroughs in artificial intelligence were to produce some sort of meta-analysis capability, the ability for a team to configure exceptions would still need to exist. If a team has to fight against its tools too much, it’ll look for others that are more accommodating. Plus, Goodheart’s Law will still reign supreme, and one should never underestimate a team’s ability to work around statically-imposed guidelines to implement what it deems is necessary. Spontaneous discovery and fixes of project smells within the project team - that is, not having been brought on by the investigation that followed some catastrophe - are unlikely to occur. The development team’s main goal is going to be providing value to the company via the code that they develop; fixing project smells like unstable tests does not have the same demonstrable value as deploying new functionality to production. Besides, it’s possible that the team - being so immersed in their environment and the day-to-day practices - is simply unaware that there’s any issue at all. The fish, as they say, is the last to discover water! A better approach would be to have an outside look at the project and how it’s being developed from time to time. Providing a disinterested point of view with the dedicated task of finding project smells will have a better chance of success in rooting these issues out than people within the project who have the constraint of needing to actively develop the project as their principal work assignment. Take, say, a group of experienced developers from across the entire product development department and form a sort of “task force” that conducts an inspection of the different projects within the department at the interval of every three or six months. This team would examine, for example: What is the quality of the code coverage and whether the tests can be improved Whether tests - and the application itself! - can be run on all platforms that the project is supposed to support; i.e., both on the developer’s machine as well as in dedicated testing and production environments What is the frequency of unstable test results occurring and how these unstable tests are resolved Upon conducting this audit of the project, the review team would present its findings to the project development team and make suggestions for how to improve any issues that the review team has found. In addition, the review team would ideally conduct a follow-up with the team to understand the context of how such project smells came about. Such causes might be: A team’s lack of awareness of better practices, both project-wise and within the code that they are developing. Morale within the team is low enough that it only conducts the bare minimum to get the assigned functionality out the door. Company requirements are overburdening the team such that it can *only* conduct the bare minimum to get the assigned functionality out the door. This follow-up would be vital to help determine what changes can be made to both the team’s and the department’s practices in order to reduce the likelihood of such project smells recurring, as the three underlying causes listed above - along with other potential causes - would produce significantly different recommendations for improvement compared to one another. A Caveat As with any review process - such as code reviews or security audits - there must never be blame placed on one or more people in the team for any problematic practices that have been uncovered. The objective of the project audit is to identify what can be improved within the project, not to “name and shame” developers and project teams for supposedly “lazy” practices or other project smells. Some sense of trepidation about one’s project being audited would already be present - nobody wants to receive news that they’ve been working in a less-than-ideal way, after all - but making the process an event to be dreaded would make developer morale plummet. In addition, it’s entirely possible that the underlying reasons for the project smells existing were ultimately outside of the development team’s hands. A team that’s severely under-staffed, for example, might be fully occupied as it is frantically achieving its base objectives; anything like well-written tests or ensuring testability on a developer’s machine would be a luxury. Conclusion Preventative care can be a hard sell, as its effectiveness is measured by the lack of negative events occurring at some point in the future and cannot be effectively predicted. If a company were to be presented with this proposal to create an auditing task force to detect and handle project smells, it might object that its limited resources are better spent on continuing the development of new functionality. The Return On Investment for a brand-new widget, after all, is much easier to quantify than working towards preventing abstract (to the non-development departments in the company, at least) issues that may or may not cause problems for the company sometime down the line. Furthermore, if everything’s “working," why waste time changing it? To repeat the phrase from the beginning of this article, “It works until it doesn’t”, and that “doesn’t” could range from one team having to fix an additional set of bugs in one development sprint to a multi-day downtime that costs the company considerable revenue (or much worse!). The news is replete with stories about companies that neglected issues within their software development department until those issues blew up in their faces. While some companies that have avoided such an event so far have simply been lucky that events have played out in their favor, it would be a far safer bet to be proactive in hunting down any project smells within a company and avoid one’s fifteen minutes of infamy.
In the complex world of service reliability, the human element remains crucial despite the focus on digital metrics. Culture, communication, and collaboration are essential for organizations to deliver reliable services. In this article, I am going to dissect the integral role of human factors in ensuring service reliability and demonstrate the symbiotic relationship between technology and the individuals behind it. Reliability-Focused Culture First of all, let’s define what is a reliability-focused culture. Here are the key aspects and features that help build a culture of reliability and constant improvement across the organization. A culture that prioritizes reliability lies at the heart of any reliable service. It's a shared belief that reliability is not an option but a fundamental requirement. This cultural ethos is not an individual entity but a collective mindset implemented at every level of the company. Accountability should be fostered across teams in order to build a reliability-focused culture. When every team member sees themselves as a custodian of service reliability, it creates a powerful force that allows for preventing errors and resolving issues rapidly. This proactive approach, rooted in culture, becomes a shield against potential disruptions. Meta's renowned mantra, "Nothing at Meta is someone else's problem," encapsulates it perfectly. Continuous learning and adaptation are what help an organization embrace the culture of reliability. Teams are encouraged to analyze incidents, share insights, and implement improvements. This ensures that the company evolves and keeps a competitive advantage by staying ahead of potential reliability challenges and outages. The 2021 Facebook outage is a poignant example, albeit a painful one, of incident management processes and a cultural emphasis on learning and adaptation. Now that we have figured out the main features of the reliability-centered and communication-driven culture let us focus on the aspects that help build effective team organization and set processes to achieve the best results. Examples of Human-Centric Reliability Models Here are some examples of how a collaborative approach to reliability is implemented in major tech companies: Google's Site Reliability Engineering Site Reliability Engineering is a set of engineering practices Google uses to run reliable production systems and keep its vast infrastructure reliable. Google’s culture emphasizes automation, learning from incidents, and shared responsibility. It is one of the major aspects that brings the highest level of reliability to Google's services. Amazon’s Two-Pizza Teams Amazon is committed to small agile teams. This structure is known as two-pizza teams — meaning each team is small enough to be fed by two pizzas. This approach fosters effective communication and collaboration. These teams consist of employees from different disciplines who work together to ensure the reliability of the services they own. Spotify’s Squad Model Spotify's engineering culture revolves around "squads." These are small cross-functional teams that have full ownership of services throughout the whole development process. The squads model ensures that reliability is considered and accounted for from the early development phase through to operations. This approach has shown an improvement in overall service dependability. Implementing a Human-Centric Reliability Model Even though, at first glance, the ways the approach is implemented in different companies seem very different. There are some key points that any company needs to address in order to successfully switch to a collaborative approach to reliability. Here are the steps to follow if you want to improve the reliability of the service in your organization. Break Down Silos Isolated departments are a thing of the past. Collaborative approaches that appear instead recognize that reliability is a collective responsibility. For example, DevOps brings together development and operations teams. This helps create a unified mindset of these teams towards service reliability and converge the expertise from different domains, building a more robust reliability strategy. Establish Cross-Functional Incident Response Reliability challenges are rarely confined to a single domain. Collaboration across functions is essential for a comprehensive incident response. For instance, in the event of an incident, developers, operations, and customer support must work together seamlessly to identify and address the issue in the most efficient way. Set Shared Objectives To Align Teams Towards Shared Reliability Goals When developers understand how their code affects operations and operations understand the intricacies of development, it leads to more reliable services. Shared objectives lift the boundaries between the teams, creating a unified process of response to potential reliability issues. Work on Effective Communication Communication is the glue that holds these teams together. In complex technological ecosystems, different teams need to effectively collaborate to sustain service reliability. The goal is to build a web of well-interconnected teams, from developers and operations to customer support. Transparent communication and sharing knowledge about changes, updates, and potential challenges are crucial. The information flow should be seamless to enable a holistic understanding of the service throughout the company and reinforce trust among the teams. When everyone is aware of what is going on, they can anticipate and prepare, reducing the risks of miscommunication or taking the wrong steps. Teams must have clear channels for immediate communication to coordinate efforts and share crucial information. If an incident occurs, the speed and accuracy of communication determine how swiftly and effectively the issue is resolved. Challenges and Strategies To Overcome Them Organizational changes never come easy, and shifting a work paradigm requires a lot of effort from all parties involved. I am going to share some tips on how to overcome the most common challenges and point out the areas that require the most attention. Overcoming Resistance To Change Sometimes, new ideas and changes face resistance from the teams, which usually comes from the fact that the current approach already provides a decent level of reliability. Shifting towards a reliability-focused culture requires effective leadership, communication, and showcasing the benefits of the new approach. Investing in Training and Development Building effective communication and collaboration requires time and effort. Successful integration of a human-centered approach to reliability takes a significant investment in training programs. These programs should mainly focus on soft skills, such as communication, teamwork, and adaptability. Measuring and Iterating It is important to measure and iterate on collaboration effectiveness. Establish feedback loops and conduct regular retrospectives to identify areas of improvement and refine collaborative processes. Conclusion Besides the technical aspects, the key to smooth operations is the people. A workplace where everyone is committed to making things work, communicating effectively, and collaborating during challenging times sets the foundation for dependable services. I have experienced many service reliability challenges and witnessed first-hand how human touch can make all the difference. In today's world, service reliability is not just about flashy tech. It is also about everyday commitment, conversations, and teamwork. By focusing on these aspects, you can ensure that the service is rock-solid.
What Is Trunk-Based Development? To create high-quality software, we must be able to trace any changes and, if necessary, roll them back. In trunk-based development, developers frequently merge minor updates into a shared repository, often referred to as the core or trunk (usually the main or master branch). Within trunk-based development, developers create short-lived branches with only a few commits. This approach helps ensure a smooth flow of production releases, even as the team size and codebase complexity increase. Main branch usage - Engineers actively collaborate on the main/master branch, integrating their changes frequently Short-lived feature branches - Goal is to complete work on these branches quickly and merge them back into the main/master branch Frequent integration - Engineers perform multiple integrations daily Reduced branching complexity - Maintain simple branching structures and naming conventions Early detection of issues - Integrations aid in identifying issues and bugs during the development phase Continuous Delivery/Deployment - Changes are always in a deployable state Feature toggles - Feature flags used to hide incomplete or work-in-progress features Trunk-Based Development (Image Source ) Benefits of Trunk-Based Development Here are some benefits of trunk-based development: Allows continuous code integration Reduces the risk of introducing bugs Makes it easy to fix and deploy code quickly Allows asynchronous code reviews Enables comprehensive automated testing Proposed Approach for a Smooth Transition Transitioning to a trunk-based Git branching model requires careful planning and consideration. Here’s a comprehensive solution addressing various aspects of the process: Current State Analysis Conduct a thorough analysis of the current version control and branching strategy. Identify pain points, bottlenecks, and areas that hinder collaboration and integration. Transition Plan Develop a phased transition plan to minimize disruptions and get it approved by the Product team. Clearly communicate the plan to the development and QA teams. Define milestones and success criteria for each phase. Trunk-Based Development Model Establish a single integration branch (e.g., “main”, “master” or “trunk”). Allow features to be developed and tested on the feature branch first without affecting the user experience. Define clear guidelines for pull requests (PRs) to maintain code quality. Encourage peer reviews and collaboration during the code review process. Develop a robust GitHub Actions pipeline to automate the build, test, and deployment processes. Add GitHub Actions to trigger automated workflows upon code changes. Automated Testing If automated tests are not currently in use, begin creating a test automation framework (choose it wisely) as it will serve as a backbone in the long run. Assuming there is currently no Test Case Management (TCM) tool like TestRail, test cases will be written either in Confluence or in Excel. Strengthen the automated testing suite to cover Smoke, Integration, Confidence, and Regression tests. Integrate automated tests into the GitHub workflow for rapid feedback. Create and schedule a nightly confidence test job that will act as a health check of the app and will run every night according to a specified schedule. The results will be posted daily on a Slack/Teams channel. Monitoring and Rollback Procedures The QA team should follow the Agile process, where for each new feature, the test plan and automated tests should be prepared before deployment. Dev and QA must go hand in hand. Implement monitoring tools to detect issues early in the development process. Establish rollback procedures to quickly revert changes in case of unexpected problems. Documentation and Training Ensure each member of the engineering team is well-versed in the GitHub/Release workflow, from branch creation to production release. Develop comprehensive documentation detailing the new branching model and associated best practices. Conduct training sessions for the development and QA teams to facilitate adaptation to these changes. Communication and Collaboration Plan Clearly communicate the benefits of the trunk-based model to the entire organization; conduct regular sessions throughout the initial year. Foster a culture of collaboration among Devs, QA, Product, and Stakeholders, encouraging shared responsibility. To enhance collaboration between Dev and QA, consider Sprint Planning, identifying Dependencies, Regular Syncs, Collaborative Automation, Learning Sessions, and holding regular retrospectives together. Key Challenges When Adopting Trunk-Based Development Testing: Trunk-based development requires a robust testing process to ensure that code changes do not break existing functionality. Inadequate automated test coverage may lead to unstable builds. Code review: With all developers working on the same codebase, it can be challenging to review all changes and ensure that they meet the necessary standards. Frequent integration might cause conflicts and integration issues. Automation: Automation is important to ensure that the testing and deployment process is efficient and error-free. In the absence of a rollback plan, teams may struggle to address issues promptly. Discipline: Trunk-based development requires a high level of discipline among team members to ensure proper adherence to the development process. Developers might fear breaking the build due to continuous integration. Collaboration: Coordinating parallel development on the main branch can be challenging. Release Flow: Step by step Branching and Commit Devs open a short-lived feature branch for any changes/improvements/features to be added to the codebase from the trunk OR master. Follow a generic format for naming : <wok_type>-<dev_name>-<issue-tracker-number>-<short-description> Follow the naming convention for feature branches, like: feature-shivam-SCT-456-user-authentication release-v2.0.1 bugfix-shivam-SCT-789-fix-header-styling While working on it, to test changes live, devs can conduct direct testing on the local environment or review instances. When a commit is pushed to the feature branch, GitHub actions for the following: Unit tests PR title validation Static code analysis (SAST) Sonarqube checks Security checks (Trivy vulnerability scan), etc. will trigger automatically Pull Request and Review Open a Pull Request; if the PR isn’t ready yet (make sure to add WIP), also add configured labels to PR to categorize it. Add CODEOWNERS into .github folder of the project. This will automatically assign (pre-configured) reviewers to your PR. Add pull_request_template.md to .github folder. This will show you a predefined PR template for every pull request. As soon as a PR is opened, a notification will be sent on Teams/Slack to inform reviewers about the PR. When a PR is raised, a smoke test will automatically trigger on the locally deployed app instance (with the latest changes as in the PR). After test completion, the test report will be sent to developers via email and notifications will be sent via Slack/Teams. Test reports and artifacts will be available to download on demand. Reviewers will review the PR and leave comments (if any), and then Approve/Request Changes on PR (if further changes are needed). *If any failure is critical or major, the team will fix it in the test Merge and Build Integration Once all GitHub Actions have passed, the developer/reviewer can merge the pull request into the trunk (master). Immediately after the PR is merged into the trunk (master), an automated build job will trigger to build the app with the latest changes in the Integration Environment. If the build job is successful, regression tests will automatically trigger in the Integration environment, and notifications will be sent on Teams/Slack. If tests fail, the QA team will examine the failures. If any issues genuinely break functionality, they will either roll back the commit OR add a HotFix. After resolving such issues, the team can proceed with promoting the changes to the staging environment. Create Release Branch and Tag At this point, cut the release branch from the trunk, such as release-v2.0.1. Add rule: If the branch name starts with release, trigger a GitHub Action to build the app instance to staging. If the build job is a success, regression tests will automatically trigger in the staging environment, and notifications will be sent on Teams/Slack. At this point, create a tag on the release branch e.g. ‘git tag -a -m “Releasing version 2.0.1” release-v2.0.1’ Add rule: If a protected tag is added matching a specific pattern, then deploy the app to production. Send the release notes to the Teams/Slack channel, notifying them about the successful production deployment. QA will perform sanity testing (manual+automation) after the prod deployment. Upon promotion to production, any issues take precedence over ongoing work for developers, QA, and the design team in general. * At any point, if the product or design teams want to conduct quick QA while changes are in testing, they can and should do so. This applies only to product/UI features/changes. They can also do the same during test reviews. The shorter the feedback loop, the better. Exceptions If any steps are not followed in either of the promotion cases (i.e., test → staging or staging → production), it must be clearly communicated why it was skipped and should only occur under necessary conditions. After promotion to staging, if the team discovers any blocker/critical UI issues, the dev team will address them following the same process as described earlier. The only exception is for non-critical issues/UI bugs, where the product team will decide whether to proceed with promotion to production or not. Exceptions can occur for smaller fixes such as copy changes, CSS fixes, or config updates. When it’s more important to roll out fast, test/QA in later iterations. Since iterations in this platform are fast and convenient, our workflow should evolve with this in mind, and to keep it that way, always! Conclusion In conclusion, the Trunk-Based Git Model serves as a valuable tool in the software development landscape, particularly for teams seeking a more straightforward, collaborative, and continuous integration-focused approach. As with any methodology, its effectiveness largely depends on the specific needs, goals, and dynamics of the development team and project at hand. If you enjoyed this story, please like and share it to help others find it! Feel free to leave a comment below. Thanks for your interest. Connect with me on LinkedIn.
Murphy's Law ("Anything that can go wrong will go wrong and at the worst possible time.") is a well-known adage, especially in engineering circles. However, its implications are often misunderstood, especially by the general public. It's not just about the universe conspiring against our systems; it's about recognizing and preparing for potential failures. Many view Murphy's Law as a blend of magic and reality. As Site Reliability Engineers (SREs), we often ponder its true nature. Is it merely a psychological bias where we emphasize failures and overlook our unnoticed successes? Psychology has identified several related biases, including Confirmation and Selection biases. The human brain tends to focus more on improbable failures than successes. Moreover, our grasp of probabilities is often flawed – the Law of Truly Large Numbers suggests that coincidences are, ironically, quite common. However, in any complex system, a multitude of possible states exist, many of which can lead to failure. While safety measures make a transition from a functioning state to a failure state less likely, over time, it's more probable for a system to fail than not. The real lesson from Murphy's Law isn't just about the omnipresence of misfortune in engineering but also how we respond to it: through redundancies, high availability systems, quality processes, testing, retries, observability, and logging. Murphy's Law makes our job more challenging and interesting! Today, however, I'd like to discuss a complementary or reciprocal aspect of Murphy's Law that I've often observed while working on large systems: Complementary Observations to Murphy's Law The Worst Possible Time Complement Often overlooked, this aspect highlights the 'magic' of Murphy's Law. Complex systems do fail, but not so frequently that we forget them. In our experience, a significant number of failures (about one-third) occur at the worst possible times, such as during important demos. For instance, over the past two months, we had a couple of important demos. In the first demo, the web application failed due to a session expiration issue, which rarely occurs. In the second, a regression embedded in a merge request caused a crash right during the demo. These were the only significant demos we had in that period, and both encountered failures. This phenomenon is often referred to as the 'Demo Effect.' The Conjunction of Events Complement The combination of events leading to a breakdown can be truly astonishing. For example, I once inadvertently caused a major breakdown in a large application responsible for sending electronic payrolls to 5 million people, coinciding with its production release day. The day before, I conducted additional benchmarks (using JMeter) on the email sending system within the development environment. Our development servers, like others in the organization, were configured to route emails through a production relay, which then sent them to the final server in the cloud. Several days prior, I had set the development server to use a mock server since my benchmark simulated email traffic peaks of several hundred thousand emails per hour. However, the day after my benchmarking, when I was off work, my boss called to inquire if I had made any special changes to email sending, as the entire system was jammed at the final mail server. Here’s what had happened: An automated Infrastructure as Code (IAC) tool overwrote my development server configuration, causing it to send emails to the actual relay instead of the mock server; The relay, recognized by the cloud provider, had its IP address changed a few days earlier; The whitelist on the cloud side hadn't been updated, and a throttling system blocked the final server; The operations team responsible for this configuration was unavailable to address the issue. The Squadron Complement Problems often cluster, complicating resolution efforts. These range from simultaneous issues exacerbating a situation to misleading issues that divert us from the real problem. I can categorize these issues into two types: 1. The Simple Additional Issue: This typically occurs at the worst possible moment, such as during another breakdown, adding more work, or slowing down repairs. For instance, in a current project I'm involved with, due to legacy reasons, certain specific characters inputted into one application can cause another application to crash, necessitating data cleanup. This issue arises roughly once every 3 or 4 months, often triggered by user instructions. Notably, several instances of this issue have coincided with much more severe system breakdowns. 2. The Deceitful Additional Issue: These issues, when combined with others, significantly complicate post-mortem analysis and can mislead the investigation. A recent example was an application bug in a Spring batch job that remained obscured due to a connection issue with the state-storing database caused by intermittent firewall outages. The Camouflage Complement Using ITIL's problem/incidents framework, we often find incidents that appear similar but have different causes. We apply the ITIL framework's problem/incident dichotomy to classify issues where a problem can generate one or more incidents. When an incident occurs, it's crucial to conduct a thorough analysis by carefully examining logs to figure out if this is only a new incident of a known problem or an entire new problem. Often, we identify incidents that appear similar to others, possibly occurring on the same day, exhibiting comparable effects but stemming from different causes. This is particularly true when incorrect error-catching practices are in place, such as using overly broad catch(Exception) statements in Java, which can either trap too many exceptions or, worse, obscure the root cause. The Over-Accident Complement Like chain reactions in traffic accidents, one incident in IT can lead to others, sometimes with more severe consequences. I can recall at least three recent examples illustrating our challenges: 1. Maintenance Page Caching Issue: Following a system failure, we activated a maintenance page, redirecting all API and frontend calls to this page. Unfortunately, this page lacked proper cache configuration. Consequently, when a few users made XHR calls precisely at the time the maintenance page was set up, it was cached in their browsers for the entire session. Even after maintenance ended and the web application frontend resumed normal operation, the API calls continued to retrieve the HTML maintenance page instead of the expected JSON response due to this browser caching. 2. Debug Verbosity Issue: To debug data sent by external clients, we store payloads into a database. To maintain a reasonable database size, we limited the stored payload sizes. However, during an issue with a partner organization, we temporarily increased the payload size limit for analysis purposes. This change was inadvertently overlooked, leading to an enormous database growth and nearly causing a complete application crash due to disk space saturation. 3. API Gateway Timeout Handling: Our API gateway was configured to replay POST calls that ended in timeouts due to network or system issues. This setup inadvertently led to catastrophic duplicate transactions. The gateway reissued requests that timed out, not realizing these transactions were still processing and would eventually complete successfully. This resulted in a conflict between robustness and data integrity requirements. The Heisenbug Complement A 'heisenbug' is a type of software bug that seems to alter or vanish when one attempts to study it. This term humorously references the Heisenberg Uncertainty Principle in quantum mechanics, which posits that the more precisely a particle's position is determined, the less precisely its momentum can be known, and vice versa. Heisenbugs commonly arise from race conditions under high loads or other factors that render the bug's behavior unpredictable and difficult to replicate in different conditions or when using debugging tools. Their elusive nature makes them particularly challenging to fix, as the process of debugging or introducing diagnostic code can change the execution environment, causing the bug to disappear. I've encountered such issues in various scenarios. For instance, while using a profiler, I observed it inadvertently slowing down threads to such an extent that it hid the race conditions. On another occasion, I demonstrated to a perplexed developer how simple it was to reproduce a race condition on non-thread-safe resources with just two or three threads running simultaneously. However, he was unable to replicate it in a single-threaded environment. The UFO Issue Complement A significant number of issues are neither fixed nor fully understood. I'm not referring to bugs that are understood but deemed too costly to fix in light of their severity or frequency. Rather, I'm talking about those perplexing issues whose occurrence is extremely rare, sometimes happening only once. Occasionally, we (partially) humorously attribute such cases to Single Event Errors caused by cosmic particles. For example, in our current application that generates and sends PDFs to end-users through various components, we encountered a peculiar issue a few months ago. A user reported, with a screenshot as evidence, a PDF where most characters appeared as gibberish symbols instead of letters. Despite thorough investigations, we were stumped and ultimately had to abandon our efforts to resolve it due to a complete lack of clues. The Non-Existing Issue Complement One particularly challenging type of issue arises when it seems like something is wrong, but in reality, there is no actual bug. These non-existent bugs are the most difficult to resolve! The misconception of a problem can come from various factors, including looking in the wrong place (such as the incorrect environment or server), misinterpreting functional requirements, or receiving incorrect inputs from end-users or partner organizations. For example, we recently had to address an issue where our system rejected an uploaded image. The partner organization assured us that the image should be accepted, claiming it was in PNG format. However, upon closer examination (that took us several staff-days), we discovered that our system's rejection was justified: the file was not actually a PNG. The False Hope Complement I often find Murphy's Law to be quite cruel. You spend many hours working on an issue, and everything seems to indicate that it is resolved, with the problem no longer reproducible. However, once the solution is deployed in production, the problem reoccurs. This is especially common with issues related to heavy loads or concurrency. The Anti-Murphy's Reciprocal In every organization I've worked for, I've noticed a peculiar phenomenon, which I'd call 'Anti-Murphy's Law.' Initially, during the maintenance phase of building an application, Murphy’s Law seems to apply. However, after several more years, a contrary phenomenon emerges: even subpar software appears not only immune to Murphy's Law but also more robust than expected. Many legacy applications run glitch-free for years, often with less observation and fewer robustness features, yet they still function effectively. The better the design of an application, the quicker it reaches this state, but even poorly designed ones eventually get there. I have only some leads to explain this strange phenomenon: Over time, users become familiar with the software's weaknesses and learn to avoid them by not using certain features, waiting longer, or using the software during specific hours. Legacy applications are often so difficult to update that they experience very few regressions. Such applications rarely have their technical environment (like the OS or database) altered to avoid complications. Eventually, everything that could go wrong has already occurred and been either fixed or worked around: it's as if Murphy's Law has given up. However, don't misunderstand me: I'm not advocating for the retention of such applications. Despite appearing immune to issues, they are challenging to update and increasingly fail to meet end-user requirements over time. Concurrently, they become more vulnerable to security risks. Conclusion Rather than adopting a pessimistic view of Murphy's Law, we should be thankful for it. It drives engineers to enhance their craft, compelling them to devise a multitude of solutions to counteract potential issues. These solutions include robustness, high availability, fail-over systems, redundancy, replays, integrity checking systems, anti-fragility, backups and restores, observability, and comprehensive logging. In conclusion, addressing a final query: can Murphy's Law turn against itself? A recent incident with a partner organization sheds light on this. They mistakenly sent us data and relied on a misconfiguration in their own API Gateway to prevent this erroneous transmission. However, by sheer coincidence, the API Gateway had been corrected in the meantime, thwarting their reliance on this error. Thus, the answer appears to be a resounding NO.
How We Used to Handle Security A few years ago, I was working on a completely new project for a Fortune 500 corporation, trying to bring a brand new cloud-based web service to life simultaneously in 4 different countries in the EMEA region, which would later serve millions of users. It took me and my team two months to handle everything: cloud infrastructure as code, state-of-the-art CI/CD workflows, containerized microservices in multiple environments, frontend distributed to CDN, and tests passing in the staging environment. We were so prepared that we could go live immediately with just one extra click of a button. And we still had a whole month before the planned release date. I know, things looked pretty good for us; until they didn't: because it was precisely at that moment a "security guy" stepped in out of nowhere and caused us two whole weeks. Of course, the security guy. I knew vaguely that they were from the same organization but maybe a different operational unit. I also had no idea that they were involved in this project before they showed up. But I could make a good guess: writing security reports and conducting security reviews, of course. What else could it be? After finishing those reports and reviews, I was optimistic: "We still have plenty of time," I told myself. But it wasn't long before my thought was rendered untrue by another unexpected development: an external QA team jumped in and started security tests. Oh, by the way, to make matters worse, the security tests were manual. It was two crazy weeks of fixing and testing and rinsing and repeating. The launch was delayed, and even months after the Big Bang release, the whole team was still miserable: busy on-call, fixing issues, etc. Later, I would continue to see many other projects like this one, and I am sure you also have similar experiences. This is actually how we used to do security. Everything is smooth until it isn't because we traditionally tend to handle the security stuff at the end of the development lifecycle, which adds cost and time to fix those discovered security issues and causes delays. Over the years, software development has evolved to agile and automatic, but how we handle security hasn't changed much: security isn't tackled until the last minute. I keep asking myself: what could've been done differently? Understanding DevSecOps and Security as Code DevSecOps: Shift Security to the Left of the SDLC Based on the experience of the project above (and many other projects), we can easily conclude why the traditional way of handling security doesn't always work: Although security is an essential aspect of software, we put that aspect at the end of the software development lifecycle (SDLC). When we only start handling the critical aspect at the very end, it's likely to cause delays because it might cause extra unexpected changes and even rework. Since we tend to do security only once (at least we hope so), in the end, we usually wouldn't bother automating our security tests. To make security great again, we make two intuitive proposals for the above problems: Shift left: why does security work at the end of the project, risking delays and rework? To change this, we want to integrate security work into every stage of the SDLC: shifting from the end (right side) to the left (beginning of the SDLC, i.e., planning, developing, etc.), so that we can discover potential issues earlier when it's a lot easier and takes much less effort to fix or even rework. Automation: why do we do security work manually, which is time-consuming, error-prone, and hard to repeat? Automation comes to the rescue. Instead of manually defining policies and security test cases, we take a code-based approach that can be automated and repeated easily. We combine the "shift left" part and the "automation" part, and bam, we get DevSecOps: a practice of integrating security at every stage, throughout the SDLC to accelerate delivery via automation, collaboration, fast feedback, and incremental, iterative improvements. What Is Security as Code (SaC)? Shifting security to the left is more of a change in the mindset; what's more important is the automation, because it's the driving force and the key to achieving a better security model: without proper automation, it's difficult, if not impossible at all, to add security checks and tests at every stage of the SDLC without introducing unnecessary costs or delays. And this is the idea of Security as Code: Security as Code (SaC) is the practice of building and integrating security into tools and workflows by identifying places where security checks, tests, and gates may be included. For those tests and checks to run automatically on every code commit, we should define security policies, tests, and scans as code in the pipelines. Hence, security "as code". From the definition, we can see that Security as Code is part of DevSecOps, or rather, it's how we can achieve DevSecOps. The key differences between Security as Code/DevSecOps and the traditional way of handling security are shifting left and automation: we try to define security in the early stages of SDLC and tackle it in every stage automatically. The Importance and Benefits of Security as Code (SaC) The biggest benefit of Security as Code, of course, is that it's accelerating the SDLC. How so? I'll talk about it from three different standpoints: First of all, efficiency-boosting: security requirements are defined early at the beginning of a project when shifting left, which means there won't be major rework in the late stage of the project with clearly defined requirements in the first place, and there won't be a dedicated security stage before the release. With automated tests, developers can make sure every single incremental code commit is secure. Secondly, codified security allows repeatability, reusability, and consistency. Development velocity is increased by shorter release cycles without manual security tests; security components can be reused at other places and even in other projects; changes to security requirements can be adopted comprehensively in a "change once, apply everywhere" manner without repeated and error-prone manual labor. Last but not least, it saves a lot of time, resources, and even money because, with automated checks, potential vulnerabilities in the development and deployment process will be caught early on in the SDLC when remediating the issues has a much smaller footprint, cost- and labor-wise. To learn more about how adding security into DevSecOps accelerates the SDLC in every stage read this blog here. Key Components of Security as Code The components of Security as Code for application development are automated security tests, automated security scans, automated security policies, and IaC security. Automated security tests: automate complex and time-consuming manual tests (and even penetration tests) via automation tools and custom scripts, making sure they can be reused across different environments and projects. Automated security scans: we can integrate security scans CI/CD pipelines so that they can be triggered automatically and can be reused across different environments and projects. We can do all kinds of scans and analyses here, for example, static code scans, dynamic analyses, and vulnerability scans against known vulnerabilities. Automated security policies: we can define different policies as code using a variety of tools, and integrate the policy checks with our pipelines. For example, we can define access control in RBAC policies for different tools; we can enforce policies in microservices, Kubernetes, and even CI/CD pipelines (for example, with the Open Policy Agent). To know more about Policy as Code and Open Policy Agent, read this blog here. IaC security: in modern times, we often define our infrastructure (especially cloud-based) as code (IaC) and deploy it automatically. We can use IaC to ensure the same security configs and best practices are applied across all environments, and we can use Security as Code measures to make sure the infrastructure code itself is secure. To do so, we integrate security tests and checks within the IaC pipeline, as with the ggshield security scanner for your Terraform code. Best Practices for Security as Code With the critical components of Security as Code sorted out, let's move on to a few best practices to follow. Security-First Mindset for Security as Code/DevSecOps First of all, since Security as Code and DevSecOps are all about shifting left, which is not only a change of how and when we do things, but more importantly, a change of the mindset, the very first best practice for Security as Code and DewvSecOps is to build (or rather, transition into) a security-first mindset. At Amazon, there is a famous saying, which is "Security is job zero". Why do we say that? Since it's important, if you only start dealing with it in the end, there will be consequences. Similar to writing tests, trying to fix issues or even rework components because of security issues found at the end of a project's development lifecycle can be orders of magnitude harder compared to when the code is still fresh, the risk just introduced, and no other components relying on it yet. Because of its importance and close relationship with other moving parts, we want to shift security to the left, and the way to achieve that is by transitioning into a security-first mindset. If you want to know more about DevSecOps and why "adding" security into your SDLC doesn't slow down, but rather speeds up things, refer blog here detailing exactly that. Code Reviews + Automated Scanning Security as Code is all about automation, so it makes sense to start writing those automated tests as early as possible, so that they can be used at the very beginning of the SDLC, acting as checks and gates, accelerating the development process. For example, we can automate SAST/DAST (Static Application Security Testing and DynamicApplication Security Testing) with well-known tools (for example, SonarQube, Synopsys, etc.) into our CI/CD pipelines so that everything runs automatically on new commits. One thing worth pointing out is that SAST + DAST isn't enough: while static and dynamic application tests are the cornerstones of security, there are blind spots. For example, one hard-coded secret in the code is more than enough to compromise the entire system. Two approaches are recommended as complements to the automated security tests. First of all, regular code reviews help. It's always nice to have another person's input on the code change because the four-eyes principle can be useful. For more tips on conducting secure code reviews, read this blog here. However, code reviews can only be so helpful to a certain extent, because, first of all, humans still tend to miss mistakes, and second of all, during code reviews, we mainly focus on the diffs rather than what's already in the code base. As a complement to code reviews, having security scanning and detection in place also helps. Continuous Monitoring, Feedback Loops, Knowledge Sharing Having automated security policies as checks can only help so much if the automated security policies themselves aren't of high quality, or worse, the results can't reach the team. Thus, creating a feedback loop to continuously deliver the results to the developers and monitor the checks is critical, too. It's even better if the monitoring can create logs automatically and display the result in a dashboard to make sure no security risk is found, no sensitive data or secret is shared, and developers can find breaches early so that they can remediate issues early. Knowledge sharing and continuous learning can also be helpful, allowing developers to learn best practices during the coding process. Security as Code: What You Should Keep in Mind Besides the best practices above, there are a few other considerations and challenges when putting Security as Code into action: First of all, we need to balance speed and security when implementing Security as Code/DevSecOps. Yes, I know, earlier that we mentioned how security is job zero and how doing DevSecOps actually speeds things up, but still, implementing Security as Code in the early stages of the SDLC still costs some time upfront, and this is one of the balances we need to consider carefully, to which, unfortunately there is no one-size-fits-all answer. In general, for big and long-lasting projects, paying those upfront costs will most likely be very beneficial in the long run, but it could be worth a second thought if it's only a one-sprint or even one-day minor task with relatively clear changes and confined attack surface. Secondly, to effectively adopt Security as Code, the skills and gaps in the team need to be identified, and continuous knowledge sharing and learning are required. This can be challenging in the DevOps/DevSecOps/Cloud world, where things evolve fast with new tools and even new methodologies emerging from time to time. Under this circumstance, it's critical to keep up with the pace, identify what could potentially push engineering productivity and security to another level, figure out what needs to be learned because we've only got limited time and can't learn everything, and learn those highly-prioritized skills quickly. Last but not least, keep a close eye on newly discovered security threats and changes regarding security regulations, which are also evolving with time. Conclusions and Next Steps Security as Code isn't just a catchphrase; it requires continuous effort to make the best out of it: a change of mindset, continuous learning, new skills, new tooling, automation, and collaboration. As a recap, here's a list of the key components: Automated security tests Automated security scans Automated security policies IaC security And here's a list of the important aspects when adopting it proactively: Security-first, mindset change, shift left Automated testing/scanning with regular code reviews Continuous feedback and learning At last, let's end this article with an FAQ list on Security as Code and DevSecOps. F.A.Q. What is Security as Code (SaC)? Security as Code (SaC) is the practice of building and integrating security into tools and workflows by defining security policies, tests, and scans as code. It identifies places where security checks, tests, and gates may be included in pipelines without adding extra overhead. What is the main purpose of Security as Code? The main purpose of Security as Code is to boost the SDLC by increasing efficiency and saving time and resources while minimizing vulnerabilities and risks. This approach integrates security measures into the development process from the start, rather than adding them at the end. What is the relationship between Security as Code and DevSecOps? DevSecOps is achieved by shifting left and automation; Security as Code handles the automation part. Security as Code is the key to DevSecOps. What is the difference between DevSecOps and secure coding? DevSecOps focuses on automated security tests and checks, whereas secure coding is the practice of developing computer software in such a way that guards against the accidental introduction of security vulnerabilities. Defects, bugs, and logic flaws are consistently the primary cause of commonly exploited software vulnerabilities. What is Infrastructure as Code (IaC) security? IaC security uses the security as a code approach to enhance infrastructure code. For example, consistent cloud security policies can be embedded into the infrastructure code itself and the pipelines to reduce security risks.
Stefan Wolpers
Agile Coach,
Berlin Product People GmbH
Daniel Stori
Software Development Manager,
AWS
Alireza Rahmani Khalili
Officially Certified Senior Software Engineer, Domain Driven Design Practitioner,
Worksome