Testing, Deployment, and Maintenance Resources

DZone's Featured Testing, Deployment, and Maintenance Resources

DevOps

The DevOps movement has paved the way for CI/CD and streamlined application delivery and release orchestration. These nuanced methodologies have not only increased the scale and speed at which we release software, but also redistributed responsibilities onto the developer and led to innovation and automation throughout the SDLC.DZone's 2023 DevOps: CI/CD, Application Delivery, and Release Orchestration Trend Report explores these derivatives of DevOps by diving into how AIOps and MLOps practices affect CI/CD, the proper way to build an effective CI/CD pipeline, strategies for source code management and branching for GitOps and CI/CD, and more. Our research builds on previous years with its focus on the challenges of CI/CD, a responsibility assessment, and the impact of release strategies, to name a few. The goal of this Trend Report is to provide developers with the information they need to further innovate on their integration and delivery pipelines.

The Rise of the Platform Engineer: How to Deal With the Increasing Complexity of Software

By Mirco Hering

Editor's Note: The following is an article written for and published in DZone's 2024 Trend Report, The Modern DevOps Lifecycle: Shifting CI/CD and Application Architectures. DevOps — ✓DevSecOps — ✓Platform engineering — ? Is platform engineering just another term used for a specialization of DevOps, or is it something different? The truth is probably somewhere in the middle. DevOps and its associated DevXOps flavors have a strong cultural spice that puts the individual teams at the center. Unfortunately, in many places, DevOps has led to new problems like tool proliferation and a lack of harmonization across the enterprise. One could say that in response to the very strict silos and strong centralization of the past, DevOps has pushed the pendulum too far toward federation — and, hence, a suboptimization at the team level — to the detriment of the organization. This has been felt most by the larger, more complex enterprises that have to deal with different technology stacks and differing levels of maturity across the organization. Platform engineering has evolved as a response to this enterprise-wide challenge. Platform engineering is not a replacement for DevOps. Instead, platform engineering complements DevOps to address enterprise-wide challenges and provide a tooling platform that makes it easier for individual teams to do the right thing rather than break things while trying to maintain consistency across the organization. IT delivery has increased in complexity over the last few years, given that more applications are moving at a faster pace. This means organizations cannot rely on individuals to control the complexity; they require systemic answers supported by the proper tooling. This is the problem statement that platform engineering has the ambition to address. With this, platform engineers have become crucial for organizations, as their role holds the keys to enabling security and engineering standards. What Is a Platform Engineer? The role of the platform engineer has three different parts. Figure 1. Role of the platform engineer The most obvious one is the role of a technical architect as they have to build an engineering platform that connects all tools and enables processes. The second aspect is a community enabler, which is similar to developer relations roles at technical tooling companies. The third part is a product manager; the competing interests and demands from the developer community need to be prioritized against the technical needs of the platform (consider things like security hardening and patching of outdated components). Platform Engineer as Technical Architect In organizations with moderate or high complexity within their technology stack, the number of tools required to build, release, and maintain software is at least a dozen, sometimes more. Integrating these tools and enabling the measurement of meaningful metrics is about as tricky as integrating business applications. After all, the challenges are very similar: Different processes need to be aligned, data models need to be transformed to make them usable, and integration points need to be connected to enable the end-to-end process. The systems that run the software side of the business have become similarly challenging. The role of the platform engineer here is to look after the architecture of the tools that run the software side — the goal being to make the tools "disappear" and make the build and release of software appear easy. Platform Engineer as Community Enabler Software engineers tend to think their solutions are better than those from someone else. As such, the adoption of engineering platforms is a challenge to overcome. Telling engineers to use a specific tool has often been met with resistance. The platform engineer must be a community enabler who works with the engineers to promote the platform and convince them of the benefits. Communication goes both ways in this part of the role as the platform engineer also must listen to the problems and challenges of the platform and identify new features that are high in demand. This leads to the third part of the role. Platform Engineer as Product Manager Competing demands on the platform come from the engineers of an organization and other stakeholders like security and, of course, the platform engineers. Prioritizing these demands in a meaningful way is a difficult task as you have to find a balance between all the competing interests, especially as funding for the platform is often a challenge in itself, so speed to value is critical for the ongoing support of the platform. The platform engineer requires good negotiation skills to navigate these challenges. Overview of Platform Engineering Architecture We spoke about the role of the platform engineer, but what is in that platform that the platform engineer is building and maintaining? It is easiest to think about three layers and one target environment: The top layer is the developer experience. These are the tools the developer directly engages with — tools that drive the overall workflow, like an Agile lifecycle management tool, a service management tool, and the developer IDE, fit into this. The bottom layer comprises the infrastructure components that must be combined to build application environments. This can be from the public or private cloud and includes traditional data center technologies. In the middle is where most of the complexity sits — the software engineering platform. Here, all the processes that are required to create and deliver software are being orchestrated: CI/CD, security scanning, environment provisioning, and release management. Figure 2. Platform structure Making the Switch: How to Adopt Platform Engineering Across DevOps Teams So where should you start? One successful adoption pattern focuses on identifying developer journeys to define a minimum viable platform. Which capabilities are required to enable a developer journey to achieve an outcome? Think of a task like provisioning an environment, deploying a new API to production, or running a performance test suite. Each is a valid developer journey with multiple touchpoints that potentially require numerous tools. Once you have created the minimum viable platform for the first set of applications or technologies, adoption follows three dimensions: More applications (once the required capabilities are available), more capabilities, and more maturity, thus increasing the levels of automation and/or performance. Besides worrying about building out the platform with a reasonable approach, three other aspects should be addressed early on: Community engagement Funding Measuring outcomes from the platform Defining a community engagement strategy can be very helpful. This strategy should contain how the information will be shared with the developer community, how feature requests can be made, and how the platform's benefits will be communicated. Defining the forums, the communications, and their respective frequency is also helpful. Funding can quickly become a bottleneck, so a funding strategy should be agreed upon early in the platform engineer adoption. This can be one of several strategies, such as dedicated funding, funding for the services provided, or a service tax on all software development. Each has its own benefits and challenges, a discussion of which is beyond the scope of this article. What is essential is to have a sustainable long-term funding strategy that does not depend on stakeholders' goodwill. Last but not least, the platform engineer needs to be able to show results, which means we need to measure meaningful metrics that showcase why the company is better off with the platform in place. This is often forgotten or an afterthought. Understanding the organization's priorities and aligning the measurement framework to it can help achieve ongoing support. Unfortunately, this usually requires data alignment across multiple tools and is easiest to accomplish when thought about upfront — it becomes increasingly difficult the longer the data models of individual tools remain isolated. Conclusion Platform engineering is still pretty new, yet there is already a lot of content on it, which shows how quickly it has gained interest from organizations. There is even a dedicated conference for it, which began in 2022 and has thousands of participants. It's the early days, but current indications show that platform engineering has quickly found market adoption and a passionate community. And while this is happening, the role of the platform engineer will steadily increase in importance, which is already showing up in salaries too. Hopefully, platform engineering will continue to help organizations reduce complexity for their developers while delivering on the DevOps promise: to provide better solutions faster and more securely. This is an excerpt from DZone's 2024 Trend Report, The Modern DevOps Lifecycle: Shifting CI/CD and Application Architectures.For more: Read the Report More

Refcard #368

Getting Started With OpenTelemetry

By Joana Carvalho

CORE

Integrating Software Supply Chains and DevOps: Tips for Effectively Reconciling Supply Chain Management and DevOps

By Justin Albano

CORE

A Comprehensive DevSecOps Guide: Key Considerations to Effectively Secure Your CI/CD Pipeline

By Louis-Guillaume Morand

Hints for Unit Testing With AssertJ

Unit testing has become a standard part of development. Many tools can be utilized for it in many different ways. This article demonstrates a couple of hints or, let's say, best practices working well for me. In This Article, You Will Learn How to write clean and readable unit tests with JUnit and Assert frameworks How to avoid false positive tests in some cases What to avoid when writing unit tests Don't Overuse NPE Checks We all tend to avoid NullPointerException as much as possible in the main code because it can lead to ugly consequences. I believe our main concern is not to avoid NPE in tests. Our goal is to verify the behavior of a tested component in a clean, readable, and reliable way. Bad Practice Many times in the past, I've used isNotNull assertion even when it wasn't needed, like in the example below: Java @Test public void getMessage() { assertThat(service).isNotNull(); assertThat(service.getMessage()).isEqualTo("Hello world!"); } This test produces errors like this: Plain Text java.lang.AssertionError: Expecting actual not to be null at com.github.aha.poc.junit.spring.StandardSpringTest.test(StandardSpringTest.java:19) Good Practice Even though the additional isNotNull assertion is not really harmful, it should be avoided due to the following reasons: It doesn't add any additional value. It's just more code to read and maintain. The test fails anyway when service is null and we see the real root cause of the failure. The test still fulfills its purpose. The produced error message is even better with the AssertJ assertion. See the modified test assertion below. Java @Test public void getMessage() { assertThat(service.getMessage()).isEqualTo("Hello world!"); } The modified test produces an error like this: Java java.lang.NullPointerException: Cannot invoke "com.github.aha.poc.junit.spring.HelloService.getMessage()" because "this.service" is null at com.github.aha.poc.junit.spring.StandardSpringTest.test(StandardSpringTest.java:19) Note: The example can be found in SimpleSpringTest. Assert Values and Not the Result From time to time, we write a correct test, but in a "bad" way. It means the test works exactly as intended and verifies our component, but the failure isn't providing enough information. Therefore, our goal is to assert the value and not the comparison result. Bad Practice Let's see a couple of such bad tests: Java // #1 assertThat(argument.contains("o")).isTrue(); // #2 var result = "Welcome to JDK 10"; assertThat(result instanceof String).isTrue(); // #3 assertThat("".isBlank()).isTrue(); // #4 Optional<Method> testMethod = testInfo.getTestMethod(); assertThat(testMethod.isPresent()).isTrue(); Some errors from the tests above are shown below. Plain Text #1 Expecting value to be true but was false at java.base/jdk.internal.reflect.DirectConstructorHandleAccessor.newInstance(DirectConstructorHandleAccessor.java:62) at java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:502) at com.github.aha.poc.junit5.params.SimpleParamTests.stringTest(SimpleParamTests.java:23) #3 Expecting value to be true but was false at java.base/jdk.internal.reflect.DirectConstructorHandleAccessor.newInstance(DirectConstructorHandleAccessor.java:62) at java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:502) at com.github.aha.poc.junit5.ConditionalTests.checkJdk11Feature(ConditionalTests.java:50) Good Practice The solution is quite easy with AssertJ and its fluent API. All the cases mentioned above can be easily rewritten as: Java // #1 assertThat(argument).contains("o"); // #2 assertThat(result).isInstanceOf(String.class); // #3 assertThat("").isBlank(); // #4 assertThat(testMethod).isPresent(); The very same errors as mentioned before provide more value now. Plain Text #1 Expecting actual: "Hello" to contain: "f" at com.github.aha.poc.junit5.params.SimpleParamTests.stringTest(SimpleParamTests.java:23) #3 Expecting blank but was: "a" at com.github.aha.poc.junit5.ConditionalTests.checkJdk11Feature(ConditionalTests.java:50) Note: The example can be found in SimpleParamTests. Group-Related Assertions Together The assertion chaining and a related code indentation help a lot in the test clarity and readability. Bad Practice As we write a test, we can end up with the correct, but less readable test. Let's imagine a test where we want to find countries and do these checks: Count the found countries. Assert the first entry with several values. Such tests can look like this example: Java @Test void listCountries() { List<Country> result = ...; assertThat(result).hasSize(5); var country = result.get(0); assertThat(country.getName()).isEqualTo("Spain"); assertThat(country.getCities().stream().map(City::getName)).contains("Barcelona"); } Good Practice Even though the previous test is correct, we should improve the readability a lot by grouping the related assertions together (lines 9-11). The goal here is to assert result once and write many chained assertions as needed. See the modified version below. Java @Test void listCountries() { List<Country> result = ...; assertThat(result) .hasSize(5) .singleElement() .satisfies(c -> { assertThat(c.getName()).isEqualTo("Spain"); assertThat(c.getCities().stream().map(City::getName)).contains("Barcelona"); }); } Note: The example can be found in CountryRepositoryOtherTests. Prevent False Positive Successful Test When any assertion method with the ThrowingConsumer argument is used, then the argument has to contain assertThat in the consumer as well. Otherwise, the test would pass all the time - even when the comparison fails, which means the wrong test. The test fails only when an assertion throws a RuntimeException or AssertionError exception. I guess it's clear, but it's easy to forget about it and write the wrong test. It happens to me from time to time. Bad Practice Let's imagine we have a couple of country codes and we want to verify that every code satisfies some condition. In our dummy case, we want to assert that every country code contains "a" character. As you can see, it's nonsense: we have codes in uppercase, but we aren't applying case insensitivity in the assertion. Java @Test void assertValues() throws Exception { var countryCodes = List.of("CZ", "AT", "CA"); assertThat( countryCodes ) .hasSize(3) .allSatisfy(countryCode -> countryCode.contains("a")); } Surprisingly, our test passed successfully. Good Practice As mentioned at the beginning of this section, our test can be corrected easily with additional assertThat in the consumer (line 7). The correct test should be like this: Java @Test void assertValues() throws Exception { var countryCodes = List.of("CZ", "AT", "CA"); assertThat( countryCodes ) .hasSize(3) .allSatisfy(countryCode -> assertThat( countryCode ).containsIgnoringCase("a")); } Now the test fails as expected with the correct error message. Plain Text java.lang.AssertionError: Expecting all elements of: ["CZ", "AT", "CA"] to satisfy given requirements, but these elements did not: "CZ" error: Expecting actual: "CZ" to contain: "a" (ignoring case) at com.github.aha.sat.core.clr.AppleTest.assertValues(AppleTest.java:45) Chain Assertions The last hint is not really the practice, but rather the recommendation. The AssertJ fluent API should be utilized in order to create more readable tests. Non-Chaining Assertions Let's consider listLogs test, whose purpose is to test the logging of a component. The goal here is to check: Asserted number of collected logs Assert existence of DEBUG and INFO log message Java @Test void listLogs() throws Exception { ListAppender<ILoggingEvent> logAppender = ...; assertThat( logAppender.list ).hasSize(2); assertThat( logAppender.list ).anySatisfy(logEntry -> { assertThat( logEntry.getLevel() ).isEqualTo(DEBUG); assertThat( logEntry.getFormattedMessage() ).startsWith("Initializing Apple"); }); assertThat( logAppender.list ).anySatisfy(logEntry -> { assertThat( logEntry.getLevel() ).isEqualTo(INFO); assertThat( logEntry.getFormattedMessage() ).isEqualTo("Here's Apple runner" ); }); } Chaining Assertions With the mentioned fluent API and the chaining, we can change the test this way: Java @Test void listLogs() throws Exception { ListAppender<ILoggingEvent> logAppender = ...; assertThat( logAppender.list ) .hasSize(2) .anySatisfy(logEntry -> { assertThat( logEntry.getLevel() ).isEqualTo(DEBUG); assertThat( logEntry.getFormattedMessage() ).startsWith("Initializing Apple"); }) .anySatisfy(logEntry -> { assertThat( logEntry.getLevel() ).isEqualTo(INFO); assertThat( logEntry.getFormattedMessage() ).isEqualTo("Here's Apple runner" ); }); } Note: the example can be found in AppleTest. Summary and Source Code The AssertJ framework provides a lot of help with their fluent API. In this article, several tips and hints were presented in order to produce clearer and more reliable tests. Please be aware that most of these recommendations are subjective. It depends on personal preferences and code style. The used source code can be found in my repositories: spring-advanced-training junit-poc

By Arnošt Havelka

CORE

Unleashing the Power of Git Bisect

We don't usually think of Git as a debugging tool. Surprisingly, Git shines not just as a version control system, but also as a potent debugging ally when dealing with the tricky matter of regressions. The Essence of Debugging with Git Before we tap into the advanced aspects of git bisect, it's essential to understand its foundational premise. Git is known for tracking changes and managing code history, but the git bisect tool is a hidden gem for regression detection. Regressions are distinct from generic bugs. They signify a backward step in functionality—where something that once worked flawlessly now fails. Pinpointing the exact change causing a regression can be akin to finding a needle in a haystack, particularly in extensive codebases with long commit histories. Traditionally, developers would employ a manual, binary search strategy—checking out different versions, testing them, and narrowing down the search scope. This method, while effective, is painstakingly slow and error-prone. Git bisect automates this search, transforming what used to be a marathon into a swift sprint. Setting the Stage for Debugging Imagine you're working on a project, and recent reports indicate a newly introduced bug affecting the functionality of a feature that previously worked flawlessly. You suspect a regression but are unsure which commit introduced the issue among the hundreds made since the last stable version. Initiating Bisect Mode To start, you'll enter bisect mode in your terminal within the project's Git repository: git bisect start This command signals Git to prepare for the bisect process. Marking the Known Good Revision Next, you identify a commit where the feature functioned correctly, often a commit tagged with a release number or dated before the issue was reported. Mark this commit as "good": git bisect good a1b2c3d Here, a1b2c3d represents the hash of the known good commit. Marking the Known Bad Revision Similarly, you mark the current version or a specific commit where the bug is present as "bad": git bisect bad z9y8x7w z9y8x7w is the hash of the bad commit, typically the latest commit in the repository where the issue is observed. Bisecting To Find the Culprit Upon marking the good and bad commits, Git automatically jumps to a commit roughly in the middle of the two and waits for you to test this revision. After testing (manually or with a script), you inform Git of the result: If the issue is present: git bisect bad If the issue is not present: git bisect good Git then continues to narrow down the range, selecting a new commit to test based on your feedback. Expected Output After several iterations, Git will isolate the problematic commit, displaying a message similar to: Bisecting: 0 revisions left to test after this (roughly 3 steps) [abcdef1234567890] Commit message of the problematic commit Reset and Analysis Once the offending commit is identified, you conclude the bisect session to return your repository to its initial state: git bisect reset Notice that bisect isn't linear. Bisect doesn't scan through the revisions in a sequential manner. Based on the good and bad markers, Git automatically selects a commit approximately in the middle of the range for testing (e.g., commit #6 in the following diagram). This is where the non-linear, binary search pattern starts, as Git divides the search space in half instead of examining each commit sequentially. This means fewer revisions get scanned and the process is faster. Advanced Usage and Tips The magic of git bisect lies in its ability to automate the binary search algorithm within your repository, systematically halving the search space until the rogue commit is identified. Git bisect offers a powerful avenue for debugging, especially for identifying regressions in a complex codebase. To elevate your use of this tool, consider delving into more advanced techniques and strategies. These tips not only enhance your debugging efficiency but also provide practical solutions to common challenges encountered during the bisecting process. Script Automation for Precision and Efficiency Automating the bisect process with a script is a game-changer, significantly reducing manual effort and minimizing the risk of human error. This script should ideally perform a quick test that directly targets the regression, returning an exit code based on the test's outcome. Example Imagine you're debugging a regression where a web application's login feature breaks. You could write a script that attempts to log in using a test account and checks if the login succeeds. The script might look something like this in a simplified form: #!/bin/bash # Attempt to log in and check for success if curl -s http://yourapplication/login -d "username=test&password=test" | grep -q "Welcome"; then exit 0 # Login succeeded, mark this commit as good else exit 1 # Login failed, mark this commit as bad fi By passing this script to git bisect run, Git automatically executes it at each step of the bisect process, effectively automating the regression hunt. Handling Flaky Tests With Strategy Flaky tests, which sometimes pass and sometimes fail under the same conditions, can complicate the bisecting process. To mitigate this, your automation script can include logic to rerun tests a certain number of times or to apply more sophisticated checks to differentiate between a true regression and a flaky failure. Example Suppose you have a test that's known to be flaky. You could adjust your script to run the test multiple times, considering the commit "bad" only if the test fails consistently: #!/bin/bash # Run the flaky test three times success_count=0 for i in {1..3}; do if ./run_flaky_test.sh; then ((success_count++)) fi done # If the test succeeds twice or more, consider it a pass if [ "$success_count" -ge 2 ]; then exit 0 else exit 1 fi This approach reduces the chances that a flaky test will lead to incorrect bisect results. Skipping Commits With Care Sometimes, you'll encounter commits that cannot be tested due to reasons like broken builds or incomplete features. git bisect skip is invaluable here, allowing you to bypass these commits. However, use this command judiciously to ensure it doesn't obscure the true source of the regression. Example If you know that commits related to database migrations temporarily break the application, you can skip testing those commits. During the bisect session, when Git lands on a commit you wish to skip, you would manually issue: git bisect skip This tells Git to exclude the current commit from the search and adjust its calculations accordingly. It's essential to only skip commits when absolutely necessary, as skipping too many can interfere with the accuracy of the bisect process. These advanced strategies enhance the utility of git bisect in your debugging toolkit. By automating the regression testing process, handling flaky tests intelligently, and knowing when to skip untestable commits, you can make the most out of git bisect for efficient and accurate debugging. Remember, the goal is not just to find the commit where the regression was introduced but to do so in the most time-efficient manner possible. With these tips and examples, you're well-equipped to tackle even the most elusive regressions in your projects. Unraveling a Regression Mystery In the past, we got to use git bisect when working on a large-scale web application. After a routine update, users began reporting a critical feature failure: the application's payment gateway stopped processing transactions correctly, leading to a significant business impact. We knew the feature worked in the last release but had no idea which of the hundreds of recent commits introduced the bug. Manually testing each commit was out of the question due to time constraints and the complexity of the setup required for each test. Enter git bisect. The team started by identifying a "good" commit where the payment gateway functioned correctly and a "bad" commit where the issue was observed. We then crafted a simple test script that would simulate a transaction and check if it succeeded. By running git bisect start, followed by marking the known good and bad commits, and executing the script with git bisect run, we set off on an automated process that identified the faulty commit. Git efficiently navigated through the commits, automatically running the test script on each step. In a matter of minutes, git bisect pinpointed the culprit: a seemingly innocuous change to the transaction logging mechanism that inadvertently broke the payment processing logic. Armed with this knowledge, we reverted the problematic change, restoring the payment gateway's functionality and averting further business disruption. This experience not only resolved the immediate issue but also transformed our approach to debugging, making git bisect a go-to tool in our arsenal. Final Word The story of the payment gateway regression is just one example of how git bisect can be a lifesaver in the complex world of software development. By automating the tedious process of regression hunting, git bisect not only saves precious time but also brings a high degree of precision to the debugging process. As developers continue to navigate the challenges of maintaining and improving complex codebases, tools like git bisect underscore the importance of leveraging technology to work smarter, not harder. Whether you're dealing with a mysterious regression or simply want to refine your debugging strategies, git bisect offers a powerful, yet underappreciated, solution to swiftly and accurately identify the source of regressions. Remember, the next time you're faced with a regression, git bisect might just be the debugging partner you need to uncover the truth hidden within your commit history. Video

By Shai Almog

CORE

Probability Basics for Software Testing

Have you ever felt like building a castle out of sand, only to have the tide of unexpected software bugs wash it all away? In everyday work in software development, unforeseen issues can spell disaster. But what if we could predict the likelihood of these problems before they arise? Enter the realm of probability, our secret weapon for building robust and reliable software. Probability plays a crucial role in software testing, helping us understand the likelihood of certain events like encountering specific paths within the code and assessing the effectiveness of test coverage. This article starts from scratch. We define probability theoretically and practically. We'll then dive into conditional probability and Bayes' theorem, giving basic formulas, examples, and applications to software testing and beyond. Laying the Foundation: Defining Probability We begin with the fundamental question: what exactly is probability? In the realm of software testing, it represents the likelihood of a particular event occurring, such as executing a specific sequence of statements within our code. Imagine a coin toss: the probability of landing heads is 1/2 (assuming a fair coin). Similarly, we can assign probabilities to events in software, but the complexities inherent in code demand a more robust approach than counting "heads" and "tails." Beyond Laplace's Marble Bag: A Set-Theoretic Approach While the classic definition by Laplace, which compares favorable outcomes to total possibilities, works for simple scenarios, it becomes cumbersome for intricate software systems. Instead, we leverage the power of set theory and propositional logic to build a more versatile framework. Imagine the set of all possible events in our code as a vast universe. Each event, like encountering a specific path in our code, is represented by a subset within this universe. We then formulate propositions (statements about these events) to understand their characteristics. The key lies in the truth set of a proposition – the collection of events within the universe where the proposition holds true. Probability Takes Shape: From Truth Sets to Calculations Now, comes the magic of probability. The probability of a proposition being true, denoted as Pr(p), is simply the size (cardinality) of its truth set divided by the size of the entire universe. This aligns with Laplace's intuition but with a more rigorous foundation. Think about checking if a month has 30 days. In the universe of all months (U = {Jan, Feb, ..., Dec}), the proposition "p(m): m is a 30-day month" has a truth set T(p(m)) = {Apr, Jun, Sep, Nov}. Therefore, Pr(p(m)) = 4/12, providing a precise measure of the likelihood of encountering a 30-day month. The Universe Matters: Choosing Wisely Selecting the appropriate universe for our calculations is crucial. Imagine finding the probability of a February in a year (Pr(February)) – simply 1/12. But what about the probability of a month with 29 days? Here, the universe needs to consider leap years, influencing the truth set and ultimately, the probability. This highlights the importance of choosing the right "playing field" for our probability calculations and avoiding "universe shifts" that can lead to misleading results. Imagine we're testing an e-commerce application and only consider the universe of "typical" transactions during peak season (e.g., holidays). We calculate the probability of encountering a payment gateway error to be low. However, we haven't considered the universe of "all possible transactions," which might include high-value orders, international payments, or unexpected surges due to flash sales. These scenarios could have a higher chance of triggering payment gateway issues, leading to underestimated risks and potential outages during crucial business periods. Essential Tools in Our Probability Arsenal Beyond the basic framework, there are some key facts that govern the behavior of probabilities within a specific universe: Pr(not p) = 1 - Pr(p): The probability of an event not happening is simply 1 minus the probability of it happening. Pr(p and q) = Pr(p) * Pr(q) (assuming independence): If events p and q are independent (meaning they don't influence each other), the probability of both happening is the product of their individual probabilities. Pr(p or q) = Pr(p) + Pr(q) - Pr(p and q): The probability of either p or q happening, or both, is the sum of their individual probabilities minus the probability of both happening together. These principles, combined with our understanding of set theory and propositional logic, can empower us to confidently manipulate probability expressions within the context of software testing. Conditional Probability While probability helps us estimate the likelihood of encountering specific events and optimize testing strategies, conditional probability takes this a step further by considering the influence of one event on the probability of another. This concept offers valuable insights in various software testing scenarios. Understanding the "Given" Conditional probability focuses on the probability of event B happening given that event A has already occurred. We represent it as P(B | A). This "given" condition acts like a filter, narrowing down the possibilities for event B based on the knowledge that event A has already happened. Basic Formulas for Conditional Probability Here are some key formulas and their relevance to software testing. 1. Unveiling the Definition (Set Membership) P(B | A) = P(A ∩ B) / P(A) Imagine events A and B as sets representing specific scenarios in our software (e.g., A = invalid login attempt, B = system error). The intersection (∩) signifies "both happening simultaneously." This translates to the probability of event B occurring given event A, represented by P(B | A), being equal to the ratio of the elements in the intersection (A ∩ B) to the elements in set A alone. In general, P(A ∩ B) might represent encountering a specific bug under certain conditions (A), and P(A) could represent the overall probability of encountering that bug. Example: Analyzing login errors, we calculate P(error | invalid login) = P({invalid login ∩ system error}) / P({invalid login}). This reveals the likelihood of encountering a system error specifically when an invalid login attempt occurs. 2. Relationship with Marginal Probabilities (Set Union and Complement) P(B) = P(B | A) * P(A) + P(B | ~A) * P(~A) This formula relates the unconditional probability of event B (P(B)) to its conditional probabilities given A and its opposite (~A), along with the marginal probabilities of A and its opposite. It highlights how considering conditions (A or ~A) can alter the overall probability of B. Example: Imagine testing a payment processing system. We estimate P(payment failure) = P(failure | network issue) * P(network issue) + P(failure | normal network) * P(normal network). This allows us to analyze the combined probability of payment failure considering both network issues and normal operation scenarios. 3. Total Probability (Unveiling Overlap, Complement and Difference) P(A ∪ B) = P(A) + P(B) - P(A ∩ B) This formula, though not directly related to conditional probability, is crucial for understanding set relationships in software testing. It ensures that considering both events A and B, along with their overlap (A ∩ B), doesn't lead to overcounting possibilities. The union (∪) signifies "either A or B or both." Example: Imagine you're testing a feature that allows users to upload files. You're interested in calculating the probability of encountering specific scenarios during testing: Events A: User uploads a valid file type (e.g., PDF, DOCX) B: User uploads a file larger than 10MB You want to ensure you cover both valid and invalid file uploads, considering both size and type. P(A ∪ B): This could represent the probability of encountering either a valid file type, a file exceeding 10MB, or both. P(A): This could represent the probability of encountering a valid file type, regardless of size. P(B): This could represent the probability of encountering a file larger than 10MB, regardless of type. P(A ∩ B): This could represent the probability of encountering a file that is both valid and larger than 10MB (overlap). 4. Independence (Disjoint Sets) P(B | A) = P(B) if A ∩ B = Ø (empty set), meaning that A and B are independent (no influence on each other). This special case applies when knowing event A doesn't change the probability of event B. While often not the case in complex software systems, it helps simplify calculations when events are truly independent. Example: Imagine testing two independent modules. Assuming no interaction, P(error in module 1 | error in module 2) = P(error in module 1), as knowing an error in module 2 doesn't influence the probability of an error in module 1. Application to Risk Assessment Suppose a component relies on an external service. We can calculate the probability of the component failing given the external service is unavailable. This conditional probability helps assess the overall system risk and prioritize testing efforts towards scenarios with higher potential impact. Application to Test Case Prioritization Consider complex systems with numerous possible error states. We can estimate the conditional probability of encountering specific errors given certain user inputs or system configurations. This allows testers to prioritize test cases based on the likelihood of triggering critical errors, optimizing testing efficiency. Application to Performance Testing Performance bottlenecks often manifest under specific loads. We can use conditional probability to estimate the likelihood of performance degradation given concurrent users or specific data sizes. This targeted testing approach helps pinpoint performance issues that occur under realistic usage conditions. Beyond the Examples These are just a few examples. Conditional probability has wider applications in areas like: Mutation testing: Estimating the probability of a test case revealing a mutation given its specific coverage criteria. Statistical testing: Analyzing hypothesis testing results and p-values in the context of specific assumptions and data sets. Machine learning testing: Evaluating the conditional probability of model predictions being wrong under specific input conditions. Remember: Choosing the right "given" conditions is crucial for meaningful results. Conditional probability requires understanding dependencies between events in our software system. Combining conditional probability with other testing techniques (e.g., combinatorial testing) can further enhance testing effectiveness. Bayes' Theorem The definition of conditional probability provides the foundation for understanding the relationship between events. Bayes' theorem builds upon this foundation by allowing us to incorporate additional information to refine our understanding in a dynamic way. It allows us to dynamically update our beliefs about the likelihood of events (e.g., bugs, crashes) based on new evidence (e.g., test results, user reports). This dynamic capability may unlock numerous applications for our testing approach. Demystifying Bayes' Theorem: Beyond the Formula Imagine we suspect a specific functionality (event B) might harbor a bug. Based on our current understanding and past experiences (prior probability), we assign a certain likelihood to this event. Now, we conduct a series of tests (evidence A) designed to uncover the bug. Bayes' theorem empowers us to leverage the results of these tests to refine our belief about the bug's existence (posterior probability). It essentially asks: "Given that I observed evidence A (test results), how does it affect the probability of event B (bug) being true?" While the formula, P(B | A) = [ P(A | B) * P(B) ] / P(A), captures the essence of the calculation, a deeper understanding lies in the interplay of its components: P(B | A): Posterior probability - This represents the updated probability of event B (bug) given evidence A (test results). This is what we ultimately seek to determine. P(A | B): Likelihood - This signifies the probability of observing evidence A (test results) if event B (bug) is actually true. In simpler terms, it reflects how effective our tests are in detecting the bug. P(B): Prior probability - This represents our initial belief about the likelihood of event B (bug) occurring, based on our prior knowledge and experience with similar functionalities. P(A): Total probability of evidence A - This encompasses the probability of observing evidence A (test results) regardless of whether event B (bug) is present or not. It accounts for the possibility of the test results occurring even if there's no bug. Visualizing the Power of Bayes' Theorem Imagine a scenario where we suspect a memory leak (event B) in a specific code change (A). Based on past experiences, we might assign a prior probability of 0.1 (10%) to this event. Now, we conduct tests (evidence A) that are known to be 80% effective (P(A | B) = 0.8) in detecting such leaks, but they might also occasionally yield positive results even in the absence of leaks (P(A) = 0.05). Applying Bayes' theorem with these values: P(B | A) = [0.8 * 0.1] / 0.05 = 1.6 This translates to a posterior probability of 64% for the memory leak existing, given the observed test results. This significant increase from the initial 10% prior probability highlights the power of Bayes' theorem in updating beliefs based on new evidence. Application to Test Effectiveness Analysis Bayes' theorem can be a useful tool for analyzing the effectiveness of individual test cases and optimizing our testing resources. Let's delve deeper into this application: 1. Gathering Data Identify known bugs (B): Compile a list of bugs that have been identified and fixed in our system. Track test case execution: Record which test cases (A) were executed for each bug and whether they successfully detected the bug. 2. Calculating Likelihood For each test case-bug pair (A, B), calculate the likelihood (P(A | B)). This represents the probability of the test case (A) detecting the bug (B) if the bug is actually present. We can estimate this likelihood by analyzing historical data on how often each test case successfully identified the specific bug or similar bugs in the past. 3. Estimating Prior Probability Assign a prior probability (P(B)) to each bug (B). This represents our initial belief about the likelihood of the bug existing in the system before any new evidence is considered. This can be based on factors like the bug's severity, the code complexity of the affected area, or historical data on similar bug occurrences. 4. Applying Bayes' Theorem For each test case, use the calculated likelihood (P(A | B)), the prior probability of the bug (P(B)), and the total probability of observing the test result (P(A)) to estimate the posterior probability (P(B | A)). This posterior probability represents the updated probability of the bug existing given that the specific test case passed. 5. Interpreting Results and Taking Action High posterior probability: If the posterior probability is high, it suggests the test case is effective in detecting the bug. Consider keeping this test case in the suite. Low posterior probability: If the posterior probability is low, it indicates the test case is unlikely to detect the bug. We might consider: Refactoring the test case: Improve its ability to detect the bug Removing the test case: If it consistently yields low posterior probabilities for various bugs, it might be redundant or ineffective. Example Imagine we have a test case (A) that has successfully detected a specific bug (B) in 70% of the past occurrences. For illustrative purposes, we assign the sample value for the prior probability of 20% to the bug existing in a new code change. Applying Bayes' theorem: P(B | A) = [0.7 * 0.2] / P(A) Since P(A) depends on various factors and might not be readily available, it's often ignored for comparative analysis between different test cases. There are three main reasons for this. The first is normalization. P(A) represents the overall probability of observing a specific test result, regardless of whether the bug is present or not. This value can be influenced by various factors beyond the specific test case being evaluated (e.g., overall test suite design, system complexity). The second reason is the focus on relative performance. When comparing the effectiveness of different test cases in identifying the same bug, the relative change in the posterior probability (P(B | A)) is crucial. This change signifies how much each test case increases our belief in the bug's presence compared to the prior probability (P(B)). The third reason is simplification. Ignoring P(A) simplifies the calculation and allows us to focus on the relative impact of each test case on the posterior probability. As long as all test cases are subjected to the same denominator (P(A)), their relative effectiveness can be compared based solely on their posterior probabilities. By calculating the posterior probability for multiple test cases targeting the same bug, we can: Identify the most effective test cases with the highest posterior probabilities. Focus our testing efforts on these high-performing tests, optimizing resource allocation and maximizing bug detection capabilities. Remember: The accuracy of this analysis relies on the quality and completeness of our data. Continuously update our data as we encounter new bugs and test results. Bayes' theorem provides valuable insights, but it shouldn't be the sole factor in test case selection. Consider other factors like test coverage and risk assessment for a holistic approach. Wrapping Up Probability is a powerful tool for our testing activities. This article starts with probability basics, continues with conditional probabilities, and finishes with Bayes' theorem. This exploration of probability provides a solid foundation to gain deeper insights into software behavior, optimize testing efforts, and ultimately contribute to building more reliable and robust software. Software testing is about predicting, preventing, and mitigating software risks. The journey of software testing is a continuous pursuit of knowledge and optimization, and probability remains our faithful companion on this exciting path. Remember, it's not just about the formulas: it's about how we apply them to better understand our software.

By Stelios Manioudakis

CORE

The Cost Crisis in Observability Tooling

The cost of services is on everybody’s mind right now, with interest rates rising, economic growth slowing, and organizational budgets increasingly feeling the pinch. But I hear a special edge in people’s voices when it comes to their observability bill, and I don’t think it’s just about the cost of goods sold. I think it’s because people are beginning to correctly intuit that the value they get out of their tooling has become radically decoupled from the price they are paying. In the happiest cases, the price you pay for your tools is “merely” rising at a rate several times faster than the value you get out of them. But that’s actually the best-case scenario. For an alarming number of people, the value they get actually decreases as their bill goes up. Observability 1.0 and the Cost Multiplier Effect Are you familiar with this chestnut? “Observability has three pillars: metrics, logs, and traces.” This isn’t exactly true, but it’s definitely true of a particular generation of tools—one might even say it's definitionally true of a particular generation of tools. Let’s call it “observability 1.0.” From an evolutionary perspective, you can see how we got here. Everybody has logs… so we spin up a service for log aggregation. But logs are expensive, and everybody wants dashboards… so we buy a metrics tool. Software engineers want to instrument their applications… so we buy an APM tool. We start unbundling the monolith into microservices, and pretty soon, we can’t understand anything without traces… so we buy a tracing tool. The front-end engineers point out that they need sessions and browser data… so we buy a RUM tool. On and on it goes. Logs, metrics, traces, APM, RUM. You’re now paying to store telemetry five different ways, in five different places, for every single request. And a 5x multiplier is on the modest side of the spectrum, given how many companies pay for multiple overlapping tools in the same category. You may also be collecting: Profiling data Product analytics Business intelligence data Database monitoring/query profiling tools Mobile app telemetry Behavioral analytics Crash reporting Language-specific profiling data Stack traces CloudWatch or hosting provider metrics …and so on. So, how many times are you paying to store data about your user requests? What’s your multiplier? (If you have one consolidated vendor bill, this may require looking at your itemized bill.) There are many types of tools, each gathering slightly different data for a slightly different use case, but underneath the hood, there are really only three basic data types: metric, unstructured logs, and structured logs. Each of these has its own distinctive trade-offs when it comes to how much they cost and how much value you can get out of them. Metrics Metrics are the great-granddaddy of telemetry formats: tiny, fast, and cheap. A “metric” consists of a single number, often with tags appended. All of the contexts of the request get discarded at write time; each individual metric is emitted separately. This means you can never correlate one metric with another from the same request, or select all the metrics for a given request ID, user, or app ID, or ask arbitrary new questions about your metrics data. Metrics-based tools include vendors like Datadog and open-source projects like Prometheus. RUM tools are built on top of metrics to understand browser user sessions; APM tools are built on top of metrics to understand application performance. When you set up a metrics tool, it generally comes prepopulated with a bunch of basic metrics, but the useful ones are typically the custom metrics you emit from your application. Your metrics bill is usually dominated by the cost of these custom metrics. At a minimum, your bill goes up linearly with the number of custom metrics you create. This is unfortunate because to restrain your bill from unbounded growth, you have to regularly audit your metrics, do your best to guess which ones are going to be useful in the future and prune any you think you can afford to go without. Even in the hands of experts, these tools require significant oversight. Linear cost growth is the goal, but it’s rarely achieved. The cost of each metric varies wildly depending on how you construct it, what the values are, how often it gets hit, etc. I’ve seen a single custom metric cost $30k per month. You probably have dozens of custom metrics per service, and it’s almost impossible to tell how much each of them costs you. Metrics bills tend to be incredibly opaque (possibly by design). Nobody can understand their software or their systems with a metrics tool alone because the metric is extremely limited in what it can do. No context, no cardinality, no strings… only basic static dashboards. For richer data, we must turn to logs. Unstructured Logs You can understand much more about your code with logs than you can with metrics. Logs are typically emitted multiple times throughout the execution of the request, with one or a small number of nouns per log line plus the request ID. Unstructured logs are still the default, although this is slowly changing. The cost of unstructured logs is driven by a few things: Write amplification: If you want to capture lots of rich context about the request, you need to emit a lot of log lines. If you are printing out just 10 log lines per request, per service, and you have half a dozen services, that’s 60 log events for every request. Noisiness: It’s extremely easy to accidentally blow up your log footprint yet add no value—e.g., by putting a print statement inside a loop instead of outside the loop. Here, the usefulness of the data goes down as the bill shoots up. Constraints on physical resources Due to the write amplification of log lines per request, it’s often physically impossible to log everything you want to log for all requests or all users—it would saturate your NIC or disk. Therefore, people tend to use blunt instruments like these to blindly slash the log volume: Log levels Consistent hashes Dumb sample rates When you emit multiple log lines per request, you end up duplicating a lot of raw data; sometimes, over half the bits are consumed by request ID, process ID, and timestamp. This can be quite meaningful from a cost perspective. All of these factors can be annoying. But the worst thing about unstructured logs is that the only thing you can do to query them is a full-text search. The more data you have, the slower it becomes to search that data, and there’s not much you can do about it. Searching your logs over any meaningful length of time can take minutes or even hours, which means experimenting and looking around for unknown unknowns is prohibitively time-consuming. You have to know what to look for in order to find it. Once again, as your logging bill goes up, the value goes down. Structured Logs Structured logs are gaining adoption across the industry, especially as OpenTelemetry picks up steam. The nice thing about structured logs is that you can actually do things with the data other than slow, dumb string searches. If you’ve structured your data properly, you can perform calculations! Compute percentiles! Generate heatmaps! Tools built on structured logs are so clearly the future. But just taking your existing logs and adding structure isn’t quite good enough. If all you do is stuff your existing log lines into key/value pairs, the problems of amplification, noisiness, and physical constraints remain unchanged—you can just search more efficiently and do some math with your data. There are a number of things you can and should do to your structured logs in order to use them more effectively and efficiently. In order of achievability: Instrument your code using the principles of canonical logs, which collect all the vital characteristics of a request into one wide, dense event. It is difficult to overstate the value of doing this for reasons of usefulness and usability as well as cost control. Add trace IDs and span IDs so you can trace your code using the same events instead of having to use an entirely separate tool. Feed your data into a columnar storage engine so you don’t have to predefine a schema or indexes to decide which dimensions of the future you can search or compute based on. Use a storage engine that supports high cardinality with an explorable interface. If you go far enough down this path of enriching your structured events, instrumenting your code with the right data, and displaying it in real-time, you will reach an entirely different set of capabilities, with a cost model so distinct it can only be described as “observability 2.0.” More on that in a second. Ballooning Costs Are Baked Into Observability 1.0 To recap, high costs are baked into the observability 1.0 model. Every pillar has a price. You have to collect and store your data—and pay to store it—again and again and again for every single use case. Depending on how many tools you use, your observability bill may be growing at a rate 3x faster than your traffic is growing, or 5x, or 10x, or even more. It gets worse. As your costs go up, the value you get out of your tools goes down. Your logs get slower and slower to search. You have to know what you’re searching for in order to find it. You have to use a blunt force sampling technique to keep the log volume from blowing up. Any time you want to be able to ask a new question, you first have to commit to a new code and deploy it. You have to guess which custom metrics you’ll need and which fields to index in advance. As the volume goes up, your ability to find a needle in the haystack—any unknown-unknowns—goes down commensurately. And nothing connects any of these tools. You cannot correlate a spike in your metrics dashboard with the same requests in your logs, nor can you trace one of the errors. It’s impossible. If your APM and metrics tools report different error rates, you have no way of resolving this confusion. The only thing connecting any of these tools is the intuition and straight-up guesses made by your most senior engineers. This means that the cognitive costs are immense, and your bus factor risks are very real. The most important connective data in your system—connecting metrics with logs and logs with traces—exists only in the heads of a few people. At the same time, the engineering overhead required to manage all these tools (and their bills) rises inexorably. With metrics, an engineer needs to spend time auditing your metrics, tracking people down to fix poorly constructed metrics, and reaping those that are too expensive or don’t get used. With logs, an engineer needs to spend time monitoring the log volume, watching for spammy or duplicate log lines, pruning or consolidating them, and choosing and maintaining indexes. But all this time spent wrangling observability 1.0 data types isn’t even the costliest part. The most expensive part is the unseen costs inflicted on your engineering organization as development slows down and tech debt piles up due to low visibility and, thus, low confidence. Is there an alternative? Yes. The Cost Model of Observability 2.0 Is Very Different Observability 2.0 has no three pillars; it has a single source of truth. Observability 2.0 tools are built on top of arbitrarily wide structured log events, also known as spans. From these wide, context-rich structured log events, you can derive the other data types (metrics, logs, or traces). Since there is only one data source, you can correlate and cross-correlate to your heart’s content. You can switch fluidly back and forth between slicing and dicing, breaking down or grouping by events, and viewing them as a trace waterfall. You don’t have to worry about cardinality or key space limitations. You also effectively get infinite custom metrics since you can append as many as you want to the same events. Not only does your cost not go up linearly as you add more custom metrics, but your telemetry just gets richer and more valuable the more key-value pairs you add! Nor are you limited to numbers; you can add any and all types of data, including valuable high-cardinality fields like “App Id” or “Full Name.” Observability 2.0 has its own amplification factor to consider. As you instrument your code with more spans per request, the number of events you have to send (and pay for) goes up. However, you have some very powerful tools for dealing with this: you can perform dynamic head-based sampling or even tail-based sampling, where you decide whether or not to keep the event after it’s finished, allowing you to capture 100% of slow requests and other outliers. Engineering Time Is Your Most Precious Resource But the biggest difference between observability 1.0 and 2.0 won’t show up on any invoice. The difference shows up in your engineering team’s ability to move quickly and with confidence. Modern software engineering is all about hooking up fast feedback loops. Observability 2.0 tooling is what unlocks the kind of fine-grained, exploratory experience you need in order to accelerate those feedback loops. Where observability 1.0 is about MTTR, MTTD, reliability, and operating software, observability 2.0 is what underpins the entire software development lifecycle, setting the bar for how swiftly you can build and ship software, find problems, and iterate on them. Observability 2.0 is about being in conversation with your code, understanding each user’s experience, and building the right things. Observability 2.0 isn’t exactly cheap either, although it is often less expensive. But the key difference between o11y 1.0 and o11y 2.0 has never been that either is cheap; it’s that with observability 2.0 when your bill goes up, the value you derive from your telemetry goes up too. You pay more money; you get more out of your tools than you should. Note: Earlier, I said, “Nothing connects any of these tools.” If you are using a single unified vendor for your metrics, logging, APM, RUM, and tracing tools, this is not strictly true. Vendors like New Relic or Datadog now let you define certain links between your traces and metrics, which allows you to correlate between data types in a few limited, predefined ways. This is better than nothing! But it’s very different from the kind of fluid, open-ended correlation capabilities that we describe with o11y 2.0. With o11y 2.0, you can slice and dice, break down, and group by your complex data sets, then grab a trace that matches any specific set of criteria at any level of granularity. With o11y 1.0, you can define a metric up front, then grab a random exemplar of that metric, and that’s it. All the limitations of metrics still apply; you can’t correlate any metric with any other metric from that request, app, user, etc, and you certainly can’t trace arbitrary criteria. But it’s not nothing.

By Charity Majors

Safe Clones With Ansible

I started research for an article on how to add a honeytrap to a GitHub repo. The idea behind a honeypot weakness is that a hacker will follow through on it and make his/her presence known in the process. My plan was to place a GitHub personal access token in an Ansible vault protected by a weak password. Should an attacker crack the password and use the token to clone the private repository, a webhook should have triggered and mailed a notification that the honeypot repo has been cloned and the password cracked. Unfortunately, GitHub seems not to allow webhooks to be triggered after cloning, as is the case for some of its higher-level actions. This set me thinking that platforms as standalone systems are not designed with Dev(Sec)Ops integration in mind. DevOps engineers have to bite the bullet and always find ways to secure pipelines end-to-end. I, therefore, instead decided to investigate how to prevent code theft using tokens or private keys gained by nefarious means. Prevention Is Better Than Detection It is not best practice to have secret material on hard drives thinking that root-only access is sufficient security. Any system administrator or hacker that is elevated to root can view the secret in the open. They should, rather, be kept inside Hardware Security Modules (HSMs) or a secret manager, at the very least. Furthermore, tokens and private keys should never be passed in as command line arguments since they might be written to a log file. A way to solve this problem is to make use of a super-secret master key to initiate proceedings and finalize using short-lived lesser keys. This is similar to the problem of sharing the first key in applied cryptography. Once the first key has been agreed upon, successive transactions can be secured using session keys. It goes beyond saying that the first key has to be stored in Hardware Security Modules, and all operations against it have to happen inside an HSM. I decided to try out something similar when Ansible clones private Git repositories. Although I will illustrate at the hand of GitHub, I am pretty sure something similar can be set up for other Git platforms as well. First Key GitHub personal access tokens can be used to perform a wide range of actions on your GitHub account and its repositories. It authenticates and authorizes from both the command line and the GitHub API. It clearly can serve as the first key. Personal access tokens are created by clicking your avatar in the top right and selecting Settings: A left nav panel should appear from where you select Developer settings: The menu for personal access tokens will display where you can create the token: I created a classic token and gave it the following scopes/permissions: repo, admin:public_key, user, and admin:gpg_key. Take care to store the token in a reputable secret manager from where it can be copied and pasted when the Ansible play asks for it when it starts. This secret manager should clear the copy buffer after a few seconds to prevent attacks utilizing attention diversion. vars_prompt: - name: github_token prompt: "Enter your github personal access token?" private: true Establishing the Session GitHub deployment keys give access to private repositories. They can be created by an API call or from the repo's top menu by clicking on Settings: With the personal access token as the first key, a deployment key can finish the operation as the session key. Specifically, Ansible authenticates itself using the token, creates the deployment key, authorizes the clone, and deletes it immediately afterward. The code from my previous post relied on adding Git URLs that contain the tokens to the Ansible vault. This has now been improved to use temporary keys as envisioned in this post. An Ansible role provided by Asif Mahmud has been amended for this as can be seen in the usual GitHub repo. The critical snippets are: - name: Add SSH public key to GitHub account ansible.builtin.uri: url: "https://api.{{ git_server_fqdn }/repos/{{ github_account_id }/{{ repo }/keys" validate_certs: yes method: POST force_basic_auth: true body: title: "{{ key_title }" key: "{{ key_content.stdout }" read_only: true body_format: json headers: Accept: application/vnd.github+json X-GitHub-Api-Version: 2022-11-28 Authorization: "Bearer {{ github_access_token }" status_code: - 201 - 422 register: create_result The GitHub API is used to add the deploy key to the private repository. Note the use of the access token typed in at the start of play to authenticate and authorize the request. - name: Clone the repository shell: | GIT_SSH_COMMAND="ssh -i {{ key_path } -v -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null" {{ git_executable } clone git@{{ git_server_fqdn }:{{ github_account_id }/{{ repo }.git {{ clone_dest } - name: Switch branch shell: "{{ git_executable } checkout {{ branch }" args: chdir: "{{ clone_dest }" The repo is cloned, followed by a switch to the required branch. - name: Delete SSH public key ansible.builtin.uri: url: "https://api.{{ git_server_fqdn }/repos/{{ github_account_id }/{{ repo }/keys/{{ create_result.json.id }" validate_certs: yes method: DELETE force_basic_auth: true headers: Accept: application/vnd.github+json X-GitHub-Api-Version: 2022-11-28 Authorization: "Bearer {{ github_access_token }" status_code: - 204 Deletion of the deployment key happens directly after the clone and switch, again via the API. Conclusion The short life of the deployment key enhances the security of the DevOps pipeline tremendously. Only the token has to be kept secured at all times as is the case for any first key. Ideally, you should integrate Ansible with a compatible HSM platform. I thank Asif Mahmud for using their code to illustrate the concept of using temporary session keys when cloning private Git repositories.

By Jan-Rudolph Bührmann

Continuous Integration and Continuous Delivery for Database Changes

DevOps proposes Continuous Integration and Continuous Delivery (CI/CD) solutions for software project management. In CI/CD, the process of software development and operations falls into a cyclical feedback loop, which promotes not only innovation and improvements but also makes the product quickly adapt to the changing needs of the market. So, it becomes easy to cater to the needs of the customer and garner their satisfaction. The development team that adopts the culture becomes agile, flexible, and adaptive (called the Agile development team) in building an incremental quality product that focuses on continuous improvement and innovation. One of the key areas of CI/CD is to address changes. The evolution of software also has its effect on the database as well. Database change management primarily focuses on this aspect and can be a real hurdle in collaboration with DevOps practices which is advocating automation for CI/CD pipelines. Automating database change management enables the development team to stay agile by keeping database schema up to date as part of the delivery and deployment process. It helps to keep track of changes critical for debugging production problems. The purpose of this article is to highlight how database change management is an important part of implementing Continuous Delivery and recommends some processes that help streamline application code and database changes into a single delivery pipeline. Continuous Integration One of the core principles of an Agile development process is Continuous Integration. Continuous Integration emphasizes making sure that code developed by multiple members of the team is always integrated. It avoids the “integration hell” that used to be so common during the days when developers worked in their silos and waited until everyone was done with their pieces of work before attempting to integrate them. Continuous Integration involves independent build machines, automated builds, and automated tests. It promotes test-driven development and the practice of making frequent atomic commits to the baseline or master branch or trunk of the version control system. Figure 1: A typical Continuous Integration process The diagram above illustrates a typical Continuous Integration process. As soon as a developer checks in code to the source control system, it will trigger a build job configured in the Continuous Integration (CI) server. The CI Job will check out code from the version control system, execute a build, run a suite of tests, and deploy the generated artifacts (e.g., a JAR file) to an artifact repository. There may be timed CI jobs to deploy the code to the development environment, push details out to a static analysis tool, run system tests on the deployed code, or any automated process that the team feels is useful to ensure that the health of the codebase is always maintained. It is the responsibility of the Agile team to make sure that if there is any failure in any of the above-mentioned automated processes, it is promptly addressed and no further commits are made to the codebase until the automated build is fixed. Continuous Delivery Continuous Delivery takes the concept of Continuous Integration a couple of steps further. In addition to making sure that different modules of a software system are always integrated, it also makes sure that the code is always deployable (to production). This means that in addition to having an automated build and a completely automated test suite, there should be an automated process of delivering the software to production. Using the automated process, it should be possible to deploy software on short notice, typically within minutes, with the click of a button. Continuous Delivery is one of the core principles of DevOps and offers many benefits including predictable deploys, reduced risk while introducing new features, shorter feedback cycles with the customer, and overall higher quality of software. Figure 2: A typical Continuous Delivery process The above diagram shows a typical Continuous Delivery process. Please note that the above-illustrated Continuous Delivery process assumes that a Continuous Integration process is already in place. The above diagram shows 2 environments: e.g., User Acceptance Test (UAT) and production. However, different organizations may have multiple staging environments (Quality Assurance or QA, load testing, pre-production, etc.) before the software makes it to the production environment. However, it is the same codebase, and more precisely, the same version of the codebase that gets deployed to different environments. Deployment to all staging environments and the production environment are performed through the same automated process. There are many tools available to manage configurations (as code) and make sure that deploys are automatic (usually self-service), controlled, repeatable, reliable, auditable, and reversible (can be rolled back). It is beyond the scope of this article to go over those DevOps tools, but the point here is to stress the fact that there must be an automated process to release software to production on demand. Database Change Management Is the Bottleneck Agile practices are pretty much mainstream nowadays when it comes to developing application code. However, we don’t see as much adoption of agile principles and continuous integration in the area of database development. Almost all enterprise applications have a database involved and thus project deliverables would involve some database-related work in addition to application code development. Therefore, slowness in the process of delivering database-related work - for example, a schema change - slows down the delivery of an entire release. In this article, we would assume the database to be a relational database management system. The processes would be very different if the database involved is a non-relational database like a columnar database, document database, or a database storing data in key-value pairs or graphs. Let me illustrate this scenario with a real example: here is this team that practices Agile software development methodologies. They follow a particular type of Agile called Scrum, and they have a 2-week Sprint. One of the stories in the current sprint is the inclusion of a new field in the document that they interchange with a downstream system. The development team estimated that the story is worth only 1 point when it comes to the development of the code. It only involves minor changes in the data access layer to save the additional field and retrieve it later when a business event occurs and causes the application to send out a document to a downstream system. However, it requires the addition of a new column to an existing table. Had there been no database changes involved, the story could have been easily completed in the current sprint, but since there is a database change involved, the development team doesn’t think it is doable in this sprint. Why? Because a schema change request needs to be sent to the Database Administrators (DBA). The DBAs will take some time to prioritize this change request and rank this against other change requests that they received from other application development teams. Once the DBAs make the changes in the development database, they will let the developers know and wait for their feedback before they promote the changes to the QA environment and other staging environments, if applicable. Developers will test changes in their code against the new schema. Finally, the development team will closely coordinate with the DBAs while scheduling delivery of application changes and database changes to production. Figure 3: Manual or semi-automated process in delivering database changes Please note in the diagram above that the process is not triggered by a developer checking in code and constitutes a handoff between two teams. Even if the deployment process on the database side is automated, it is not integrated with the delivery pipeline of application code. The changes in the application code are directly dependent on the database changes, and they together constitute a release that delivers a useful feature to the customer. Without one change, the other change is not only useless but could potentially cause regression. However, the lifecycle of both of these changes is completely independent of each other. The fact that the database and codebase changes follow independent life cycles and the fact that there are handoffs and manual checkpoints involved, the Continuous Delivery process, in this example, is broken. Recommendations To Fix CI/CD for Database Changes In the following sections, we will explain how this can be fixed and how database-related work including data modeling and schema changes, etc., can be brought under the ambit of the Continuous Delivery process. DBAs Should Be a Part of the Cross-Functional Agile Team Many organizations have their DBAs split into broadly two different types of roles based on whether they help to build a database for application development teams or maintain production databases. The primary responsibility of a production DBA is to ensure the availability of production databases. They monitor the database, take care of upgrades and patches, allocate storage, perform backup and recovery, etc. A development DBA, on the other hand, works closely with the application development team and helps them come up with data model design, converts a logical data model into a physical database schema, estimates storage requirements, etc. To bring database work and application development work into one single delivery pipeline, it is almost necessary that the development DBA be a part of the development team. Full-stack developers in the development team with good knowledge of the database may also wear the hat of a development DBA. Database as Code It is not feasible to have database changes and application code integrated into a single delivery pipeline unless database changes are treated the same way as application code. This necessitates scripting every change in the database and having them version-controlled. It should then be possible to stand up a new instance of the database automatically from the scripts on demand. If we had to capture database objects as code, we would first need to classify them and evaluate each one of those types to see if and how they need to be captured as script (code). Following is a broad classification of them: Database Structure This is basically the definition of how stored data will be structured in the database and is also known as a schema. These include table definitions, views, constraints, indexes, and types. The data dictionary may also be considered as a part of the database structure. Stored Code These are very similar to application code, except that they are stored in the database and are executed by the database engine. They include stored procedures, functions, packages, triggers, etc. Reference Data These are usually a set of permissible values that are referenced from other tables that store business data. Ideally, tables representing reference data have very few records and don’t change much over the life cycle of an application. They may change when some business process changes but don’t usually change during the normal course of business. Application Data or Business Data These are the data that the application records during the normal course of business. The main purpose of any database system is to store these data. The other three types of database objects exist only to support these data. Out of the above four types of database objects, the first three can be and should be captured as scripts and stored in a version control system. Type Example Scripted (Stored like code?) Database Structure Schema Objects like Tables, Views, Constraints, Indexes, etc. Yes Stored Code Triggers, Procedures, Functions, Packages, etc. Yes Reference Data Codes, Lookup Tables, Static data, etc. Yes Business/Application Data Data generated from day-to-day business operations No Table 1: Depicts what types of database objects can be scripted and what types can’t be scripted As shown in the table above, business data or application data are the only types that won’t be scripted or stored as code. All rollbacks, revisions, archival, etc., are handled by the database itself; however, there is one exception. When a schema change forces data migration - say, for example, populating a new column or moving data from a base table to a normalized table - that migration script should be treated as code and should follow the same life cycle as the schema change. Let's take an example of a very simple data model to illustrate how scripts may be stored as code. This model is so simple and so often used in examples, that it may be considered the “Hello, World!” of data modeling. Figure 4: Example model with tables containing business data and ones containing reference data In the model above, a customer may be associated with zero or more addresses, like billing address, shipping address, etc. The table AddressType stores the different types of addresses like billing, shipping, residential, work, etc. The data stored in AddressType can be considered reference data as they are not supposed to grow during day-to-day business operations. On the other hand, the other tables contain business data. As the business finds more and more customers, the other tables will continue to grow. Example Scripts: Tables: Constraints: Reference Data: We won’t get into any more details and cover each type of database object. The purpose of the examples is to illustrate that all database objects, except for business data, can be and should be captured in SQL scripts. Version Control Database Artifacts in the Same Repository as Application Code Keeping the database artifacts in the same repository of the version control system as the application code offers a lot of advantages. They can be tagged and released together since, in most cases, a change in database schema also involves a change in application code, and they together constitute a release. Having them together also reduces the possibility of application code and the database getting out of sync. Another advantage is just plain readability. It is easier for a new team member to come up to speed if everything related to a project is in a single place. Figure 5: Example structure of a Java Maven project containing database code The above screenshot shows how database scripts can be stored alongside application code. Our example is a Java application, structured as a Maven project. The concept is however agnostic of what technology is used to build the application. Even if it was a Ruby or a .NET application, we would store the database objects as scripts alongside application code to let CI/CD automation tools find them in one place and perform necessary operations on them like building the schema from scratch or generating migration scripts for a production deployment. Integrate Database Artifacts Into the Build Scripts It is important to include database scripts in the build process to ensure that database changes go hand in hand with application code in the same delivery pipeline. Database artifacts are usually SQL scripts of some form and all major build tools support executing SQL scripts either natively or via plugins. We won’t get into any specific build technology but will list down the tasks that the build would need to perform. Here we are talking about builds in local environments or CI servers. We will talk about builds in staging environments and production at a later stage. The typical tasks involved are: Drop Schema Create Schema Create Database Structure (or schema objects): They include tables, constraints, indexes, sequences, and synonyms. Deploy stored code, like procedures, functions, packages, etc. Load reference data Load TEST data If the build tool in question supports build phases, this will typically be in the phase before integration tests. This ensures that the database will be in a stable state with a known set of data loaded. There should be sufficient integration tests that will cause the build to fail if the application code goes out of sync with the data model. This ensures that the database is always integrated with the application code: the first step in achieving a Continuous Delivery model involving database change management. Figure 6: Screenshot of code snippet showing a Maven build for running database scripts The above screenshot illustrates the usage of a Maven plugin to run SQL scripts. It drops the schema, recreates it, and runs all the DDL scripts to create tables, constraints, indexes, sequences, and synonyms. Then it deploys all the stored code into the database and finally loads all reference data and test data. Refactor Data Model as Needed Agile methodology encourages evolutionary design over upfront design; however, many organizations that claim to be Agile shops, actually perform an upfront design when it comes to data modeling. There is a perception that schema changes are difficult to implement later in the game, and thus it is important to get it right the first time. If the recommendations made in the previous sections are made, like having an integrated team with developers and DBAs, scripting database changes, and version controlling them alongside application code, it won’t be difficult to automate all schema changes. Once the deployment and rollback of database changes are fully automated and there is a suite of automated tests in place, it should be easy to mitigate risks in refactoring schema. Avoid Shared Database Having a database schema shared by more than one application is a bad idea, but they still exist. There is even a mention of a “Shared Database” as an integration pattern in a famous book on enterprise integration patterns, Enterprise Integration Patterns by Gregor Holpe and Bobby Woolf. Any effort to bring application code and database changes under the same delivery pipeline won’t work unless the database truly belongs to the application and is not shared by other applications. However, this is not the only reason why a shared database should be avoided. "Shared Database" also causes tight coupling between applications and a multitude of other problems. Dedicated Schema for Every Committer and CI Server Developers should be able to work on their own sandboxes without the fear of breaking anything in a common environment like the development database instance; similarly, there should be a dedicated sandbox for the CI server as well. This follows the pattern of how application code is developed. A developer makes changes and runs the build locally, and if the build succeeds and all the tests pass, (s)he commits the changes. The sandbox could be either an independent database instance, typically installed locally on the developer’s machine, or it could be a different schema in a shared database instance. Figure 7: Developers make changes in their local environment and commit frequently As shown in the above diagram, each developer has their own copy of the schema. When a full build is performed, in addition to building the application, it also builds the database schema from scratch. It drops the schema, recreates it, and executes DDL scripts to load all schema objects like tables, views, sequences, constraints, and indexes. It creates objects representing stored code, like functions, procedures, packages, and triggers. Finally, it loads all the reference data and test data. Automated tests ensure that the application code and database object are always in sync. It must be noted that data model changes are less frequent than application code, so the build script should have the option to skip the database build for the sake of build performance. The CI build job should also be set up to have its own sandbox of the database. The build script performs a full build that includes building the application as well as building the database schema from scratch. It runs a suite of automated tests to ensure that the application itself and the database that it interacts with, are in sync. Figure 8: Revised CI process with integration of database build with build of application code Please note that the similarity of the process described in the above diagram with the one described in Figure 1. The build machine or the CI server contains a build job that is triggered by any commit to the repository. The build that it performs includes both the application build and the database build. The database scripts are now always integrated, just like application code. Dealing With Migrations The process described above would build the database schema objects, stored code, reference data, and test data from scratch. This is all good for continuous integration and local environments. This process won’t work for the production database and even QA or UAT environments. The real purpose of any database is storing business data, and every other database object exists only to support business data. Dropping schema and recreating it from scripts is not an option for a database currently running business transactions. In this case, there is a need for scripting deltas, i.e., the changes that will transition the database structure from a known state (a particular release of software) to a desired state. The transition will also include any data migration. Schema changes may lead to a requirement to migrate data as well. For example, as a result of normalization, data from one table may need to be migrated to one or more child tables. In such cases, a script that transforms data from the parent table to the children should also be a part of the migration scripts. Schema changes may be scripted and maintained in the source code repository so that they are part of the build. These scripts may be hand-coded during active development, but there are tools available to automate that process as well. One such tool is Flyway, which can generate migration scripts for the transition of one state of schema to another state. Figure 9: Automation of schema migrations and rollback In the above picture, the left-hand side shows the current state of the database which is in sync with the application release 1.0.1 (the previous release). The right-hand side shows the desired state of the database in the next release. We have the state on the left-hand side captured and tagged in the version control system. The right-hand side is also captured in the version control system as the baseline, master branch, or trunk. The difference between the right-hand side and the left-hand side is what needs to be applied to the database in the staging environments and the production environment. The differences may be manually tracked and scripted, which is laborious and error-prone. The above diagram illustrates that tools like Flyway can automate the creation of such differences in the form of migration scripts. The automated process will create the following: Migration script (to transition the database from the prior release to the new release) Rollback script (to transition the database back to the previous release). The generated scripts will be tagged and stored with other deploy artifacts. This automation process may be integrated with the Continuous Delivery process to ensure repeatable, reliable, and reversible (ability to rollback) database changes. Continuous Delivery With Database Changes Incorporated Into It Let us now put the pieces together. There is a Continuous Integration process already in place that rebuilds the database along with the application code. We have a process in place that generates migration scripts for the database. These generated migration scripts are a part of the deployment artifacts. The DevOps tools will use these released artifacts to build any of the staging environments or the production environment. The deployment artifacts will also contain rollback scripts to support self-service rollback. If anything goes wrong, the previous version of the application may then be redeployed and the database rollback script shall be run to transition the database schema to the previous state that is in sync with the previous release of the application code. Figure 10: Continuous Delivery incorporating database changes The above diagram depicts a Continuous Delivery process that has database change management incorporated into it. This assumes that a Continuous Integration process is already there in place. When a UAT (or any other staging environment like TEST, QA, etc.) deployment is initiated, the automated processes take care of creating a tag in the source control repository, building application deployable artifacts from the tagged codebase, generating database migration scripts, assembling the artifacts and deploying. The deployment process includes the deployment of the application as well as applying migration scripts to the database. The same artifacts will be used to deploy the application to the production environment, following the approval process. A rollback would involve redeploying the previous release of the application and running the database rollback script. Tools Available in the Market The previous sections primarily describe how to achieve CI/CD in a project that involves database changes by following some processes but don’t particularly take into consideration any tools that help in achieving them. The above recommendations are independent of any particular tool. A homegrown solution can be developed using common automation tools like Maven or Gradle for build automation, Jenkins or TravisCI for Continuous Integration, and Chef or Puppet for configuration management; however, there are many tools available in the marketplace, that specifically deal with Database DevOps. Those tools may also be taken advantage of. Some examples are: Datical Redgate Liquibase Flyway Conclusion Continuous Integration and Continuous Delivery processes offer tremendous benefits to organizations, like accelerated time to market, reliable releases, and overall higher-quality software. Database change management is traditionally cautious and slow. In many cases, database changes involve manual processes and often cause a bottleneck to the Continuous Delivery process. The processes and best practices mentioned in this article, along with available tools in the market, should hopefully eliminate this bottleneck and help to bring database changes into the same delivery pipeline as application code.

By Ashoke Bhowmick

The Four Pillars of Programming Logic in Software Quality Engineering

Software development, like constructing any intricate masterpiece, requires a strong foundation. This foundation isn't just made of lines of code, but also of solid logic. Just as architects rely on the laws of physics, software developers use the principles of logic. This article showcases the fundamentals of four powerful pillars of logic, each offering unique capabilities to shape and empower creations of quality. Imagine these pillars as bridges connecting different aspects of quality in our code. Propositional logic, the simplest among them, lays the groundwork with clear-cut true and false statements, like the building blocks of your structure. Then comes predicate logic, a more expressive cousin, allowing us to define complex relationships and variables, adding intricate details and dynamic behaviors. But software doesn't exist in a vacuum — temporal logic steps in, enabling us to reason about the flow of time in our code, ensuring actions happen in the right sequence and at the right moments. Finally, fuzzy logic acknowledges the nuances of the real world, letting us deal with concepts that aren't always black and white, adding adaptability and responsiveness to our code. I will explore the basic strengths and weaknesses of each pillar giving quick examples in Python. Propositional Logic: The Building Blocks of Truth A proposition is an unambiguous sentence that is either true or false. Propositions serve as the fundamental units of evaluation of truth. They are essentially statements that can be definitively classified as either true or false, offering the groundwork for clear and unambiguous reasoning. They are the basis for constructing sound arguments and logical conclusions. Key Characteristics of Propositions Clarity: The meaning of a proposition should be unequivocal, leaving no room for interpretation or subjective opinions. For example, "The sky is blue" is a proposition, while "This movie is fantastic" is not, as it expresses personal preference. Truth value: Every proposition can be conclusively determined to be either true or false. "The sun is a star" is demonstrably true, while "Unicorns exist" is definitively false. Specificity: Propositions avoid vague or ambiguous language that could lead to confusion. "It's going to rain tomorrow" is less precise than "The current weather forecast predicts a 90% chance of precipitation tomorrow." Examples of Propositions The number of planets in our solar system is eight. (True) All dogs are mammals. (True) This object is made of wood. (Either true or false, depending on the actual object) Pizza is the best food ever. (Expresses an opinion, not a factual statement, and therefore not a proposition) It's crucial to understand that propositions operate within the realm of factual statements, not opinions or subjective impressions. Statements like "This music is beautiful" or "That painting is captivating" express individual preferences, not verifiable truths. By grasping the essence of propositions, we equip ourselves with a valuable tool for clear thinking and logical analysis, essential for various endeavors, from scientific exploration to quality coding and everyday life. Propositional logic has operations, expressions, and identities that are very similar (in fact, they are isomorphic) to set theory. Imagine logic as a LEGO set, where propositions are the individual bricks. Each brick represents a simple, declarative statement that can be either true or false. We express these statements using variables like p and q, and combine them with logical operators like AND (∧), OR (∨), NOT (¬), IF-THEN (→), and IF-AND-ONLY-IF (↔). Think of operators as the connectors that snap the bricks together, building more complex logical structures. Strengths Simplicity: Easy to understand and implement, making it a great starting point for logic applications. After all, simplicity is a cornerstone of quality. Efficiency: Offers a concise way to represent simple conditions and decision-making in code. Versatility: Applicable to various situations where basic truth value evaluations are needed. Limitations Limited Expressiveness: Cannot represent relationships between objects or quantifiers like "for all" and "there exists." Higher-order logic can address this limitation. Focus on Boolean Values: Only deals with true or false, not more nuanced conditions or variables. Python Examples Checking if a user is logged in and has admin privileges: Python logged_in = True admin = False if logged_in and admin: print("Welcome, Administrator!") else: print("Please log in or request admin privileges.") Validating user input for age: Python age = int(input("Enter your age: ")) if age >= 18: print("You are eligible to proceed.") else: print("Sorry, you must be 18 or older.") Predicate Logic: Beyond True and False While propositional logic deals with individual blocks, predicate logic introduces variables and functions, allowing you to create more dynamic and expressive structures. Imagine these as advanced LEGO pieces that can represent objects, properties, and relationships. The core concept here is a predicate, which acts like a function that evaluates to true or false based on specific conditions. Strengths Expressive power: Can represent complex relationships between objects and express conditions beyond simple true/false. Flexibility: Allows using variables within predicates, making them adaptable to various situations. Foundations for more advanced logic: Forms the basis for powerful techniques like formal verification. Limitations Increased complexity: Requires a deeper understanding of logic and can be more challenging to implement. Computational cost: Evaluating complex predicates can be computationally expensive compared to simpler propositions. Python Examples Checking if a number is even or odd: Python def is_even(number): return number % 2 == 0 num = int(input("Enter a number: ")) if is_even(num): print(f"{num} is even.") else: print(f"{num} is odd.") Validating email format: Python import re def is_valid_email(email): regex = r"^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$" return re.match(regex, email) is not None email = input("Enter your email address: ") if is_valid_email(email): print("Valid email address.") else: print("Invalid email format.") Combining Forces: An Example Imagine an online store where a user needs to be logged in, have a valid email address, and have placed an order before they can write a review. Here's how we can combine propositional and predicate logic: Python def can_write_review(user): # Propositional logic for basic conditions logged_in = user.is_logged_in() has_email = user.has_valid_email() placed_order = user.has_placed_order() # Predicate logic to check email format def is_valid_email_format(email): # ... (implement email validation logic using regex) return logged_in and has_email(is_valid_email_format) and placed_order In this example, we use both: Propositional logic checks the overall conditions of logged_in, has_email, and placed_order using AND operations. Predicate logic is embedded within has_email, where we define a separate function is_valid_email_format (implementation not shown) to validate the email format using a more complex condition (potentially using regular expressions). This demonstrates how the two logics can work together to express intricate rules and decision-making in code. The Third Pillar: Temporal Logic While propositional and predicate logic focuses on truth values at specific points in time, temporal logic allows us to reason about the behavior of our code over time, ensuring proper sequencing and timing. Imagine adding arrow blocks to our LEGO set, connecting actions and states across different time points. Temporal logic provides operators like: Eventually (◇): Something will eventually happen. Always (□): Something will always happen or be true. Until (U): Something will happen before another thing happens. Strengths Expressive power: Allows reasoning about the behavior of systems over time, ensuring proper sequencing and timing. Verification: This can be used to formally verify properties of temporal systems, guaranteeing desired behavior. Flexibility: Various operators like eventually, always, and until offer rich expressiveness. Weaknesses Complexity: Requires a deeper understanding of logic and can be challenging to implement. Computational cost: Verifying complex temporal properties can be computationally expensive. Abstraction: Requires careful mapping between temporal logic statements and actual code implementation. Traffic Light Control System Imagine a traffic light system with two perpendicular roads (North-South and East-West). We want to ensure: Safety: No cars from both directions ever cross at the same time. Liveness: Each direction eventually gets a green light (doesn't wait forever). Logic Breakdown Propositional Logic: north_red = True and east_red = True represent both lights being red (initial state). north_green = not east_green ensures only one light is green at a time. Predicate Logic: has_waited_enough(direction): checks if a direction has waited for a minimum time while red. Temporal Logic: ◇(north_green U east_green): eventually, either north or east light will be green. □(eventually north_green ∧ eventually east_green): both directions will eventually get a green light. Python Example Python import time north_red = True east_red = True north_wait_time = 0 east_wait_time = 0 def has_waited_enough(direction): if direction == "north": return north_wait_time >= 5 # Adjust minimum wait time as needed else: return east_wait_time >= 5 while True: # Handle pedestrian button presses or other external events here... # Switch lights based on logic if north_red and has_waited_enough("north"): north_red = False north_green = True north_wait_time = 0 elif east_red and has_waited_enough("east"): east_red = False east_green = True east_wait_time = 0 # Update wait times if north_green: north_wait_time += 1 if east_green: east_wait_time += 1 # Display light states print("North:", "Red" if north_red else "Green") print("East:", "Red" if east_red else "Green") time.sleep(1) # Simulate time passing This example incorporates: Propositional logic for basic state changes and ensuring only one light is green. Predicate logic to dynamically determine when a direction has waited long enough. Temporal logic to guarantee both directions eventually get a green light. This is a simplified example. Real-world implementations might involve additional factors and complexities. By combining these logic types, we can create more robust and dynamic systems that exhibit both safety and liveness properties. Fuzzy Logic: The Shades of Grey The fourth pillar in our logic toolbox is Fuzzy Logic. Unlike the crisp true/false of propositional logic and the structured relationships of predicate logic, fuzzy logic deals with the shades of grey. It allows us to represent and reason about concepts that are inherently imprecise or subjective, using degrees of truth between 0 (completely false) and 1 (completely true). Strengths Real-world applicability: Handles imprecise or subjective concepts effectively, reflecting human decision-making. Flexibility: Can adapt to changing conditions and provide nuanced outputs based on degrees of truth. Robustness: Less sensitive to minor changes in input data compared to crisp logic. Weaknesses Interpretation: Defining fuzzy sets and membership functions can be subjective and require domain expertise. Computational cost: Implementing fuzzy inference and reasoning can be computationally intensive. Verification: Verifying and debugging fuzzy systems can be challenging due to their non-deterministic nature. Real-World Example Consider a thermostat controlling your home's temperature. Instead of just "on" or "off," fuzzy logic allows you to define "cold," "comfortable," and "hot" as fuzzy sets with gradual transitions between them. This enables the thermostat to respond more naturally to temperature changes, adjusting heating/cooling intensity based on the degree of "hot" or "cold" it detects. Bringing Them All Together: Traffic Light With Fuzzy Logic Now, let's revisit our traffic light control system and add a layer of fuzzy logic. Problem In our previous example, the wait time for each direction was fixed. But what if traffic volume varies? We want to prioritize the direction with more waiting cars. Solution Propositional logic: Maintain the core safety rule: north_red ∧ east_red. Predicate logic: Use has_waiting_cars(direction) to count cars in each direction. Temporal logic: Ensure fairness: ◇(north_green U east_green). Fuzzy logic: Define fuzzy sets for "high," "medium," and "low" traffic based on car count. Use these to dynamically adjust wait times. At a very basic level, our Python code could look like: Python import time from skfuzzy import control as ctrl # Propositional logic variables north_red = True east_red = True # Predicate logic function def has_waiting_cars(direction): # Simulate car count (replace with actual sensor data) if direction == "north": return random.randint(0, 10) > 0 # Adjust threshold as needed else: return random.randint(0, 10) > 0 # Temporal logic fairness rule fairness_satisfied = False # Fuzzy logic variables traffic_level = ctrl.Antecedent(np.arange(0, 11), 'traffic_level') wait_time_adjust = ctrl.Consequent(np.arange(-5, 6), 'wait_time_adjust') # Fuzzy membership functions for traffic level low_traffic = ctrl.fuzzy.trapmf(traffic_level, 0, 3, 5, 7) medium_traffic = ctrl.fuzzy.trapmf(traffic_level, 3, 5, 7, 9) high_traffic = ctrl.fuzzy.trapmf(traffic_level, 7, 9, 11, 11) # Fuzzy rules for wait time adjustment rule1 = ctrl.Rule(low_traffic, wait_time_adjust, 3) rule2 = ctrl.Rule(medium_traffic, wait_time_adjust, 0) rule3 = ctrl.Rule(high_traffic, wait_time_adjust, -3) # Control system and simulation wait_ctrl = ctrl.ControlSystem([rule1, rule2, rule3]) wait_sim = ctrl.ControlSystemSimulation(wait_ctrl) while True: # Update logic states # Propositional logic: safety rule north_red = not east_red # Ensure only one light is green at a time # Predicate logic: check waiting cars north_cars = has_waiting_cars("north") east_cars = has_waiting_cars("east") # Temporal logic: fairness rule if not fairness_satisfied: # Initial green light assignment (randomly choose a direction) if fairness_satisfied is False: if random.random() < 0.5: north_red = False else: east_red = False # Ensure both directions eventually get a green light if north_red and east_red: if north_cars >= east_cars: north_red = False else: east_red = False elif north_red or east_red: # At least one green light active fairness_satisfied = True # Fuzzy logic: calculate wait time adjustment if north_red: traffic_sim.input['traffic_level'] = north_cars else: traffic_sim.input['traffic_level'] = east_cars traffic_sim.compute() adjusted_wait_time = ctrl.control_output(traffic_sim, wait_time_adjust, defuzzifier=ctrl.Defuzzifier(method='centroid')) # Update wait times based on adjusted value and fairness considerations if north_red: north_wait_time += adjusted_wait_time else: north_wait_time = 0 # Reset wait time when light turns green if east_red: east_wait_time += adjusted_wait_time else: east_wait_time = 0 # Simulate light duration (replace with actual control mechanisms) time.sleep(1) # Display light states and wait times print("North:", "Red" if north_red else "Green") print("East:", "Red" if east_red else "Green") print("North wait time:", north_wait_time) print("East wait time:", east_wait_time) print("---") There are various Python libraries like fuzzywuzzy and scikit-fuzzy that can help to implement fuzzy logic functionalities. Choose one that suits your project and explore its documentation for specific usage details. Remember, this is a simplified example, and the actual implementation will depend on your specific requirements and chosen fuzzy logic approach. This basic example is written for the sole purpose of demonstrating the core concepts. The code is by no means optimal, and it can be further refined in many ways for efficiency, fairness, error handling, and realism, among others. Explanation We define fuzzy sets for traffic_level and wait_time_adjust using trapezoidal membership functions. Adjust the ranges (0-11 for traffic level, -5-5 for wait time) based on your desired behavior. We define three fuzzy rules that map the combined degrees of truth for each traffic level to a wait time adjustment. You can add or modify these rules for more complex behavior. We use the scikit-fuzzy library to create a control system and simulation, passing the traffic_level as input. The simulation outputs a fuzzy set for wait_time_adjust. We defuzzify this set using the centroid method to get a crisp wait time value. Wrapping Up This article highlights four types of logic as a foundation for quality code. Each line of code represents a statement, a decision, a relationship — essentially, a logical step in the overall flow. Understanding and applying different logical frameworks, from the simple truths of propositional logic to the temporal constraints of temporal logic, empowers developers to build systems that are not only functional but also efficient, adaptable, and elegant. Propositional Logic This fundamental building block lays the groundwork by representing basic truths and falsehoods (e.g., "user is logged in" or "file exists"). Conditional statements and operators allow for simple decision-making within the code, ensuring proper flow and error handling. Predicate Logic Expanding on propositions, it introduces variables and relationships, enabling dynamic representation of complex entities and scenarios. For instance, functions in object-oriented programming can be viewed as predicates operating on specific objects and data. This expressive power can enhance code modularity and reusability. Temporal Logic With the flow of time being crucial in software, temporal logic ensures proper sequencing and timing. It allows us to express constraints like "before accessing data, validation must occur" or "the system must respond within 10 milliseconds." This temporal reasoning leads to code that adheres to timing requirements and can avoid race conditions. Fuzzy Logic Not every situation is black and white. Fuzzy logic embraces the shades of grey by dealing with imprecise or subjective concepts. A recommendation system can analyze user preferences or item features with degrees of relevance, leading to more nuanced and personalized recommendations. This adaptability enhances user experience and handles real-world complexities. Each type of logic plays a role in constructing well-designed software. Propositional logic forms the bedrock, predicate logic adds structure, temporal logic ensures timing, and fuzzy logic handles nuances. Their combined power leads to more reliable, efficient, and adaptable code, contributing to the foundation of high-quality software.

By Stelios Manioudakis

CORE

Code Search Using Retrieval Augmented Generation

Retrieval Augmented Generation (RAG) is becoming a popular paradigm for bridging the knowledge gap between pre-trained Large Language models and other data sources. For developer productivity, several code copilots help with code completion. Code Search is an age-old problem that can be rethought in the age of RAG. Imagine you are trying to contribute to a new code base (a GitHub repository) for a beginner task. Knowing which file to change and where to make the change can be time-consuming. We've all been there. You're enthusiastic about contributing to a new GitHub repository but overwhelmed. Which file do you modify? Where do you start? For newcomers, the maze of a new codebase can be truly daunting. Retrieval Augmented Generation for Code Search The technical solution consists of 2 parts. 1. Build a vector index generating embedding for every file (eg. .py .java.) 2. Query the vector index and leverage the code interpreter to provide instructions by calling GPT-x. Building the Vector Index Once you have a local copy of the GitHub repo, akin to a crawler of web search index, Traverse every file matching a regex (*.py, *.sh, *.java) Read the content and generate an embedding. Using OpenAI’s Ada embedding or Sentence BERT embedding (or both.) Build a vector store using annoy. Instead of choosing a single embedding, if we build multiple vector stores based on different embeddings, it improves the quality of retrieval. (anecdotally) However, there is a cost of maintaining multiple indices. 1. Prepare Your Requirements.txt To Install Necessary Python Packages pip install -r requirements.txt Python annoy==1.17.3 langchain==0.0.279 sentence-transformers==2.2.2 openai==0.28.0 open-interpreter==0.1.6 2. Walk Through Every File Python ### Traverse through every file in the directory def get_files(path): files = [] for r, d, f in os.walk(path): for file in f: if ".py" in file or ".sh" in file or ".java" in file: files.append(os.path.join(r, file)) return files 3. Get OpenAI Ada Embeddings Python embeddings = OpenAIEmbeddings(openai_api_key=" <Insert your key>") # we are getting embeddings for the contents of the file def get_file_embeddings(path): try: text = get_file_contents(path) ret = embeddings.embed_query(text) return ret except: return None def get_file_contents(path): with open(path, 'r') as f: return f.read() files = get_files(LOCAL_REPO_GITHUB_PATH) embeddings_dict = {} s = set() for file in files: e = get_file_embeddings(file) if (e is None): print ("Error in generating an embedding for the contents of file: ") print (file) s.add(file) else: embeddings_dict[file] = e 4. Generate the Annoy Index In Annoy, the metric can be "angular," "euclidean," "manhattan," "hamming," or "dot." Python annoy_index_t = AnnoyIndex(1536, 'angular') index_map = {} i = 0 for file in embeddings_dict: annoy_index_t.add_item(i, embeddings_dict[file]) index_map[i] = file i+=1 annoy_index_t.build(len(files)) name = "CodeBase" + "_ada.ann" annoy_index_t.save(name) ### Maintains a forward map of id -> file name with open('index_map' + "CodeBase" + '.txt', 'w') as f: for idx, path in index_map.items(): f.write(f'{idx}\t{path}\n') We can see the size of indices is proportional to the number of files in the local repository. Size of annoy index generated for popular GitHub repositories. Repository File Count (approx as its growing) Size Langchain 1983+ 60 MB Llama Index 779 14 MB Apache Solr 5000+ 328 MB Local GPT 8 165 KB Generate Response With Open Interpreter (Calls GPT-4) Once the index is built, a simple command line python script can be implemented to ask questions right from the terminal about your codebase. We can leverage Open Interpreter. One of the reasons to use Open Interpreter instead of us calling GPT-4 or other LLMs directly is because Open-Interpreter allows us to make changes to your file and run commands. It handles interaction with GPT-4. Python embeddings = OpenAIEmbeddings(openai_api_key="Your OPEN AI KEY") query = sys.argv[1] ### Your question depth = int(sys.argv[2]) ## Number of documents to retrieve from Vector SEarch name = sys.argv[3] ## Name of your index ### Get Top K files based on nearest neighbor search def query_top_files(query, top_n=4): # Load annoy index and index map t = AnnoyIndex(EMBEDDING_DIM, 'angular') t.load(name+'_ada.ann') index_map = load_index_map() # Get embeddings for the query query_embedding = get_embeddings_for_text(query) # Search in the Annoy index indices, distances = t.get_nns_by_vector(query_embedding, top_n, include_distances=True) # Fetch file paths for these indices (forward index helps) files = [(index_map[idx], dist) for idx, dist in zip(indices, distances)] return files ### Use Open Interpreter to make the call to GPT-4 import interpreter results = query_top_files(query, depth) file_content = "" s = set() print ("Files you might want to read:") for path, dist in results: content = get_file_contents(path) file_content += "Path : " file_content += path if (path not in s): print (path) s.add(path) file_content += "\n" file_content += content print( "open interpreter's recommendation") message = "Take a deep breath. I have a task to complete. Please help with the task below and answer my question. Task : READ THE FILE content below and their paths and answer " + query + "\n" + file_content interpreter.chat(message) print ("interpreter's recommendation done. (Risk: LLMs are known to hallucinate)") Anecdotal Results Langchain Question: Where should I make changes to add a new summarization prompt? The recommended files to change are; refine_prompts.py stuff_prompt.py map_reduce_prompt.py entity_summarization.py All of these files are indeed related to the summarization prompt in langchain. Local GPT Question: Which files should I change, and how do I add support to the new model Falcon 80B? Open interpreter identifies the files to be changed and gives specific step-by-step instruction for adding Falcon 80 b model to the list of models in constants.py and adding support in the user interface of localGPT_UI.py. For specific prompt templates, it recommends to modify the method get_prompt_template in prompt_template_utils.py. The complete code can be found here. Conclusion The advantages of a simple RAG solution like this will help with: Accelerated Onboarding: New contributors can quickly get up to speed with the codebase, reducing the onboarding time. Reduced Errors: With specific guidance, newcomers are less likely to make mistakes or introduce bugs. Increased Engagement: A supportive tool can encourage more contributions from the community, especially those hesitant due to unfamiliarity with the codebase. Continuous Learning: Even for experienced developers, the tool can be a means to discover and learn about lesser-known parts of the codebase.

By Raghavan Muthuregunathan

How To Implement Code Reviews Into Your DevOps Practice

DevOps encompasses a set of practices and principles that blend development and operations to deliver high-quality software products efficiently and effectively by fostering a culture of open communication between software developers and IT professionals. Code reviews play a critical role in achieving success in a DevOps approach mainly because they enhance the quality of code, promote collaboration among team members, and encourage the sharing of knowledge within the team. However, integrating code reviews into your DevOps practices requires careful planning and consideration. This article presents a discussion on the strategies you should adopt for implementing code reviews successfully into your DevOps practice. What Is a Code Review? Code review is defined as a process used to evaluate the source code in an application with the purpose of identifying any bugs or flaws, within it. Typically, code reviews are conducted by developers in the team other than the person who wrote the code. To ensure the success of your code review process, you should define clear goals and standards, foster communication and collaboration, use a code review checklist, review small chunks of code at a time, embrace a positive code review culture, and embrace automation and include automated tools in your code review workflow. The next section talks about each of these in detail. Implementing Code Review Into a DevOps Practice The key principles of DevOps include collaboration, automation, CI/CD, Infrastructure as Code (IaC), adherence to Agile and Lean principles, and continuous monitoring. There are several strategies you can adopt to implement code review into your DevOps practice successfully: Define Clear Goals and Code Review Guidelines Before implementing code reviews, it's crucial to establish objectives and establish guidelines to ensure that the code review process is both efficient and effective. This helps maintain quality as far as coding standards are concerned and sets a benchmark for the reviewer's expectations. Identifying bugs, enforcing practices, maintaining and enforcing coding standards, and facilitating knowledge sharing among team members should be among these goals. Develop code review guidelines that encompass criteria for reviewing code including aspects like code style, performance optimization, security measures, readability enhancements, and maintainability considerations. Leverage Automated Code Review Tools Leverage automated code review tools that help in automated checks for code quality. To ensure proper code reviews, it's essential to choose the tools that align with your DevOps principles. There are options including basic pull request functionalities, in version control systems such as GitLab, GitHub, and Bitbucket. You can also make use of platforms like Crucible, Gerrit, and Phabricator which are specifically designed to help with conducting code reviews. When making your selection, consider factors like user-friendliness, integration capabilities with development tools support, code comments, discussion boards, and the ability to track the progress of the code review process. Related: Gitlab vs Jenkins, CI/CD tools compared. Define a Code Review Workflow Establish a clear workflow for your code reviews to streamline the process and avoid confusion. It would help if you defined when code reviews should occur, such as before merging changes, during feature development, or before deploying the software to the production environment. Specify the duration allowed for code review, outlining deadlines for reviewers to provide feedback. Ensure that the feedback loop is closed, that developers who wrote the code address the review comments, and that reviewers validate the changes made. Review Small and Digestible Units of Code A typical code review cycle should involve only a little code. Instead, it should split the code into smaller, manageable chunks for review. This would assist reviewers in directing their attention towards features or elements allowing them to offer constructive suggestions. It is also less likely to overlook critical issues when reviewing smaller chunks of code, resulting in a more thorough and detailed review. Establish Clear Roles and Responsibilities Typically, a code review team comprises the developers, reviewers, the lead reviewer or moderator, and the project manager or the team lead. A developer initiates the code review process by submitting a piece of code for review. A team of code reviewers reviews a piece of code. Upon successful review, the code reviewers may request improvements or clarifications in the code. The lead reviewer or moderator is responsible for ensuring that the code review process is thorough and efficient. The project manager or the team lead ensures that the code reviews are complete within the decided time frame and ensuring that the code is aligned with the broader aspects of the project goals. Embrace Positive Feedback Constructive criticism is an element, for the success of a code review process. Improving the code's quality would be easier if you encouraged constructive feedback. Developers responsible, for writing the code should actively seek feedback while reviewers should offer suggestions and ideas. It would be really appreciated if you could acknowledge the hard work, information exchange, and improvements that result from fruitful code reviews. Conduct Regular Training An effective code review process should incorporate a training program to facilitate learning opportunities for the team members. Conducting regular training sessions and setting a clear goal for code review are essential elements of the success of a code review process. Regular trainings play a role, in enhancing the knowledge and capabilities of the team members enabling them to boost their skills. By investing in training the team members can unlock their potential leading to overall success, for the entire team. Capture Metrics To assess the efficiency of your code review procedure and pinpoint areas that require enhancement it is crucial to monitor metrics. You should set a few tangible goals before starting your code review process and then capture metrics (CPU consumption, memory consumption, I/O bottlenecks, code coverage, etc.) accordingly. Your code review process will be more successful if you use the right tools to capture the desired metrics and measure their success. Conclusion Although the key intent of a code review process is identifying bugs or areas of improvement in the code, there is a lot more you can add to your kitty from a successful code review. An effective code review process ensures consistency in design and implementation, optimizes code for better performance and scalability, helps teams collaborate to share knowledge, and improves the overall code quality. That said, for the success of a code review process, it is imperative that the code reviews are accepted on a positive note and the code review comments help the team learn to enhance their knowledge and skills.

By Joydip Kanjilal

CORE

O11y Guide, Cloud-Native Observability Pitfalls: Ignoring Existing Landscape

Are you looking at your organization's efforts to enter or expand into the cloud-native landscape and feeling a bit daunted by the vast expanse of information surrounding cloud-native observability? When you're moving so fast with agile practices across your DevOps, SREs, and platform engineering teams, it's no wonder this can seem a bit confusing. Unfortunately, the choices being made have a great impact on both your business, your budgets, and the ultimate success of your cloud-native initiatives that hasty decisions upfront lead to big headaches very quickly down the road. In the previous article, we looked at the problem of underestimating cardinality in our cloud-native observability solutions. Now it's time to move on to another common mistake organizations make, that of ignoring our existing landscape. By sharing common pitfalls in this series, the hope is that we can learn from them. This article could also have been titled, "Underestimating Our Existing Landscape." When we start planning to integrate our application landscape into our observability solution, we often end up with large discrepancies between planning and outcomes. They Can't Hurt Me The truth is we have a lot of applications out there in our architecture. The strange thing is during the decision-making process around cloud native observability and scoping solutions, they often are forgotten. Well, not necessarily forgotten, but certainly underestimated. The cost that they bring is in the hidden story around instrumentation. We have auto-instrumentation that suggests it's quick and easy, but often does not bring the exactly needed insights. On top of that, auto-instrumentation generates extra data from metrics and tracing activities that we are often not that interested in. Manual instrumentation is the real cost to provide our exact insights and the data we want to watch from our application landscape. This is what often results in unexpected or incorrectly scoped work (a.k.a., costs) with it as we change, test, and deploy new versions of existing applications. We want to stay with open source and open standards in our architecture, so we are going to end up in the cloud native standards found within the Cloud Native Computing Foundation. With that in mind, we can take a closer look at two technologies for our cloud-native observability solution: one for metrics and one for traces. Instrumenting Metrics Widely adopted and accepted standards for metrics can be found in the Prometheus project, including time-series storage, communication protocols to scrape (pull) data from targets, and PromQL, the query language for visualizing the data. Below you see an outline of the architecture used by Prometheus to collect metrics data. There are client libraries, exporters, and standards in communication to detect services across various cloud-native technologies. They make it look extremely low effort to ensure we can start collecting meaningful data in the form of standardized metrics from your applications, devices, and services. The reality is that we need to look much closer at scoping the efforts required to instrument our applications. Below you see an example of what is necessary to (either auto or manually) instrument a Java application. The process is the same for either method. While some of the data can be automatically gathered, that's just generic Java information for your applications and services. Manual instrumentation is the cost you can't forget, where you need to make code changes and redeploy. While it's nice to discuss manual instrumentation in the abstract sense, nothing beats getting hands-on with a real coding example. To that end, we can dive into what it takes to both auto and manually instrument a simple Java application in this workshop lab. Below you see a small example of the code you will apply to your example application in one of the workshop exercises to create a gauge metric: Java // Start thread and apply values to metrics. Thread bgThread = new Thread(() -> { while (true) { try { counter.labelValues("ok").inc(); counter.labelValues("ok").inc(); counter.labelValues("error").inc(); gauge.labelValues("value").set(rand(-5, 10)); TimeUnit.SECONDS.sleep(1); } catch (InterruptedException e) { e.printStackTrace(); } } }); bgThread.start(); Be sure to explore the free online workshop and get hands-on experience with what instrumentation for your Java applications entails. Instrumenting Traces In the case of tracing, a widely adopted and accepted standard is the OpenTelemetry (OTel) project, which is used to instrument and collect telemetry data through a push mechanism to an agent installed on the host. Below you see an outline of the architecture used by OTel to collect telemetry data: Whether we choose automatic or manual instrumentation, we have the same issues as previously discussed above. Our applications and services all require some form of cost to instrument our applications and we can't forget that when scoping our observability solutions. The telemetry data is pushed to an agent, known as the OTel Collector, which is installed on the application's host platform. It uses a widely accepted open standard to communicate known as the OpenTelemetry Protocol (OTLP). Note that OTel does not have a backend component, instead choosing to leverage other technologies for the backend and the collector sends all processed telemetry data onwards to that configured backend. Again, it's nice to discuss manual instrumentation in the abstract sense, but nothing beats getting hands-on with a real coding example. To that end, we can dive into what it takes to programmatically instrument a simple application using OTel in this workshop lab. Below, you see a small example of the code that you will apply to your example application in one of the workshop exercises to collect OTel telemetry data, and later in the workshop, view in the Jaeger UI: Python ... from opentelemetry.trace import get_tracer_provider, set_tracer_provider set_tracer_provider(TracerProvider()) get_tracer_provider().add_span_processor( BatchSpanProcessor(ConsoleSpanExporter()) ) instrumentor = FlaskInstrumentor() app = Flask(__name__) instrumentor.instrument_app(app) ... Be sure to explore the free online workshop and get hands-on yourself to experience how much effort it is to instrument your applications using OTel. The road to cloud-native success has many pitfalls. Understanding how to avoid the pillars and focusing instead on solutions for the phases of observability will save much wasted time and energy. Coming Up Next Another pitfall organizations struggle with in cloud native observability is the protocol jungle. In the next article in this series, I'll share why this is a pitfall and how we can avoid it wreaking havoc on our cloud-native observability efforts.

By Eric D. Schabell

CORE