Monitoring and Observability Resources for Engineers

DZone's Featured Monitoring and Observability Resources

Achieving High Availability in CI/CD With Observability

By Lipsa Das

CORE

Editor's Note: The following is an article written for and published in DZone's 2024 Trend Report, The Modern DevOps Lifecycle: Shifting CI/CD and Application Architectures. Forbes estimates that cloud budgets will break all previous records as businesses will spend over $1 trillion on cloud computing infrastructure in 2024. Since most application releases depend on cloud infrastructure, having good continuous integration and continuous delivery (CI/CD) pipelines and end-to-end observability becomes essential for ensuring highly available systems. By integrating observability tools in CI/CD pipelines, organizations can increase deployment frequency, minimize risks, and build highly available systems. Complementing these practices is site reliability engineering (SRE), a discipline ensuring system reliability, performance, and scalability. This article will help you understand the key concepts of observability and how to integrate observability in CI/CD for creating highly available systems. Observability and High Availability in SRE Observability refers to offering real-time insights into application performance, whereas high availability means ensuring systems remain operational by minimizing downtime. Understanding how the system behaves, performs, and responds to various conditions is central to achieving high availability. Observability equips SRE teams with the necessary tools to gain insights into a system's performance. Figure 1. Observability in the DevOps workflow Components of Observability Observability involves three essential components: Metrics – measurable data on various aspects of system performance and user experience Logs – detailed event information for post-incident reviews Traces – end-to-end visibility in complex architectures to help you understand requests across services Together, they comprehensively picture the system's behavior, performance, and interactions. This observability data can then be analyzed by SRE teams to make data-driven decisions and swiftly resolve issues to make their system highly available. The Role of Observability in High Availability Businesses have to ensure that their development and SRE teams are skilled at predicting and resolving system failures, unexpected traffic spikes, network issues, and software bugs to provide a smooth experience to their users. Observability is vital in assessing high availability by continuously monitoring specific metrics that are crucial for system health, such as latency, error rates, throughput, saturation, and more, therefore providing a real-time health check. Deviations from normal behavior trigger alerts, allowing SRE teams to proactively address potential issues before they impact availability. How Observability Helps SRE Teams Each observability component contributes unique insights into different facets of system performance. These components empower SRE teams to proactively monitor, diagnose, and optimize system behavior. Some use cases of metrics, logs, and traces for SRE teams are post-incident reviews, identification of system weaknesses, capacity planning, and performance optimization. Post-Incident Reviews Observability tools allow SRE teams to look at past data to analyze and understand system behavior during incidents, anomalies, or outages. Detailed logs, metrics, and traces provide a timeline of events that help identify the root causes of issues. Identification of System Weaknesses Observability data aids in pinpointing system weaknesses by providing insights into how the system behaves under various conditions. By analyzing metrics, logs, and traces, SRE teams can identify patterns or anomalies that may indicate vulnerabilities, performance bottlenecks, or areas prone to failures. Capacity Planning and Performance Optimization By collecting and analyzing metrics related to resource utilization, response times, and system throughput, SRE teams can make informed decisions about capacity requirements. This proactive approach ensures that systems are adequately scaled to handle expected workloads and their performance is optimized to meet user demands. In short, resources can be easily scaled down during non-peak hours or scaled up when demands surge. SRE Best Practices for Reliability At its core, SRE practices aim to create scalable and highly reliable software systems using two key principles that guide SRE teams: SRE golden signals and service-level objectives (SLOs). Understanding SRE Golden Signals The SRE golden signals are a set of critical metrics that provide a holistic view of a system's health and performance. The four primary golden signals are: Latency – Time taken for a system to respond to a request. High latency negatively impacts user experience. Traffic – Volume of requests a system is handling. Monitoring helps anticipate and respond to changing demands. Errors – Elevated error rates can indicate software bugs, infrastructure problems, or other issues that may impact reliability. Saturation – Utilization of system resources such as CPU, memory, or disk. It helps identify potential bottlenecks and ensures the system has sufficient resources to handle the load. Setting Effective SLOs SLOs define the target levels of reliability or performance that a service aims to achieve. They are typically expressed as a percentage over a specific time period. SRE teams use SLOs to set clear expectations for a system’s behavior, availability, and reliability. They continuously monitor the SRE golden signals to assess whether the system meets its SLOs. If the system falls below the defined SLOs, it triggers a reassessment of the service's architecture, capacity, or other aspects to improve availability. Businesses can use observability tools to set up alerts based on predetermined thresholds for key metrics. Defining Mitigation Strategies Automating repetitive tasks, such as configuration management, deployments, and scaling, reduces the risk of human error and improves system reliability. Introducing redundancy in critical components ensures that a failure in one area doesn't lead to a system-wide outage. This could involve redundant servers, data centers, or even cloud providers. Additionally, implementing rollback mechanisms for deployments allows SRE teams to quickly revert to a stable state in the event of issues introduced by new releases. CI/CD Pipelines for Zero Downtime Achieving zero downtime through effective CI/CD pipelines enables services to provide users with continuous access to the latest release. Let’s look at some of the key strategies employed to ensure zero downtime. Strategies for Designing Pipelines to Ensure Zero Downtime Some strategies for minimizing disruptions and maximizing user experience include blue-green deployments, canary releases, and feature toggles. Let’s look at them in more detail. Figure 2. Strategies for designing pipelines to ensure zero downtime Blue-Green Deployments Blue-green deployments involve maintaining two identical environments (blue and green), where only one actively serves production traffic at a time. When deploying updates, traffic is seamlessly switched from the current (blue) environment to the new (green) one. This approach ensures minimal downtime as the transition is instantaneous, allowing quick rollback in case issues arise. Canary Releases Canary releases involve deploying updates to a small subset of users before rolling them out to everyone. This gradual and controlled approach allows teams to monitor for potential issues in a real-world environment with reduced impact. The deployment is released to a wider audience if the canary group experiences no significant issues. Feature Toggles Feature toggles, or feature flags, enable developers to control the visibility of new features in production independently of other features. By toggling features on or off, teams can release code to production but activate or deactivate specific functionalities dynamically without deploying new code. This approach provides flexibility, allowing features to be gradually rolled out or rolled back without redeploying the entire application. Best Practices in CI/CD for Ensuring High Availability Successfully implementing CI/CD pipelines for high availability often requires a good deal of consideration and lots of trial and error. While there are many implementations, adhering to best practices can help you avoid common problems and improve your pipeline faster. Some industry best practices you can implement in your CI/CD pipeline to ensure zero downtime are automated testing, artifact versioning, and Infrastructure as Code (IaC). Automated Testing You can use comprehensive test suites — including unit tests, integration tests, and end-to-end tests — to identify potential issues early in the development process. Automated testing during integration provides confidence in the reliability of code changes, reducing the likelihood of introducing critical bugs during deployments. Artifact Versioning By assigning unique versions to artifacts, such as compiled binaries or deployable packages, teams can systematically track changes over time. This practice enables precise identification of specific code iterations, thus simplifying debugging, troubleshooting, and rollback processes. Versioning artifacts ensures traceability and facilitates rollback to previous versions in the case of issues during deployment. Infrastructure as Code Utilize Infrastructure as Code to define and manage infrastructure configurations, using tools such as OpenTofu, Ansible, Pulumi, Terraform, etc. IaC ensures consistency between development, testing, and production environments, reducing the risk of deployment-related issues. Integrating Observability Into CI/CD Pipelines Observing key metrics such as build success rates, deployment durations, and resource utilization during CI/CD provides visibility into the health and efficiency of the CI/CD pipeline. Observability can be implemented during continuous integration (CI) and continuous deployment (CD) as well as post-deployment. Observability in Continuous Integration Observability tools capture key metrics during the CI process, such as build success rates, test coverage, and code quality. These metrics provide immediate feedback on the health of the codebase. Logging enables the recording of events and activities during the CI process. Logs help developers and CI/CD administrators troubleshoot issues and understand the execution flow. Tracing tools provide insights into the execution path of CI tasks, allowing teams to identify bottlenecks or areas for optimization. Observability in Continuous Deployment Observability platforms monitor the CD pipeline in real time, tracking deployment success rates, deployment durations, and resource utilization. Observability tools integrate with deployment tools to capture data before, during, and after deployment. Alerts based on predefined thresholds or anomalies in CD metrics notify teams of potential issues, enabling quick intervention and minimizing the risk of deploying faulty code. Post-Deployment Observability Application performance monitoring tools provide insights into the performance of deployed applications, including response times, error rates, and transaction traces. This information is crucial for identifying and resolving issues introduced during and after deployment. Observability platforms with error-tracking capabilities help pinpoint and prioritize software bugs or issues arising from the deployed code. Aggregating logs from post-deployment environments allows for a comprehensive view of system behavior and facilitates troubleshooting and debugging. Conclusion The symbiotic relationship between observability and high availability is integral to meeting the demands of agile, user-centric development environments. With real-time monitoring, alerting, and post-deployment insights, observability plays a major role in achieving and maintaining high availability. Cloud providers are now leveraging drag-and-drop interfaces and natural language tools to eliminate the need for advanced technical skills for deployment and management of cloud infrastructure. Hence, it is easier than ever to create highly available systems by combining the powers of CI/CD and observability. Resources: Continuous Integration Patterns and Anti-Patterns by Nicolas Giron and Hicham Bouissoumer, DZone Refcard Continuous Delivery Patterns and Anti-Patterns by Nicolas Giron and Hicham Bouissoumer, DZone Refcard "The 10 Biggest Cloud Computing Trends In 2024 Everyone Must Be Ready For Now" by Bernard Marr, Forbes This is an excerpt from DZone's 2024 Trend Report,The Modern DevOps Lifecycle: Shifting CI/CD and Application Architectures.For more: Read the Report More

Mastering Prometheus: Unlocking Actionable Insights and Enhanced Monitoring in Kubernetes Environments

By Rajesh Gheware

In the dynamic world of cloud-native technologies, monitoring and observability have become indispensable. Kubernetes, the de-facto orchestration platform, offers scalability and agility. However, managing its health and performance efficiently necessitates a robust monitoring solution. Prometheus, a powerful open-source monitoring system, emerges as a perfect fit for this role, especially when integrated with Kubernetes. This guide outlines a strategic approach to deploying Prometheus in a Kubernetes cluster, leveraging helm for installation, setting up an ingress nginx controller with metrics scraping enabled, and configuring Prometheus alerts to monitor and act upon specific incidents, such as detecting ingress URLs that return 500 errors. Prometheus Prometheus excels at providing actionable insights into the health and performance of applications and infrastructure. By collecting and analyzing metrics in real-time, it enables teams to proactively identify and resolve issues before they impact users. For instance, Prometheus can be configured to monitor system resources like CPU, memory usage, and response times, alerting teams to anomalies or thresholds breaches through its powerful alerting rules engine, Alertmanager. Utilizing PromQL, Prometheus's query language, teams can dive deep into their metrics, uncovering patterns and trends that guide optimization efforts. For example, tracking the rate of HTTP errors or response times can highlight inefficiencies or stability issues within an application, prompting immediate action. Additionally, by integrating Prometheus with visualization tools like Grafana, teams can create dashboards that offer at-a-glance insights into system health, facilitating quick decision-making. Through these capabilities, Prometheus not only monitors systems but also empowers teams with the data-driven insights needed to enhance performance and reliability. Prerequisites Docker and KIND: A Kubernetes cluster set-up utility (Kubernetes IN Docker.) Helm, a package manager for Kubernetes, installed. Basic understanding of Kubernetes and Prometheus concepts. 1. Setting Up Your Kubernetes Cluster With Kind Kind allows you to run Kubernetes clusters in Docker containers. It's an excellent tool for development and testing. Ensure you have Docker and Kind installed on your machine. To create a new cluster: kind create cluster --name prometheus-demo Verify your cluster is up and running: kubectl cluster-info --context kind-prometheus-demo 2. Installing Prometheus Using Helm Helm simplifies the deployment and management of applications on Kubernetes. We'll use it to install Prometheus: Add the Prometheus community Helm chart repository: helm repo add prometheus-community https://prometheus-community.github.io/helm-charts helm repo update Install Prometheus: helm install prometheus prometheus-community/kube-prometheus-stack --namespace monitoring --create-namespace helm upgrade prometheus prometheus-community/kube-prometheus-stack \ --namespace monitoring \ --set prometheus.prometheusSpec.podMonitorSelectorNilUsesHelmValues=false \ --set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false This command deploys Prometheus along with Alertmanager, Grafana, and several Kubernetes exporters to gather metrics. Also, customize your installation to scan for service monitors in all the namespaces. 3. Setting Up Ingress Nginx Controller and Enabling Metrics Scraping Ingress controllers play a crucial role in managing access to services in a Kubernetes environment. We'll install the Nginx Ingress Controller using Helm and enable Prometheus metrics scraping: Add the ingress-nginx repository: helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx helm repo update Install the ingress-nginx chart: helm upgrade --install ingress-nginx ingress-nginx/ingress-nginx \ --namespace ingress-nginx --create-namespace \ --set controller.metrics.enabled=true \ --set controller.metrics.serviceMonitor.enabled=true \ --set controller.metrics.serviceMonitor.additionalLabels.release="prometheus" This command installs the Nginx Ingress Controller and enables Prometheus to scrape metrics from it, essential for monitoring the performance and health of your ingress resources. 4. Monitoring and Alerting for Ingress URLs Returning 500 Errors Prometheus's real power shines in its ability to not only monitor your stack but also provide actionable insights through alerting. Let's configure an alert to detect when ingress URLs return 500 errors. Define an alert rule in Prometheus: Create a new file called custom-alerts.yaml and define an alert rule to monitor for 500 errors: apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: name: ingress-500-errors namespace: monitoring labels: prometheus: kube-prometheus spec: groups: - name: http-errors rules: - alert: HighHTTPErrorRate expr: | sum (rate(nginx_ingress_controller_requests{status=~"5.."}[1m])) > 0.1 OR absent(sum (rate(nginx_ingress_controller_requests{status=~"5.."}[1m]))) for: 1m labels: severity: critical annotations: summary: High HTTP Error Rate description: "This alert fires when the rate of HTTP 500 responses from the Ingress exceeds 0.1 per second over the last 5 minutes." Apply the alert rule to Prometheus: You'll need to configure Prometheus to load this alert rule. If you're using the Helm chart, you can customize the values.yaml file or create a ConfigMap to include your custom alert rules. Verify the alert is working: Trigger a condition that causes a 500 error and observe Prometheus firing the alert. For example, launch the following application: kubectl create deploy hello --image brainupgrade/hello:1.0 kubectl expose deploy hello --port 80 --target-port 8080 kubectl create ingress hello --rule="hello.internal.brainupgrade.in/=hello:80" --class nginx Access the application using the below command: curl -H "Host: hello.internal.brainupgrade.in" 172.18.0.3:31080 Wherein: 172.18.0.3 is the IP of the KIND cluster node. 31080 is the node port of the ingress controller service. This could be different in your case. Bring down the hello service pods using the following command: kubectl scale --replicas 0 deploy hello You can view active alerts in the Prometheus UI (localhost:9999) by running the following command. kubectl port-forward -n monitoring svc/prometheus-operated 9999:9090 And you will see the alert being fired. See the following snapshot: Error alert on Prometheus UI. You can also configure Alertmanager to send notifications through various channels (email, Slack, etc.). Conclusion Integrating Prometheus with Kubernetes via Helm provides a powerful, flexible monitoring solution that's vital for maintaining the health and performance of your cloud-native applications. By setting up ingress monitoring and configuring alerts for specific error conditions, you can ensure your infrastructure not only remains operational but also proactively managed. Remember, the key to effective monitoring is not just collecting metrics but deriving actionable insights that lead to improved reliability and performance. More

O11y Guide, Cloud-Native Observability Pitfalls: Ignoring Existing Landscape

By Eric D. Schabell

CORE

The Cost Crisis in Observability Tooling

By Charity Majors

Logging and Monitoring in Microsoft Azure

By Aditya Bhuyan

Revolutionizing Observability: How AI-Driven Observability Unlocks a New Era of Efficiency

Observability is the ability to measure the state of a service or software system with the help of tools such as logs, metrics, and traces. It is a crucial aspect of distributed systems, as it allows stakeholders such as Software Engineers, Site Reliability Engineers, and Product Managers to troubleshoot issues with their service, monitor performance, and gain insights into the software system's behavior. It also helps to bring visibility into important Product decisions such as monitoring the adoption rate of a new feature, analyzing user feedback, and identifying and fixing any performance issues to ensure a stable and delightful customer experience. In this article, we will discuss the importance of observability in distributed systems, the different tools used for monitoring, and the future of observability and Generative AI. Importance of Observability in Distributed Systems Distributed systems are a type of software architecture that involves multiple services and servers working together to achieve a common goal. Some examples of distributed applications include: Streaming services: Streaming services like Netflix and Spotify use distributed systems to handle large volumes of data and ensure smooth playback for users. Rideshare applications: Rideshare applications like Uber and Lyft rely on distributed systems to match drivers with passengers, track vehicle locations, and process payments. Distributed systems have several advantages, such as: Availability: If one server or pod on the network goes down, another can be spun up and pick up the work, thus ensuring high availability. Scalability: Distributed systems can scale out to accommodate increased load by adding more servers, making it easier to scale quickly, handle more users, or process more data. Maintainability: Distributed systems are more maintainable than centralized systems, as individual servers can be updated or replaced without affecting the overall system. However, distributed systems also come with disadvantages, such as increased complexity of management and the need for a deep understanding of the system's components. Observability helps to address these challenges. Troubleshooting Observability allows Engineers to diagnose issues in distributed systems more effectively by providing insightful information on system performance and behavior. Let’s take an example: when users of a video streaming service experience unexpected buffering, observability tools can help engineers quickly identify if the cause is a server overload, a network bottleneck, or a bad deployment, enabling a swift resolution to keep binge-watchers happily streaming. Preventive Measures By identifying potential problems before they occur, observability helps to prevent failures and improve system reliability. For example, if our video streaming service's metrics show a spike in CPU usage, engineers can identify the cause as a memory leak in a specific microservice. By addressing this issue proactively, they can prevent the service from crashing and ensure a smooth streaming experience for users. Business Insights Observability patterns for distributed systems provide valuable information for business decision-making. In the case of our video streaming service, observability tools can reveal user engagement patterns, such as peak viewing times, which can inform server scaling strategies to handle high traffic during new episode releases, thereby enhancing user satisfaction and reducing churn. The Three Pillars of Observability Logs, metrics, and traces are often known as the three pillars of observability. These powerful tools, if understood well, can unlock the ability to build better systems. 1. Logs Event logs are immutable, timestamped records of discrete events that happened over time. They provide information on system activity and timestamps. Let’s go back to our example of a video streaming service. Every time a user watches a video, an event log is created. This log contains details like the user ID, video ID, playback start time, timestamp of the event, and any errors encountered during streaming. If there are errors observed during video playback, engineers can look at these logs to understand what happened during that specific viewing session. 2. Metrics Metrics are quantitative data points that measure various aspects of system performance and product usage. Metrics such as CPU usage, memory usage, and network bandwidth of the servers delivering the video content are constantly monitored. Alerts can be configured on metric thresholds. If there's a sudden spike in page load latency, an alert would go off indicating there’s a problem that needs to be addressed to prevent a downgraded customer experience. 3. Traces Traces provide a detailed view of the path that a request takes through a distributed system. For a video streaming service, a trace could show the journey of a user's request from the moment they log in to the platform and hit play to the point where the video begins streaming. This trace would include all the microservices involved, such as authentication, content delivery, and data storage. If there's a delay in video start time, tracing can help pinpoint exactly where in the process the delay is occurring. Some popular examples of observability tools include DataDog, New Relic, and Splunk and open-source alternatives such as Prometheus and Grafana, which offer robust capabilities. Additionally, several tech companies build internal observability platforms by leveraging the flexibility and power of open-source tools like Prometheus and Grafana. Future of Observability and Generative AI As we look towards the future of observability in distributed systems, the applications of artificial intelligence (AI), and specifically generative AI, introduce innovative solutions that potentially simplify the lives of engineers, helping them focus on critical problems. Automated Pattern Recognition Generative AI shines in analyzing vast datasets and automatically recognizing abnormal patterns within them. This capability could save on-call engineers a lot of time as it can quickly identify issues, allowing them to focus on resolving problems rather than searching for the needle in the haystack. Cognitive Incident Response AI-powered systems can offer cognitive incident response by understanding the context of errors and suggesting diagnosis for the error based on past incidents. This capability allows for more intelligent alerting, alerting teams only for new and critical incidents and letting the observability tool take care of known issues. Enhanced Observability With AI Chatbot Picture a scenario where engineers on your team can simply ask for the data they need in everyday language, and AI-powered observability tools do the heavy lifting. These tools can sift through logs, metrics, and traces to deliver the answers you're looking for. For example, with Coralogix's Query Assistant, users can ask questions like "What metrics are available for each Redis instance?" and the system will not only understand the query but also present the information in an easy-to-digest dashboard or visualization. This level of interaction simplifies the debugging process for both engineers and those less familiar with complex query languages, making data exploration easier. Given the rapid advancements in the field of Artificial Intelligence and its integration into Observability tools, I’m super excited for what’s to come in the future. The future of observability, enriched by AI, promises not only a single source of truth for complex systems but also a smarter and more intuitive way for Engineers and other stakeholders to engage with data, driving better business outcomes and enabling a focus on creativity and critical incidents over routine tasks.

By Deepita Pai

Effective Log Data Analysis With Amazon CloudWatch: Harnessing Machine Learning

In today's cloud computing world, all types of logging data are extremely valuable. Logs can include a wide variety of data, including system events, transaction data, user activities, web browser logs, errors, and performance metrics. Managing logs efficiently is extremely important for organizations, but dealing with large volumes of data makes it challenging to detect anomalies and unusual patterns or predict potential issues before they become critical. Efficient log management strategies, such as implementing structured logging, using log aggregation tools, and applying machine learning for log analysis, are crucial for handling this data effectively. One of the latest advancements in effectively analyzing a large amount of logging data is Machine Learning (ML) powered analytics provided by Amazon CloudWatch. It is a brand new capability of CloudWatch. This innovative service is transforming the way organizations handle their log data. It offers a faster, more insightful, and automated log data analysis. This article specifically explores utilizing the machine learning-powered analytics of CloudWatch to overcome the challenges of effectively identifying hidden issues within the log data. Before deep diving into some of these features, let's have a quick refresher about Amazon CloudWatch. What Is Amazon CloudWatch? It is an AWS-native monitoring and observability service that offers a whole suite of capabilities: Monitoring: Tracks performance and operational health. Data collection: Gathers logs, metrics, and events, providing a comprehensive view of AWS resources. Unified operational view: Provides insights into applications running on AWS and on-premises servers. Challenges With Logs Data Analysis Volume of Data There's too much log data. In this modern era, applications emit a tremendous amount of log events. Log data can grow so rapidly that developers often find it difficult to identify issues within it; it is like finding a needle in a haystack. Change Identification Another common problem we have often seen is the fundamental problem of log analysis that goes back as long as logs have been around, identifying what has changed in your logs. Proactive Detection Proactive detection is another common challenge. It's great if you can utilize logs to dive in when an application's having an issue, find the root cause of that application issue, and fix it. But how do you know when those issues are occurring? How do you proactively detect them? Of course, you can implement metrics, alarms, etc., for the issues you know about. But there's always the problem of unknowns. So, we're often instrumenting observability and monitoring for past issues. Now, let's dive deep into the machine learning capabilities from CloudWatch that will help you overcome the challenges we have just discussed. Machine Learning Capabilities From CloudWatch Pattern Analysis Imagine you are troubleshooting a real-time distributed application accessed by millions of customers globally and generating a significant amount of application logs. Analyzing tens of thousands of log events manually is challenging, and it can take forever to find the root cause. That is where the new AWS CloudWatch machine learning-based capability can quickly help by grouping log events into patterns within the Logs Insight page of CloudWatch. It is much easier to identify through a limited number of patterns and quickly filter the ones that might be interesting or relevant based on the issue you are trying to troubleshoot. It also allows you to expand the specific pattern to look for the relevant events along with related patterns that might be pertinent. In simple words, Pattern Analysis is the automated grouping and categorization of your log events. Comparison Analysis How can we elevate pattern analysis to the next level? Now that we've seen how pattern analysis works let's see how we can extend this feature to perform comparison analysis. "Comparison Analysis" aims to solve the second challenge of identifying the log changes. Comparison analysis lets you effectively profile your logs using patterns from one time period and then compare them to the patterns extracted for another period and analyze the differences. This will help us answer this fundamental question of what changed to my logs. You can quickly compare your logs while your application's having an issue to a known healthy period. Any changes between two time periods are a strong indicator of the possible root cause of your problem. CloudWatch Logs Anomaly Detection Anomaly detection, in simple terms, is the process of identifying unusual patterns or behaviors in the logs that do not conform to expected norms. To use this feature, we need to first select the LogGroup for the application and enable CloudWatch Logs anomaly detection for it. At that point, CloudWatch will train a machine-learning model on the expected patterns and the volume of each pattern associated with your application. CloudWatch will take five minutes to train the model using logs from your application, and the feature will become active and automatically start servicing these anomalies any time they occur. So things like a brand new error message occurring that wasn't there before, a sudden spike in the volume, or if there's a spike in HTTP 400s are some examples that will result in an anomaly being generated for that. Generate Logs Insight Queries Using Generative AI With this capability, you can give natural language commands to filter log events, and CloudWatch can generate queries using Generative AI. If you are unfamiliar with CloudWatch query language or are from a non-technical background, you can easily use this feature to generate queries and filter logs. It's an iterative process; you need to learn precisely what you want from the first query. So you can update and iterate the query based on the results you see. Let's look at a couple of examples: Natural Language Prompt: "Check API Response Times" Auto-generated query by CloudWatch: In this query: fields @timestamp, @message selects the timestamp and message fields from your logs. | parse @message "Response Time: *" as responseTime parses the @message field to extract the value following the text "Response Time: " and labels it as responseTime. | stats avg(responseTime) calculates the average of the extracted responseTime values. Natural Language Prompt: "Please provide the duration of the ten invocations with the highest latency." Auto-generated query by CloudWatch In this query: fields @timestamp, @message, latency selects the @timestamp, @message, and latency fields from the logs. | stats max(latency) as maxLatency by @message computes the maximum latency value for each unique message. | sort maxLatency desc sorts the results in descending order based on the maximum latency, showing the highest values at the top. | limit 10 restricts the output to the top 10 results with the highest latency values. We can execute these queries in the CloudWatch “Logs Insights” query box to filter the log events from the application logs. These queries extract specific information from the logs, such as identifying errors, monitoring performance metrics, or tracking user activities. The query syntax might vary based on the particular log format and the information you seek. Conclusion CloudWatch's machine learning features offer a robust solution for managing the complexities of log data. These tools make log analysis more efficient and insightful, from automating pattern analysis to enabling anomaly detection. The addition of generative AI for query generation further democratizes access to these powerful insights.

By Rajat Gupta

O11y Guide, Cloud-Native Observability Pitfalls: Controlling Costs

Are you looking at your organization's efforts to enter or expand into the cloud-native landscape and feeling a bit daunted by the vast expanse of information surrounding cloud-native observability? When you're moving so fast with Agile practices across your DevOps, SREs, and platform engineering teams, it's no wonder this can seem a bit confusing. Unfortunately, the choices being made have a great impact on both your business, your budgets, and the ultimate success of your cloud-native initiatives that hasty decisions upfront lead to big headaches very quickly down the road. In the previous introduction, we looked at the problem facing everyone with cloud-native observability. It was the first article in this series. In this article, you'll find the first pitfall discussion that's another common mistake organizations make. By sharing common pitfalls in this series, the hope is that we can learn from them. After laying the groundwork in the previous article, it's time to tackle the first pitfall, where we need to look at how to control the costs and the broken cost models we encounter with cloud-native observability. O11y Costs Broken One of the biggest topics of the last year has been how broken the cost models are for cloud-native observability. I previously wrote about why cloud-native observability needs phases, detailing how the second generation of observability tooling suffers from this broken model. "The second generation consisted of application performance monitoring (APM) with the infrastructure using virtual machines and later cloud platforms. These second-generation monitoring tools have been unable to keep up with the data volume and massive scale that cloud-native architectures..." They store all of our cloud-native observability data and charge for this, and as our business finds success, scaling data volumes means expensive observability tooling, degraded visualization performance, and slow data queries (rules, alerts, dashboards, etc.). Organizations would not care how much data is being stored or what it costs if they had better outcomes, happier customers, higher levels of availability, faster remediation of issues, and, above all, more revenue. Unfortunately, as pointed out on TheNewStack, "It's remarkable how common this situation is, where an organization is paying more for their observability data than they do for their production infrastructure." The issue quickly resolves itself around the answer to the question, "Do we need to store all our observability data?" The quick and dirty answer is, of course, not! There has been almost no incentive for any tooling vendors to provide insights into the data we are ingesting for what is actually being used and what is not. It turns out that when you do take a good look at the data coming in and are able to filter all your data at ingestion for what is not touched by any user, not ad-hoc queried, not part of any dashboard, not part of any rule, and not used for any alerts. It turns out to make quite a difference in data costs. In the example above, we designed a dashboard for a service status overview, initially while ingesting over 280K data points. With the ability to inspect and clarify that a lot of these data points were not used in the organization, the same ingestion flow was reduced to just 390 single data points being stored. The cost reduction here depends on your vendor pricing, but with the effect shown here, it's obviously going to be a dramatic cost control tool. It's important to understand that we need to ingest what we can collect, but we really only want to store what we are actually going to use for queries, rules, alerts, and visualizations. Below is an architectural view of how we are assisted by having control plane functionality and tooling between our data ingestion and data storage. Any data we are not storing can later be passed through to storage should a future project require it. Finally, without standards and ownership of the cost-controlling processes in an organization, there is little hope of controlling costs. To this end, the FinOps role has become critical to many organizations, and the entire field started a community in 2019 known as the FinOps Foundation. It's very important that cloud-native observability vendors join these efforts moving forward, and this should be a point of interest when evaluating new tooling. Today, 90% of the Fortune 50 now have FinOps teams. The road to cloud-native success has many pitfalls, and understanding how to avoid the pillars and focusing instead on solutions for the phases of observability will save much wasted time and energy. Coming Up Next Another pitfall is when organizations focus on The Pillars in their observability solutions. In the next article in this series, I'll share why this is a pitfall and how we can avoid it wreaking havoc on our cloud-native observability efforts.

By Eric D. Schabell

CORE

DTrace Revisited: Advanced Debugging Techniques

When we think of debugging, we think of breakpoints in IDEs, stepping over, inspecting variables, etc. However, there are instances where stepping outside the conventional confines of an IDE becomes essential to track and resolve complex issues. This is where tools like DTrace come into play, offering a more nuanced and powerful approach to debugging than traditional methods. This blog post delves into the intricacies of DTrace, an innovative tool that has reshaped the landscape of debugging and system analysis. DTrace Overview DTrace was first introduced by Sun Microsystems in 2004, DTrace quickly garnered attention for its groundbreaking approach to dynamic system tracing. Originally developed for Solaris, it has since been ported to various platforms, including MacOS, Windows, and Linux. DTrace stands out as a dynamic tracing framework that enables deep inspection of live systems – from operating systems to running applications. Its capacity to provide real-time insights into system and application behavior without significant performance degradation marks it as a revolutionary tool in the domain of system diagnostics and debugging. Understanding DTrace’s Capabilities DTrace, short for Dynamic Tracing, is a comprehensive toolkit for real-time system monitoring and debugging, offering an array of capabilities that span across different levels of system operation. Its versatility lies in its ability to provide insights into both high-level system performance and detailed process-level activities. System Monitoring and Analysis At its core, DTrace excels in monitoring various system-level operations. It can trace system calls, file system activities, and network operations. This enables developers and system administrators to observe the interactions between the operating system and the applications running on it. For instance, DTrace can identify which files a process accesses, monitor network requests, and even trace system calls to provide a detailed view of what's happening within the system. Process and Performance Analysis Beyond system-level monitoring, DTrace is particularly adept at dissecting individual processes. It can provide detailed information about process execution, including CPU and memory usage, helping to pinpoint performance bottlenecks or memory leaks. This granular level of detail is invaluable for performance tuning and debugging complex software issues. Customizability and Flexibility One of the most powerful aspects of DTrace is its customizability. With a scripting language based on C syntax, DTrace allows the creation of customized scripts to probe specific aspects of system behavior. This flexibility means that it can be adapted to a wide range of debugging scenarios, making it a versatile tool in a developer’s arsenal. Real-World Applications In practical terms, DTrace can be used to diagnose elusive performance issues, track down resource leaks, or understand complex interactions between different system components. For example, it can be used to determine the cause of a slow file operation, analyze the reasons behind a process crash, or understand the system impact of a new software deployment. Performance and Compatibility of DTrace A standout feature of DTrace is its ability to operate with remarkable efficiency. Despite its deep system integration, DTrace is designed to have minimal impact on overall system performance. This efficiency makes it a feasible tool for use in live production environments, where maintaining system stability and performance is crucial. Its non-intrusive nature allows developers and system administrators to conduct thorough debugging and performance analysis without the worry of significantly slowing down or disrupting the normal operation of the system. Cross-Platform Compatibility Originally developed for Solaris, DTrace has evolved into a cross-platform tool, with adaptations available for MacOS, Windows, and various Linux distributions. Each platform presents its own set of features and limitations. For instance, while DTrace is a native component in Solaris and MacOS, its implementation in Linux often requires a specialized build due to kernel support and licensing considerations. Compatibility Challenges on MacOS On MacOS, DTrace's functionality intersects with System Integrity Protection (SIP), a security feature designed to prevent potentially harmful actions. To utilize DTrace effectively, users may need to disable SIP, which should be done with caution. This process involves booting into recovery mode and executing specific commands, a step that highlights the need for a careful approach when working with such powerful system-level tools. We can disable SIP using the command: csrutil disable We can optionally use a more refined approach of enabling SIP without dtrace using the following command: csrutil enable --without dtrace Be extra careful when issuing these commands and when working on machines where dtrace is enabled. Back up your data properly! Customizability and Flexibility of DTrace A key feature that sets DTrace apart in the realm of system monitoring tools is its highly customizable nature. DTrace employs a scripting language that bears similarity to C syntax, offering users the ability to craft detailed and specific diagnostic scripts. This scripting capability allows for the creation of custom probes that can be fine-tuned to target particular aspects of system behavior, providing precise and relevant data. Adaptability to Various Scenarios The flexibility of DTrace's scripting language means it can adapt to a multitude of debugging scenarios. Whether it's tracking down memory leaks, analyzing CPU usage, or monitoring I/O operations, DTrace can be configured to provide insights tailored to the specific needs of the task. This adaptability makes it an invaluable tool for both developers and system administrators who require a dynamic approach to problem-solving. Examples of Customizable Probes Users can define probes to monitor specific system events, track the behavior of certain processes, or gather data on system resource usage. This level of customization ensures that DTrace can be an effective tool in a variety of contexts, from routine maintenance to complex troubleshooting tasks. The following is a simple "Hello, world!" dtrace probe: sudo dtrace -qn 'syscall::write:entry, syscall::sendto:entry /pid == $target/ { printf("(%d) %s %s", pid, probefunc, copyinstr(arg1)); }' -p 9999 The kernel is instrumented with hooks that match various callbacks. dtrace connects to these hooks and can perform interesting tasks when these hooks are triggered. They have a naming convention, specifically provider:module:function:name. In this case, the provider is a system call in both cases. We have no module so we can leave that part blank between the colon (:) symbols. We grab a write operation and sendto entries. When an application writes or tries to send a packet, the following code event will trigger. These things happen frequently, which is why we restrict the process ID to the specific target with pid == $target. This means the code will only trigger for the PID passed to us in the command line. The rest of the code should be simple for anyone with basic C experience: it's a printf that would list the processes and the data passed. Real-World Applications of DTrace DTrace's diverse capabilities extend far beyond theoretical use, playing a pivotal role in resolving real-world system complexities. Its ability to provide deep insights into system operations makes it an indispensable tool in a variety of practical applications. To get a sense of how DTrace can be used, we can use the man -k dtrace command whose output on my Mac is below: bitesize.d(1m) - analyse disk I/O size by process. Uses DTrace cpuwalk.d(1m) - Measure which CPUs a process runs on. Uses DTrace creatbyproc.d(1m) - snoop creat()s by process name. Uses DTrace dappprof(1m) - profile user and lib function usage. Uses DTrace dapptrace(1m) - trace user and library function usage. Uses DTrace dispqlen.d(1m) - dispatcher queue length by CPU. Uses DTrace dtrace(1) - dynamic tracing compiler and tracing utility dtruss(1m) - process syscall details. Uses DTrace errinfo(1m) - print errno for syscall fails. Uses DTrace execsnoop(1m) - snoop new process execution. Uses DTrace fddist(1m) - file descriptor usage distributions. Uses DTrace filebyproc.d(1m) - snoop opens by process name. Uses DTrace hotspot.d(1m) - print disk event by location. Uses DTrace iofile.d(1m) - I/O wait time by file and process. Uses DTrace iofileb.d(1m) - I/O bytes by file and process. Uses DTrace iopattern(1m) - print disk I/O pattern. Uses DTrace iopending(1m) - plot number of pending disk events. Uses DTrace iosnoop(1m) - snoop I/O events as they occur. Uses DTrace iotop(1m) - display top disk I/O events by process. Uses DTrace kill.d(1m) - snoop process signals as they occur. Uses DTrace lastwords(1m) - print syscalls before exit. Uses DTrace loads.d(1m) - print load averages. Uses DTrace newproc.d(1m) - snoop new processes. Uses DTrace opensnoop(1m) - snoop file opens as they occur. Uses DTrace pathopens.d(1m) - full pathnames opened ok count. Uses DTrace perldtrace(1) - Perl's support for DTrace pidpersec.d(1m) - print new PIDs per sec. Uses DTrace plockstat(1) - front-end to DTrace to print statistics about POSIX mutexes and read/write locks priclass.d(1m) - priority distribution by scheduling class. Uses DTrace pridist.d(1m) - process priority distribution. Uses DTrace procsystime(1m) - analyse system call times. Uses DTrace rwbypid.d(1m) - read/write calls by PID. Uses DTrace rwbytype.d(1m) - read/write bytes by vnode type. Uses DTrace rwsnoop(1m) - snoop read/write events. Uses DTrace sampleproc(1m) - sample processes on the CPUs. Uses DTrace seeksize.d(1m) - print disk event seek report. Uses DTrace setuids.d(1m) - snoop setuid calls as they occur. Uses DTrace sigdist.d(1m) - signal distribution by process. Uses DTrace syscallbypid.d(1m) - syscalls by process ID. Uses DTrace syscallbyproc.d(1m) - syscalls by process name. Uses DTrace syscallbysysc.d(1m) - syscalls by syscall. Uses DTrace topsyscall(1m) - top syscalls by syscall name. Uses DTrace topsysproc(1m) - top syscalls by process name. Uses DTrace Tcl_CommandTraceInfo(3tcl), Tcl_TraceCommand(3tcl), Tcl_UntraceCommand(3tcl) - monitor renames and deletes of a command bitesize.d(1m) - analyse disk I/O size by process. Uses DTrace cpuwalk.d(1m) - Measure which CPUs a process runs on. Uses DTrace creatbyproc.d(1m) - snoop creat()s by process name. Uses DTrace dappprof(1m) - profile user and lib function usage. Uses DTrace dapptrace(1m) - trace user and library function usage. Uses DTrace dispqlen.d(1m) - dispatcher queue length by CPU. Uses DTrace dtrace(1) - dynamic tracing compiler and tracing utility dtruss(1m) - process syscall details. Uses DTrace errinfo(1m) - print errno for syscall fails. Uses DTrace execsnoop(1m) - snoop new process execution. Uses DTrace fddist(1m) - file descriptor usage distributions. Uses DTrace filebyproc.d(1m) - snoop opens by process name. Uses DTrace hotspot.d(1m) - print disk event by location. Uses DTrace iofile.d(1m) - I/O wait time by file and process. Uses DTrace iofileb.d(1m) - I/O bytes by file and process. Uses DTrace iopattern(1m) - print disk I/O pattern. Uses DTrace iopending(1m) - plot number of pending disk events. Uses DTrace iosnoop(1m) - snoop I/O events as they occur. Uses DTrace iotop(1m) - display top disk I/O events by process. Uses DTrace kill.d(1m) - snoop process signals as they occur. Uses DTrace lastwords(1m) - print syscalls before exit. Uses DTrace loads.d(1m) - print load averages. Uses DTrace newproc.d(1m) - snoop new processes. Uses DTrace opensnoop(1m) - snoop file opens as they occur. Uses DTrace pathopens.d(1m) - full pathnames opened ok count. Uses DTrace perldtrace(1) - Perl's support for DTrace pidpersec.d(1m) - print new PIDs per sec. Uses DTrace plockstat(1) - front-end to DTrace to print statistics about POSIX mutexes and read/write locks priclass.d(1m) - priority distribution by scheduling class. Uses DTrace pridist.d(1m) - process priority distribution. Uses DTrace procsystime(1m) - analyse system call times. Uses DTrace rwbypid.d(1m) - read/write calls by PID. Uses DTrace rwbytype.d(1m) - read/write bytes by vnode type. Uses DTrace rwsnoop(1m) - snoop read/write events. Uses DTrace sampleproc(1m) - sample processes on the CPUs. Uses DTrace seeksize.d(1m) - print disk event seek report. Uses DTrace setuids.d(1m) - snoop setuid calls as they occur. Uses DTrace sigdist.d(1m) - signal distribution by process. Uses DTrace syscallbypid.d(1m) - syscalls by process ID. Uses DTrace syscallbyproc.d(1m) - syscalls by process name. Uses DTrace syscallbysysc.d(1m) - syscalls by syscall. Uses DTrace topsyscall(1m) - top syscalls by syscall name. Uses DTrace topsysproc(1m) - top syscalls by process name. Uses DTrace There's a lot here; we don't need to read everything. The point is that when you run into a problem you can just search through this list and find a tool dedicated to debugging that problem. Let’s say you're facing elevated disk write issues that are causing the performance of your application to degrade. . . But is it your app at fault or some other app? rwbypid.d can help you with that: it can generate a list of processes and the number of calls they have for read/write based on the process ID as seen in the following screenshot: We can use this information to better understand IO issues in our code or even in 3rd party applications/libraries. iosnoop is another tool that helps us track IO operations but with more details: In diagnosing elusive system issues, DTrace shines by enabling detailed observation of system calls, file operations, and network activities. For instance, it can be used to uncover the root cause of unexpected system behaviors or to trace the origin of security breaches, offering a level of detail that is often unattainable with other debugging tools. Performance optimization is the main area where DTrace demonstrates its strengths. It allows administrators and developers to pinpoint performance bottlenecks, whether they lie in application code, system calls, or hardware interactions. By providing real-time data on resource usage, DTrace helps in fine-tuning systems for optimal performance. Final Words In conclusion, DTrace stands as a powerful and versatile tool in the realm of system monitoring and debugging. We've explored its broad capabilities, from in-depth system analysis to individual process tracing, and its remarkable performance efficiency that allows for its use in live environments. Its cross-platform compatibility, coupled with the challenges and solutions specific to MacOS, highlights its widespread applicability. The customizability through scripting provides unmatched flexibility, adapting to a myriad of diagnostic needs. Real-world applications of DTrace in diagnosing system issues and optimizing performance underscore its practical value. DTrace's comprehensive toolkit offers an unparalleled window into the inner workings of systems, making it an invaluable asset for system administrators and developers alike. Whether it's for routine troubleshooting or complex performance tuning, DTrace provides insights and solutions that are essential in the modern computing landscape.

By Shai Almog

CORE

Improving Upon My OpenTelemetry Tracing Demo

Last year, I wrote a post on OpenTelemetry Tracing to understand more about the subject. I also created a demo around it, which featured the following components: The Apache APISIX API Gateway A Kotlin/Spring Boot service A Python/Flask service And a Rust/Axum service I've recently improved the demo to deepen my understanding and want to share my learning. Using a Regular Database In the initial demo, I didn't bother with a regular database. Instead: The Kotlin service used the embedded Java H2 database The Python service used the embedded SQLite The Rust service used hard-coded data in a hash map I replaced all of them with a regular PostgreSQL database, with a dedicated schema for each. The OpenTelemetry agent added a new span when connecting to the database on the JVM and in Python. For the JVM, it's automatic when one uses the Java agent. One needs to install the relevant package in Python — see next section. OpenTelemetry Integrations in Python Libraries Python requires you to explicitly add the package that instruments a specific library for OpenTelemetry. For example, the demo uses Flask; hence, we should add the Flask integration package. However, it can become a pretty tedious process. Yet, once you've installed opentelemetry-distro, you can "sniff" installed packages and install the relevant integration. Shell pip install opentelemetry-distro opentelemetry-bootstrap -a install For the demo, it installs the following: Plain Text opentelemetry_instrumentation-0.41b0.dist-info opentelemetry_instrumentation_aws_lambda-0.41b0.dist-info opentelemetry_instrumentation_dbapi-0.41b0.dist-info opentelemetry_instrumentation_flask-0.41b0.dist-info opentelemetry_instrumentation_grpc-0.41b0.dist-info opentelemetry_instrumentation_jinja2-0.41b0.dist-info opentelemetry_instrumentation_logging-0.41b0.dist-info opentelemetry_instrumentation_requests-0.41b0.dist-info opentelemetry_instrumentation_sqlalchemy-0.41b0.dist-info opentelemetry_instrumentation_sqlite3-0.41b0.dist-info opentelemetry_instrumentation_urllib-0.41b0.dist-info opentelemetry_instrumentation_urllib3-0.41b0.dist-info opentelemetry_instrumentation_wsgi-0.41b0.dist-info The above setup adds a new automated trace for connections. Gunicorn on Flask Every time I started the Flask service, it showed a warning in red that it shouldn't be used in production. While it's unrelated to OpenTelemetry, and though nobody complained, I was not too fond of it. For this reason, I added a "real" HTTP server. I chose Gunicorn, for no other reason than because my knowledge of the Python ecosystem is still shallow. The server is a runtime concern. We only need to change the Dockerfile slightly: Dockerfile RUN pip install gunicorn ENTRYPOINT ["opentelemetry-instrument", "gunicorn", "-b", "0.0.0.0", "-w", "4", "app:app"] The -b option refers to binding; you can attach to a specific IP. Since I'm running Docker, I don't know the IP, so I bind to any. The -w option specifies the number of workers Finally, the app:app argument sets the module and the application, separated by a colon Gunicorn usage doesn't impact OpenTelemetry integrations. Heredocs for the Win You may benefit from this if you write a lot of Dockerfile. Every Docker layer has a storage cost. Hence, inside a Dockerfile, one tends to avoid unnecessary layers. For example, the two following snippets yield the same results. Dockerfile RUN pip install pip-tools RUN pip-compile RUN pip install -r requirements.txt RUN pip install gunicorn RUN opentelemetry-bootstrap -a install RUN pip install pip-tools \ && pip-compile \ && pip install -r requirements.txt \ && pip install gunicorn \ && opentelemetry-bootstrap -a install The first snippet creates five layers, while the second is only one; however, the first is more readable than the second. With heredocs, we can access a more readable syntax that creates a single layer: Dockerfile RUN <<EOF pip install pip-tools pip-compile pip install -r requirements.txt pip install gunicorn opentelemetry-bootstrap -a install EOF Heredocs are a great way to have more readable and more optimized Dockerfiles. Try them! Explicit API Call on the JVM In the initial demo, I showed two approaches: The first uses auto-instrumentation, which requires no additional action The second uses manual instrumentation with Spring annotations I wanted to demo an explicit call with the API in the improved version. The use-case is analytics and uses a message queue: I get the trace data from the HTTP call and create a message with such data so the subscriber can use it as a parent. First, we need to add the OpenTelemetry API dependency to the project. We inherit the version from the Spring Boot Starter parent POM: XML <dependency> <groupId>io.opentelemetry</groupId> <artifactId>opentelemetry-api</artifactId> </dependency> At this point, we can access the API. OpenTelemetry offers a static method to get an instance: Kotlin val otel = GlobalOpenTelemetry.get() At runtime, the agent will work its magic to return the instance. Here's a simplified class diagram focused on tracing: In turn, the flow goes something like this: Kotlin val otel = GlobalOpenTelemetry.get() //1 val tracer = otel.tracerBuilder("ch.frankel.catalog").build() //2 val span = tracer.spanBuilder("AnalyticsFilter.filter") //3 .setParent(Context.current()) //4 .startSpan() //5 // Do something here span.end() //6 Get the underlying OpenTelemetry Get the tracer builder and "build" the tracer Get the span builder Add the span to the whole chain Start the span End the span; after this step, send the data to the OpenTelemetry endpoint configured Adding a Message Queue When I did the talk based on the post, attendees frequently asked whether OpenTelemetry would work with messages such as MQ or Kafka. While I thought it was the case in theory, I wanted to make sure of it: I added a message queue in the demo under the pretense of analytics. The Kotlin service will publish a message to an MQTT topic on each request. A NodeJS service will subscribe to the topic. Attaching OpenTelemetry Data to the Message So far, OpenTelemetry automatically reads the context to find out the trace ID and the parent span ID. Whatever the approach, auto-instrumentation or manual, annotations-based or explicit, the library takes care of it. I didn't find any existing similar automation for messaging; we need to code our way in. The gist of OpenTelemetry is the traceparent HTTP header. We need to read it and send it along with the message. First, let's add MQTT API to the project. XML <dependency> <groupId>org.eclipse.paho</groupId> <artifactId>org.eclipse.paho.mqttv5.client</artifactId> <version>1.2.5</version> </dependency> Interestingly enough, the API doesn't allow access to the traceparent directly. However, we can reconstruct it via the SpanContext class. I'm using MQTT v5 for my message broker. Note that the v5 allows for metadata attached to the message; when using v3, the message itself needs to wrap them. JavaScript val spanContext = span.spanContext //1 val message = MqttMessage().apply { properties = MqttProperties().apply { val traceparent = "00-${spanContext.traceId}-${spanContext.spanId}-${spanContext.traceFlags}" //2 userProperties = listOf(UserProperty("traceparent", traceparent)) //3 } qos = options.qos isRetained = options.retained val hostAddress = req.remoteAddress().map { it.address.hostAddress }.getOrNull() payload = Json.encodeToString(Payload(req.path(), hostAddress)).toByteArray() //4 } val client = MqttClient(mqtt.serverUri, mqtt.clientId) //5 client.publish(mqtt.options, message) //6 Get the span context Construct the traceparent from the span context, according to the W3C Trace Context specification Set the message metadata Set the message body Create the client Publish the message Getting OpenTelemetry Data From the Message The subscriber is a new component based on NodeJS. First, we configure the app to use the OpenTelemetry trace exporter: JavaScript const sdk = new NodeSDK({ resource: new Resource({[SemanticResourceAttributes.SERVICE_NAME]: 'analytics'}), traceExporter: new OTLPTraceExporter({ url: `${collectorUri}/v1/traces` }) }) sdk.start() The next step is to read the metadata, recreate the context from the traceparent, and create a span. JavaScript client.on('message', (aTopic, payload, packet) => { if (aTopic === topic) { console.log('Received new message') const data = JSON.parse(payload.toString()) const userProperties = {} if (packet.properties['userProperties']) { //1 const props = packet.properties['userProperties'] for (const key of Object.keys(props)) { userProperties[key] = props[key] } } const activeContext = propagation.extract(context.active(), userProperties) //2 const tracer = trace.getTracer('analytics') const span = tracer.startSpan( //3 'Read message', {attributes: {path: data['path'], clientIp: data['clientIp']}, activeContext, ) span.end() //4 } }) Read the metadata Recreate the context from the traceparent Create the span End the span For the record, I tried to migrate to TypeScript, but when I did, I didn't receive the message. Help or hints are very welcome! Apache APISIX for Messaging Though it's not common knowledge, Apache APISIX can proxy HTTP calls as well as UDP and TCP messages. It only offers a few plugins at the moment, but it will add more in the future. An OpenTelemetry one will surely be part of it. In the meantime, let's prepare for it. The first step is to configure Apache APISIX to allow both HTTP and TCP: YAML apisix: proxy_mode: http&stream #1 stream_proxy: tcp: - addr: 9100 #2 tls: false Configure APISIX for both modes Set the TCP port The next step is to configure TCP routing: YAML upstreams: - id: 4 nodes: "mosquitto:1883": 1 #1 stream_routes: #2 - id: 1 upstream_id: 4 plugins: mqtt-proxy: #3 protocol_name: MQTT protocol_level: 5 #4 Define the MQTT queue as the upstream Define the "streaming" route. APISIX defines everything that's not HTTP as streaming Use the MQTT proxy. Note APISIX offers a Kafka-based one Address the MQTT version. For version above 3, it should be 5 Finally, we can replace the MQTT URLs in the Docker Compose file with APISIX URLs. Conclusion I've described several items I added to improve my OpenTelemetry demo in this post. While most are indeed related to OpenTelemetry, some of them aren't. I may add another component in another different stack, a front-end. The complete source code for this post can be found on GitHub.

By Nicolas Fränkel

CORE

Making Waves: Dynatrace Perform 2024 Ushers in New Era of Observability

Dynatrace welcomed thousands of in-person and virtual attendees to its annual Perform conference in Las Vegas this week. The overarching theme was “Make Waves,” – conveying both the tectonic shifts happening across industries and opportunities for organizations to drive transformational impact. True to the cutting-edge nature of the company, Dynatrace had several major announcements that will allow enterprises to tackle some of today’s most pressing challenges around cloud complexity, AI adoption, security threats, and sustainability commitments. Let’s dive into the key developments. Reducing the IT Carbon Footprint With climate change accelerating, reducing carbon emissions has become a business imperative. However, IT infrastructures are extremely complex, making it difficult for enterprises to quantify and optimize their footprints at scale. Dynatrace Carbon Impact is purpose-built to address this challenge. It translates highly granular observability data like compute utilization metrics into accurate sustainability impacts per data center, cloud provider regions, host clusters, and even individual workloads. Teams can instantly identify “hot spots” representing the highest energy waste and emissions for focused efficiency gains. For example, Carbon Impact may reveal an overload of duplicate microservices, dragging down utilization rates across critical application resources. It also suggests precise optimization actions based on cloud architectures and dependencies, like eliminating grossly underutilized instances. Moreover, its continuous monitoring provides oversight into sustainability KPIs overtime after taking measures like rightsizing initiatives or green coding enhancements. According to Dynatrace customer Lloyds Banking Group, which aims to cut operational carbon 75% by 2030, these capabilities create “the visibility and impact across IT ecosystems needed to optimize infrastructure efficiency.” As businesses pursue environmental goals amidst cloud scale and complexity, Carbon Impact makes observability the key enabler to reaching those targets. Making Observability Work for AI Artificial intelligence holds tremendous promise, but as the adoption of complex technologies like large language models and generative AI accelerates, new observability challenges arise. These modern AI workloads can behave unexpectedly, carry proprietary IP within models, hampering visibility, and operate as black boxes unable to trace failures. Their on-demand consumption models also make resource usage hard to predict and control. Dynatrace AI Observability is purpose-built to overcome these hurdles. It instruments the entire AI stack, including infrastructure like GPU clusters, ML pipelines, model governance systems, and AI apps. This full-stack observability combined with explanatory models from Davis AI delivers precise insights into the provenance and behavior of AI systems. Teams can pinpoint the root causes of model degradation plus quantify accuracies. For large language models like GPT, in particular, Dynatrace traces query patterns and token consumption to prevent overages. As models iteratively learn from new data, they monitor for harmful drift. This governance ensures models operate reliably and cost-effectively at the enterprise scale. In an environment demanding responsible and secure AI rollouts across industries, observability is no longer optional. Dynatrace equips businesses to drive generative AI and ML innovation with confidence. Driving Analytics and Automation at Scale Modern cloud-native environments generate massive data streams that are difficult for enterprises to smoothly manage, let alone extract value from. Constrained bandwidth and storage compound the issue, while ad hoc observability pipelines and data quality defects create headaches for practitioners. Dynatrace OpenPipeline elegantly solves these challenges. It offers a single, high-powered route to funnel all observability, security, and business telemetry pouring from dynamic cloud workloads into value-driving analytics and automation platforms like Dynatrace. Leveraging patent-pending accelerated processing algorithms combined with instant query abilities, OpenPipeline can evaluate staggering data volumes in flight up to 5-10 times faster than alternatives to unlock real-time analytics use cases previously unachievable. No need for clumsy sampling approximations. It also enriches telemetry with full topology context for precise answers while allowing teams to seamlessly filter, route, and transform data on ingest based on specific analytics or compliance needs. OpenPipeline even helps reduce duplicate streams by up to 30% to minimize bandwidth demands and required data warehouse storage capacity. For developers, SRE, and data engineering teams struggling to build custom pipelines handling massive, myriad data sources across today's heterogeneous enterprise stacks, OpenPipeline brings simplicity and performance, allowing more focus on extracting insights. Ensuring Analytics and Automation Quality Making decisions or triggering critical workflows based on bad data can spell disaster for organizations. But maintaining flawless data quality gets exponentially harder as cloud scale and complexity mushroom. Luckily for Dynatrace platform users, Data Observability helps eliminate these worries. It leverages Davis AI and other Dynatrace modules to automatically track key telemetry health metrics on ingest, including freshness, volume patterns, distribution outliers, and even schema changes. Any anomalies threatening downstream analytics and automation fidelity trigger alerts for investigation, made easy by lineage tracking to pinpoint root sources even across interconnected data pipelines. Teams save countless hours and no longer need to manually piece together where data defects originated. But beyond reactive governance, Dynatrace Data Observability also proactively optimizes analytics by continually assessing the relevance and utilization of data feeds. Teams can confidently retire unused streams wasting resources or identify new sources to incorporate for better insights and models. For developers building custom data integrations and architects managing business-critical analytics, worry-free data means more efficient delivery of value and innovation for the business. Data Observability grants the ease of mind that both historical and real-time data fueling crucial automation is fully trustworthy. The Path to Software Perfection Across the board, Dynatrace Perform 2024 indicated how AI and automation will reshape performance engineering. Founder and CTO Bernd Greifeneder summarized it perfectly: “We built Dynatrace to help customers automate because that is how you get to software perfection. These advances give teams the answers and governance to prevent problems automatically versus manual fixes.” Dynatrace Perform attendees are clearly excited for observability’s next paradigm shift.

By Tom Smith

CORE

Introducing Kubecost 2.0

Today, we’re proudly announcing the launch of Kubecost 2.0. It’s available for free to all users and can be accessed in seconds. Our most radical release yet—we’re shipping more than a dozen major new features and an entirely new API backend. Let’s delve into key features and enhancements that make Kubecost 2.0 the best Kubernetes cost management solution available. Here’s an overview of all the great new functionality you can find in Kubecost 2.0: Network Monitoring Visualizes All Traffic Costs Kubecost’s Network Monitoring provides full visibility into Kubernetes and cloud network costs. By monitoring the cost of pods, namespaces, clusters, and cloud services, you can quickly determine where in your infrastructure is driving spend in near real-time. Interacting with this feature, you can discover more about the sources of your inbound and outbound traffic costs, drag and drop icons, or hone in on specific traffic routes. This functionality is especially helpful for larger organizations or teams hoping to learn more about their complex network costs. Learn more in our Network Monitoring doc. Collections Combine Kubernetes and Cloud Costs The new Collections page lets you create custom spend categories comprised of both Kubernetes and cloud costs while removing any overlapping or duplicate costs. This is especially helpful for teams with complex and multi-faceted cost sources that don’t wish to relabel their costs in the cloud or Kubecost. Additionally, aggregating and filtering ensure you only see the costs you want to see and nothing else. Read more in our Collections doc. Kubecost Actions Kubecost Actions provides users with automated workflows to optimize Kubernetes costs. It’s available today with three core actions: dynamic request sizing, cluster turndown, and namespace turndown. We’ve made it easier to create your schedules and get the most out of our offered savings functionality. Learn more in our Actions doc. Forecast Spend With Machine Learning New machine learning-based forecasting models leverage historical Kubernetes and cloud data to provide accurate predictions, allowing teams to anticipate cost fluctuations and allocate resources efficiently. You can access forecasting through Kubecost’s major monitoring dashboards, Allocations, Assets, and the Cloud Cost Explorer by selecting from any future date ranges. You will then see projected future costs along with your realized spending. Learn about forecasting here. Anomaly Detection Anomaly Detection takes cost forecasting a step beyond by allowing you to detect when actual spend deviates from spend predicted by Kubecost. You can quickly identify unexpected spending across key areas and address overages quickly where appropriate. Allowing users to ensure their cloud or Kubernetes spending does not exceed expectations significantly. Read more in our Anomaly Detection doc. 100X Performance Improvement at Scale Kubecost 2.0 introduces a major upgrade with a new API backend, delivering a massive 100x performance improvement at scale, coupled with a 3x enhancement in resource efficiency. This means teams can now experience significantly faster and more responsive interactions with both Kubecost APIs and UI, especially when dealing with large-scale Kubernetes environments. The ability to query 3+ years of historical data provides engineering and FinOps teams with a comprehensive view of resource utilization trends, enabling more informed decision-making and long-term trend analysis. Installing Kubecost You can upgrade to Kubecost 2.0 in seconds, and it’s free to install. Get started with the following helm command to begin visualizing your Kubernetes costs and identifying optimizations: Shell helm install kubecost cost-analyzer \ --repo https://kubecost.github.io/cost-analyzer/ \ --namespace kubecost --create-namespace \ --set kubecostToken="YnJldHRAa3ViZWNvc3QuY29txm343yadf98" Next Steps This is only a preview of the key features now available with Kubecost 2.0. Check out our full release notes full release notes to read about all the great features available in self-managed Kubecost. Other notable features of this release include real-time cost learning, team access management, monitoring shared GPUs, and more. Want to see Kubecost 2.0 in action? Join our Kubecost 2.0 webinar Kubecost 2.0 webinar on Thursday, February 15th at 1 PM ET (10 AM PT), where we will be doing a deep-dive on the new functionality to show you how you can empower your team with granular, actionable insights for efficient Kubernetes operations.

By Jennifer Staretorp

Introduction to Grafana, Prometheus, and Zabbix

What Is Grafana? Grafana is an open-source tool to visualize the metrics and logs from different data sources. It can query those metrics, send alerts, and can be actively used for monitoring and observability, making it a popular tool for gaining insights. The metrics can be stored in various DBs, and Grafana supports most of them, like Prometheus, Zabbix, Graphite, MySQL, PostgreSQL, Elasticsearch, etc. If the data sources are not available then customized plugins can be developed to integrate these data sources. Grafana is used widely these days to monitor and visualize the metrics for 100s or 1000s of servers, Kubernetes Platforms, Virtual Machines, Big Data Platforms, etc. The key feature of Grafana is its ability to share these metrics in visual forms by creating dashboards so that the teams can collaborate for data analysis and provide support in real-time. Various platforms that are supported by Grafana today: Relational Databases Cloud Services like Google Cloud Monitoring, Amazon Cloud Watch, Azure Cloud Time Series Databases like InfluxDB for memory and CPU Usage graphs Other Data Sources like Elasticsearch and Graphite What Is Prometheus? It’s an open-source data source that is used for infrastructure monitoring and observability. It stores the time-series data, which is collected from various sources like applications developed in various programming languages, virtual machines, databases, servers, Kubernetes clusters, etc. To query these metrics, it uses a query language called PromQL that can be used to explore these metrics for various times and intervals and ensure insights into the health of the systems mentioned above. To create dashboards, send alerts, and ensure observability, tools like Grafana are used. What Is Zabbix? Zabbix is used for comprehensive monitoring which ensures the reliability and efficiency of the IT infrastructure like network, servers, and applications. It has three components: Zabbix Server, Zabbix Agent, and Frontend. Zabbix Server is used for gathering the data. The Zabbix agent collects and sends the data to the Zabbix Server. Frontend is a Web Interface for configuration. Comparison Between Zabbix and Prometheus Features/Aspects Prometheus Zabbix Primary Use Case Collection of metrics of the servers and services. Comprehensive monitoring of network, servers, and applications. Data Collection Collection of metrics in numbers and can be viewed via HTTP Endpoints. Agents (Zabbix Agent) for collecting performance data, SNMP, IPMI, and JMX support. Supports agentless monitoring for certain scenarios. Logging Cannot collect or analyze logs. Centralized logging analysis is not possible with Zabbix. You can monitor log files, but no analysis possibilities are available. Data Visualization Data visualization is possible through numbers and graphs. Graphical representation of monitored data through charts, graphs, maps, and dashboards. Application Metrics Application metrics can be collected if Prometheus is integrated with the web application. No Application-related metrics/dashboards/Alerts are available at this moment. Service Metrics Prometheus can collect metrics from web applications. Zabbix can monitor services like haproxy, databases MySQL, PostgreSQL, HTTP services, etc., and require configurations/integrations. Custom Metrics Custom metrics can be collected through the exporters. The Anomaly Detection feature is available from version 6.0. Alerting System Alerting is not possible via Prometheus. Robust alerting system with customizable triggers, actions, escalations, and notification channels.Alerts can trigger E-mail Notifications, Slack, Pagerduty, Jira, etc. Service Availability Not possible. Zabbix has a built-in feature to generate Service Availability Reports. Planned and unplanned outage windows require manual input to the script, which generates a service availability report. Retention Policy Prometheus can retain the metrics for 1 day to multiple days. Zabbix persists all the data in its own database. Security Features Prometheus and most exporters support TLS. Including authentication of clients via TLS client certificates. Details on configuring Prometheus are here. The Go projects share the same TLS library based on the Go crypto/tls library. Robust user roles and permissions, granular control over user access, and secure communication between components. Kubernetes Compatibility Prometheus can be fully integrated with Kubernetes clusters. It takes full advantage of exporters to collect the metrics and show them on the UI. CompatibilityZabbix is compatible with K8s and can be used to monitor various aspects of the K8s environment.No Logging using K8s Log ForwarderApplication logs cannot be forwarded to the Zabbix server.Container Monitoring with Zabbix The Zabbix Server can be configured to collect data from Zabbix Agents deployed on K8s nodes and applications.ScalabilityZabbix also scales horizontally, making it suitable for large K8s deployments with a growing number of nodes and applications. Adding a Data Source in Grafana A data source is the location from where the metrics are being sourced. These metrics are integrated into Grafana for visualization and other purposes. Prerequisites You need to have an ‘administrator’ role in Grafana to make these changes. You need to have connection details of the data source, like database name, login details, URL and port number of the database, and other relevant information. The below steps can be used to add a custom source: First, navigate to the sidebar and open the context menu. Then click on Configurations and then Data Sources. To create a data source, add it first. Click on the Connections from the menu and create a data source. Now, select the type of data source where the metrics are being sourced from. If it’s a custom data source, then click on custom data source. Provide the connection details that are required and were collected in pre-requisites. Save and Test the connectivity and ensure there are no errors. Once the data source is saved, you can explore the metrics and also create the dashboards. Demo on How To Integrate Grafana and Prometheus to Monitor the Metrics of a Server Assumptions Operating System: Centos 7 Linux Virtual Machine Internet is available to access the packages Root Access to the VM Single Machine setup: Grafana, Prometheus, and Node Exporter are installed on a single VM Node Exporter Install Grafana Disable Selinux and Firewall Plain Text systemctl stop firewalld systemctl disable firewalld Add yum.repo for Grafana Plain Text [grafana] name=grafana baseurl=https://packages.grafana.com/oss/rpm repo_gpgcheck=1 enabled=1 gpgcheck=1 gpgkey=https://packages.grafana.com/gpg.key sslverify=1 sslcacert=/etc/pki/tls/certs/ca-bundle.crt Plain Text [root@vm-grafana ~]# yum install grafana [root@vm-grafana ~]# vim /etc/sysconfig/grafana-server Edit the Grafana configuration file to add the port and IP where Grafana is installed. [root@vm-grafana ~]# vim /etc/grafana/grafana.ini Uncomment # The http port to use http_port = 3000 # The public facing domain name used to access grafana from a browser domain = 127.0.0.1 Restart Grafana Server Service and Check Logs Log file location. [root@vm-grafana ~]# tail -f /var/log/grafana/grafana.log [root@vm-grafana ~]# systemctl restart grafana-server [root@vm-grafana ~]# systemctl status grafana-server Connect to the Web UI Grafana will connect to Port 3000. Image 1: Grafana Web UI Install Prometheus and Node-Exporter Download Package for Prometheus Location. Plain Text yum install wget wget https://github.com/prometheus/prometheus/releases/download/v2.49.1/prometheus-2.49.1.linux-amd64.tar.gz Saving to: ‘prometheus-2.49.1.linux-amd64.tar.gz’ Installation useradd --no-create-home --shell /bin/false prometheus mkdir /etc/Prometheus mkdir /var/lib/Prometheus chown prometheus:prometheus /etc/prometheus chown prometheus:prometheus /var/lib/prometheus Extract the package. Plain Text [root@vm-grafana ~]# tar zxvf prometheus-2.49.1.linux-amd64.tar.gz mv prometheus-2.49.1.linux-amd64 prometheuspackage cp prometheuspackage/prometheus /usr/local/bin/ cp prometheuspackage/promtool /usr/local/bin/ chown prometheus:prometheus /usr/local/bin/prometheus chown prometheus:prometheus /usr/local/bin/promtool cp -r prometheuspackage/consoles /etc/prometheus cp -r prometheuspackage/console_libraries /etc/Prometheus chown -R prometheus:prometheus /etc/prometheus/consoles chown -R prometheus:prometheus /etc/prometheus/console_libraries chown prometheus:prometheus /etc/prometheus/prometheus.yml Edit configuration file. Plain Text global: scrape_interval: 10s scrape_configs: - job_name: 'prometheus_master' scrape_interval: 5s static_configs: - targets: ['localhost:9090'] Create a Linux Service file for Prometheus. Plain Text vim /etc/systemd/system/prometheus.service [Unit] Description=Prometheus Wants=network-online.target After=network-online.target [Service] User=prometheus Group=prometheus Type=simple ExecStart=/usr/local/bin/prometheus \ --config.file /etc/prometheus/prometheus.yml \ --storage.tsdb.path /var/lib/prometheus/ \ --web.console.templates=/etc/prometheus/consoles \ --web.console.libraries=/etc/prometheus/console_libraries [Install] WantedBy=multi-user.target Start the service. Plain Text systemctl daemon-reload systemctl start prometheus systemctl status prometheus Access Prometheus Web UI. Image 2: Prometheus Web UI Image 3: Prometheus Web UI Monitor Linux Server Using Prometheus and Integration With node_exporter Download the Setup Plain Text wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz [root@vm-grafana ~]# tar zxvf node_exporter-1.7.0.linux-amd64.tar.gz [root@vm-grafana ~]# ls -ld node_exporter-1.7.0.linux-amd64 drwxr-xr-x. 2 prometheus prometheus 56 Nov 13 00:03 node_exporter-1.7.0.linux-amd64 Setup Instructions Plain Text useradd -rs /bin/false nodeusr mv node_exporter-1.7.0.linux-amd64/node_exporter /usr/local/bin/ Create a Service File for the Node Exporter Plain Text vim /etc/systemd/system/node_exporter.service [Unit] Description=Node Exporter After=network.target [Service] User=nodeusr Group=nodeusr Type=simple ExecStart=/usr/local/bin/node_exporter [Install] WantedBy=multi-user.target Reload the System Daemon and Start Node Exporter Service Plain Text systemctl daemon-reload systemctl restart node_exporter systemctl enable node_exporter View the metrics browsing node exporter URL. Image 4: node_exporter Web UI Integrate node_exporter With Prometheus Login to the Prometheus server and modify the prometheus.yml configuration file. Add the following configurations under the scrape config. Plain Text vim /etc/prometheus/prometheus.yml - job_name: 'node_exporter_centos' scrape_interval: 5s static_configs: - targets: ['TARGET_SERVER_IP:9100'] Restart Prometheus service. Plain Text systemctl restart prometheus Login to Prometheus Server Web Interface, and Check Targets Follow this link. Image 5: node_exporter in Prometheus Image 6: node_exporter in Prometheus You can click the graph and query any server metrics and click execute to show output. It will show the console output. Image 7: Metrics from the VM Image 8: Building Graph from the Query Metrics Add Prometheus as DataSource in Grafana Click on Add a new data source and add Prometheus as the data source by entering the Prometheus URL. Image 9: Add a new data source Import a pre-built dashboard from Grafana using this link and ID. Dashboard ID: 10180 Click on Dashboard and go to the imported dashboard. You should now be able to see all the metrics of the server. Image 10: Grafana Dashboard Next Steps Users can explore more on setting up alerts, adding Role Based Access Control, importing metrics from a remote server, and Grafana Administration as the next steps.

By Babul Bansal

OpenTelemetry vs. Prometheus: Which One’s Right for You?

OpenTelemetry and Prometheus are both open-source, but they can have a significant difference in how your cloud application functions. While OpenTelemetry is ideal for cloud-native applications and focuses on monitoring and improving application performance, Prometheus prioritizes reliability and accuracy. So, which one is the ideal option for your observability needs? The answer to this question is not as straightforward as you might expect. Both OpenTelemetry and Prometheus have their own strengths and weaknesses, catering to different needs and priorities. If you are confused about which option to go for, this blog post aims to be your guiding light through the intricacies of OpenTelemetry Vs Prometheus. We will unravel their architectures, dissect ease of use, delve into pricing considerations, and weigh the advantages and disadvantages. What Is OpenTelemetry? To comprehend the decision on whether to go with OpenTelemetry vs. Prometheus, we must first understand each option. Let's begin by decoding OpenTelemetry. OpenTelemetry is an open-source observability framework designed to provide comprehensive insights into the performance and behavior of software applications. Developed as a merger of OpenTracing and OpenCensus, OpenTelemetry is now a Cloud Native Computing Foundation (CNCF) project, enjoying widespread adoption within the developer community. OTel Architecture The OpenTelemetry architecture reflects this multi-dimensional vision. It comprises crucial components like: The API It acts as a universal translator, enabling applications to "speak" the language of telemetry, regardless of language or framework. These APIs provide a standardized way to generate traces, metrics, and logs, ensuring consistency and interoperability across tools. It also offers a flexible foundation for instrumenting code and capturing telemetry data in a structured format. The SDKs Language-specific libraries (available for Java, Python, JavaScript, Go, and more) that implement the OpenTelemetry API. They provide convenient tools to instrument code, generate telemetry data, and send it to the collector. SDKs help simplify the process of capturing telemetry data, making it easier for developers to integrate observability into their applications. The Collector The Otel collector is the heart of the OpenTelemetry architecture and is responsible for receiving, processing, and exporting telemetry data to various backends. It can be deployed as an agent on each host or as a centralized service. OpenTelemery offers a range of configurations and exporters for seamless integration with popular observability tools like Prometheus, Jaeger, Middleware, Datadog, Grafana, and more. Exporters Exporters are crucial in OpenTelemetry for transmitting collected telemetry data to external systems. The platform supports a variety of exporters, ensuring compatibility with popular observability platforms and backends. Context Propagation OpenTelemetry incorporates context propagation mechanisms to link distributed traces seamlessly. This ensures that a trace initiated in one part of your system can be followed through various interconnected services. Benefits of OpenTelemetry This modular design offers unmatched flexibility. You can choose the SDKs that suit your languages and environments and seamlessly integrate them with your existing observability tools. Moreover, OpenTelemetry boasts vendor-agnosticism, meaning you're not locked into a specific platform. It's your data, your freedom. However, this complexity comes with some trade-offs. OpenTelemetry is still evolving, and its ecosystem is less mature than Prometheus. Getting started might require more effort, and the Instrumentation overhead can be slightly higher. It's a trade-off between a richer picture and immediate usability. So, is OpenTelemetry suitable for you? If you seek the power of complete observability, the flexibility to adapt, and the freedom to choose, then OpenTelemetry might be your ideal partner. But be prepared to invest the time and effort to leverage its full potential. What Is Prometheus? Now, let’s understand the tool that provides a complete range of observability solutions. Prometheus, an open-source monitoring and alerting toolkit, was conceived at SoundCloud in 2012 and later donated to the Cloud Native Computing Foundation (CNCF). Praised for its simplicity and reliability, Prometheus has become a cornerstone for organizations seeking a robust solution to monitor their applications and infrastructure. Its focus is laser-sharp: collecting time-series data that paints a quantitative picture of your system's health and performance. This includes its pull-based model, where your exporter pushes metrics to Prometheus on its terms and minimizes operational overhead. The PromQL query language lets you slice and dice your metrics with surgical precision, creating insightful graphs and alerts. Key Components of the Prometheus Architecture To appreciate the nuances of Prometheus, it's essential to comprehend the underlying architecture that propels its monitoring capabilities. Prometheus server: At the core of Prometheus is its server, which is responsible for scraping and storing time-series data through HTTP pull requests. Data model: Prometheus embraces a multi-dimensional data model, utilizing key-value pairs for labels to identify time-series data uniquely. PromQL: A powerful query language, PromQL, enables users to retrieve and analyze time-series data collected by Prometheus. Alerting rules: Prometheus incorporates a robust alerting system, allowing users to define rules based on queries and thresholds. Exporters: Similar to OpenTelemetry, Prometheus leverages APIs to gather metrics from various sources, ensuring flexibility in monitoring diverse components. So, when is Prometheus the perfect fit? If your primary concern is monitoring key metrics across your system, and you value operational simplicity and robust tools, then Prometheus won't disappoint. It's ideal for situations where you need clear, quantitative insights without the complexities of multi-dimensional data collection. OpenTelemetry vs. Prometheus Now that we have understood both platforms let us make a head-to-head comparison of OpenTelemetry Metrics and Prometheus to understand their strengths and weaknesses. Ease of Use Criteria OpenTelemetry Prometheus Instrumentation It offers libraries for multiple languages, making it accessible to diverse ecosystems. Requires exporters for instrumentation, which may be perceived as an additional step. Configuration Features auto-instrumentation for common frameworks, simplifying setup. Configuration can be manual, necessitating a deeper understanding of settings. Learning Curve Users familiar with OpenTracing or OpenCensus may find the transition smoother. PromQL and Prometheus-specific concepts may pose a learning curve for some users. Use Case Criteria OpenTelemetry Prometheus Application Types Well-suited for complex, distributed microservices architectures. It is ideal for monitoring containerized environments and providing real-time insights. Data Types Captures both traces and metrics, offering comprehensive observability. Primarily focused on time-series metrics but has some support for event-based monitoring. Ecosystem Integration Widespread adoption and compatibility with various observability platforms. Strong integration with Kubernetes and native support for exporters and service discovery. Pricing Criteria OpenTelemetry Prometheus Licensing OpenTelemetry is open source with an Apache 2.0 license, offering flexibility. Prometheus follows the open-source model with an Apache 2.0 license, providing freedom of use. Operational Costs Costs may vary based on the chosen backend and hosting options. Typically, operational costs are associated with storage and scalability requirements. Advantages OpenTelemetry Comprehensive observability with both traces and metrics. Wide language support and ecosystem integration. Active community support and continuous development. Vendor-agnostic, flexible, richer data context, future-proof. Prometheus Efficient real-time monitoring with a powerful query language (PromQL). Strong support for containerized environments. Robust alerting capabilities. Proven stability, efficient data collection, familiar tools, and integrations. Disadvantages OpenTelemetry Higher instrumentation overhead, less mature ecosystem. Some users may experience a learning curve. Exporter configuration can be complex. Prometheus Limited data scope (no traces or logs), potential vendor lock-in for specific integrations. Configuration may seem manual and intricate for beginners. Conclusion The ultimate choice hinges on your needs. Weigh your needs, assess your resources, and listen to your system's requirements. Does it call for a multifaceted architecture or a focused, metric-driven solution? The answer will lead you to your ideal observability platform. OpenTelemetry offers a unified observability solution, while Prometheus excels in specialized scenarios. But remember, this is not a competition but a collaboration. You can integrate both OpenTelemetry and Prometheus to combine their strengths. Start by using OpenTelemetry to capture your system's observability data, and let Prometheus translate it into actionable insights through its metric-powered lens.

By Savan Kharod

Cloud Native London Meetup: 3 Pitfalls Everyone Should Avoid With Cloud Native Observability

Recently, I was back at the Cloud Native London meetup, having been given the opportunity to present due to a speaker canceling at the last minute. This group has 7,000+ members and is, "...the official Cloud Native Computing Foundation (CNCF) Meetup group dedicated to building a strong, open, diverse developer community around the Cloud Native platform and technologies in London." You can also find them on their own Slack channel, so feel free to drop in for a quick chat if you like. There were over 85 attendees who braved the cold London evening to join us for pizza, drinks, and a bit of fun with my session having a special design this time around. I went out on a limb and tried something I'd never seen before - a sort of choose-your-own-adventure presentation. Below I've included a description of how I think it went, the feedback I got, and where you can find both the slides and recording online if you missed it. About the Presentation Here are the schedule details for the day: Check out the three fantastic speakers we've got lined up for you on Wednesday 10 January: 18:00 Pizza and drinks 18:30 Welcome 18:45 Quickwit: Cloud-Native Logging and Distributed Tracing (Francois Massot, Quickwit) 19:15 - 3 Pitfalls Everyone Should Avoid with Cloud Native Observability (Eric D. Schabell, Chronosphere) 19:45 Break 20:00 Transcending microservices hell for Supergraph Nirvana (Tom Harding, Hasura) 20:30 Wrap up See you there! The agenda for the January Cloud Native London Meetup is now up. If you're not able to join us, don't forget to update your RSVP before 10am on Wednesday! Or alternatively, join us via the YouTube stream without signing up. As I mentioned, my talk is a new idea I've been working on for the last year. I want to share insights into the mistakes and pitfalls that I'm seeing customers and practitioners make repeatedly on their cloud-native observability journey. Not only were there new ideas for content, but I wanted to try something a bit more daring this time around and tried to engage the audience with a bit of choose-your-own-adventure in which they were choosing which pitfall would be covered next. I started with a generic introduction, then gave them the following six choices: Ignoring costs in the application landscape Focusing on The Pillars Sneaky sprawling tooling mess Controlling costs Losing your way in the protocol jungles Underestimating cardinality For this Cloud Native London session, we ended up going in this order: pitfalls #6, #3, and #4. This meant the session recording posted online from the event contained the following content: Introduction to cloud-native and cloud-native observability problems (framing the topic) Pitfall #1 - Underestimating cardinality Pitfall #2 - Sneaky sprawling tooling mess Pitfall #3 - Controlling costs It went pretty smoothly and I was excited to get a lot of feedback from attendees who enjoyed the content, the takes on cloud-native observability pitfalls, and they loved the engaging style of choosing your own adventure! If you get the chance to see this talk next time I present it, there's a good chance it will contain completely different content. Video, Slides, and Abstract Session Video Recording Session Slides 3 Pitfalls Everyone Should Avoid with Cloud Native Observability from Eric D. Schabell Abstract Are you looking at your organization's efforts to enter or expand into the cloud-native landscape and feeling a bit daunted by the vast expanse of information surrounding cloud-native observability? When you're moving so fast with agile practices across your DevOps, SREs, and platform engineering teams, it's no wonder this can seem a bit confusing. Unfortunately, the choices being made have a great impact on your business, your budgets, and the ultimate success of your cloud-native initiatives. That hasty decision up front leads to big headaches very quickly down the road. In this talk, I'll introduce the problem facing everyone with cloud-native observability followed by 3 common mistakes that I'm seeing organizations make and how you can avoid them! Coming Up I am scheduled to return in May to present again and look forward to seeing everyone in London in the spring!

By Eric D. Schabell

CORE

Monitoring and Observability

DZone's Featured Monitoring and Observability Resources

Top Monitoring and Observability Experts

The Latest Monitoring and Observability Topics