Conf42 DevOps 2023 - Online

Observability of Microservices using Open Source Solutions

Video size:

Abstract

Observability is the ability to measure the internal states of a system by examining its outputs. Observability is about bringing visibility into a system - turning the lights on, to see and understand the state of each component of the system, with context to aid with debugging and performance tuning. In essence, it’s a method for learning about what you don’t know from what you do know. Today’s cloud-native applications must be developed with observability at the core of their design. Observability is no longer a ‘neat’ practice that only superstar engineering teams incorporate but a clear day 1 developer concern. Finding tooling that helps you bridge the correlation between logs, metrics and traces into a perfect concoction is crucial, thus building an efficient observability system might seem like a daunting and expensive task to many teams only recently venturing into the world of reliability. However, setting up a Day 1 observability structure can be much simpler than anticipated with the right insights and tooling, and can even be achieved solely with Open Source Solutions. In this talk, I’ll aim to acquaint the listeners to how observability is being achieved in organisations of different scales and how they can establish an efficient microservice observability system using just open source tools.

Summary

  • Shubham talks about observability of microservices using open source tools. It's the ability to measure the internal states of the system by just examining its outputs. It gives engineers a proactive approach to optimizing the system. Why now? Software complexity is increasing at an exponential rate.
  • Kind is a tool for running local Kubernetes clusters using docker container nodes. If you have go 1.17 plus and Docker installed, all you have to do is run this command. I'll be sharing all links, snippets and resources I mentioned in the talk in a gist.
  • In this tutorial, we will deploy three different databases and the visualization tool. The observability control plane will automatically instrument our applications. There are two ways to select which applications audiogo should instrument. We'll use the simple way, but which is opt out and instrumenting everything.
  • Audigo can now see and correlate metrics to traces to logs in order to dive deeply into how our application behaves. We can also generate traces, metrics and logs from an application within minutes. Most importantly we have all the data we need to quickly detect and fix production issues on target applications.
  • An appropriately observable system is definitely a must for quick resolution. But make sure that you don't lose the war against time due to human errors or miscommunication. Consider whether your organization could benefit from an incident response tooling like Zend.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hey everyone, welcome to my talk about observability of microservices using open source tools. Hope you're having a good time at the conference so far and that you take away something valuable from my talk. So, a quick introduction to break the ice hi, I'm Shubham and I'm a developed relations engineer at the NUD. It's an advanced incident management and response orchestration platform. I'm an expert in making mistakes, learning from them, and advocating for best practices for setting up their DevOps, SRE and production engineering teams. So the burning question what is this talk about? So firstly, we'll have a quick brief on observability, something I'm sure the brilliant speakers before me must have done a good job covering, but we'll quickly glance on the same why we need observability from day one. Then we're going to be setting up an observability system using open source tools in record time. And finally, we'll talk about what comes after you have observability set up. So again, what is observability? It's the ability to measure the internal states of the system by just examining its outputs. So observability gives engineers a rather proactive approach to optimizing the system. It provides a connected, real time view of all the operational data in your software system, as well as the flexibility to ask the question on the fly about your applications and get the answers that you need. And that is the key part, having to know ahead of time what you wanted to ask. It's also about understanding the service level required to meet the customer needs, and then instrumenting the necessary component to ensure that the desired level of performance is achieved. It is not about collecting a colossal amount of data to just have all the time, but rather the right metrics to track and monitor the key performance indicators relevant to your software and adapting the system to meet the customer expectations. So why do I keep hearing about it all the time now? The word has been thrown around everywhere, and every fourth post on LinkedIn does have observability as the core concept. It's a fair question. We've been around for a considerably long time now, and the buzz around the world has begun to catch on only very recently, like about five to six years ago. Why now? And, well, simply. But software complexity is increasing at an exponential rate, and products are innovating in a crazy, unpredictable direction. On the infrastructure side, microservices, polyglot persistence and containers are enabling the decomposition of monoliths into agile, complex, ever evolving systems. Meanwhile, on the product side and platform side, creative solutions are empowering users to do new and exciting things. However, it creates challenges for developers to make stable and reliable component as recently as five years ago, I think most systems were much simpler. You'd have a classic lamp stack, one big database, an app tier, a web layer, a caching layer, some basic load balancing, and you could predict most failures. Craft a few dashboards that addressed nearly every performance RCA that you might need over the course of time, and your team did not spend a lot of time chasing known unknowns and it was relatively easy to find out what was really going on behind the scenes. But now, with a platform or a microservices architecture, or millions of unique users or applications connected, you have a long fat line of unique questions to answer all the time. There are many more potential combinations of things going wrong, and sometimes they connect in a manner that is hard to decipher. Now why do we need day one observability? Why are we solving a problem before it even exists, right when environments are as complex as they are today, simply monitoring for known problems doesn't understand the growing number of new issues that arise. So these new issues, the unknown unknowns that we talked about earlier, mean that an observable system in an observable system you know what is causing the problem and you dont have a standard starting point to find out. So without deep observability, it is natural to make assumptions about production system behavior, including what we think may be potential performance bottlenecks or failure scenarios. When failures do occur, we are often in the dark as to whether they have occurred and both the impact and potential fixes. This leads to wasted time and effort, like throughout the organization, jumping from one theory to another, one change to another maybe. And you really do not understand how any of this is going to impact the system if customers are impacted. Cost of the guesswork is exponentially high and it can escalate as quickly for any stage and in production. While Kubernetes can help recover from some failures, we've seen it be a lifesaver for a lot of organisations. But there are many scenarios where that can cause system to run suboptimally or fail continuously. Even and even when service availability is maintained, performance bottlenecks can result in premature auto scaling, resulting in the excessive use of costly cloud computing resources, to say the very least. Indeed, there are cases where the first signs of failure is an astronomically high cloud computing bill. So now, without any further stalling, let's move on to the meat of it building a complete open source solution for extracting and shipping traces, metrics and logs and correlation between them. So a few prerequisites to fulfill this demo. First up we have the kind tool. Kind is a tool for running local Kubernetes clusters using docker container nodes. Kind was primarily designed for testing Kubernetes itself. And yeah, if you have go 1.17 plus and Docker installed, all you have to do is run this command to install and run kind locally. I'll be sharing all links, snippets and resources I mentioned in the talk in a gist and you can find the link for the same down below. Next up we have Kubectl, the Kubernetes command line tool which allows us to run commands against Kubernetes clusters. And finally we have helps helps helps us manage Kubernetes applications via hem charts which help us define, install and upgrade even the most complex case applications. So let's move on to the observability backend as there is currently no one database that can store all logs, traces and metrics. We will deploy three different databases and the visualization tool. So we have Grafana for dashboards and visualization of the data we have from Mythius. It records real time metrics in a time series database. It allows for high dimensionality with flexible queries and real time alerting. We'll use Loki which is a horizontally scalable, highly available and multi tenant log aggregation system inspired by Prometheus. And finally we'll use tempo, which is an easy to use and highly scalable distributed tracing backend. Tempo is cost efficient and requires only object storage to operate and naturally it offers very deep integrations with Grafana, Prometheus and Loki. So now we need the observability control plane which will automatically instrument our applications and for that we'll be using Odigos. So Audiogos is an open source observability control plane that helps us in two ways. Firstly, automatic instrumentation. Odigos automatically instruments your applications and produces distributed traces and metrics without any code changes. And collector management which is audibles automatically deploys and scales collectors according to application traffic. So we don't need to spend a lot of time deploying and configuring collectors. So now for this demonstration, our target application will be a microservices based application written in Java or Python. We'll be using a fork of the bank of Anthos example application which was made by Google, and you can find the link in the resource section. You can deploy the target application by simply running the following code and you should be good to go. To install the observability backend, we'll execute the following helm chart which deploys tempo, the traces database, Prometheus, the metrics database and Loki, the logs database, as well as a pre configured Grafana instance with those databases as the data sources. So once our test application is up and running, our observability databases have been set up and ready to receive data. We'll install add Audigos as the control plane to collect and transfer logs, metrics and traces from our applications to the obsolete database. So to install audios via the helm chart, you just need to execute the following commands and after all the odds in the audio system namespace are running, we need to open the audios UI by running the port forward command. Once that's done, all you need to do is navigate to localhost 3000 to see your audio setup and ready to go. So now there are two ways to select which applications audiogo should instrument. The first one is opting out basically which will instrument everything, including every new application that will be deployed in the future. You can still exclude any application that you do not want to be instrumented, and the other way is opting in where you select only a certain few applications that you want to be instrumented. And for the purposes of this tutorial, we'll use the simple way, but which is opt out and instrumenting everything. So the next step is to tell audios how to reach the three databases that we just deployed. And to do that we'll just add the following three destinations inside audiogos, Loki, Prometheus and tempo. To do that, you just need to enter these URLs as mentioned over here. These are also present in the resources section, so feel free to use them from there. And all we need to do after this is wait a few seconds for audios to finish deploying the required collectors and instrument the target applications. And now finally, the last step is to explore observability data in Grafana. We can now see and correlate metrics to traces to logs in order to dive deeply into how our application behaves. And to do that, we'll need to port forward a Grafana instance by running the following command. Then we navigate to localhost, enter admin as the default username, and for the password, you need to enter the output of this following command. Please note that there's a percentage symbol at the end of the string. Make sure that you remove that while you're venturing it into the password. And now time to see the power of data. So just log on to your grafana instance and we'll start by viewing a service graph of a microservices application. To do that, just go to explore over here, make sure tempo is selected on top, choose the service graph and just run the query. And there you go, you have a basic node graph available to you with a single user service application. Now let's use some metrics, right? So we'll click on the user service and we'll choose request rate. And there you go. A graph for all of the metrics that we want is present to us. There are three kinds of metrics that Audigo supports. Firstly, it's metrics related to running of the application, that is number of HTTP requests, latency, DB concoction. Secondly, metrics related to the language runtime, that is threads, heaps, all of that and metrics related to the host environments, which is cpu, memory and disk usage. So now let's look at some traces. So this time we'll click on the user service and just select histogram. And in order to correlation metrics to traces we'll use a feature called exemplars. To show exemplars we'll need to click on options over here and just enable exemplars. These tiny diamonds will be now visible to you. You can just select any of these and click on query with tempo. And there you go. A trace like this should be present to you. You can see exactly how much time each part of the entire request took. And digging into one of the sections will show additional information such as database queries, HTTP headers and. Yeah, now if you want to drill down further, we'll go into logs, which is through Loki, and you can just simply query the relevant logs as we need it. So to that we'll just firstly choose our namespace and use this traces id as an identifier and just run this query. And there you go, you have all the data relevant to this particular trace and you can map what it did end to end. And that is your observability framework for you guys. So we've learned how easy it can be to extract and ship logs, traces and metrics using only open source solutions. And in addition we were also able to generate traces, metrics and logs from an application within minutes. We also have the ability to correlation between different signals. We correlated metrics to traces and traces to logs, and most importantly we have all the data we need to quickly detect and fix production issues on target applications. So great, we will now be able to detect and have all the data to diagnose and fix issues. But how do we approach incident management here for quicker resolution? So an appropriately observable system is definitely a must for quick resolution. But make sure that you don't lose the war against time during firefights due to human errors or miscommunication. Assess whether your organization could benefit from an incident response tooling like Zend to keep responders on the same page, harness context rich data from your observatory plane and enable your team to jump back from issues as fast as they can. And that's all my time. Thanks a lot for tuning in, and I hope you take something valuable away from this session. The resources link is right here. Again, feel free to reach out to me me on Twitter or LinkedIn in case you have any questions. And have a great day.
...

Shubham Srivastava

Leading Developer Relations @ Zenduty

Shubham Srivastava's LinkedIn account Shubham Srivastava's twitter account



Join the community!

Learn for free, join the best tech learning community for a price of a pumpkin latte.

Annual
Monthly
Newsletter
$ 0 /mo

Event notifications, weekly newsletter

Delayed access to all content

Immediate access to Keynotes & Panels

Community
$ 8.34 /mo

Immediate access to all content

Courses, quizes & certificates

Community chats

Join the community (7 day free trial)