Conf42 Observability 2024 - Online

Observability, what it is & what it can be!

Abstract

Observability is the market buzzword, and every organization using monitoring tools wants to have an observability stack in place. Setting up the tools is only one part of the stack, but using it right is still a mystery to many, which is the primary goal of this talk.

Summary

  • Con 42, observability 2024, the online conference about observability. Swapnil: What is the difference between monitoring the current state of observability and what it can be. Why do we need observability when we have monitoring when we get the data that we need?
  • observability is defined as the ability to determine a system's internal state by examining its outputs. First thing that people want to do based on this data is visualization. Challenges include the increasing complexity of software systems and cost of observability.
  • All the new applications, all the new infrastructure that we are adding needs to be designed for observability. This has helped into faster issue resolutions for us, because we are able to correlate the data. You can do a predictive maintenance of your system or infrastructure based on the data that you consume.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hello, good morning, good afternoon, good evening, wherever you are. Welcome to CON 42, observability 2024, the online conference about observability. My name is Swapnil and I am going to talk a bit about observability today. I would like to thank the organizers of this conference for including this session and specifically including it into the monitorings section, because that is the primary motivation of this session, to understand what is the difference between monitoring the current state of observability and what it can be. So what is monitoring? We have been doing monitoring over the years for our applications and infrastructure that can be getting the heartbeat or the logs of that system, but it was limited to that. Then why do we need observability when we have monitoring when we get the data that we need? Right. So before we dive further into this session, let's have a look at what observability is by definition, and then we will see different aspects of it. So observability by definition is defined as the ability to determine a system's internal state by examining its outputs. So what are these outputs? In the technical term, these outputs are termed as signals. They are classified primarily into these four types, as we see matrix events, log entries. This data can be captured by using different agents. They provide that data and a mechanism where you can collect that. So this collection is stored into a time series or similar database where you can see the value of these signals at different time intervals and then kind of make decision based on that. So these signals are then aggregated into the database right on the time series level. And first thing that people want to do based on this data is visualization. You might have seen Grafana hall, right. So where we have different ways where you can visualize the data by creating graphs, panels or different visualizations. In addition to this, the important part about observability as using this data and alert the user for any abnormal behavior of the system based on thresholds or based on some events. So this is the primary observability that we see at most places. But this has a number of challenges because of the amount of data that we get. It is not only the amount of data, but the quality of the data and how we can use the data bandwidth. So let's have a look at some the challenges related to observability. So first challenge is the increasing complexity of software systems. So what happened initially was people use the existing monitoring setup and made a few changes and they thought they are ready for observability, but it is not. Why? Because the observability stack needs to understand the increasing complexity of your systems. So in today's distributed systems, there are a number of applications, infrastructure components are added every day. So the observatory stack needs to understand the new components that are being added, what is the data that they are getting about these components, and how to provide similar visualization alerting mechanism to the end user with the same inquest software system. So that complexity needs to be understood by the system. And we'll see why this complexity makes a statement about this. Because with cloud native technologies, there are a number of components that are added every day. They can be installed with a click of a button or with some motivation using DevOps. And they start in just creating a number of metrics and logs and sending them to the monitoring system or the observability system that you are deployed. So it kind of provides a lot of data that cannot, if not understood by the system, is of no use, right? In addition to just receiving the data, the expectation of the end user is to have the end to end visibility into the complete stack, not only in terms of errors, but every activity that you do on the system needs to be tracked. So you should be able to track each event from the user to the backend system, or from backend system to the user who the data flows and where it goes, at which point of the system it touches. You need to know that. And one of the primary drivers for observability beyond monitoring is the cost. The cost of downtime can be very high for both end user as well as the product of the product teams if the downtime is not understood. So that was the primary motivation for having this observatory system beyond mounting that can give additional information and insights and actions for the data that we consume from the mounting systems. In addition to that, in the recent years, what we have seen is the cost of observability. Platform itself can be very high if the user does not know what data they are ingesting, how much data they are ingesting, how they are processing the data, for creating the dashboards and alerts, then the running of objective platform can be very costly to system and it can bring a very big hole into a pocket. So that is another challenge that people are facing. And why do we need objective platform? We are good with what we have with monitoring and we will have people who can basically look at this information and take actions. And the additional driver that we see recently is the security part of it. Security has always been key to all the enterprise and product deployments, but with the adoption of cloud native technologies. All the data that is sitting on cloud security brings another aspect to it. So you need to know where my data is, how my data is flowing, if the data is flowing correctly or not, is it security compliant so that all data, you can get the information from the laws or matrix related to that. But you need to have a mechanism to get alerted for the scenarios that affect security. For example, my transit data is secured with TL's. Where is my Cert going to expire? Who is going to renew the Cert if that is expired? Right? So just one use case, right? And there are many similar to this. And in addition to security, there are compliance activities. So many enterprise system as well as financial systems require you to be compliant to certain things. And this involves a number of steps. Each step is equally important and can bring down the compliance efforts. So you need to have the ability to get the compliance requirement as part of observability, see what data you are ingesting and make decisions out of that, or make alerts based on that. So the current systems are not completely capable of doing that. So that is a challenge. So you need specific personnel to look into this, do the changes and then they beyond that. So these are the current challenges. And to overcome these challenges, what we are proposing is observability 2.0. What is it we have seen, we are getting different signals from different agents. Each agent that is sending the data is sent into a different format for the data metadata and everything that they are sending to the system. If the format is different, it becomes very difficult to manage the data and then showcase in the same user interface. For example, Grafana has done a very good job of aggregating data from different data sources. So you can configure different data sources, consume the data and then show it on the visualization panel. But if you want to use the data from different sources into single panel or single UI frame, it becomes a bit difficult and sometimes not manageable to the user, primarily because the information that you get from the sources is different. So what we need is a unified database wherein whatever data that you are sending from any agent, so we are basically calling it as poly agent. Any agent that you are sending the data, consume the data and store into a very similar format inside the database. So that unified database adds a different value to the objective platform in terms of setting the context of the information that is received, correlating it and enriching it. So what is the context? So for example, for every logline that I am receiving, I need to know where it is coming from, what is the application? If you are deploying it into kubernetes, what is the Kubernetes cluster name? What is the pod, what is the namespace? If you're sending it from host, what is the host ip address? Or what is the host name? What is the cloud? If you're using AWS, what is the region from where the host is situated? All this is the context about that particular log line and that needs to be stored so that you can correlate it with additional services. So based on this data, you can basically see what all services are failing into a particular AWS zone. If I want to see that, I need to have this data stored. And sometimes what happens is the raw data does not have this amount of information. The application is just sending the data to the monitoring platform or the observability platform, but it doesn't bother sending all the details related to that application. So then it becomes the job of the agents to enrich the data, or job of the objective platform to scrape that additional information and enrich the data so that you have all the data with the context for you to be correlating or when you are querying it. What this helps is setting up dynamic and intelligent alerting. What is dynamic and intelligent alerting? So based on the correlated information, you can set the alert that looks at the data and then makes a decision rather than some defined threshold behind the scenes. So the threshold can be dynamic or it can be based on some dynamic activity. For example, your power detection has gone from a certain percentage plus 20% in matter of five minutes. Then it will create an alert because it is basically mounting the current of that port. Same goes for host, and that adds the intelligence into the system to basically just change the threshold at any point of time. And all this data can be fed into a machine learning algorithm to find out patterns and creating actions out of this. So we'll see some examples of this actions in the next slides. But this is the primary motivation behind obsolete zero and how we are trying to build it into a platform. The most important part in all this is that we should not go into the same space where we were while trying to solve the problem. So we are making sure that all the data that we are consuming, that we are utilizing and putting into the product is using the open standards. So any new system or existing system that is using the open source standards for this should be able to use the data if you want to migrate from the current system. So that should be the case for observability how we can achieve this. So it's a combined effort. So it's not only that the operations teams deploy the stack and the development team starts using it for using observatory zero, the development practices needs to change. So what we call it as observability driven development, that needs to come from development as well. All the new applications, all the new infrastructure that we are adding needs to be designed for observability. It needs to send the required signals with the correct data to the platform, so that it can help you find the issues easily. So we will leverage the existing monitoring data. But at the same time we are expecting a collaboration between development and operations team to basically set the infrastructure correctly and the development teams to have the right side of instrumentation that is sending the required data to the observatory platform. And as always, this is a continuous improvement into the system. So you might not get everything in the first go, but you will evolve as you go and you basically add additional information both from the instrumentation side and from the infrastructure components. So this has helped into faster issue resolutions for us, because we are able to correlate the data and come to a conclusion. Okay, this is the application that is being getting the errors, and this is only into this AWS region, on this node, into this particular port because of the correlation that you could find out. This helps the system performance as well as the resilience of the system, and it will help you with the collaboration activities as well. So if we do this, this basically helps you into the automation part of it, where we saw you can have machine learning and AI on the objective platform. So what is that? So with the data that we have, we enhance the root cause analysis for any errors or failures that we see into the observed system. So we get a lot of additional details and correlation to the other services, so that we can track the entire failure from one service to another. This also helps into setting up alerts with anomaly. So you find the services that are showing these patterns and then you alert based on that. So you can include the algorithms as part of your observability system that are analyzing this data and then giving the output that can be used by the alerting system. You can help this to optimize the system performance, so that you can tune the system regularly based on what kind of data that you see into the now dashboards, and as we have seen, the increasing alerting part of it. So you can create the intelligent alerts for changes or forecast on average. So there is an interesting use case that we have achieved with this and that is very important. So you can do a predictive maintenance of your system or infrastructure based on the data that you consume and c. Right. So most of us are deploying the applications on kubernetes. If you are using stateful sets, you mandatory need a volume where you store the state data. And it is very possible that the data volume might not be sufficient for the amount of data that you ingest over the period of time. So what you need to do if the data is exceeding to the volume that you need to manually or maybe sometimes with automation, increase the size of the volume. That is not something that is being taken care of by the cloud till now. So what we have done is we have applied the forecast algorithm to the data that we receive for volume sizes and we forecast that this is the timeframe when the volume will be filled. So what will be the volumes utilization in next five days? Next three days, right. And based on that you can do the predictive maintenance. You can go ahead and increase the volume size beforehand so that you don't turn into the errors. So this is, this kind of reactive maintenance is possible into observability system when you have the data that is not only correct but that is corrected and that can be put to the ML. So that helps you with capacity planning. So similar to what the use case that I mentioned. So you can increase not only the volumes but the number of nodes, number of replicas of the pods, number of hosts that are needed beyond the auto scaling part that is given by the cloud provider or orchestrator platforms. You can define dynamic thresholds that you have seen already. So the threshold basically is automatically updated using the change algorithm. So based on the current utilization of the system, it will see for next 20% in the next five days. And this means it will basically throw an alert. Okay, this is not behaving correctly. Please have a look at that. In addition to this, just by we have the alerting capabilities which can create the incident response to the off streams for critical alerts. So it will automatically send the alert to things like pagerduty or create Jira tickets and they can have a look at that. And in addition to the capacity planning, you can do proactive workforce management as well. So this is basically a set of things that are intended to be in an observable platform. The current observable platforms that we see, they have few of these things in pieces, but not all. And the most important thing that we are looking at in object 2.0 is open standards. So you should be able to work with things like promises hotel and blockchain for all your objective operations. Right. So this is mostly what I had in this session. Just a quick word about the observation that I represent. So we are a small observer startup cloudflues. We have a product and you can basically have look at that from our download page, or you can even play about it on the playground that we can see in the links. And yeah, that's pretty much that I had from my side.
...

Swapnil Kulkarni

Customer Success Engineer @ Kloudfuse

Swapnil Kulkarni's LinkedIn account Swapnil Kulkarni's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways