Conf42 Site Reliability Engineering 2021 - Online

Improve observability and operational performance for your container workloads

Video size:

Abstract

In this session we will see how to improve observability of container workloads focusing on the three pillars of monitoring, logging and traceability. In the operational performance, we will discuss how to detect behaviours that deviate from normal operating patterns.

Summary

  • In my session I'll be talking about improving observability when you're running your workloads on AWS. We will also touch upon some of the SRE practices like slis, slos and slas. You can enable your DevOps for reliability with chaos native.
  • As we are trying to develop more and more applications in a distributed ecosystem, observability has suddenly become a very important aspect of all these applications. Observability is a correlation of all the different signals that you would implement if you are running applications. There are many different ways of improving your system's observability.
  • The logs can be split into four different categories. Application logs, system logs, audit logs, and infrastructure logs. A consistent logging format will help you ease that entire process of extracting information from the logs. Too much logging can also result in performance issues.
  • The third signal that we have spoken about in the previous slide is tracing. Distributed tracing has become increasingly important when you are running your system in a distributed environment. It gives you information of all the lag or the latency that you are having at different levels of your application.
  • Amazon Cloudwatch will take care of your metrics and your logs, and AWS Xray provides you the tracing capability. This will also give you a service map and it helps the developers to analyze and debug their production workloads, especially in a distributed environment.
  • Amazon Cloudwatch is responsible for getting all these metrics put together. If a metric is exceeding or going beyond a specific threshold that you would have set, you can have a Cloudwatch alarm go off. You can also have auto scaling being invoked. As a consumer you can plot the graphs which we will have a look at in the next slide.
  • Aws well, now let's look at log insights. This is a query which is being executed on Cloudwatch logs itself. It helps you parse certain parts of that particular log format and filter all the logs. And after that you can show the distribution of the log events over time.
  • So with the logs in place, let's talk about the metrics. These metrics are being grouped by namespace and then by the various dimensions which are associated with that. Next we have the graphed metrics. It's very related to the metrics that you saw in the earlier slide.
  • There are basically two types of alarms. One would be the basic metric alarms and others are composite alarms. The alarms can be put to work together with the metrics that you are collecting to give you this experience of either sending out SNS notification or maybe even doing automated remediation.
  • Amazon Cloudwatch can help you take care of your logging as well as your metric needs. And now we have a look at AWS X ray with AWS Xray. It helps you determine how your service is operating and also gives you a one shot view of how different services are integrated.
  • The importance of service level indicators, objective and agreement. Make sure these are incorporated in your sare practice. Not meeting a particular SLA or meeting an SLA determines the success or failure of a service. Some of the guidance which you can follow.
  • Make sure you implement the SLIs, SLOs and SLAs for your application in consultation with the stakeholders. If you're running your services on AWS, try and leverage the native tooling that is already available. Hop over to observability workshop AWS and get a hands on experience of all the features which are there.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Are you an SRE, a developer? A quality engineer who wants to tackle the challenge of improving reliability in your DevOps? You can enable your DevOps for reliability with chaos native. Create your free account at Chaos native Litmus Cloud hello and welcome to the SRE conference. In my session I'll be talking about improving observability when you're running your workloads on AWS. As part of this session, we will talk about the best practices that you can follow whenever your workloads are running on AWS by leveraging the native AWS services, and also touch upon some of the SRE practices like slis, slos and slas, and what kind of best practices you can follow when you are setting up these KPIs and slas for your workloads. With that being said, let's have a look at the agenda that we will be following today. We'll start with the introduction where we talk about observability. What are the key signals that you'll be looking out for whenever you're implementing observability or improving observability for your workloads? Then we'll see the different features which are available in AWS, AWS part of Amazon Cloudwatch and AWS Xray to understand the state and behavior of your application. Finally, we will touch upon the importance of slis, slos and slas, and then we'll close off the session with summarizing whatever we have learned for the next 30 minutes. So let's start with the first part. What is observability? As we are trying to develop more and more applications in a distributed ecosystem, observability has suddenly become a very important aspect of all these applications, especially when it comes to defining observability. There are many definitions that you would find in different websites and resources which are available. The way you can put observability in the most simplest form is it helps you understand what is happening in your system. As a counterpoint to that, you may ask, oh well, I have logs or I have metrics or I have tracing. I have implemented one of these three different aspects which are already there in my environment. Doesn't that give me observability? Well, look at observability as a correlation of all these different signals that you would implement if you are running applications in a distributed environment. Some of the questions that you would want to ask yourself is is my system up or is it down? Is it fast or slow? Based on the experience of the end users, any kind of KPIs and slas which I am establishing how would I know that I am meeting them? These are some of the questions for which observability can help you answer, especially when you are running your applications at scale and in the cloud. You cannot afford to be blind to all the different aspects of running applications in a distributed environment. You need to be able to answer a wide range of operational and business related questions. You should also be able to spot problems which ideally before they would disrupt your operations. And it should also be able to respond quickly to any kind of issues that you would see arising from a customer. To achieve all this insight, you need your systems to be extremely observable. Now, obviously there are many different ways of improving your system's observability. We will be focusing to begin with on the three primary signals which are logs, metrics and traces. Most of you would be implementing at least one of these in your system, or maybe all three of them if you have a very mature observability practice in your organization or in your project. There is also a white paper from CNCF which describes observability and it's still a work in progress, but you can surely have a look at it on their GitHub repo to know more about how the cloud native foundation is thinking about observability and what kind of recommendations they are giving. Now one question that you can ask is is observability new for these software systems? I would say no, because the three signals that I spoke about in the previous slide, which is log metric and traces logs, have always been used for most of your debugging and for identifying the root cause of your issues. Take an example of anything that would go wrong in your application even before you were running them. In cloud, there will always be a log which is sitting somewhere in your virtualbox which can be used for identifying what is the root cause or something that went wrong. Then you have metrics. The metrics have always been used whenever your applications have been running for a very long time. Or maybe you are trying to understand the different infrastructure related issues which may happen. Let's say something went wrong, or the cpu is too high, the memory is too high, or even during performance testing you would notice that certain aspects of the application are not behaving the way it was intended to be. You would be relying on metrics and finally traces. Traces help you understand how a request is traversing from point a to point b and what all different services it is impacting even downstream as that request is traversing. So all in all, these signals together or individually have always been used by different software development teams during their development, testing, your operations, or even maintenance of the software systems. So observability is not something which is leveraging concepts which you are not aware of. Rather it is correlating all these different signals which you have always been seeing about the application and giving you a consolidated picture of what is the state of your application. So let's go a little bit in depth into each of these. What are logs? The way I look at it, the logs can be split into four different categories. Application logs, system logs, audit logs, and infrastructure logs. You would see the same segregation from the CNCF paper as well. The application logs are the normal logs that you would see as part of your Java application or your Python application. Any kind of log appending that you're doing that becomes your application logs. What about system logs? These would be from your oss. Let's say you are having some kind of AMI which are running and that AMI is having errors, and those would be the logs which you'll be capturing here. Audit logs is something like an example of what action has been done by what user, what was the entire trace of a specific business outcome which has been done by a user or even cloud trail. If you have implemented cloud trail in AWS, that's again audit logging. And finally infrastructure logs. Whenever you are building out your infrastructure, maybe by using the cloud native services or even by using let's say chef or terraform or cloudformation, any kind of logs that it would generate from there. Again, those are infrastructure logs. Once you have segregated these logs into four different categories, you can ideally look at the logging levels that you are having, and this applies predominantly on the application side where you have the trace, debug, born info and error. These are different log levels. Depending on the environment that you are using, you should keep changing these log levels. Try to have only the error logs at the production environment and use the rest of the logging levels in the lower environments. Sometimes too much of logging can also result in performance issues because of all the file I O actions which are happening behind the scenes. One more point to be careful about is avoid printing sensitive information in the logs. Each business would define their own explanation of what is a sensitive information, but you should have mechanisms in place in your project or in your organization which looks at any kind of data which has been printed into the logs and to determine if they are sensitive or not. Next point would be defining a consistent logging pattern. Quite often it's very difficult to have a schema which you will follow for a long duration, especially when you are writing logs. So you can use a log format, maybe by using log four j or logback or any of the logging frameworks that you have, and you can define a consistent logging format. The advantage is it helps you extract the information from these logs during analytics. So you may be having a centralized logging location where all these logs are being read by some team trying to find data out of it, or do some analytics on the logs. A consistent logging format will help you ease that entire process of extracting information from the logs. And finally, when you are running container based workloads, it is preferred to write your logs into a standard output or AWS, an SD out. And when you're writing it into an STD out, all these logs can be aggregated. Let's say you're running your application in a Kubernetes cluster. You can have some daemon set operation which is running behind the scenes to aggregate all these logs by possibly using fluentbit and send it across to the destination of your choice, which can be a Cloudwatch logs in the case of AWS, or you can even run those to send the logs to an elasticsearch instance, which may be running on cloud or on prem, depending on whatever is your choice of architecture with that information on logs. Let's have a look at metrics. Metrics often define as the performance of your system. So when we say metrics, it can be the cpu information that you're having. Let's say one of my cpu is going beyond the 70% threshold. That's a metric which says my vm is having too much of load happening on it, or maybe there is some out of memory that is going to happen. So there is a threshold of say 70 80% that whatever threshold you have set up that is exceeding. That's the metric for you. You can aggregate different types of data as part of this metric. It can be a numeric representation, it can be a point in time observation of a system. It can have different cardinality associated with it. The metrics are essentially divided into two different types. One is real time monitoring and alerting, and second is trend analysis and long term planning as it's self explanatory. Real time monitoring and alerting will immediately tell you if something is wrong with your application. So that's where if you have the right set of alarms, you have the right set of notification and other bits in place, you would immediately know that a specific application is not showing the right metrics, or maybe having too many 404 errors, or too many 500 errors. Those are again metrics. It gives you account of what is happening with your application, the trend analysis over time it helps you do the right sizing. For example, if you are running a new application and you know that the application is just getting started. So possibly you would use a lower configuration of a vm, let's say two virtual cpus and four gb of ram. Over time, as your application scales out, you would want to increase the cpu or maybe scale out. Those kind of long term planning can again be done based on the trend of what these metrics are coming in and what is the observation from the historical data. So metrics help you guide the future of what needs to be done for your application and also give a snapshot of what is currently happening with your application. The third signal that we have spoken about in the previous slide is tracing. So distributed tracing has become increasingly important when you are running your system in a distributed environment. For example, a request which is initiated by your end user or your customer from your point a to point b, by having the right tracing in place, it shows the effect across all the different downstream services which are there. Quite often. When you're running a microservices architecture, it's not always one microservice, it's a collection. So you would have three or four microservices interacting with each other behind the scenes. So if you're going from microservice a to d, the tracing will help you show the information of what is happening when the request is going from a to b, b to c, and c to D. So it gives you information of all the lag or the latency that you are having at different levels of your application. And that way it gives you a holistic picture of a request or a journey of a request. Okay, moving on. Now that we have established what observability means and what are the different signals that we will be talking about in observability, we will discuss about the ways by which you can understand the state and behavior of the application. For this purpose, I will be focusing on two main services. In AWS. There are different observability services which are available both in the open source as well as in AWS. For this particular presentation, I'll be focusing on Amazon Cloudwatch and AWS X ray. Amazon Cloudwatch will take care of your metrics and your logs, and AWS Xray provides you the tracing capability. So let's consider a use case. Let's say that I want to monitor a microservices architecture and this microservices architecture consists of different distributed parts which needs to be monitored. Obviously for the logging aspects, you can make use of Amazon Cloud watch and it can also be used for collecting the metrics, setting up the alarms and reacting to certain changes which are happening in your AWS environment, and also with many different microservices working together. You want to know what is the chain of invocation for these microservices? And that's where the idea of xray comes in, which uses the correlation ids which are unique identifiers attached to all the requests and messages related to a specific event chain. So let's take an example the same way. Service ABC if I'm trying to do a get operation on service a behind the scenes, it may be fetching the data from service B and C and consolidating it and give it back to me. That's where Xray will help relate all these different invocations by using the correlation id and you can get a consolidated view of how your application is behaving with the different inter service communication and also the overall response when it is being provided back to the user. So as I mentioned earlier, you have two services which we'll be focusing here for this presentation, Amazon Cloudwatch and x ray. So Cloudwatch is a monitoring and management service that provides the data and insights into AWS. With the help of Cloudwatch, you will be able to get all your logs consolidated into Cloudwatch logs. You can have the container insights in order to get information around the metrics on containers which may be running on your ECS or eks orchestrators. You can make use of service lens to see how the different services are related to each other. You can make use of metrics and alarms. The metrics will help you get data on number of 400 errors or the 500 errors which you are seeing for an application. Load balancer and alarms will help you set up threshold. For example, if I have a cpu which is going beyond 70%, have an alarm which is going off and telling me that okay, this is an error and there is something happening consistently where the cpu threshold is beyond 70%, you can also have anomaly detection and that's where the earlier part of metrics being able to help you with the long term trend analytics comes into picture. Because at any point of time, if there is an anomaly in your overall operation, the metrics will help you catch that. Talking about AWS x ray, that's basically for tracing and analytics. This will also give you a service map and it helps the developers to analyze and debug their production workloads, especially in a distributed environment. It can understand how your application and the underlying services are performing. And you can use Xray to both analyze the applications in production as well as in development. And there are different environments based on each of the customer. You can surely have x ray do all that in each of those environments. Now let's look at the different components that you have with Amazon Cloudwatch. I did give a brief overview of Amazon Cloudwatch in the previous slide. Let's take an example of one of the resource that we have here. Let's say it's an ALb or maybe you have some kind of custom data which you need to send. So right in the left you will be sending this custom data into Cloudwatch, and Cloudwatch will be responsible for getting all these metrics put together. So you can see an example here of the cpu percentage hours per week, and again cpu percentage, et cetera, et cetera. So all these metrics are being put together by Cloudwatch at any point of time. If a metric is exceeding or going beyond a specific threshold that you would have set, you can have a Cloudwatch alarm go off and that alarm can integrate with an SNS email notification which will inform your team that there is something wrong in your application. You need to possibly go and have a look at that. If it's not a notification, you can also have auto scaling being invoked. So let's take an example that I have one virtual cpu and the threshold is saying that the cpu percentage has gone beyond 70%. I would like to have auto scaling kick in at this point of time. So that's where I can use Amazon Cloud watch alarms and I can trigger an auto scaling so that my cpu count, the EC two count for that matter, can go from a single instance to a fleet consisting of three or four different instances, and then it can scale in as well once the threshold is coming back to normal. And as a consumer, you can have a look at all these statistics. You can plot the graphs which we will have a look at it in the next slide. Aws well, now let's look at log insights. I did mention that the logs are getting pushed to Cloudwatch logs, and once the logs are available in Cloudwatch logs, you can execute this query. And this is a query which is being executed on Cloudwatch logs itself, which helps you parse certain parts of that particular log format and filter all the logs which are having a logging type of error. And this is really helpful when you have all these logs getting consolidated as part of your log groups within Amazon Cloudwatch. And after that you can see in this particular screenshot how you can show the distribution of the log events over time. Even the custom log events can be seen here. So the log insights help you look at what has happened, say from 01:00 a.m. All the way to 02:00 a.m. And it will be able to consolidate all that information, show you the distribution of the events which have happened, and you can also find specific events which you would have otherwise missed when the logs are being manually looked at. So that's the advantage of having the log insights. The sample query here, it fetches the timestamp and it fetches the message fields, and it orders the timestamp in the descending order. So with the logs in place, let's talk about the metrics. We did see in the earlier example that the metrics are being exposed for different AWS services. Now these metrics are being grouped by namespace and then by the various dimensions which are associated with that. For example, here in this screenshot, you can see that all the custom namespaces like the container insights, the Prometheus related stuff, or the ecs container insights, all these are custom metrics which have been added into the custom namespaces. And then you have the AWS related namespace like API, gateway, application, load balancers, dynamodb and ebs. These are the default namespaces from different AWS services which can be used in your account. So this way the metrics section will help you find all the related metrics which are for a specific service or for the custom events which are being consolidated by your team. Next we have the graphed metrics. It's very related to the metrics that you saw in the earlier slide. It helps you basically run statistics on your metrics which can be average, minimum, maximum sum, et cetera. And if you look here at the right side, what you see as the red circle, it also helps you add the anomaly detection band where you're saying that this is what my application is normally behaviours, but at a certain aspects it helps you identify, okay, this is an anomaly. So possibly something went wrong, or maybe you got a very high spike of incoming traffic at that point in time. So it helps you identify such anomalies in your regular flow of traffic and obviously in your metrics. We have been talking about alarms for quite a while. So there are basically two types of alarms. One would be the basic metric alarms and others are composite alarms. So we saw that there are metrics. So if I am using just one metric, and I am using that for creating an alarm, then that would be a metric alarm, a composite alarm. It would basically include a rule expression, and it needs to take into account the different states which are there in the alarm. So the threshold of the alarm, for example, whenever you are setting an EC two auto scaling event, you would set up say that it needs to scale out whenever it is 90% and above cpu utilization. And then it would scale in whenever it's 50% or something cpu utilization. So that's a threshold that you would set up for an alarm. So you can see here that the blue line that is visible here, that's the threshold that you have set up. The value is the red one. So the statistic that you are measuring, that would determine whether your alarm is enabled or it is okay, or it is an in alarm. So after three periods over the threshold, where you can say that, I'm not going to say that the particular metric is an alarm unless it happens three times in a specific period. So that's what you're seeing here, where the value has become three times more than the threshold. So that's where the alarm will go up. Or you can put a specific statistic which says that the moment the value goes up, consider the alarm to be in action. So depending on your business case, depending on the use case, you can pick and choose whether you want to have the value to be above the threshold for consecutive period or just one period. And that's how the alarms can be put to work together with the metrics that you are collecting to give you this experience of either sending out SNS notification or maybe even doing automated remediation. Or finally, you can also have the auto scaling, which is kind of like an automated remediation. So in this example I wanted to show about automated remediation where you are leveraging the AWS lambda. So you can see here that, and we have a blog about this, about incident management and remediation in the cloud. In this example, you're monitoring a microservice API that is sitting behind application load balancer. The traffic can't reach the microservice, so it times out. So you could have an alarm that is triggered to send SNS notification to the topic when a lambda function is being used. So the moment the alarm kicks off, this alarm is getting triggered, the SNS topic will send the notification to lambda. And once the notification is received by the lambda, it knows that there is something wrong with the security group. So what it will do is it'll go back and it will fix this security group, possibly edit the inbound or the outbound rules which are there. And that way you will allow the traffic to go through and through. So that's one way of doing automated remediation. You can also make use of chat Ops is basically you integrate it with some kind of a slack integration and instead you post a message to your slack channel and someone would be made aware. Okay, here is something which has gone wrong. So there is multiple ways in which you can leverage the lambda and you can leverage the Cloudwatch alarm to together have this automated remediation in place by using the Cloudwatch. If you have a look at the EKS workshop which is published by AWS, you will see that there is a specific chapter which talks about service mesh integration and how you can use the container insights. This image is from the container insights which you can set up. So as you can see here on the left panel there is a section called Container Insights. And here you can select whichever is your EKS cluster which is native Kubernetes AWS managed Kubernetes cluster. And here you can have different views on cpu utilization, memory utilization, network, number of nodes, disk and cluster failures. This is the advantage of having container insights which gives you a one shot view of different clusters which are there. It also gives you a view of different services and resources if you're using ecs. So that way you have all the metrics which are needed for your container workloads together in Cloudwatch. Now Cloudwatch can't just be used for container workloads, you can also use it for EC twos. That would be just your normal virtual Vm. You'll have to export all the logs which are getting generated in there by using a Cloudwatch agent. But you can have all those logs come into Cloudwatch logs and you can do the exact same operations that we have been talking about in the earlier slides, even by using EC two. So next is have a look at the anomaly detection. So when you enable anomaly detection for a particular metrics, it applies the statistical and the machine learning models which are the algorithms which are already in place. So you can see from the graph that this grayed out area is what is the anomaly detection band around the different metrics which have been set up. And in this particular anomaly detection band, we are saying that anomaly detection between m one and m two. So this indicates that the anomaly detection has been enabled for the metric with a one dimensional m one and with a standard deviation of two as a default. So by default it will be one, and the moment it goes to a standard deviation of two, you would have an anomaly being detected in here. So when you are viewing a graph of a metric data, overlay the expected values on top of the graph. So I know that this is where my normal execution of the graph is, and the moment you see these red areas is where some anomaly has been detected. So that's generally a pattern which can be used for your long term trend analytics, or it can even be used for your long term planning, saying what is the usage that's going to happen for each of your services, and also what percentage of those will be used. So depending on whichever metric you're using, in this case it is cpu utilization. You can determine the band around which it will be operating and it helps you detect any anomalies which will come up in your operation. The advantage of having this sort of a setup is the moment there is something wrong from the normal baselines, your ops team will be made aware of it and you are not being caught off guard whenever such systems are running at scale in a distributed environment. This is an example of one thing that I mentioned in the earliers, where you can consolidate all your logs by potentially running a Cloudwatch agent, and then you can have the metrics and everything being shipped to the Cloudwatch logs by using a fluent bit sidecar pattern. This pattern is very much common in the container space, and the container insight also collects the performance logs, focusing something called as the embedded metric format. These performance log events are structured JSON schema basically, and it enables you to send high cardinality data which can be ingested and it can be stored at scale using Amazon Cloudwatch. So just to summarize here, we have seen that Amazon Cloudwatch can help you take care of your logging as well as your metric needs when it comes to observability. And now we have a look at AWS X ray with AWS X ray, going back to the first point I mentioned, it's basically about correlation ids. So this is an example where you can see that this service of scorekeep is behind the scene, getting the data from DynamoDB. It is keeping the session, it is updating the item, and it is ultimately putting the data into there. So this breakdown, what you're seeing, the step by step breakdown, and also the time which it takes for each of the stage, that's what x ray gives you, the trace id which is the correlation id. It's added to the HTTP request for specific headers and it has the name called X Amazon Trace iD. And ultimately you can have this integrated with Amazon API gateway or AWS application load balancer and likewise. So in this section we have a service map, and x ray uses all the data that you have in your application to generate this service map. What you're seeing here is anything which is green is what is up and running, and here what is red. So there is something wrong with this particular service. So this sort of a consolidated view for a client, let's say a user is invoking something. It helps you determine how your service is operating and also gives you a one shot view of how different services are integrated, especially in a large distributed system. So that covers all the topics around Amazon Cloudwatch and AWS x ray. So we touched upon different aspects of metrics, how you can have the graphs put together, how you can use the logging aspects of Amazon Cloudwatch for consolidating all your logging, and ultimately use the x ray for the tracing bit as well. Now I would like to move a little bit outside of the AWS services which we have been talking about for the last 1520 minutes, and just talk about the importance of service level indicators, objective and agreement. Quite often whenever you are defining your SRE practice, it's quite important to have these three definitions well sorted out. So what is a service level indicator? The service level indicator is basically a number. It's a quantitative measure of some aspect of the service which is being provided to you. So if I have a service, say a rest API which I am giving, and I'm saying that it's going to be up 99% or something like that, then I know that this is the measure that I'm going to meet and that's the SLA part of it. That's the measure which I have to meet whenever the service is running for a long duration, and that's the reliability part of it. And every time you are trying to set up slas, SLO or SLI, make sure these are incorporated in your sare practice. Because ultimately not meeting a particular SLA or meeting an SLA determines the success or failure of a service. And most often the slas are easily recognized with a financial penalty or some kind of rebate which is associated with them. Some of the guidance which you can follow, especially when you are setting up the SRE practice and for the product development, these are things which you would obviously find in different resources online. Do have a look at Google's SRE book which talks about what are the best practices and how you can have the SRE practice implemented in your organization. Also, have a look at AWS provided resiliency and well architected framework which talks about all the five pillars that you would need to accomplish if you want to ensure that your architecture is well architected for the high availability, resiliency, operational excellence, security and likewise. So the first thing is you must not use each and every metric in any tracking or the monitoring system that you are having. It's always a concern when sometimes you want to capture everything that your system is coming in that is not really good in the long term because you would not be differentiating as to what it is that you want to know about the system and what is it that you don't want to know about the system. The second is have as few slos as possible and get an agreement from all the stakeholders. For clarity. These should be measured and the conditions under which these slos are valid. Even that has to be clarified with the stakeholders. And finally, make sure that you have an error budget which provides the objective metric that determines how unreliable a service is allowed to be. So let's take an example. If you are saying that your service is going to be available 99.9% availability, that's approximately four and a half hours of downtime in a year. Now, are you ready to have that kind of a setup, and are you ready to have that kind of an SLA for your application? That is something which you have to discuss with your stakeholders because the more available that you want to have your application, sometimes the cost associated with it is also high. So you need to take care of what is the Dr. For your application, how the application is going to behave if it is not going to be 99.9 or 99.95% availability? How is the run team going to make sure that it has the right resources, like the run book and other details about the applications? How is the handover going to be from the application team which has been building the service to the run team? What kind of DevOps methodologies we are following? Does the team have the autonomy for building and deploying certain patches or fixes for such an application? So all those things have to be taken into account when you're defining these KPIs for your application. Okay, so with that being said, let's summarize. So what did we learn here? We spoke about logs, metrics and traces and how all of them correlate. In order to give you a better experience in the overall observability of your system. Make sure you implement the SLIs, SLOs and SLAs for your application in consultation with the stakeholders, evaluate how these behaviors are under different circumstances. If you're running your services on AWS, try and leverage the native tooling that is already available for logging alarms and dashboards like Amazon Cloudwatch, and for tracing using AWS x ray, and for a more deeper dive into what is being offered. AWS part of observability from AWS hop over to observability workshop AWS and you can get a hands on experience of all the features which are there. It's approximately three to 4 hours long. Lab. It's a self learn kind of a lab. You can run this in AWS environment with all the different templates which are already available. So with that being said, thank you so much for your time and hope you have a good day.
...

Suraj Muraleedharan

Senior DevOps Consultant @ AWS

Suraj Muraleedharan's LinkedIn account



Join the community!

Learn for free, join the best tech learning community for a price of a pumpkin latte.

Annual
Monthly
Newsletter
$ 0 /mo

Event notifications, weekly newsletter

Delayed access to all content

Immediate access to Keynotes & Panels

Community
$ 8.34 /mo

Immediate access to all content

Courses, quizes & certificates

Community chats

Join the community (7 day free trial)