Conf42 Observability 2024 - Online

Observability for Modern Event Driven Applications

Abstract

Adopting Event-driven architecture helps to build agile modern applications. However, end-to-end observability of the loosely coupled components is critical to achieving operational excellence. Join this session to learn how to build observable event-driven applications using AWS native observability tools.

Summary

  • Aurmila Raju is a senior solution architect with Amazon Web Services. Today's session focuses on observability for modern event driven applications. EDA offers a lot of good things, but it's hard to get it right.
  • Event driven architecture is primarily based on the domain driven design and event storming methodology. You start from your business process, identify the events in each of the business domain, and then design your technical architecture based on that. End to end observability is key for a successful edge.
  • Next service that we want to look at is Amazon Q service. SQS is a pull based event broker. Eventbridge is a push based broker. You can configure your dead letter queue and the number of retries based on your business SLA's. These metrics are written into or organized into cloud watches through namespaces and dimensions.
  • The business logic in an event driven architecture is usually it's a lambda function. In that case you can instrument your code to create custom metrics and then send it into Cloudwatch. This is something that most applications will need for better performance.
  • There is an out of the box feature in Cloudwatch for doing lambda insights. Some of the use cases where this might come in handy is identifying high cost functions, identifying memory leaks and identifying any performance changes. There are various mechanism where in which each of these areas can be fine tuned.
  • Cloudwatch logs can be collected from various services. API Gateway and Lambda. And also you can do custom metrics based on your logs. Can be useful on how it relates to business SLA's.
  • Next step is to derive insights similar to the lambda insights. New feature that has been announced is called a powered natural language query generation. The next way of doing some alarms is through CloudWatch anomaly deduction. Now it's time to move to the last two stages, which is advanced and proactive observability.
  • pattern analysis in logging sites is a new feature that has been announced recently. It can be very useful in identifying the various patterns of logs that you have got in your application. If there is any out of ordinary pattern, it's going to automatically highlight it to you.
  • AWS X ray can help you to do end to end view of requests flowing through an application. You can do it through Lambda and you can also do it on API gateway. What xray creates when all of these traces from these applications are reported back to x ray is it creates a service map.
  • Cloudwatch service lens is a single pane of glass which is going to help you to drill into your observability telemetry data. These best practices is not just for EDA, but for any application where you want to get started.
  • The next is tool selection. So choose the right tool for the job. Observability needs to be an internal process that everyone agrees on. As and when your application grows, your maturity model is going to grow. So that is my eight step process towards the best practices of observability.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hello everyone, welcome to cons 42, observability 2024 and thank you for taking time to join my session. I want to start today's session with a story. Let's say there is a bank xxx and it has released a newsletter stating this dear customers, we are happy to announce that you can now open savings account through your mobile banking. Place the request with a few clicks on your mobile app and get your account operational in 2 hours. And let's say that this solution was done by the bank using a very modern architecture, event driven architecture, a cloud native architecture. And after a few days of this product launch, a customer calls the customer service representative and states that I placed a request yesterday for savings account on mobile banking app but my account is not operational till now. The customer service representative logs a ticket to the mobile banking team. The mobile banking team takes a look at the backend systems and can see that the request is successfully placed. So she forwards the ticket to the core banking team. Now the core banking team looks at the system and says that I have not received any account opening request. So what happened to the account opening request? Or should I say, what happened to the account opening event? So the answer to this question is what is the premise of this session? So welcome to my session on observability for modern event driven applications. I'm Aurmila Raju. I'm a senior solution architect with Amazon Web Services. So let's get started and dive into this session. I want to do some basic level setting on what is event driven architecture. So please note the text in bold and underlined it says that this is an architectural style of building loosely coupled systems. And these loosely coupled systems talk to each other by emitting and responding to events. So what is an event? An event is a change in state or an update emitted by a producer. A producer can be any component within your application, and in this event consumers are interested in. So producers and consumers, which are like two components of your application, talk to each other through an event broker. So that's how the high level architecture of an event driven system comes about. So why do customers move towards event driven applications? Because it offers a lot of good things. And few of the highlights are speed and agility because the systems are loosely coupled, so each team can build their own component independently and get it to deployment. The next is resiliency, so the blast radius of failure is reduced because of the loose coupling between the systems, and each system fails independently, so there is no single point of failure. The next is scalability, so you are able to minimize any waiting time because of the asynchronous and parallel processing that we bring in event driven architecture. And lastly, but not the but it is the most important one which enables you to work backwards from your business requirements and your business process workflow. So this is an architectural style which brings your business and technology stakeholders together in building technology applications. So these good things will help you to meet your business objective in a very effective way. But it's important to understand that EDA is hard to get it right. There are various factors for it and key things are highlighted here, starting with eventual consistency. So what we mean by that is, due to loose coupled systems and asynchronous nature, components are not consistent at the same time. So your business process must be able to cope with that kind of delays. The next is end to end performance. If there is a performance bottleneck in one of the components, it is going to impact the end to end performance of your application. And third, is meeting business SLA's. So when we talked about the pros, we said that it helps you to work backwards from your business process. So that means you need to meet your business SLA's in this type of an architecture. So in this business process workflow, some steps might be real time, some might be real near real time, and some might be batch. So you must design in such a way that the SLA's of each step are met properly to get the EDA right. So how do we do this right? So there are many architectural design decisions that you need to make to get it right. And one of the key things to get this right is observability. So observing your system so that it's like if there is an event, as we started with, the story event is flowing between systems. So you need to know where the event is at and what time. And if there is an event that is failed, there must be proactive mechanism built into your architecture to recognize that and take appropriate action. So that's where observability plays a role. So what is observability? It's a measure of how well we can understand a system from the work it does. So it is about like collecting the right amount of data and gaining insights from it and taking proactive actions to make your application work better. So I want to demonstrate it with an example use case so where which relates back to our story of opening a savings account for an existing customer through a mobile banking app. So I want to show you a high level business process workflow. It can be any complex workflow but I just put here a very oversimplified version because we are going to use this just to see how the observability fits into the business process workflow. So let's say the customer logged into the mobile app and selects a savings product and checks the product eligibility, and then the request to open the savings account is placed. So once it is placed it goes to the core banking to get the account opened. And after the account is opened there could be like post account opening steps, like a monthly interest schedule getting updated. Or you send a mobile push notification to convey to the customer that the account is opened and account is operational. And you may send a welcome email, or you may send a survey or to know how the account opening journey has been. So usually these kind of steps or the events, so if you see the orange ones are all written in the past tense because events are usually written in the past tense and the gray boxes around it are the business domains from which these events originate. And events can flow between domain. So event driven architecture is primarily based on the domain driven design and event storming methodology. So I'm not going to dive into those concepts, but it's a good way of designing an EDA is this, that is, you start from your business process, identify the events in each of the business domain, and then design your technical architecture based on that. So let's say this is our business process workflow. And in accompanying this, that could be business SLA's because we talked about SLA's previously and for this example use case, you can have SLA's like something like, like the ones that I've highlighted here. Like open the account 24/7 that is, customer can login into the mobile banking app anytime and place the request. And the product eligibility is done in real time. So it's always important to define what is real time. So here we are defining real time as like hundred milliseconds for the product eligibility to be done and the account opening request is placed. And also we are saying that the account should be operational in 2 hours of the request being placed and another SLA's in a similar way. So the point to note here is in this business process workflow, there are some synchronous real time steps and also some asynchronous near real time or batch steps. Yeah, and why it is hard to get it right is there could be a lot of things that can go wrong here. Examples are what happens if the product eligibility check takes more than 100 milliseconds, which is CSL what happens if account opening service is down? And what happens if mobile push notification fails? That's why we are saying that end to end observability is key for a successful edge. So you need to have visibility into each of the component that you have in your end to end architecture, and you should be able to do real time troubleshooting on the errors and issues that is occurring. And if you see in this picture, the benefits of observability has both operational benefits and also business benefits when you move towards the right. So in terms of business benefit, it is going to help you with your overall customer experience and also to meet your business objectives and business outcomes. So usually in a traditional monitoring, you need to do the monitoring across all of your layers, right from your storage or network layer up to your business layer. But when you do event driven architecture, and especially when you do it on AWS, you do it with many of the serverless services, like services like SQS, SNS, Amazon, Evenbridge, API, Gateway, Lambda. So these are the services that you usually use. And when they are serverless, these layers of monitoring is offloaded to AWS. So you will be able to focus on your business application and data observability alone. Now, going back to our example, so we so far we have been talking about only the business process and events. So I have given some services, put some services behind it to show how an architecture for such a solution would look like. So we are not going to rationalize on our design choices of why I've used Evenbridge here or why I've used sqs. Because in an event driven architecture there is no right or wrong answers. It's primarily based on what your requirements are and using the right architectural style and then choosing the appropriate services for that. So in this example, what we are doing is. So this top bit is a synchronous process where you do the product eligibility check through an API synchronous API call, which goes to the product DB and gives the request back. And once that is done, an event to open the account as a request is placed into Eventbridge as an event, and this event is put into an sqs queue and you have a core banking platform. And here I assume that it is going to be an on premise data center system which listens into this queue. So whenever there is a request, it picks it from there and then does the necessary process to open the account. And once the account is opened, an event is placed back into the event bridge and from there that event is of interest to various systems like the monthly interest schedule updating service or the customer comms with since mobile push notification and email, or the marketing team who is going to send the account opening survey. So this is just to show you how it is made up of synchronous and asynchronous step. And each step has got its own SLA's. So you need to monitor each of this component and achieve the observability as per your business SLA's. So how do we do that? Let's start looking at an observability maturity model for that. So maturity model helps customers evaluate where they are so that they know where they want to be and how to get there. So as they expand their workloads, the observability is expected to mature. So we will start from the foundational level, which is foundational monitoring and it relates to collecting of telemetry data. So what do we mean by telemetry data? Here we have three types and those are the actual three pillars of observability, which are metrics, logs and traces. So metrics is time series data calculated or measured at various time intervals. So it can be like API request rate or error rate, or a duration of a lambda function, etcetera. And logs or timestamp the records of discrete events. So those are like events that happen within your system or components of your application, such as a failure event or an error event or a state transformation. So those are examples of logs. And then we have got traces. A trace represents a single user journey across multiple components in your application. So usually it is very useful in case of microservices and API based architecture to see how the API request is being routed through various systems and the response coming back, you can trace that entire request. So these three form the pillars of observability. And when you do this observability using AWS native tools, Amazon Cloudwatch helps you to do logs and metrics, and AWS x ray helps you with traces. So let's go and look at each one of the observability pillar and what we will do is as and when we see each of the pillar. I'm going to go back to the example application design that we had and then relate what kind of observability examples that we can do for that application to provide you some context. So we'll start with viewing of standard metrics. So Cloudwatch has inbuilt metrics. So whenever you are integrating the AWS services with Cloudwatch automatically, there are, there are a set of metrics for each service which gets logged into Cloudwatch. So if you see, these are the serverless services that I have highlighted because they have been used in the example architecture that we just spoke about. But we can do the same with the other AWS services as well. So this inbuilt metrics with you must be able to meet like 70% of your observability needs just from the inbuilt metrics. So let's see some examples of what those key metrics can be. So for lambda, the invocation metric can be helpful to assess the amount of traffic and failures, if any, and performance metric can be on memory utilization and duration of execution execution. This relates to the cost of the function. Two like these parameters and then we have concurrency. Concurrency metric helps to assess the number of parallel invocations. So this metric can help to track the performance of application against existing concurrency limits and see if you need to increase the limits or secure like proficient concurrency as per the needs of your application. Similarly, for API gateway, I've highlighted a set of key metric. So the API gateway is the gateway for your microservices in modern application. So keeping track of like API call counts latency errors can be very helpful in measuring your business objectives. So when we talked about limits, right, though these are like serverless services and scales, inherently you need to be cautious about limits to avoid throttling. But in some cases throttling can be good too. For example, you can throttle the number of requests to API gateway to avoid security attacks and also setting like client limitations when there are like multiple clients accessing your API gateway. So it's important to analyze with what limits you are operating your application, whether they are the right limits or do you need any increase to make your application perform better. So the next service that we want to look at is Amazon Q service. So SQS is a pull based event broker. So what we mean by that is the consumer has to come and pull the messages. Until then, the messages or events are going to be within the queue. So metrics like approximate age of the oldest message. So if you're monitoring that, and if the age is increasing beyond a particular threshold, that means that the consumer is not keeping up with the speed of the amount of messages in the, in the queue. So that is something to keep track of and identify the issue. The next is Amazon Eventbridge. So in, even in our design we had these services. So I have highlighted some key metrics here, like the dead letter Q invocation. So I mentioned SQS is a pool based broker, whereas Amazon Eventbridge is a push based broker. That means the responsibility of doing retries and error handling lies with the broker itself. So let's say Eventbridge is trying to send an event to a target system and the target system is unavailable. It will do multiple retries as per the number of retries that is configured in Eventbridge. And even then if it is not able to reach the target, then it is going to write the information or event into a dead letter queue. So again, it will write into dead letter queue only if you have configured it. So if something is arriving in a queue, that means that is a failed event. So what is the business impact of a failed event? So you can configure your dead letter queue and the number of retries based on your business SLA's. So even in our example, you can have a scenario where the savings account opening request has come, but for some reason it has not been picked up by the account opening service. So that it could be a reason that the eventbridge didn't get to place the event into the queue at all. So if that is the case, then you can write that event into a dead letter queue and keep track of this metric and take some appropriate actions at the back of it. So those are the examples of standard metrics and how these metrics are written into or organized into the cloud watches through namespaces and dimensions. So namespaces consider it like a box or a container for your scope. So the scope can be your application. So in this case I just put savings account opening application is the namespace and within that you can have dimensions. So the dimensions can be the service name. For Lambda, it can be the check product eligibility services, the dimension within which you are tracking the metrics. Similarly for queue, the queue name can be a dimension for event bridge, the event bus can be a dimension within which you track the so these are for the standard metrics, but you can do custom dimensions and custom metrics as well, which is what we are going to see next. So I mentioned that 70% of your needs are going to be met with the standard metrics, but just the built in metrics may not be enough. So that could be criteria where or scenarios where you need to measure the application performance against your business goal, like the revenue or signups, the page views. So those are not your application level things that you need to track from your application. Business logic so what do we use in writing? The business logic in an event driven architecture is usually it's a lambda function, or it can be a container container service within which you're doing some business logic and you want the tracking to be done through the business logic. In that case you can instrument your code to create custom metrics and then send it into Cloudwatch. So this is something that most applications will need for better performance. So we will see how it is done in the coming slides. So now going back to our maturity model, so we have talked about the foundational monitoring and next we are moving towards doing telemetry analysis and insights. So we have got the data. How do you do insights? Primarily the lambda service, which is the key business logic service in an event driven architecture. How do you collect insights from that? So there is an out of the box feature in Cloudwatch for doing lambda insights. So when you enable that, you will be able to monitor, troubleshoot and optimize the performance of your AWS lambda functions. Some of the use cases where this might come in handy is identifying high cost functions, identifying memory leaks and identifying any performance changes whenever there is a new version of a lambda function that is being deployed, and also understanding the latency drivers in function. So latency drivers actually it's a very key concept because in a lambda execution time, there are various splits within that, like cold start time and there can be a bootstrapping time and then the actual execution time. Cold start is the time that AWS takes to provision and lambda instance and bootstrapping time is the time to get your dependencies and libraries loaded and then you have the actual execution time. So it's important to split the whole execution time to see how much is cold start, how much is the bootstrapping time and how much is execution time to see if there is any bottleneck. And there are various mechanism where in which each of these areas can be fine tuned. So the lambda insight is a dashboard within Cloudwatch. So these are automatic dashboards which Cloudwatch creates. There are two main types of dashboards. One is a multifunction dashboard, which provides an aggregated view across multiple lambda functions. So you can see the list of lambda functions in your account, how much of the code start, how much is the memory utilization and all that. So it looks something like this. The next is a single function dashboard. It helps to view a single lambda function and identify root causes for any issues. So this is a very useful feature. So recommend looking at it. So even in our architecture, if you want to go back to our example and see where it can be useful. So you, we had the product eligibility service as a lambda function. So if you calculate the whole duration of execution, that is going to directly impact your business SLA, which is like 100 milliseconds for the product eligibility check. So if the duration of the lambda itself is more than that, then that is something to be looked at. Okay, so now that was about metrics and also a quick overview of lambda insights. So now let's see what are the other things that you can do with the next pillar of observability, which is structured and centralized logging. So cloudwatch logs can be collected from various services. So the key services that I have highlighted is API Gateway and Lambda. So for API gateway there are two levels of logging, which is logging errors and logging information. So maybe in your lower environments you want to do both, but in your higher environments, a stable environment, you just want to track errors. Yeah. So it's up to the customer requirements. And also you can do custom metrics based on your logs. That is, you can filter a set of logs based on criteria. So example count of 400, 403 errors, maybe a filter, and then you create, you create a custom metric saying like what was the count? So that can be a metric filter that you can create and add it into your custom dashboards. And next is lambda logging. So this is where lambda logging is going to help you to write custom metrics. So if you remember, we just talked about custom metrics. So this is the way you do it. So there are two ways you could do it, either through the put metric API or through the embedded metrics format. Put metric API is a synchronous API call. That means you're going to write the logs during the execution time of the lambda. So that is unnecessary add added overhead to your lambda execution time. So the recommended approach is to do it asynchronously. The great example is to do it via the embedded matrix format. So that will write, write the logs asynchronously or like outside of the execution of the lambda function. So what you can do is you can create your custom message of how bought or of what information you have to write into your logs and then put it into Cloudwatch. So in this, in a way you are bringing your custom metrics into Cloudwatch logs. So to give an example of Cloudwatch logging where it can be useful on how it relates to business SLA's is. So we said for API gateway we have error and information logs. So if it is an error event and if you remember there was a business SLA to ability to open account 24/7 but if there are any errors that is happening at the product eligibility check, then it is impacting the business SLA so that is customer would not be able to open the or place the request if there is an error when doing the product eligibility check. So that is something to be avoided or remediated immediately. So now we have got all of the logs, let's say, and next step is to derive insights similar to the lambda insights, which was an inherent feature. If you want to do a similar querying of the logs and get some insights for the other metrics and log data that you have got, you have something called the querying of Cloudwatch logs insights. So there is a specific syntax to be used in order to do this. So here, if you want to do top hundred most expensive executions, then you select the fields, sort it by the builder duration in a descending order and you're limiting by 100 so that you get the top hundred records. Then you get the information, something like this. Another example is to get the last 100 error messages. So again you are selecting the fields, putting the filter condition on this log level as err, sorting by timestamp in the descending order and limiting by 100 rows. So, but as you can see, writing this query and the syntax is a bit of a learning curve. So for to help customers you want to get started with this querying format. There is a new feature that has been announced which is called as a powered natural language query generation. So this is still in preview. As you, as you note, it's not generally available, but this is a great service and it is going to be very helpful where you, you can type in your query in natural language similar to what we had in the previous slide. That is if you type in as get the last hundred error messages in the API Gateway Cloudwatch log group, then it is going to create the query for you and then maybe you can fine tune it further and get your results. So this is one way of doing a powered insights from Cloudwatch. So that was on the intermediate monitoring or the analysis and insights. Now it's time to move to the last two stages, which is like advanced and proactive observability. How can you do that? How can you like proactively handle or find out errors and handle errors and see some, do some anomaly deduction and take appropriate actions? The first area is creating alerts. One way to do alerts is through cloudwatch alarms. How you can create an alarm is you can choose a specific metric and set a threshold on that metric. And then whenever that threshold is breached, this alarm will be raised. So by alarm, what we say is there will be a notification sent to a target. So the notification can be through an SNS, through an email to an appropriate operations team, or it can be an integration to an incident management system. So in our example, so if we say what can be an alert example is when the age of the message in the queue is growing beyond the business sla. We already looked at it. So if that is the case, then that means no one is picking up the account opening request. So that's something to be kind of concerned. Like before the business sla of 2 hours is reached, you need to get the account opened. So if no one is picking up the request, you need to like monitor, alert someone and get it corrected. The next way of doing some alarms is through CloudWatch anomaly deduction. So if you enable this feature within CloudWatch, CloudWatch is going to keep track of your metrics and patterns. Again, you will see that if you see in this graph, you're seeing like at what times or durations of your day or a week, that there is peaks. And when there is a lesser number of requests, when there is a peak, if there is a change in the regular pattern, it is going to alert as an alarm. So then maybe you will need a human in the loop step here to see if it isn't really an alarm situation or some increase in traffic which has caused the anomaly. So in our example, what could be an anomaly deduction scenario? So if there are any API requests. So you see, let's say from eight to five, you see a peak in account opening and after that it is less. And on some day, in middle of the night, there have been like hundreds and hundreds of account opening request. Then that is something of an anomaly and something to be looked at. It could be a security vulnerability pattern that someone is trying to attack the site with multiple requests. So by you might have to look at like putting a waf or preventing the distributed denial of service for your application. So in those cases, anomaly reduction will be very useful. And another feature of correlation is block patterns. So when do we. You would need to look at a pattern within logs. Usually that there are very large challenges with the log analysis. Some of the features are highlighted here, like there is too much of data because you are continuously getting logs from various components of a system. And if there is any change in the system, how it is changing, the amount of logs or the type of logs that you are creating will also change. So how do you proactively detect any unusual changes in your logs from the huge amount of logs that you have got? If there is a mechanism to do some pattern matching, to see that these are the various types of logs that I have got and that is the usual pattern of your application. And if there is any new pattern being recognized, please do let me know proactively so that I can go and check if that is something of concern or is it something due to a change in my application? An example could be. So this is an example of an API request in the API gateway. So the pattern here is, it is an information message followed by a timestamp and it says API request received followed by a customer id. So this is logged as a pattern. So pattern analysis in logging sites is a new feature that has been announced recently. So to log the various patterns. So whenever, for whenever there is a new application and you have enabled logging in components, this feature can be very useful in identifying the various patterns of logs that you have got in your application. And if there is any out of ordinary pattern, it's going to automatically highlight it to you so that you can be aware of it and see whether it is an anomaly or a real pattern that has emerged in your application. So we have covered logs and metrics and now we are going towards the tracing part. So tracing, as I mentioned, is through AWS X ray. And this service can help you to do end to end view of requests flowing through an application. So you can do it through Lambda and you can also do it on API gateway. And there are a few other services with which x ray integrates to bring those into your trays. For example Eventbridge. So there are limitations, but still you can do tracing with Eventbridge if you are instrumenting your code on the producer side. So on the producer side, if you instrument it and then send the event, the x ray header similar to how you see for API Gateway, you will see a tracing header that is being sent from the producer into Eventbridge, and Eventbridge can pass on that header to the target and the target can continue the tracing. So in this way you can bring the applications into your trace. And what xray creates when all of these traces from these applications are reported back to x ray is it creates a service map. The service map is nothing but a flow of the event or the request through various applications. So if we take our example, you can do a trace in this real time flow where the customer request is sent to API gateway and it is sent to Lambda and then to a dynamodb table. So you, if you remember, we had a business sla of doing this request in hundred milliseconds. So you can use x ray to see what is the end to end processing time is, and also you can see the split of time in each of the service, like how much latency is in IP gateway, how much is in lambda and how much is in lambda animal DB. And that will give you an idea of where fine tuning, if at all needed, has to be done. So for X ray you can do it on lambda console or through the API, Amazon API gateway console, or you can do it via infrastructure as code. So like a AWS Sam, you can do that as well. So that's a very quick overview of doing tracing using AWS x ray. And the last bit of service which can be very handy in terms of lambda that I wanted to highlight is the lambda power tools. We're not going to dive deep into that. I have added a resource link at the end of the slide to know more about it, but just on a high level, it is a developer toolkit or a very opinionated library that is created by AWS which helps you to implement observability best practices. That is, you can do logs, metrics and traces with very minimal code. So that's the main idea of it. So it's very useful, not just for observability, but for many other serverless best practices. So if you use the serverless links against the well architected framework, these are the various areas in which power tools are going to assist you. So it's worth mentioning. So that's why I've highlighted that. So now that we've seen all of the bits and pieces of what are the various things that you can do, imagine there is an observability team or all the application teams which is involved in building this end to end application. So if something goes wrong and if you want to troubleshoot, it's good to have all of the things in a single place. So you need not go to multiple places to find out where the issue was. So you need a single plane of glass in which you can see your logs, metrics, your alarms, your dashboards, on your traces. So that's why we have this service or feature in Cloudwatch called Cloudwatch service lens. So it is a single pane of glass which is going to help you to drill into your all of your observability telemetry data that we have discussed so far. So please take a look at it. It is going to be very useful. And lastly, I want to finish off with some best practices for observing event driven application. So these best practices is not just for EDA, but for any application where you want to get started. With the observability that is. So it is like an eight step process. We are going to like just whiz through that process. You can look into the resources for more details on this. First thing is observe what matters because as we discussed, there could be a huge amount of logs, metrics and traces that is generated from each of your component. So focus on what matters to your business, what matters to your customers. So that's why we keep going back to the business SLA's and work backwards from that to see what are the data that needs to be observed. And you need to measure your objectives against those SLA's so that you know what good looks like. Because we cannot be saying that. Yes, I might always be looking at the happy path. We need to see like this is what is the metric and I have achieved that and that's why my application is performing at its best. And identify the sources from which this telemetry data has to be taken. The plan ahead. So this is not an reactive monitoring, it's a proactive observability. So that is important to be kept in mind. The next is alerting strategy. So we discussed about the various types of alerts that can be created, but define the criteria because there are some alerts can be just warning, some can be critical where immediate action is needed. So define the criteria and also the appropriate actions for each of the alerts. The next is dashboards. So now we have all of the data. So you can create nice graphs and charts within Cloudwatch, but have some strategy of what data is going into each of the dashboard and who is going to look at it. So you can create like very high level dashboards, like customer experience or how your application is performing last week and how it is performing this week. So these kind of things for like cxo levels and maybe the head of levels and very low level dashboards going into the nitty gritty details of your application. Maybe for your platform engineering team. The next is tool selection. So choose the right tool for the job. So in this session we talked about AWS native tools for observability. But many of our customers who build ADA on AWS, but they still use third party tools or open source tools like open telemetry, which has the industry standard and supported by various vendor applications like Grafana and Prometheus. Those also can be good choices for your application. So it all depends on what is your need and then you pick the right features for it and then bringing it all together. Observability needs to be an internal process that everyone agrees on and it has to be part of the operational readiness. So if these are the necessary observability things to be in place for the application to go live. So that kind of one mindset and cultural change has to be there in your organization to mature your observability framework and finally iterate because this is not a one off process. As and when your application grows, your maturity model is going to grow and when there are like new features that is getting introduced into your application, your observability is also changing. So this has to be an iterative process and to be reviewed routinely. So that is my eight step process towards the best practices of observability. So use those best practices, overcome the challenges that you have in EDA and get your EDA right, because EDA is great and it is going to help you to deliver business outcomes very effectively. I added some further reading to know more about serverless observability and observability for modern applications and also a link to the lambda power tools that we talked about. So thank you so much for your time. I'm Umila Raju. Please feel free to connect with me on LinkedIn and I'm happy to take your questions as well. Thank you.
...

Urmila Raju

Senior Solutions Architect @ AWS

Urmila Raju's LinkedIn account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways