Observability in Serverless Application

Video size:

Abstract

In this session, we will go through an introduction to observability and how to implement the concepts to serverless applications on AWS.

We will cover the key services and tools and how to implement observability at both the infrastructure and application layers

Summary

You can enable your DevOps for reliability with chaos native. Create your free account at Chaos native Litmus Cloud. Thanks for joining this session today.
In this session we'll be talking about observability in serverless applications. When we talk about serverless, we mean the actions of an event driven application. Key aim of a serverless web developer is to understand what is going on between services.
So let's move on to really understand what is observability. I like to use this analogy that have been used by my colleague Nathan Peck as well in AWS. Think about this magic box. It works basically by taking in a green circle and ten milliseconds layers, it spits out the Popo Pentagon. That's really where it begins to penetrate and observability can really help.
So why did you experience a lot of the things you experienced? And what must be done to make this behavior a lot more consistent? Because you want consistency so that you can keep processing those green cycles. These are kind of really what you need to be able to fully understand your whole system.
A good observability allows you to answer questions you did not know you need to ask. When a problem happens, you can assess the data in your system and be able to understand why that problem occurred. We have two key AWS services that helps you to implement observability.
We have a number of metrics for AWS lambda service and also the Amazon API gateway service. We also provide a cross service metrics dashboard. There are cases where you will need your own custom metrics. You can create custom metrics for your application using Cloudwatch.
Login is one of the key pillars of observability. We have a number of built in login mechanisms for customers across the various services. For API gateway, we support two levels of login error and info. Or you can use the embedded metric format to create a structured login into Cloud watch.
Cloudwatch embedded metric formats allows you to embed custom metrics within your log file. It can also create alarms to alert you when maybe your metrics goes outside of the threshold or when you kind of identify anomaly within your system.
AWS distlow for open telemetry, which you could use for collection. We've also got the Amazon managed service for Prometheus. And finally, Grafana is another popular open source project to help you to visualize your metrics.
Open telemetry is an open source project. It's basically an observability framework for your cloud native software. Customers can integrate open telemetry in their lambda function via one click deploys. Another tool you can use is the Lambda Power Tool to implement availability within your serverless application.

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Are you an sre? A developer? A quality engineer who wants to tackle the challenge of improving reliability in your DevOps? You can enable your DevOps for reliability with chaos native. Create your free account at Chaos native Litmus Cloud everyone, thanks for joining this session today. My name is Ozioma Uzoegwu and I'm a solutions architect in AWS. In my day job I work with our SMB customers in the UK and I'm also part of our front end and mobile specialist team. In this session we'll be talking about observability in serverless applications. Hello view of an agenda of what we're going to go through. I'll start by covering kind of defining what is a serverless application. Then we start looking at really what is observability and then I'll cover some of the AWS services that you could use for observability, then wrap up with some of the open source tools and some of the useful links and resources you'll find useful as well. When we talk about serverless, we mean the actions of an event driven application. It usually consists of an event source which generates an event, and this can be either identifying events from changes in the data state or changes in a resource state, or it can also be changes in a request endpoint, for example a rest API. And then what an event basically does is to trigger a lambda function. So a lambda function is a small single purpose functionality that can be programmed with any of the six programming languages supported by lambda. Or you can also bring your own custom runtime using the runtime API, and then the lambda function basically performs an action. It can be either based on your business logic, retrieving data from a data store, storing data from a data store, or just returning items to the UI, or potentially even calling an external HTTP endpoint. If you think about your traditional application stack, maybe you have some workloads running on Prem. You typically have number of layers right from the networking storage to the server hardware, to the operating system to the virtualization software, right up to your application and data and then your business logic. If you can remember as well. You need to kind of monitor all these various components. They are all your responsibility to kind of manage and maintain and make sure they are up to date. What tablets does for you is to really remove that undifferentiated heavy lifting that comes with managing all these layers of the stack. So we take care of the responsibility of quite a number of layers within the typical application stack. And as a customer you focus only on your application code and your data and the business logic as well. Let's look at what an example of a serverless application looks like. You typically will have a front end, and we have a service on the platform called AWS amplify console that you can use to host static content. And by static content I mean your HTML, your CSS and also your Javascript. We also have another service on the platform called Amazon Cognito, which you can use for your authentication. And then from the backend perspective to kind of service your APIs, we have a number of services that is really where the serverless comes into the picture. So we've got API gateway, which is a scalable API management service that you could use to kind of deploy your rest or websocket APIs. And they've got the lambda function which basically responds to events that can be triggered by your API gateway, which is your API request, and then an Amazon DynamoDB which is a NoSQL database that can store your data from the API. It can even get a bit more complex. So you can also have other serverless services on the platform. For example, step function, an example on the slide you could see this is a simple serverless feedback application whereby a user can submit a feedback and then it goes through a number of activities to process that feedback. Starting with sentiment analysis where it kind of looks at the feedback to say, is it a positive or negative feedback? Or then it stores the feedback into a DynamoDB database and then you can send a notification to the feedback owners to say you've received the feedback, say to Amazon chime, for example. Okay, so it can really get complex. And the key message here is that there are a number of components and services that are involved here. You could see the lambda functions, you could see the API gateway. And the key aim of a serverless web developer is to be able to kind of understand what is going on between these services, the latency of the transactions, where there might be potential bottlenecks or failures, or be a lot more proactive in identifying where there might be issues and how to resolve those issues. So let's move on to really understand what is observability. And for me to explain this, I like to use this analogy that have been used by my colleague Nathan Peck as well in AWS. So think about this magic box. You just joined a new company and on your first day on your onboarding and you are told that you're going to be responsible for this big magic box. The magic box works basically by taking in a green circle. The green circle goes in and ten milliseconds layers, it spits out the Popo Pentagon. And that's how it works. There's a caveat here that the folks that developed this magic box have now left the business. They didn't deliver the documentation, and it's now left for you to manage this magic box. Two, four, seven, and also make sure it's running 365 days in the year. Now you crack on with your job, and on the fifth day of your job you just notice something strange. You put in a green circle and 2 seconds later you get your popo Pentagon. This is far, much bigger than the ten milliseconds you are meant to get out the popo Pentagon. And you wonder what might be wrong, what's going on? And then another day you put in a green circle and ten milliseconds layers, you get a blue hexagon. And again you are wondering what's going wrong here? Why is it happening? And it might be one of those days. That's how the system might have behaved. It will correct itself. And then another day you put in a green cycle, the system catches fire and this is where it becomes very bad because your customers are no longer able to kind of fit in their green cycles. To start looking at your computers, to start looking at who can process this a lot more better than what you can do. And that's really where it begins to penetrate and observability can really help. So why did you experience a lot of the things you experienced and why couldn't you be able to kind of resolve that? I think there are a couple of questions that comes to mind. So the first one is that you didn't have any observability, so you didn't know anything that was happening in the box. But some of the questions that you might be asking is really what is in that box? Why does it behave the way it does when it behavior changes? Why did it change? And what must be done to make this behavior a lot more consistent? Because you want consistency so that you can keep processing those green cycles. There are other kind of other business stats that you can also look at. What is the usage, how many customers are expected to be using this box, and what's the kind of impact in terms of scalability? And also what's the business impact if green circles are not processed? What does it mean from a business perspective? If I only process ten green circles against 20, what does that mean in terms of business impact? And these are kind of really what you need to be able to kind of fully understand your whole system and be able to make sure that you have the right observability in place. So now what is observability? For me, a single thing I appreciate if you can take away from this session is really a good observability. Allows you to answer questions you did not know you need to ask. It is proactive, not just reactive. But when a problem happens, you can basically assess the data in your system and be able to understand why that problem occurred. So let's look at the three pillars of observability tooling. So the first one is the metrics. And metrics are basically defined AWS, the numeric data that you can measure at various time intervals. And then you've got the logs which are basically timestamp records of events, of discrete events that's really happening within your application. And finally you have traces which is basically tracing of the HTTP request that really goes through various components within your application. And these are kind of the three key pillars when you talk about observability. Now, if you have a problem within your system and you want to kind of look at the typical troubleshooting of your query and your workflow, when you have observability tooling put in place, the first thing you mostly do is you ask a question. And this is really what observability helps you to achieve. You can ask a question to say why is my system behaving this way? Or you might receive an alarm or a notification about an issue. And the next thing you do is to be able to kind of use what we call a service map to look at what might be potentially causing that issue. Or how can this question I have be answered? And then you've got the traces which basically looks at the various touch points of your request as it goes to the various services. And you can use trace maps to be able to start identifying the potential reasons for those issues or to answer the questions you have. And then you can move over to kind of look at using trace analysis, to kind of analyze the traces, to kind of have a deeper look of what might be causing it. And finally, based on that correlation you have maybe between your traces and metrics, you can then look at your logs and delve a bit deeper to be able to identify the root cause. And that's kind of the typical flow of how you kind of troubleshoot when you have observability tooling in place. But what we then do is to look at the AWS services that can help you through this workflow and be able to kind of ensure you have observability put in place in your application. Now, we have two key AWS services that helps you to implement observability. So the first one is the Amazon Cloud Watch, which is a service that could help you to kind of ingest logs, create metrics and alarms within your application. We've also got AWS X ray which is a distributed tracing service which you could use to instrument tracing in your application. It also gives you a platform to kind of perform analytics on your traces and also view a service map to see kind of where the different components that your request kind of went through as it's being fulfilled. Let's delve a bit deep into Amazon Cloud watch, so a couple of stats here for you. Amazon Cloud Watch processes 1 quadrillion plus metrics observation each month, and also it processes 3.9 trillion events each month. And this is the service we use to monitor our entire infrastructure of AWS and also Amazon.com, which kind of gives you a feel of the scale of this service and its suitability to kind of serve majority of the use cases. Finally, it also processes 100 petabytes of log ingested every month, and this is quite massive when it comes to scale. Let's then go back to the backend of your serverless application, which typically contains your API gateway, your lambda and your Amazon Dynamodb, and kind of talk through how you can implement observability for these services using some of the two key services we've just talked about on the platform. So the first one is the built in metrics. So we've got a number of metrics for AWS lambda service and also the Amazon API gateway service. So for lambda, for example, we give you beauty metrics around kind of the invocation errors you have in your lambda function where there might be potential throttling, the duration of your lambda functions and potentially the concurrent execution of your lambda functions as well. With API gateway we have a range of built in APIs. For the rest APIs, for the HTTP APIs, and also for the websocket APIs you can start looking at things like latency and also potential 405 hundred errors you have as well. Then for Amazon Dynamodb we also have a number of built in metrics, things like the retro events, the number of capacity units you have available on the service as well, those consumed, and those are still kind of available for you to use. And these are the metrics you can start ingesting, start understanding a bit more about your serverless application with this metrics. We also give you a nice dashboard on Cloudwatch on how to kind of visualize that metric. So what I've got here is a pay service metrics dashboard where you can look at your lambda functions in terms of the invocation of the lambda function and also the duration of those lambda function. We also provide a cross service metrics dashboard. And this is really looking at if you have an application that uses a number of different serverless services like your API, gateway and step function, you'll be able to kind of use this cross service metrics dashboard to be able to visualize what's going on within your application. We know that the beauty metrics are not enough. There are cases where you will need your own custom metrics, and this might be, for example, to look at your business and customer metrics. For example, you want to monitor the revenue generated by this product, the sign ups, the daily sign ups you're having, the page views you're having within your web application. Or you can also start looking at some of the operational metrics as well. If you think about the CI CD pipeline, how long it takes you to recover from failure, the number of calls or pages that you're having, or the time to resolve an issue, these are some of the metrics that you want to track that we don't currently support as a built in metrics today. Also, you can also look at some of the cost errors you have on lambda. And potentially, if you want to look at other dimensions, add some dimensions to your metrics. Things like user id, the category or item. These are some of the scenarios where you might need to build your own custom metrics. You can create custom metrics for your application using Cloudwatch, and you use the built in capabilities of the AWS SDK to call the Cloudwatch putmetric data API call. And for this API call, you're charged by metrics and by put call for the data into a metrics on the right. I've got an example of how this works. So you just basically call the putmetric data API, and what it will do is it will kind of take the metrics that you've defined in your code and the value you've set and push that synchronously to Cloudwatch. We've also got the embedded metric format, which I will cover a little bit more shortly on a different way to do this. So you can also visualize your custom metrics on Cloudwatch. You could see this is a metric that kind of tracks upload. So tracks uploaded, just tracks upload to your system and you'll be able to kind of visualize the line graph of that metric or so can also view it via numbers. We've also got what we call the Cloudwatch metrics Explorer, which lets you kind of drill down to your metrics based on the properties and tags of that metrics as well. So let's look at login. Login is one of the key pillars of observability, and we have a number of built in login mechanisms for customers across the various services. For API gateway, we support two levels of login error and info, and you can set this globally in stage, or you can override it up method basis for HTTP APIs and also websocket APIs. We allow customers to kind of use their login using login variables as well. We also provide capabilities for customers to kind of enable login within their lambda function. You can do this through the language specific or the language equivalent of console log in your application. Or you can also use the putmetric data API we discussed shortly in the last slide. Or you can use the embedded metric format, I'll be covering that to create what we call a structured JSON Login into Cloud watch, and you can then export that into Amazon Open search, which is a new name for Amazon Elasticsearch or Amazon S three, and then do your visualization using tools like kibana or Atena Quicksight as well. Now let's look at Cloudwatch embedded metric formats. So if you think about this, when you log within your application code, for example within lambda, your log basically comes out as a text within a log file. And what you then need to do is you need to kind of process that log, take that log line, process it, understand what it's all about, and then be able to potentially create metrics or alarm of it. What Cloudwatch embedded metric formats helps you to do is to take away that undifferentiated heavy lifting by basically allowing you to embed custom metrics within your log file. And Cloudwatch be able to kind of process that, extract the metrics, and be able to give you a visualization for that metrics. You can do that using, you can enable this using the Putlock events API call, and we support this for a number of open source client libraries in node, in Python or in Java. Let's look at an example of Cloudwatch embedded metric format. On the right, I've got an example of the structure of the Cloudwatch embedded metric format. So you could see the details about kind of the lambda function, and you can also see kind of the snap space and dimension to help to organize the cloud watch metrics. And then you see the metric detail, which in this case is price and quantity, which can be passed by the event payload as well. This will be sent into Cloudwatch and the metrics will be extracted and you'll be able to kind of visualize these metrics within your various dashboards. Let's look at Amazon Cloudwatch loginsight. So when you've generated your logs, the next thing is to really start kind of deriving some insights from that log. And that's really what Amazon Cloud watch loginsight does for you. It boosts you to interactively search and analyze your log data within Amazon cloud watch logs. So for example, here I've got the log from a lambda functions, and you can be able to kind of filter the log by a log level, say for error. And you can save your queries and you can query up to 20 log groups at a given time. And you do this using a flexible proposed viewed query language we've built for Cloudwatch login sites. You can also go a little bit more complex looking at potentially the top hundred most expensive execution you've done on your lambda function. And you do this basically via the build duration. So on the left I've shown the kind of the purpose build query that you could use for this, and then you could kind of list out the hundred most expensive invocation based on the build duration of the lambda function. You can even go for that to start looking at things around performance. So for example, if you want to look at the performance of your lambda function, which is a key info or a key metrics to have or a key insight to have when you're talking about observability for your serverless application, I can look at the performance by duration. So based on the duration of the lambda function, it can start giving you some feel around the performance of a five minute window looking at the average, the maximum, the minimum, also the p 90 values for the duration of the lambda function. And then when you kind of have your logs done, you have your metrics. The next thing then is to create alarms, to be able to kind of alert you when maybe your metrics goes outside of the threshold or when you kind of identify anomaly within your system. And to do that, it's quite simple. Within cloud watch, you select your metrics, you kind of define the statistics for that metric. So you want the sum of a five minutes period. For example, you select the threshold type. In this case we're going for static threshold type and we're looking at anything lower than five, and then you specify the notification mechanisms when an alarm occurs, which in this case can be an SLS notification to an email address. Something else we have within cloud watch is called cloudwatch anomaly detection. Think about some types of metrics you might have where there is potentially some pattern on the metrics, some discernible pattern on the metrics. What Cloudwatch can do is to use machine learning to really understand that pattern and be able to kind of alert when there is an anomaly detected, something outside of the normal for your metrics. And it does it for you using a built in machine learning model, and it will be able to kind of alert you using the various alerting mechanisms within Cloudwatch. Let's look at AWS X ray. AWS x ray provides distributed tracing to help you to have an end to end view of requests flowing through an application for the lambda service. You can instrument incoming requests for all supported languages, and you can enable this within your lambda function by either kind of ticking the checkbox within the settings of the lambda function, or you can also use any of the infrastructure as code tools of your choice, if that's the means you use to kind of deploy your lambda function for API gateway what API gateway does when it comes to tracing is to insert a tracing header into HTTP calls, as well as report data tracing data back to the x ray service. And again, you can enable this within API gateway via the console or via infrastructure AWS go to and on the right I've shown what a service map could look like, which kind of shows the tracing of the request going through various services for your serverless application. So on the screen I've got a tracing example. So this is looking at a particular trace. This is an example of uploading data onto Amazon SRI. And you can see it kind of shows the various activities that happen as part of that transaction and the latency and duration each of them took. So you can see the initialization of the lambda function and also the upload, the put object API call to Amazon S three, which unfortunately returned the full for indices. But that's kind of the level of information you'll be seeing from the trace. From this transaction. We've also got the X ray analytics, which you can use to kind of perform deep analytics on the X ray trace data. So on the screen you could see a heat map of retrieved traces, and you can also kind of filter some of the traces based on a given time range to be able to compare kind of the traces returned within those two time range and then to kind of start spotting potential issues within your application. You can also look at divergence within a particular parameter within your trace, for example HTTP status code, or if you've added additional custom parameter within your traces, for example username. You can be able to kind of start doing some analysis to compare different users and what difference you are seeing from the traces between those two users as well. Let's look at Cloudwatch service lens. Cloudwatch service lens is really the service that ties all this together. It provides a single pane of glass where you can visualize your Cloudwatch metrics and logs in addition to all the traces from AWS x ray. It really gives you a complete view of your application and its dependencies and you'll be able to kind of drill down to that next level of detail that you need to be able to kind of troubleshoot or identify where an issue might be going on within your system. I think it's better to kind of see a little demo of how service lens works and what you can do with service lens. You can see all the services within the service map. It'll be tiny, but we can filter through, say a particular stage within an API gateway. Or you can also filter by what we call the x ray group, which brings out kind of all the services that are involved with that particular x ray group. You see the trace summary across the various services. We can select, for example, a lambda function to be able to see the latency of the lambda function, the number of requests per minute, and also the faults per minute. You can drill down for that particular lambda function where you'll be able to start seeing things like the latency, number of requests and also the faults as well. You also be able to drill down to the lambda logs. You can also view the metrics to the dashboard, also view the traces. I think traces is where it begins to get interesting, because for the trace within the lambda function, you have filters that you could select to be able to filter the trace. You can filter and also see a very high level view of the traces. Let's focus on the user agent. We want to see the users from Mozilla Firefox and also running Windows as the operating system. So you want to see the users assessing your application from that. Here we have five traces. We just filter by the P 95 to P 99 trace, and then we'll be able to see that particular trace for that percentile, and then we can drill down within that trace. You'll be able to see what the transaction looks like, the request, the services that the request went through, so it started from an API gateway, shows you the latency and the duration and the response codes from API gateway, and then it moves to a lambda function and then transacts with Dynamodb to store data. You also be able to see the logs from the lambda service. In fact, you also see the logs from API gateway from lambda, which you can analyze using the Cloudwatch log insights. So far we have looked at the native AWS services that you could use for implementing observability within your application. Now let's go back to that troubleshooting workflow and see how these services fit into each of the stages of this workflow. Now in the notification stage you can use Amazon Cloudwatch alarm to notify if there is any kind of incident within your application or any metrics that breaches any threshold. And then you can also use a service lens with a service map capability to be able to kind of identify potential points of interest where you might want to deep dive. And then when it comes to traces, you can use the x ray to kind of view traces, view maps, see the request as it goes through various services within the platform, and then you can start your analysis correlating some of the traces with the metrics using x ray analytics to kind of dive a bit deep into each of the traces. And if you need more information and more context to that particular trace, you can then use log insights to kind of query your cloud watch logs to be able to gain more information about that particular incident. Now let's look at AWS open source observability services. We have a number of services on the platform for observability, some open source services. So for example, we've got the AWS distlow for open telemetry, which you could use for collection. We've also got the Amazon managed service for Prometheus. So Prometheus is a very popular open source project for collecting metrics for your container workloads, or potentially as well for your serverless application. We've packaged that AWS a managed service making sure that customers, you don't need to worry about the online physical infrastructure that runs your primitive server. We've also got the Amazon Open search service, which is the new name for the Amazon elasticsearch service, and you could use that for your logs and traces, to ingest your logs and traces. And then finally Amazon managed service for Grafana. Again, Grafana is another popular open source project to help you to kind of visualize your metrics, and we've packaged that as well as a managed service, enabling customers to run Grafana without worrying about the underlying physical infrastructure. Let's delve a bit deep into AWS distro for open telemetry. Before I delve into that, I want to talk a little bit more about open telemetry. What is it all about? So a recent survey that was done identified that 50% of companies use at least five observability tools, and out of within that 50% of the companies, 30% of them use more than ten observability tools. Think about developers that work in these companies. They have to use different sdks and agents to be able to implement observability within their application. And this kind of reduces developer velocity and also increases the learning curve they need to go through to be able to do this. Also, when it comes to resource consumption, multiple observability agents and collector agents kind of increases your resource consumption and can potentially increase your cost of compute as well. In many cases, these observability tools do not handshake in an easy way. So there needs to be some potential manual correlation with the data you are seeing from one tool with the data you are seeing from another tools. So mono correlation in some ways is prone to error. And that is really the problem that the open telemetry project is looking to solve. So the open telemetry is an open source project. It's basically an observability framework for your cloud native software. It comes with a collection of tools of APIs and sdks, and it can basically allow you to instrument to generate, to collect, and also to export telemetry data for analysis in order to really understand your software performance and its behavior as well. And by telemetry data, we're talking about metrics, logs and traces, which are the core pillars of observability. Let's then look at the AWS this way. For open telemetry, it's basically a secure, production ready, open source distribution of open telemetry supported by AWS. It's an upstream first distro of open telemetry, which means that AWS contributes to the upstream first and then builds out the downstream implementation on AWS distro for open telemetry, it is certified by AWS for security and predictability, backed by the AWS support. And what we've also done with this is to kind of make it easy for customers to integrate open telemetry in their lambda function via one click deploys. We've also kind of bundled the open telemetry collector as a lambda layer. So if you want to integrate open telemetry into your lambda function using the AWS distro. You can easily do that via lambda layer, so you don't need to kind of change or modify your lambda function. You can also export the data that is collected from AWS distro for open telemetry to a number of solutions, for example to Cloudwatch, to x ray, to Amazon managed service for premises, and also to open site service and other partner solutions as well. So to end, I'm sharing a couple of resources that will be useful. So for example, the AWS distro for Open telemetry will have a GitHub page that you can have a look at that open source project. Another tool I didn't talk about in this talk is called the Lambda Power Tool, which you can also use to implement some availability within your serverless application. Have a look at that. Also, we've built the AWS Lambda operator Guide, which is an opinionated guide to kind of some of the key concepts in operating lambda within your serverless application. So things around monitoring is a key area within that guide. Have a look at it as well. Thank you so much for joining the session. I really appreciate the time and listening in the session. Again, thank you to comfort two for inviting me to speak on this session as well and wish you have a great rest of the conference. Thank you.

Slides

Download slides (PDF)

See all 48 talks at this event!

Conf42 Site Reliability Engineering 2021 - Online

September 30 2021

Observability in Serverless Application

Video size:

Abstract

Summary

Transcript

Slides

Ozioma Uzoegwu

Solutions Architect @ AWS

Join the community!

Featured event

2025

2024

Info

Conf42 Site Reliability Engineering 2021 - Online

September 30 2021

Observability in Serverless Application

Video size:

Abstract

Summary

Transcript

Slides

Ozioma Uzoegwu

Solutions Architect @ AWS

Join the community!