Transcript
This transcript was autogenerated. To make changes, submit a PR.
Are you an sre? A developer?
A quality engineer who wants to tackle the challenge of
improving reliability in your DevOps? You can enable your
DevOps for reliability with chaos native.
Create your free account at Chaos native Litmus Cloud
everyone, thanks for joining this session today. My name is Ozioma Uzoegwu
and I'm a solutions architect in AWS. In my day
job I work with our SMB customers in the UK and I'm also part
of our front end and mobile specialist team. In this session we'll
be talking about observability in serverless applications.
Hello view of an agenda of what we're going to go through. I'll start by
covering kind of defining what is a serverless application. Then we start
looking at really what is observability and then I'll cover some
of the AWS services that you could use for observability, then wrap
up with some of the open source tools and some of the useful links and
resources you'll find useful as well. When we talk about serverless,
we mean the actions of an event driven application.
It usually consists of an event source which
generates an event, and this can be either identifying
events from changes in the data state or changes in a resource state,
or it can also be changes in a request endpoint,
for example a rest API. And then what an event basically does
is to trigger a lambda function. So a lambda function
is a small single purpose functionality that
can be programmed with any of the six programming languages supported
by lambda. Or you can also bring your own custom runtime using the runtime
API, and then the lambda function basically performs an
action. It can be either based on your business logic,
retrieving data from a data store, storing data from a data store, or just returning
items to the UI, or potentially even
calling an external HTTP endpoint.
If you think about your traditional application stack,
maybe you have some workloads running on Prem. You typically have
number of layers right from the networking storage to
the server hardware, to the operating system to the virtualization
software, right up to your application and data and then your
business logic.
If you can remember as well. You need to kind of monitor all
these various components. They are all your responsibility to kind of manage
and maintain and make sure they are up to date. What tablets does for
you is to really remove that undifferentiated heavy lifting that
comes with managing all these layers of the stack.
So we take care of the responsibility of quite a
number of layers within the typical application stack.
And as a customer you focus only on your application
code and your data and the business logic as well. Let's look
at what an example of a serverless application looks like.
You typically will have a front end,
and we have a service on the platform called AWS amplify console
that you can use to host static content. And by static content I
mean your HTML, your CSS and also your Javascript.
We also have another service on the platform called Amazon Cognito, which you
can use for your authentication. And then from the backend perspective
to kind of service your APIs, we have a number
of services that is really where the serverless comes into the picture.
So we've got API gateway, which is a scalable API
management service that you could use to kind of deploy your rest
or websocket APIs. And they've got the lambda function
which basically responds to events that can be
triggered by your API gateway, which is your API request,
and then an Amazon DynamoDB which is a NoSQL database
that can store your data from the API.
It can even get a bit more complex. So you can
also have other serverless services on the platform. For example,
step function, an example on the slide you could see this is
a simple serverless feedback application whereby a
user can submit a feedback and then it goes through a number of activities
to process that feedback. Starting with sentiment analysis where
it kind of looks at the feedback to say, is it a positive or negative
feedback? Or then it stores the feedback into a DynamoDB
database and then you can send a notification to
the feedback owners to say you've received the feedback, say to
Amazon chime, for example. Okay,
so it can really get complex. And the key message here
is that there are a number of components and services that are involved
here. You could see the lambda functions, you could see the API
gateway. And the key aim of a
serverless web developer is to be able to kind of understand what is
going on between these services, the latency of the transactions,
where there might be potential bottlenecks or failures, or be a lot
more proactive in identifying where there might be issues
and how to resolve those issues. So let's move on to really understand what
is observability. And for me to explain this, I like
to use this analogy that have been used by my colleague Nathan Peck as well
in AWS. So think about this magic box.
You just joined a new company and on your first
day on your onboarding and you are told that you're going to be responsible
for this big magic box. The magic box works basically by taking
in a green circle. The green circle goes in and ten milliseconds
layers, it spits out the Popo Pentagon. And that's how it works.
There's a caveat here that the folks that developed this magic box have now
left the business. They didn't deliver the documentation,
and it's now left for you to manage this magic box. Two, four,
seven, and also make sure it's running 365
days in the year. Now you crack
on with your job, and on the fifth day
of your job you just notice something strange. You put in a green circle
and 2 seconds later you get your popo Pentagon.
This is far, much bigger than the ten milliseconds you are meant to
get out the popo Pentagon. And you wonder what might be wrong, what's going
on? And then another day you put in a green circle
and ten milliseconds layers, you get a blue hexagon.
And again you are wondering what's going wrong here? Why is
it happening? And it might be one of those days. That's how
the system might have behaved. It will correct itself. And then
another day you put in a green cycle, the system catches
fire and this is where it becomes very bad
because your customers are no longer able to kind of fit in their green cycles.
To start looking at your computers, to start looking at who can process
this a lot more better than what you can do. And that's really where
it begins to penetrate and observability
can really help. So why
did you experience a lot of the things you experienced and why couldn't you be
able to kind of resolve that? I think there are a
couple of questions that comes to mind. So the first one is that you didn't
have any observability, so you didn't know anything that was happening in
the box. But some of the questions that you might be asking is really what
is in that box? Why does it behave the way it does when
it behavior changes? Why did it change? And what must be done
to make this behavior a lot more consistent? Because you want consistency
so that you can keep processing those green cycles. There are other
kind of other business stats that you can also look at. What is the usage,
how many customers are expected to be using this box, and what's the kind of
impact in terms of scalability? And also what's the business impact
if green circles are not processed?
What does it mean from a business perspective? If I only process ten green
circles against 20, what does that mean in terms of business impact?
And these are kind of really what
you need to be able to kind of fully understand your whole
system and be able to make sure that you have the right observability
in place. So now what
is observability? For me, a single thing I appreciate if
you can take away from this session is really a good observability.
Allows you to answer questions you did not know you need
to ask. It is proactive, not just reactive.
But when a problem happens, you can basically assess the data in
your system and be able to understand why that problem
occurred. So let's look at the three pillars of observability tooling.
So the first one is the metrics. And metrics are basically
defined AWS, the numeric data that you can measure at
various time intervals. And then you've got the logs which are
basically timestamp records of events, of discrete events
that's really happening within your application. And finally you have traces which
is basically tracing of the HTTP request that
really goes through various components within your application. And these are kind
of the three key pillars when you talk about observability.
Now, if you have a problem within your system and you
want to kind of look at the typical troubleshooting of your query
and your workflow, when you have observability tooling put in
place, the first thing you mostly do is you ask
a question. And this is really what observability helps you to achieve. You can ask
a question to say why is my system behaving this way? Or you might receive
an alarm or a notification about an issue. And the next thing
you do is to be able to kind of use what we call a service
map to look at what might be potentially causing
that issue. Or how can this question I have be answered?
And then you've got the traces which basically looks at the
various touch points of your
request as it goes to the various services. And you can use trace
maps to be able to start identifying the potential reasons
for those issues or to answer the questions you have. And then you can
move over to kind of look at using trace analysis,
to kind of analyze the traces, to kind of have a deeper look of
what might be causing it. And finally, based on that correlation you
have maybe between your traces and metrics, you can then look at your logs
and delve a bit deeper to be able to identify the root cause. And that's
kind of the typical flow of how you kind of troubleshoot when you have
observability tooling in place. But what we then do
is to look at the AWS services that can help you through this workflow
and be able to kind of ensure you have observability put
in place in your application. Now, we have two key
AWS services that helps you to implement observability.
So the first one is the Amazon Cloud Watch, which is a
service that could help you to kind of ingest logs,
create metrics and alarms within your
application. We've also got AWS X ray which is a distributed
tracing service which you could use to instrument tracing
in your application. It also gives you a platform to kind of
perform analytics on your traces and also view a service
map to see kind of where the different components
that your request kind of went through as it's being fulfilled.
Let's delve a bit deep into Amazon Cloud watch, so a
couple of stats here for you. Amazon Cloud Watch processes 1 quadrillion
plus metrics observation each month, and also
it processes 3.9 trillion events each month.
And this is the service we use to monitor our entire
infrastructure of AWS and also Amazon.com, which kind of
gives you a feel of the scale of this service and its suitability to kind
of serve majority of the use cases. Finally,
it also processes 100 petabytes of log ingested every
month, and this is quite massive when it comes to scale.
Let's then go back to the backend of your serverless application, which typically
contains your API gateway, your lambda and your Amazon Dynamodb,
and kind of talk through how you can implement observability
for these services using some of the two key services we've just
talked about on the platform. So the first one
is the built in metrics. So we've got a
number of metrics for AWS lambda service and also the Amazon API
gateway service. So for lambda, for example, we give you beauty metrics
around kind of the invocation errors you have in your lambda function where there might
be potential throttling, the duration of your lambda functions
and potentially the concurrent execution of your lambda functions
as well. With API gateway we have a range of built in APIs.
For the rest APIs, for the HTTP APIs, and also for the websocket
APIs you can start looking at things like latency and also
potential 405 hundred errors you have as well.
Then for Amazon Dynamodb we also have a number of
built in metrics, things like the retro
events, the number of capacity units you have available
on the service as well, those consumed, and those are still kind
of available for you to use. And these are the metrics you can
start ingesting, start understanding a bit more about your serverless
application with
this metrics. We also give you a nice dashboard on Cloudwatch
on how to kind of visualize that metric. So what I've got here
is a pay service metrics dashboard where you can look at your lambda functions
in terms of the invocation of the lambda function and also the duration of those
lambda function. We also provide a cross service metrics
dashboard. And this is really looking at if you have
an application that uses a number of different serverless services like your API,
gateway and step function, you'll be able to kind of use this cross
service metrics dashboard to be able to visualize
what's going on within your application. We know that
the beauty metrics are not enough. There are cases where you
will need your own custom metrics, and this might be, for example,
to look at your business and customer metrics. For example, you want to
monitor the revenue generated by this product,
the sign ups, the daily sign ups you're having, the page views
you're having within your web application. Or you can also start looking
at some of the operational metrics as well. If you think about the CI CD
pipeline, how long it takes you to recover from failure,
the number of calls or pages that you're having, or the
time to resolve an issue, these are some of the metrics that you want to
track that we don't currently support as a built in metrics today.
Also, you can also look at some of the cost errors you have on lambda.
And potentially, if you want to look at other dimensions, add some dimensions to your
metrics. Things like user id, the category or item. These are
some of the scenarios where you might need to build your own custom metrics.
You can create custom metrics for your application using
Cloudwatch, and you use the built in capabilities
of the AWS SDK to call the Cloudwatch putmetric data
API call. And for this API call,
you're charged by metrics and by put call for the data into a metrics
on the right. I've got an example of how this works.
So you just basically call the putmetric data API,
and what it will do is it will kind of take the metrics that you've
defined in your code and the value you've set and push that synchronously
to Cloudwatch. We've also got the
embedded metric format, which I will cover a little bit more shortly on
a different way to do this. So you can also visualize your
custom metrics on Cloudwatch. You could see this is a metric that
kind of tracks upload. So tracks uploaded, just tracks upload to
your system and you'll be able to kind of visualize the line graph of that
metric or so can also view it via numbers. We've also
got what we call the Cloudwatch metrics Explorer,
which lets you kind of drill down to your metrics based on the properties
and tags of that metrics as well.
So let's look at login. Login is one of
the key pillars of observability, and we have a number of built in login
mechanisms for customers across the various services. For API gateway,
we support two levels of login error and info, and you can set
this globally in stage, or you can override it up method
basis for HTTP APIs and also websocket APIs.
We allow customers to kind of use their login using
login variables as well. We also provide
capabilities for customers to kind of enable login within their lambda
function. You can do this through the language specific or
the language equivalent of console log in your application.
Or you can also use the putmetric data API we discussed
shortly in the last slide. Or you can use the embedded metric
format, I'll be covering that to create what we call a structured JSON
Login into Cloud watch, and you can then export that into
Amazon Open search, which is a new name for Amazon Elasticsearch or Amazon
S three, and then do your visualization using tools like kibana or Atena
Quicksight as well. Now let's look
at Cloudwatch embedded metric formats. So if you think
about this, when you log within your application code,
for example within lambda, your log basically comes
out as a text within a log file. And what you then
need to do is you need to kind of process that log, take that
log line, process it, understand what it's all about, and then be able to
potentially create metrics or alarm of it. What Cloudwatch embedded
metric formats helps you to do is to take away that undifferentiated
heavy lifting by basically allowing you to
embed custom metrics within your log file.
And Cloudwatch be able to kind of process that, extract the metrics,
and be able to give you a visualization for that metrics.
You can do that using, you can enable this using the Putlock events
API call, and we support this for a number of open
source client libraries in node, in Python or
in Java. Let's look at an example of Cloudwatch
embedded metric format. On the right, I've got an example of the
structure of the Cloudwatch embedded metric format. So you could
see the details about kind of the lambda function, and you
can also see kind of the snap space and dimension to
help to organize the cloud watch metrics. And then you see the
metric detail, which in this case is price and quantity,
which can be passed by the event payload as well.
This will be sent into Cloudwatch and
the metrics will be extracted and you'll be able to kind of visualize
these metrics within your various dashboards. Let's look
at Amazon Cloudwatch loginsight. So when you've generated your logs,
the next thing is to really start kind of deriving some insights from that
log. And that's really what Amazon Cloud watch loginsight
does for you. It boosts you to interactively search and
analyze your log data within Amazon cloud watch logs.
So for example, here I've got the log
from a lambda functions, and you can be able to kind of filter the log
by a log level, say for error. And you can save your queries
and you can query up to 20 log groups at
a given time. And you do this using a flexible
proposed viewed query language we've built for Cloudwatch login sites.
You can also go a little bit more complex looking at potentially
the top hundred most expensive execution
you've done on your lambda function. And you do this basically via the build duration.
So on the left I've shown the kind of the purpose build
query that you could use for this, and then you could kind of list
out the hundred most expensive invocation based
on the build duration of the lambda function. You can even go
for that to start looking at things around performance. So for example, if you want
to look at the performance of your lambda function, which is a key
info or a key metrics to have or a key insight to have
when you're talking about observability for your serverless application,
I can look at the performance by duration. So based
on the duration of the lambda function, it can start giving you some feel around
the performance of a five minute window looking at the average,
the maximum, the minimum, also the p 90 values
for the duration of the lambda function.
And then when you kind of have your logs done, you have your metrics.
The next thing then is to create alarms, to be able to kind of alert
you when maybe your metrics goes outside
of the threshold or when you kind of identify anomaly within your system.
And to do that, it's quite simple. Within cloud watch, you select your
metrics, you kind of define the statistics for that metric. So you
want the sum of a five minutes period. For example,
you select the threshold type. In this case we're going
for static threshold type and we're looking at anything lower
than five, and then you specify the
notification mechanisms when an alarm occurs, which in this case can be an
SLS notification to an email address.
Something else we have within cloud watch is called cloudwatch
anomaly detection. Think about some types of metrics
you might have where there is potentially some pattern on the
metrics, some discernible pattern on the metrics. What Cloudwatch can
do is to use machine learning to really understand that pattern and be
able to kind of alert when there is an anomaly detected,
something outside of the normal for your metrics. And it
does it for you using a built in machine learning model, and it will be
able to kind of alert you using the various alerting mechanisms within Cloudwatch.
Let's look at AWS X ray. AWS x ray provides
distributed tracing to help you to have an end to end
view of requests flowing through an application for
the lambda service. You can instrument incoming requests for all supported
languages, and you can enable this within your
lambda function by either kind of ticking the checkbox within the settings of
the lambda function, or you can also use any of the infrastructure as
code tools of your choice, if that's the means you use to kind
of deploy your lambda function for API gateway
what API gateway does when it comes to tracing is to insert a tracing
header into HTTP calls, as well as report data tracing
data back to the x ray service. And again, you can enable this
within API gateway via the console or via infrastructure AWS
go to and on the right I've shown what a service map
could look like, which kind of shows the tracing of the
request going through various services for your serverless application.
So on the screen I've got a tracing example. So this is looking at a
particular trace. This is an example of uploading data onto
Amazon SRI. And you can see it kind of shows the various
activities that happen as part of that transaction and the latency
and duration each of them took. So you can see the
initialization of the lambda function and also the upload,
the put object API call to Amazon S three, which unfortunately
returned the full for indices. But that's kind of the level of information you'll
be seeing from the trace. From this transaction.
We've also got the X ray analytics, which you can use to
kind of perform deep analytics on the X ray trace data.
So on the screen you could see a heat
map of retrieved traces, and you can also
kind of filter some of the traces based on a given
time range to be able to compare kind of the traces
returned within those two time range and then to kind of start spotting
potential issues within your application. You can also look at
divergence within a particular parameter within your
trace, for example HTTP status code, or if you've added
additional custom parameter within your traces, for example username.
You can be able to kind of start doing some analysis to compare
different users and what difference you are seeing from the traces
between those two users as well. Let's look
at Cloudwatch service lens. Cloudwatch service lens is
really the service that ties all this together. It provides
a single pane of glass where you can visualize your Cloudwatch metrics
and logs in addition to all the traces from AWS
x ray. It really gives you a complete view of your
application and its dependencies and you'll be able to kind of drill down to
that next level of detail that you need to be able to kind of troubleshoot
or identify where an issue might be going on
within your system. I think it's better to kind
of see a little demo of how service lens works and what you
can do with service lens. You can see all the services within the
service map. It'll be tiny, but we can filter through,
say a particular stage within an API gateway.
Or you can also filter by what we call the x ray group,
which brings out kind of all the services that are involved with
that particular x ray group. You see the trace summary
across the various services. We can select, for example,
a lambda function to be able to see the latency
of the lambda function, the number of requests per minute, and also the
faults per minute. You can drill down for that particular lambda
function where you'll be able to start seeing things like the latency,
number of requests and also the faults as well. You also
be able to drill down to the lambda logs. You can also view the
metrics to the dashboard, also view the traces. I think traces is
where it begins to get interesting, because for the trace within the
lambda function, you have filters that you could select
to be able to filter the trace. You can filter
and also see a very high level view of the traces. Let's focus on
the user agent. We want to see the users from Mozilla Firefox
and also running Windows as the operating system.
So you want to see the users assessing your application from that. Here we
have five traces. We just filter by the P
95 to P 99 trace, and then we'll
be able to see that particular trace for
that percentile, and then we can drill down within that trace.
You'll be able to see what the transaction looks like,
the request, the services that the request went through, so it started from an
API gateway, shows you the latency and the duration and
the response codes from API gateway, and then it moves to a lambda
function and then transacts with Dynamodb to
store data. You also be able to see the logs from the
lambda service. In fact, you also see the logs from API gateway
from lambda, which you can analyze using the Cloudwatch log insights.
So far we have looked at the native AWS services that you could
use for implementing observability within your application.
Now let's go back to that troubleshooting workflow and see
how these services fit into each of the stages of this workflow.
Now in the notification stage you can use Amazon Cloudwatch
alarm to notify if there is any kind of incident within
your application or any metrics that breaches any threshold.
And then you can also use a service lens
with a service map capability to be able to kind of identify potential
points of interest where you might want to deep dive. And then
when it comes to traces, you can use the x ray to
kind of view traces, view maps, see the request as it
goes through various services within the platform,
and then you can start your analysis correlating some of the traces with
the metrics using x ray analytics to kind of dive a bit deep
into each of the traces. And if you need more information and
more context to that particular trace, you can then use
log insights to kind of query your cloud watch logs to be able to
gain more information about that particular incident.
Now let's look at AWS open source observability services.
We have a number of services on the platform for observability,
some open source services. So for example, we've got the AWS distlow
for open telemetry, which you could use for collection.
We've also got the Amazon managed service for Prometheus.
So Prometheus is a very popular open source project
for collecting metrics for your container
workloads, or potentially as well for your serverless application.
We've packaged that AWS a managed service making sure that customers,
you don't need to worry about the online physical infrastructure that runs your primitive
server. We've also got the Amazon Open search service,
which is the new name for the Amazon elasticsearch service,
and you could use that for your logs and traces, to ingest your
logs and traces. And then finally Amazon managed
service for Grafana. Again, Grafana is another popular open
source project to help you to kind of visualize
your metrics, and we've packaged that as well as a managed service,
enabling customers to run Grafana
without worrying about the underlying physical infrastructure.
Let's delve a bit deep into AWS distro for open telemetry.
Before I delve into that, I want to talk a little bit more about open
telemetry. What is it all about? So a recent survey that was done
identified that 50% of companies use at least five observability
tools, and out of within that 50% of the
companies, 30% of them use more than ten
observability tools. Think about developers that work
in these companies. They have to use different
sdks and agents to be able to implement observability within
their application. And this kind of reduces developer velocity and also
increases the learning curve they need to go through to be able to do this.
Also, when it comes to resource consumption,
multiple observability agents and collector agents kind
of increases your resource consumption and can potentially increase
your cost of compute as well. In many cases,
these observability tools do not handshake
in an easy way. So there needs to be some potential manual correlation
with the data you are seeing from one tool with the data you are seeing
from another tools. So mono correlation in some ways is prone
to error. And that is really the problem that the open
telemetry project is looking to solve. So the open telemetry
is an open source project. It's basically an
observability framework for your cloud native software.
It comes with a collection of tools of APIs and sdks,
and it can basically allow you to instrument
to generate, to collect, and also to export telemetry data
for analysis in order to really understand your software
performance and its behavior as well. And by telemetry data,
we're talking about metrics, logs and traces, which are the core pillars
of observability. Let's then look at the AWS
this way. For open telemetry, it's basically a secure, production ready,
open source distribution of open telemetry
supported by AWS. It's an upstream first distro
of open telemetry, which means that AWS contributes
to the upstream first and then builds out the downstream implementation
on AWS distro for open telemetry,
it is certified by AWS for security and predictability,
backed by the AWS support. And what we've also done with
this is to kind of make it easy for customers to integrate
open telemetry in their lambda function via one click deploys.
We've also kind of bundled the open telemetry collector
as a lambda layer. So if you want to integrate open
telemetry into your lambda function using the AWS distro. You can
easily do that via lambda layer, so you don't need to kind of change or
modify your lambda function. You can also export
the data that is collected from AWS distro for open
telemetry to a number of solutions, for example to Cloudwatch,
to x ray, to Amazon managed service for premises, and also to open
site service and other partner solutions as well.
So to end, I'm sharing a couple of resources that will
be useful. So for example, the AWS distro for
Open telemetry will have a GitHub page that you can have a look at
that open source project. Another tool I
didn't talk about in this talk is called the Lambda Power Tool, which you can
also use to implement some availability within your serverless application. Have a
look at that. Also, we've built the
AWS Lambda operator Guide, which is an opinionated guide to
kind of some of the key concepts in operating lambda
within your serverless application. So things around monitoring
is a key area within that guide. Have a look at it as well.
Thank you so much for joining the session. I really appreciate
the time and listening in the session. Again, thank you to comfort
two for inviting me to speak on this session as well and
wish you have a great rest of the conference. Thank you.