Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello everyone, welcome to cons 42,
observability 2024 and thank you for taking time to join
my session.
I want to start today's session with a story.
Let's say there is a bank xxx and it
has released a newsletter stating this dear customers,
we are happy to announce that you can now open savings account through your
mobile banking. Place the request
with a few clicks on your mobile app and
get your account operational in 2 hours.
And let's say that this solution was
done by the bank using a very modern architecture,
event driven architecture, a cloud native architecture.
And after a few days of this product launch,
a customer calls the customer service representative
and states that I placed a request yesterday for savings account
on mobile banking app but my account is not operational
till now. The customer service representative logs
a ticket to the mobile banking team. The mobile banking team
takes a look at the backend systems and can see that the
request is successfully placed. So she
forwards the ticket to the core banking team.
Now the core banking team looks at the system and
says that I have not received any account opening request.
So what happened to the account opening request?
Or should I say, what happened to the account opening event?
So the answer to this question is what is the premise
of this session? So welcome to my session on observability
for modern event driven applications. I'm Aurmila
Raju. I'm a senior solution architect with Amazon Web Services.
So let's get started and dive into this session.
I want to do some basic level setting on what is event driven architecture.
So please note the text in bold and underlined it says
that this is an architectural style of building loosely
coupled systems. And these loosely coupled
systems talk to each other by emitting and responding
to events. So what is an
event? An event is a change in state or
an update emitted by a producer.
A producer can be any component within your application,
and in this event consumers are interested in.
So producers and consumers, which are like two components of your application,
talk to each other through an event broker.
So that's how the high level architecture
of an event driven system comes about.
So why do customers move towards event driven applications?
Because it offers a lot of good things.
And few of the highlights are speed and agility because
the systems are loosely coupled, so each team can build their own
component independently and get it to deployment. The next
is resiliency, so the blast radius of failure is
reduced because of the loose coupling between the systems,
and each system fails independently,
so there is no single point of failure. The next is scalability,
so you are able to minimize any waiting time
because of the asynchronous and parallel processing that we bring in
event driven architecture. And lastly, but not the
but it is the most important one which enables you to
work backwards from your business requirements and your business process
workflow. So this is an architectural style which
brings your business and technology stakeholders together
in building technology applications.
So these good things will help you to
meet your business objective in a very effective way.
But it's important to understand that EDA is
hard to get it right. There are various factors for it
and key things are highlighted here,
starting with eventual consistency.
So what we mean by that is, due to loose
coupled systems and asynchronous nature,
components are not consistent at the same time.
So your business process must be able to cope with that kind of
delays. The next is end to end performance.
If there is a performance bottleneck in one of the components,
it is going to impact the end to end performance of your
application. And third, is meeting business SLA's.
So when we talked about the pros, we said that it
helps you to work backwards from your business process.
So that means you need to meet your business SLA's
in this type of an architecture. So in this business process workflow,
some steps might be real time, some might be real near real time,
and some might be batch. So you must
design in such a way that the SLA's of
each step are met properly to get the EDA right.
So how do we do this right? So there are many architectural
design decisions that you need to make to get it right.
And one of the key things to get this right is
observability. So observing your
system so that it's like if there is an event,
as we started with, the story event is flowing between systems.
So you need to know where the event is at and what time. And if
there is an event that is failed, there must be proactive mechanism built
into your architecture to recognize that and take appropriate
action. So that's where observability plays a role.
So what is observability? It's a measure of how
well we can understand a system from the work it does.
So it is about like collecting the right amount of
data and gaining insights from it
and taking proactive actions to make your application
work better. So I want to
demonstrate it with an example use case so where which
relates back to our story of opening a savings account
for an existing customer through a mobile banking app.
So I want to show you a high level business process
workflow. It can be
any complex workflow but I just put here a
very oversimplified version because we are
going to use this just to see how the observability fits
into the business process workflow.
So let's say the customer logged into the mobile app
and selects a savings product and checks the product eligibility,
and then the request to open the savings
account is placed. So once it is placed it
goes to the core banking to get the account opened. And after
the account is opened there could be like post account opening steps, like a
monthly interest schedule getting updated. Or you send a mobile push notification
to convey to the customer that the
account is opened and account is operational. And you may send a welcome
email, or you may send a survey or to
know how the account opening journey has been. So usually these
kind of steps or the events,
so if you see the orange ones are all written in the past tense
because events are usually written in the past tense and
the gray boxes around it are the business domains
from which these events originate. And events can flow
between domain. So event driven architecture is primarily based
on the domain driven design and event storming methodology.
So I'm not going to dive into those concepts,
but it's a good way of designing an EDA is
this, that is, you start from your business process, identify the events
in each of the business domain, and then design
your technical architecture based on that.
So let's say this is our business process workflow. And in
accompanying this, that could be business SLA's because we talked about
SLA's previously and for this example use case,
you can have SLA's like something like, like the ones
that I've highlighted here. Like open the account 24/7
that is, customer can login into the mobile banking app anytime
and place the request. And the product eligibility is
done in real time. So it's always important to define what is real
time. So here we are defining real time as like hundred milliseconds
for the product eligibility to be done and the account opening
request is placed. And also we are saying that the account should be operational
in 2 hours of the request being placed and another SLA's
in a similar way. So the point to
note here is in this business process workflow,
there are some synchronous real time steps and also
some asynchronous near real time or
batch steps. Yeah, and why
it is hard to get it right is there could be a lot of things
that can go wrong here. Examples are what happens if the
product eligibility check takes more than 100 milliseconds, which is CSL
what happens if account opening service is down?
And what happens if mobile push notification fails?
That's why we are saying that end to end observability is
key for a successful edge. So you need to have
visibility into each of the component that you
have in your end to end architecture,
and you should be able to do real time troubleshooting on the
errors and issues that is occurring. And if you see in this
picture, the benefits of observability has both
operational benefits and also business benefits when you move towards
the right. So in terms of business benefit, it is going to help
you with your overall customer experience and also
to meet your business objectives and business outcomes.
So usually in a traditional monitoring,
you need to do the monitoring across all of your layers,
right from your storage or network layer up to your business layer.
But when you do event driven architecture, and especially when you do it on
AWS, you do it with many of the serverless services,
like services like SQS, SNS, Amazon,
Evenbridge, API, Gateway, Lambda. So these are the services
that you usually use. And when they
are serverless, these layers of
monitoring is offloaded to AWS. So you will be able to focus
on your business application and data
observability alone.
Now, going back to our example, so we so far we have been
talking about only the business process and events. So I have given some
services, put some services behind it to show how an
architecture for such a solution would look like. So we
are not going to rationalize on our design choices of why I've used
Evenbridge here or why I've used sqs. Because in an event driven architecture
there is no right or wrong answers. It's primarily based on what your
requirements are and using the right architectural style
and then choosing the appropriate services for that. So in
this example, what we are doing is. So this top bit is a synchronous process
where you do the product eligibility check through an
API synchronous API call, which goes to the product DB and gives the
request back. And once that is done, an event to open
the account as a request is placed into Eventbridge
as an event, and this event is put into an sqs queue
and you have a core banking platform. And here I assume that it is going
to be an on premise data center system which listens into
this queue. So whenever there is a request, it picks it from there
and then does the necessary process to open the account.
And once the account is opened, an event is placed back into the
event bridge and from there that event is of
interest to various systems like the monthly interest schedule
updating service or the customer comms with since mobile push notification and
email, or the marketing team who is going to send the account
opening survey. So this is just to show you how
it is made up of synchronous and asynchronous step.
And each step has got its own SLA's. So you need
to monitor each of this component and achieve
the observability as per your business SLA's.
So how do we do that? Let's start looking at an observability
maturity model for that. So maturity model
helps customers evaluate where they are so that they know
where they want to be and how to get there. So as
they expand their workloads, the observability is expected to
mature. So we will start from the foundational level, which is
foundational monitoring and it relates to collecting of telemetry data.
So what do we mean by telemetry data? Here we have three
types and those are the actual three pillars of observability,
which are metrics, logs and traces. So metrics is time
series data calculated or measured at
various time intervals. So it can be like API request rate
or error rate, or a duration of a lambda function,
etcetera. And logs or timestamp the records
of discrete events. So those are
like events that happen within your system or components of your application,
such as a failure event or an error event or a state transformation.
So those are examples of logs. And then we have got traces.
A trace represents a single user journey across
multiple components in your application. So usually it is very
useful in case of microservices and API based architecture
to see how the API request is being routed
through various systems and the response coming
back, you can trace that entire request.
So these three form the pillars of observability. And when you do
this observability using AWS native tools,
Amazon Cloudwatch helps you to
do logs and metrics, and AWS x ray helps
you with traces. So let's go and look at each
one of the observability pillar and
what we will do is as and when we see each of the pillar.
I'm going to go back to the example application design that we
had and then relate what kind of observability examples
that we can do for that application to provide you some context.
So we'll start with viewing of standard metrics.
So Cloudwatch has inbuilt metrics. So whenever
you are integrating the AWS services with Cloudwatch automatically,
there are, there are a set of metrics for each service which gets
logged into Cloudwatch. So if
you see, these are the serverless services that I have highlighted
because they have been used in the example architecture that
we just spoke about. But we can do the same with the
other AWS services as well. So this
inbuilt metrics with you must be able to meet like 70%
of your observability needs just from the inbuilt
metrics. So let's see some examples of
what those key metrics can be. So for
lambda, the invocation metric can be helpful to
assess the amount of traffic and failures, if any, and performance
metric can be on memory utilization and duration of
execution execution. This relates to the cost of the function.
Two like these parameters and then we have concurrency.
Concurrency metric helps to assess the number
of parallel invocations. So this metric can help to track
the performance of application against existing
concurrency limits and see if you need
to increase the limits or secure like proficient concurrency
as per the needs of your application.
Similarly, for API gateway, I've highlighted
a set of key metric. So the API
gateway is the gateway for your microservices in modern application.
So keeping track of like API call counts latency errors
can be very helpful in measuring your business objectives.
So when we talked about limits, right, though these
are like serverless services and scales, inherently you
need to be cautious about limits to avoid
throttling. But in some cases throttling
can be good too. For example, you can
throttle the number of requests to API gateway to avoid
security attacks and also setting like client limitations
when there are like multiple clients accessing your API
gateway. So it's important to analyze with what
limits you are operating your application, whether they are the right
limits or do you need any increase to
make your application perform better. So the
next service that we want to look at is Amazon Q service.
So SQS is a pull based event broker.
So what we mean by that is the consumer has to come
and pull the messages. Until then, the messages or events are going to
be within the queue. So metrics
like approximate age of the oldest message. So if
you're monitoring that, and if the age is increasing beyond
a particular threshold, that means that the consumer is
not keeping up with the speed of the amount
of messages in the, in the queue. So that is something
to keep track of and identify the issue.
The next is Amazon Eventbridge. So in, even in
our design we had these services. So I have highlighted
some key metrics here, like the dead letter Q invocation.
So I mentioned SQS is a pool based broker,
whereas Amazon Eventbridge is a push based broker.
That means the responsibility of doing
retries and error handling lies with the broker itself.
So let's say Eventbridge is trying to send
an event to a target system and the target system is unavailable.
It will do multiple retries as per the number of retries that
is configured in Eventbridge. And even then if it is not able to
reach the target, then it is going to write the information
or event into a dead letter queue.
So again, it will write into dead letter queue only if you have configured it.
So if something is arriving in a queue, that means that is a failed event.
So what is the business impact of a failed event? So you
can configure your dead letter queue and the number of retries
based on your business SLA's. So even
in our example, you can have a scenario
where the savings account opening request has come,
but for some reason it has not been picked up by the account opening
service. So that it could be a reason that the eventbridge didn't
get to place the event into the queue at all. So if that is the
case, then you can write that
event into a dead letter queue and keep
track of this metric and take some appropriate
actions at the back of it. So those are the
examples of standard metrics and how these metrics are written
into or organized into the cloud watches through
namespaces and dimensions. So namespaces consider
it like a box or a container for your scope.
So the scope can be your application. So in this case I just put
savings account opening application is the namespace and within
that you can have dimensions. So the dimensions can be
the service name. For Lambda, it can be the check product eligibility services,
the dimension within which you are tracking the metrics. Similarly for queue,
the queue name can be a dimension for event bridge, the event bus
can be a dimension within which you track
the so these are for the standard metrics,
but you can do custom dimensions and custom metrics
as well, which is what we are going to see next.
So I mentioned that 70% of your needs are going to be met
with the standard metrics, but just the
built in metrics may not be enough. So that could
be criteria where or scenarios where you
need to measure the application performance against your business goal,
like the revenue or signups, the page views.
So those are not your application level things that
you need to track from your application.
Business logic so what do we use in writing?
The business logic in an event driven architecture is usually it's
a lambda function, or it can be a container container
service within which you're doing some business logic and
you want the tracking to be done through the
business logic. In that case you can instrument your
code to create custom metrics and then send
it into Cloudwatch. So this is something
that most applications will need for better performance.
So we will see how it is done in the coming
slides. So now going back to our maturity model,
so we have talked about the foundational monitoring and next
we are moving towards doing telemetry analysis
and insights. So we have got the data. How do you do insights?
Primarily the lambda service, which is the key
business logic service in an event driven architecture. How do you collect
insights from that? So there is an out of the box feature in Cloudwatch for
doing lambda insights. So when you enable
that, you will be able to monitor, troubleshoot and optimize
the performance of your AWS lambda functions.
Some of the use cases where this might come in handy is identifying
high cost functions, identifying memory leaks and
identifying any performance changes whenever there is a new version of
a lambda function that is being deployed, and also
understanding the latency drivers in function.
So latency drivers actually it's a very key concept because
in a lambda execution time, there are various splits
within that, like cold start time and there can
be a bootstrapping time and then the actual execution
time. Cold start is the time that AWS takes
to provision and lambda instance
and bootstrapping time is the time to get your dependencies and
libraries loaded and then you have the actual execution
time. So it's important to split the whole execution time
to see how much is cold start, how much is the bootstrapping
time and how much is execution time to see if there is
any bottleneck. And there are various mechanism where in which
each of these areas can be fine tuned.
So the lambda insight is a dashboard within
Cloudwatch. So these are automatic dashboards which Cloudwatch creates. There are
two main types of dashboards. One is a multifunction dashboard,
which provides an aggregated view across multiple lambda
functions. So you can see the list of lambda functions in your account,
how much of the code start, how much is the memory utilization and all that.
So it looks something like this. The next is a single function dashboard.
It helps to view a single lambda function and identify
root causes for any issues. So this is a very useful
feature. So recommend looking
at it. So even in our architecture, if you want to go back to our
example and see where it can be useful. So you, we had the
product eligibility service as a lambda function.
So if you calculate the whole duration of execution,
that is going to directly impact your
business SLA, which is like 100 milliseconds for the
product eligibility check. So if the duration of the lambda
itself is more than that, then that is something to be looked at.
Okay, so now that was about metrics and also
a quick overview of lambda insights. So now let's see
what are the other things that you can do
with the next pillar of observability, which is structured and
centralized logging. So cloudwatch logs
can be collected from various services. So the key
services that I have highlighted is API Gateway and
Lambda. So for API gateway there are two levels of logging,
which is logging errors and logging information.
So maybe in your lower environments you want to do both, but in your higher
environments, a stable environment, you just want to track errors.
Yeah. So it's up to the customer requirements.
And also you can do custom metrics based on
your logs. That is, you can filter a set of
logs based on criteria. So example count of 400,
403 errors, maybe a filter, and then you create,
you create a custom metric saying like what was the count?
So that can be a metric filter that you can create
and add it into your custom dashboards.
And next is lambda logging.
So this is where lambda logging is going to help you
to write custom metrics. So if you remember, we just talked about custom
metrics. So this is the way you do it. So there are two ways you
could do it, either through the put metric API or through the embedded
metrics format. Put metric API is a synchronous API
call. That means you're going to write the logs during
the execution time of the lambda.
So that is unnecessary add added overhead to your lambda execution
time. So the recommended approach is to do it asynchronously.
The great example is to do it via the embedded matrix format.
So that will write,
write the logs asynchronously or like outside of the execution
of the lambda function. So what you
can do is you can create your custom message of how bought
or of what information you have to write into your logs
and then put it into Cloudwatch.
So in this, in a way you are bringing your custom metrics
into Cloudwatch logs.
So to give an example of Cloudwatch logging
where it can be useful on how it relates to business
SLA's is. So we said for API gateway we have
error and information logs.
So if it is an error event and if
you remember there was a business SLA to ability to open account
24/7 but if there are any errors that is happening at the
product eligibility check, then it is impacting the business
SLA so that is customer would not be able to open the or place the
request if there is an error when doing the product eligibility
check. So that is something to be avoided or remediated
immediately. So now
we have got all of the logs, let's say, and next step
is to derive insights similar to the lambda insights, which was an inherent
feature. If you want to do a similar querying of
the logs and get some insights for
the other metrics and log data that you have got,
you have something called the querying of Cloudwatch logs
insights. So there is a specific syntax
to be used in order to do this. So here,
if you want to do top hundred most expensive executions,
then you select the fields, sort it by the
builder duration in a descending order and you're limiting by 100 so that you get
the top hundred records. Then you get the information,
something like this. Another example is to get the last
100 error messages. So again you are selecting
the fields, putting the filter condition on this log level as err,
sorting by timestamp in the descending order and limiting by 100 rows.
So, but as you can see, writing this query and
the syntax is a bit of a learning curve. So for to help customers
you want to get started with this querying format. There is
a new feature that has been announced which is called as a powered natural
language query generation. So this is still in preview.
As you, as you note, it's not generally available,
but this is a great service and it is going to be very helpful where
you, you can type in your query
in natural language similar to what we had in the previous slide.
That is if you type in as get the last hundred error messages
in the API Gateway Cloudwatch log group,
then it is going to create the query for you and then maybe you can
fine tune it further and get your results.
So this is one way of doing a powered
insights from Cloudwatch. So that was on the intermediate
monitoring or the analysis and insights.
Now it's time to move to the last two stages, which is like advanced
and proactive observability. How can you do that? How can
you like proactively handle or
find out errors and handle errors and see
some, do some anomaly deduction and take appropriate actions?
The first area is creating alerts.
One way to do alerts is through cloudwatch alarms.
How you can create an alarm is you can choose a specific metric
and set a threshold on that metric. And then
whenever that threshold is breached, this alarm will be raised.
So by alarm, what we say is there will be a notification sent
to a target. So the notification can be through an SNS,
through an email to an appropriate operations
team, or it can be an integration to
an incident management system.
So in our example, so if we say what can be an alert
example is when the age of the message in the queue is growing
beyond the business sla. We already looked at it. So if that
is the case, then that means no one is picking up the account opening
request. So that's something to be kind of concerned. Like before
the business sla of 2 hours is reached, you need
to get the account opened. So if no one is picking up the request,
you need to like monitor, alert someone and get it corrected.
The next way of doing some alarms is through CloudWatch anomaly deduction.
So if you enable this feature within CloudWatch,
CloudWatch is going to keep track of your metrics and patterns.
Again, you will see that if you see in this graph,
you're seeing like at what times or durations of
your day or a week, that there is peaks.
And when there is a lesser number of requests, when there is
a peak, if there is a change in the regular pattern,
it is going to alert as an alarm.
So then maybe you will need a human in the loop step here
to see if it isn't really an alarm situation or
some increase in traffic which has caused the anomaly.
So in our example, what could be an anomaly
deduction scenario? So if there are any API requests.
So you see, let's say from eight to five, you see a peak
in account opening and after that it is less.
And on some day, in middle of the night, there have been like
hundreds and hundreds of account opening request. Then that is something
of an anomaly and something to be looked at. It could be
a security vulnerability pattern that someone is trying to attack the
site with multiple requests. So by
you might have to look at like putting a waf or preventing the
distributed denial of service for your application.
So in those cases, anomaly reduction will be very useful.
And another feature of correlation
is block patterns. So when do
we. You would need to look at a pattern within logs.
Usually that there are very large
challenges with the log analysis. Some of the features are
highlighted here, like there is too much of data because you are continuously getting
logs from various components of a system. And if
there is any change in the system, how it is changing, the amount of logs
or the type of logs that you are creating will also change.
So how do you proactively detect any unusual changes in
your logs from the huge amount of logs that you have got?
If there is a mechanism to do some pattern matching,
to see that these are the various types of logs that I have got
and that is the usual pattern of your application. And if there
is any new pattern being recognized, please do let me know proactively
so that I can go and check if that is something of concern or
is it something due to a change in my application? An example
could be. So this is an example of an API request in the
API gateway. So the pattern here is, it is an
information message followed by a timestamp and it
says API request received followed by a customer id.
So this is logged as a pattern. So pattern analysis in logging
sites is a new feature that has been announced recently.
So to log the various patterns. So whenever, for whenever
there is a new application and you have enabled logging
in components, this feature can be very useful in
identifying the various patterns of logs that
you have got in your application. And if there is any out of ordinary pattern,
it's going to automatically highlight it to you so
that you can be aware of it and see whether it is
an anomaly or a
real pattern that has emerged in your application.
So we have covered logs and metrics
and now we are going towards the tracing part. So tracing,
as I mentioned, is through AWS X ray.
And this service can help you to
do end to end view of requests flowing through an application.
So you can do it through Lambda and
you can also do it on API gateway. And there are a
few other services with which x ray
integrates to bring those into your trays.
For example Eventbridge. So there are limitations,
but still you can do tracing with Eventbridge if you
are instrumenting your code on the producer side.
So on the producer side, if you instrument it and then send
the event, the x ray header similar to how
you see for API Gateway, you will see a tracing header that
is being sent from the producer into Eventbridge, and Eventbridge can pass
on that header to the target and the target can continue
the tracing. So in this way you can bring the
applications into your trace. And what
xray creates when all of these traces from
these applications are reported back to x ray is it
creates a service map. The service map
is nothing but a flow of the event
or the request through various
applications. So if we take our example, you can do
a trace in this real time flow
where the customer request is sent to API gateway and it is sent
to Lambda and then to a dynamodb table.
So you, if you remember, we had a business sla of doing
this request in hundred milliseconds. So you can use
x ray to see what is the end to end processing time is,
and also you can see the split of time in each of the service,
like how much latency is in IP gateway, how much is in
lambda and how much is in lambda animal DB. And that will give
you an idea of where fine tuning, if at all needed,
has to be done. So for X ray you can
do it on lambda console or through the API, Amazon API gateway
console, or you can do it via infrastructure as
code. So like a AWS Sam, you can do that as well.
So that's a very quick overview of doing tracing using AWS
x ray. And the last bit of service which
can be very handy in terms of lambda that I wanted to highlight
is the lambda power tools. We're not going to dive deep into that.
I have added a resource link at the end of the slide
to know more about it, but just on a high level,
it is a developer toolkit or a very opinionated library
that is created by AWS which helps you to
implement observability best practices. That is, you can
do logs, metrics and traces with very minimal code.
So that's the main idea of it. So it's
very useful, not just for observability, but for
many other serverless best practices.
So if you use the serverless links against the well architected framework,
these are the various areas in which power tools are going to
assist you. So it's worth mentioning. So that's why I've highlighted that.
So now that we've seen all of the bits and pieces of what are the
various things that you can do, imagine there is an observability
team or all the application teams which is involved
in building this end to end application. So if something
goes wrong and if you want to troubleshoot,
it's good to have all of the things in a single place. So you
need not go to multiple places to find out where the issue was.
So you need a single plane of glass in which
you can see your logs, metrics, your alarms, your dashboards,
on your traces. So that's why we have this service or feature
in Cloudwatch called Cloudwatch service
lens. So it is a single pane of glass which is going to help
you to drill into your all of your
observability telemetry data that we have discussed so far.
So please take a look at it. It is going
to be very useful. And lastly,
I want to finish off with some best practices for observing event driven
application. So these best practices is not
just for EDA, but for any application where you want to get started.
With the observability that is. So it is like an eight step process.
We are going to like just whiz through that process.
You can look into the resources for more details on this.
First thing is observe what matters because as we discussed,
there could be a huge amount of logs,
metrics and traces that is generated from each of your component.
So focus on what matters to your business,
what matters to your customers. So that's why we keep going back
to the business SLA's and work backwards from that to
see what are the data that needs to be observed. And you need to
measure your objectives against those SLA's
so that you know what good looks like. Because we cannot be saying that.
Yes, I might always be looking at the happy path. We need
to see like this is what is the metric and I have achieved that and
that's why my application is performing at its best. And identify
the sources from which this telemetry data has to be taken.
The plan ahead. So this is not an reactive monitoring,
it's a proactive observability. So that is important to
be kept in mind. The next is alerting strategy.
So we discussed about the various types of alerts that can be
created, but define the criteria because there are some
alerts can be just warning, some can be critical where immediate action is
needed. So define the criteria and also the appropriate actions
for each of the alerts. The next is dashboards.
So now we have all of the data. So you can create nice graphs
and charts within Cloudwatch, but have
some strategy of what data is going into each of
the dashboard and who is going to look at it. So you can create like
very high level dashboards, like customer experience or how your
application is performing last week and how it is performing this week.
So these kind of things for like cxo levels and maybe the
head of levels and very low level dashboards going into the
nitty gritty details of your application.
Maybe for your platform engineering team. The next is
tool selection. So choose the right tool for the job.
So in this session we talked about AWS native tools for observability.
But many of our customers who
build ADA on AWS, but they still use third
party tools or open source tools like open telemetry,
which has the industry standard and supported by various vendor applications like
Grafana and Prometheus. Those also can be good choices
for your application. So it all depends on what is
your need and then you pick the right features for it and
then bringing it all together. Observability needs
to be an internal process that everyone agrees on and
it has to be part of the operational readiness.
So if these are the necessary observability things to
be in place for the application to go live. So that kind of one
mindset and cultural change has to be there in your organization to
mature your observability framework and
finally iterate because this is not a one off process. As and
when your application grows, your maturity model is going to grow
and when there are like new features that is getting introduced
into your application, your observability
is also changing. So this has to be an iterative process
and to be reviewed routinely. So that is my
eight step process towards the best
practices of observability. So use those best practices,
overcome the challenges that you have in EDA and get
your EDA right, because EDA is great and it is going to
help you to deliver business outcomes very effectively.
I added some further reading to know more about serverless observability
and observability for modern applications and also a link to the lambda
power tools that we talked about. So thank you so much for your time.
I'm Umila Raju. Please feel free to connect with me on LinkedIn and
I'm happy to take your questions as well. Thank you.