Transcript
This transcript was autogenerated. To make changes, submit a PR.
For a recent Catwoman Sari survey identified
that more than 64%
of organizations believe they should
effectively start monitoring their endpoints,
even if they lie outside their physical control.
And 44% of the organizations believe learning
from failures are important and they should invest on
identifying more ways of
preventing these kind of disruptions.
And interestingly,
24% organizations had a recent breach,
which means they had a recent contractual breach during last twelve months
and more than majority which is around 66%
of these organizations. They are already using
two to five monitoring tools or the observability tools.
What literally this means is monitoring
or observing the key endpoints
is close to every organization's hearts,
minds and their operations and they are
understanding that the
failures are normal and they are trying to learn from me as well and
then trying to figure out what are the ways they
can prevent these incidents.
But unfortunately, close to one fourth of
them still have contractual breaches
happening regular intervals.
And this is on top of they
are bringing in multiple observable tools to help
in their course. So something is not
working and I'm pretty sure you got the numbers
here.
Hi everyone, my name is Indigo Immelsuria, so welcome to Observability
2024. As part of Observability
2024, I'm going to walk you through why I truly
believe AWS is your ally or the best friend
when it comes to implementing observability.
Observability as you know, it's about understanding
internal system behaviors so that we can proactively
eliminate some of these issues or if at all,
identify some of these disruptions in advance. And then
we can improve our mean time for detection and then mean time
for resolutions while implementing observability.
There are widely documented anti patterns
you will come across and there are some unknown anti
patterns as well hidden. And as part of my presentations
I will discuss these anti patents and how leveraging
AWS will help you in your journey.
As part of my presentation I will discuss importance
of observability, why observability is important for organizations.
There are no two questions about that, why it is important and
we will understand some of the observability anti patterns and
we will also look at AWS capabilities services
AWS offering and we will deep dive
and try to understand some of these the services
AWS offering and how we can leverage them to fight
against our war
of ensuring that we
don't succumb to anti patents. So we
will discuss about some of the implementation guidelines,
the best practices.
Moving on, a quick intro about myself. My name is Indik Uma
Soory I based out of Colombia, Sri Lanka.
I live with my daughter and wife. I am
currently having close to around 18 years of experience
working in industry. My expertise are
predominantly on site reliability engineering,
observability, AI, Ops, DevOps and generative
AI. I'm currently working
at Virtusa. I'm overseeing the technical
delivery, the solutioning, architecturing and
capability development as well. One of the significant
part of my current job and outside I'm enjoying
is being a trainer. I'm a very passionate sharing
knowledge, empowering others and building the community.
I'm a very passionate technical blogger
as well. You can find me at Dev two.
I'm also AWS community builder under cloud operations and
also ambassador at DevOps Institute.
So as I said, AWS and observability
is very close to my heart. And I have been involved in implementing
lot of observable solution for Fortune
500 companies and this over
the period of time I have understand these anti patterns
and some of the best practices and how we can leverage AWS
to overcome and expedite our journey.
So let's discuss the complexities
we have in current world. The current systems,
as you already know, moved out from monolith
and now we have the microservices and obviously
microservices bringing in a lot of complexities which resulted
in me moving out from monitoring which is about, you know,
doing something which is already predefined.
But now this day and age, our systems have lot of
unknowns and we have to figure
out these unknowns have better controllers and observability
is providing us a greater solution.
Observability is trying to understand the internal state and
that results enormous of data which we are getting as
well. And you might already know,
almost all our systems moved out from on premise. We are
now in cloud. So microservices,
observability, cloud has resulted in tons
of data introducing
multiple failure points. So much of complexities
in our systems and with results it's little difficult
to manage. The unknowns get hidden
and then they will appear in the most
unexpected times and will have disruptions
which will cause bad user experiences, impact on
our revenue, lot of other things.
So observability is very important in
this complex distributed system architecture.
Why? Because we want to be on top of performance.
So distributed systems are very good,
but monitoring them, observing them,
identifying their bottlenecks and improving the
performance is key. Just because you have a distributed system,
not necessarily mean you will get the optimal
performance. And as we
discussed earlier, distributed systems being complex,
there are lot of hidden unknowns.
These unknowns can appear at any given time and will
bite you. And detecting these issues quickly,
identifying them, resolving them, is far more
and more than that. What organizations try to do is
eliminate them, if at all it's possible, because there's
nothing better than fixing something
before it's going to impact you.
All of this means is we want to have more reliable
systems. So reliable systems means reliable
services and end of the day happy customers.
And also we have to understand that
just like any other thing, these systems will also
crash and these systems will have some bugs. This system will go
through some cycles of bad times and
this happen. We want to build a comprehensive
framework or mechanism which allow us
to debug and troubleshoot these things
and fix them promptly.
So overall observability is important
for distributed systems to make sure
that our systems are doing what it's supposed
to be doing. And we have the ability of
understanding and managing these systems,
and ability of eliminating some
of the disruptions by identifying them early and
then looking at all those observability pillars
like metrics, logs your traces, events and
gives you the proactiveness which you always want,
so that you can be on top of your game
in managing distributed systems.
While it's important when you look at observability from
a distributed angle, moving to cloud,
even though it has fixed lot of our problems,
it's again, people start to understand
cloud is not the single solution
for all your problems, and especially
availability and reliability. Cloud providers
are providing you a platform where you can deploy your
systems, your services and data, and you
have that accountability from your side to
making sure that you manage it properly.
So when it comes to cloud, still your scalability,
unless you are using managed services, unless you are using some of
those cloud scalability solutions,
scalability is your problem and
security is obviously one of your problems as well.
And what we have done is we are good at
managing the risk so that we look at poly clouds,
like deploying things in multiple clouds, so polyclouds,
multi clouds, bringing in another different complexities
and a different total dimension into this problem.
And also we want to ensure that we reduce the cost.
So when you want to achieve all of these things, observability is
paramount. This again for the cloud, observability is
important and observability gives us this ability
to develop and maintain optimal
systems in cloud.
And finally microservices,
almost all complex systems, we are moving
out from monolith to microservices. So when it comes to microservices,
it has so much of good and
so much of capabilities which we are harnessing to provide
better customer experiences.
But what you have to understand is microservices itself
are bringing in lot of complexities like dependencies,
troubleshooting of bottlenecks in performances.
The complexities result in time
taken to isolate root causes,
scalability problems, debugging related
problems. So all of these things results that.
Observability is key here. So as I've been
going through for last few minutes, either in distributed
systems, cloud microservices,
observability is key. Observability is a
friend in enabling reliability
systems.
So if you look at observability, Gardner published
the observability hype cycle. So it's about how
what has trigger observability at peak, what are
the things we were interested in and then how
this the hype cycle went, and then what are
the takeaways. So if you can see, APM is
what something widely industry has been
leveraging, which is application performance management.
It's about phrases. We have the logs. Logs are,
which is predominantly the traditional part of observability.
And then we have the metrics, which is we very happily
used to configure our alerts, because the metrics
are numbers. Numbers are good at, you know,
measuring things. And the traces help us
for the troubleshooting to understand, allow us to go through
and dig in and understand exactly what's happening.
Because all the years we found that monitoring sometimes focus on
infrastructure. But infrastructure is one part of the
big picture. It's actually a code which is doing bulk,
like we have done a lot of good things, improvements into infrastructure now.
So now the focus is back to the code.
And APN pays a major role here.
So when it comes to observability, observability is all about the
logs, which is about creating
audit trail. It's about metrics, which is providing you
the ability of configuring alerts, measuring things,
the traces, which is about digging into your code,
understanding the bottlenecks where the code is not performing metrics
are helping you to develop alarms and all of these things
allowing you to create dashboards. We have the synthetic
monitors. Over the years we have developed some other capabilities
like real user monitoring, which is about monitoring
and observing what our end users are doing at front end.
And of course this all built on top of our infrastructure,
monitoring, network monitoring, security and
cost optimization. So in nutshell,
observability is looking at your entire system holistically
and trying to understand things before they go
wrong.
So now that you have understanding of observability, why observability
is important and what you are trying to achieve.
Let's look at AWS. So AWS
over the years has been bringing in lot
of capabilities in observability area.
And one of these, it started with Cloudwatch.
Cloudwatch is integrated with almost all AWS services,
so that you have the ability of, you know, shipping all your logs
there and then integrating all the metrics from there.
You can create the dashboards, your alerts. And then
AWS introduced things like AWS x ray
which have the ability of looking enabling
phrases. You have the option of probably going with open telemetry.
And AWS introduced things like real user monitoring
to monitor the front end. And obviously
recently they introduced things like application signals.
So all of these things, AWS have a collection of very
powerful set of services.
Either you can go with AWS native services, or if you
are more for open source,
open source kind of a person, you can use AWS grafana
on top of that, use some things like open search or manage
Prometheus and jaggers, ifkin and others to enable your
traces as well.
So AWS is able to
provide these capabilities for the both
kind of worlds, right? Either you are AWS native
person or you are open source person. So all
of these capabilities will enable you in
developing a great observability
platform for your systems. So the idea is that how
you can use this and
get that benefit and when you are doing
that, understanding some of these anti patterns
are very important because by
knowing them, you know how best you can use some of these services
which will automatically in some instance will
able to help you and ensure that you don't fall into these anti patterns.
So moving on, let's discuss some of these anti patterns
and I'll go through them by the pillars of it observability,
especially in logs. One of the challenge we have
is sometimes you have more locks.
It's very difficult,
it's a good problem to have. But when you have excessive login
and when you does sometimes have little bit
of unstructured, not a structure way,
but it does is generate a lot of noise and it's
harder to extract lot of details.
This is the place where AWS has done lot.
The Cloudwatch has the ability of integrating with logging,
not only just ship the log to Cloudwatch and you can do that.
AWS has come up with lot of new capabilities like log
anomaly detections and things like natural
language searching. These capabilities allowing you even
you have in the situations where you have excessive locks, there's no
structure, it's little bit of hard to troubleshoot.
You are able to use these cloud watch capabilities
to overcome some of these anti patterns.
And then when you look at metrics,
while metrics are good, there are
a lot of anti patterns as well. Sometimes we are coming
up with lot of unclear misaligned metrics
which finally resulted in service level objectives.
We probably sometimes coming up with some bad sampling when
you're doing the matrix. And sometimes these metrics,
there are so much of metrics that it's very difficult
to understand how to
kind of like pick the right metric for your needs.
So what this doing is, it's bringing you
false sense of comfort saying you have a lot,
but actually this might not correlate with the end user experience.
And sometimes a bad sampling might result in you are
not getting when you need it, you might not have the data and
unnecessarily having numerous set of metrics might
lead to unnecessary complexities.
So AWS, by using Cloudwatch,
special cloud metrics, what you can do is you
can actually focus on
very easily the availability of metrics
and then go very quickly and trying to understand what
makes sense instead of you spending lot of time
trying to enable metrics and
then later trying to figuring out what is required
and what might not add value by using Cloudwatch.
And usually when you plug Cloudwatch into your services and
you know the metrics will start appearing and then very quickly you
can go and go through and understand what is this metrics
doing which is have more relevance to you. And probably you can
do some customizations when it comes to the data. And this
helps you in ensuring
that you pick the right things,
write things which add values to you.
And also you can use things like the
AWS,
some of the default metrics which will help
you to figure out in situations where
cardinality is a problem.
So in nutshell, Cloudwatch matrix is a
beautiful thing which will enable you
your metric enabled journey in observability smoothly.
Yeah, it might already
help you in ensuring
that you don't fall into some of these anti patterns.
So when it comes to the tracers, tracers also have
quite a few anti patterns.
Sometimes some traces we don't give the priority,
sometimes there's a lack of trace id consistencies and sometimes
the instrumentation is not enough.
By leveraging AWS x ray or using open telemetry smartly,
you are able to get your traces in
front. And the beauty is that not only services
like what you do,
microservices and other things, but even the lambdas and even the
other things like you have that option of enabling
traces using AWS
x ray real quickly and that will enable
you that you have more
traces, you have the consistencies,
and you can get the power of distributed tracing.
So AWS does this seamlessly without
you want to do a lot of
hard work.
And when it comes to traces, there are few more like things
like continuing the context. Context is very important when you are
trying to connect from front end to the back end
and looking at the traces and the visualizations and kind of like
connecting with realism on terrain. So AWS by
nature, the X ray, the capabilities are
allowing you significantly reduce
the manual effort of spending
time to fix some of these problems. X ray
does it in some instance
magically like if you are using AWS lambda,
enabling x ray is probably a one click.
So x ray as a service is powerful
and that's allowing you to mitigate
some of these challenges. You will come across when
you want to enable traces.
And finally, when you look at
end to end big picture some
observability problems. We have a alert overloading.
This calls alert fatigue which is about
you have flood of alerts and you are not able to understand why
or isolate what has caused it.
So AWS is providing a lot of services like Cloudwatch,
SNS and those things where you can
even bridge, even you can smartly configure them
and able to get through some of these about
the alert overloading situations and
lot of places you have seen the
observability is destroyed. But with AWS
x ray, Cloudwatch and other things,
you are able to bringing in some unified view into your
observability framework where you can see
things when it requires from 13,000ft
above and you can then drill down whenever
you need. And one other thing is usually
people sometimes ignore the non functional requirements when it comes to observability,
which is in my mind it's key.
So AWS services allow you to
get some of these non NFR metrics and other
data like you probably you might be
using AWS lambda or RDS managed services.
All those things will help you in overcoming
some of these challenges as well. Not necessarily might
related to observability, but going
with these services will definitely help you in achieving
what we are trying to achieve in your observability objectives.
And one other anti pattern is most
of our applications are not isolated.
We have upstreams, we have downstreams, we have a lot of independent dependencies.
So that creates lot of buying frauds in our infrastructure
and managing these things. It's also little tidies.
But you can use AWS
systems manager when you are doing a lot of updates, patching and other
things. You can use AWS cloudformation to
bring in consistency infrastructure core solutions that
will also enable you in building some of these
observability in automated fashion into your system.
And obviously there are some environmental inconsistencies that
you can address by using AWS services like elastic
build stock or code pipelines and those things.
And while it's good, we have
gone through the observable pillars, the matrix logs
and phrases, and also the big picture of observability
and what are the anti patterns and how we can use AWS services
and AWS services like Cloudwatch, Cloud watch Matrix,
the X ray and the dashboards and
other factors, how it's allowing you to ensuring
that by nature of using these services you are
able to mitigate some of these anti patterns.
And one thing, one big anti pattern I have
seen is in observable implementation,
is not having a plan.
Probably will jump in and going there and trying to do
that. But sometimes what's important is
having a plan. And having a plan means
having an observability maturity model where you have
understanding that what is your end game. So what do you want is
have a plan where you can take your observability
from reactive to proactive and then proactive to predict
you, and then from predictive to build your system to
have capabilities like autonomous. So I am
not suggesting that you take this and stick
into the assets, but I am suggesting that you create some
maturity model which suits you, so that it
suits you, so that observability is not just a destination,
it's a journey. So you are able to take your observability
from reactive to proactive and predictive to autonomous.
So when you are working with your logs metrics, traces, you are
able to look at how just it's not just enabling logs,
it's about trying to take value out of logs. It's about enabling
AI into logs so that you can cut down some of these manual
touch points. It's about making your system autonomous.
And when it comes to infrastructure, networking, security,
again, the same thing apply. It's about making
and pushing things from reactive to proactive and to predict you
and then autonomous.
And one of the other important things in observability
is having understanding that how to
measure your progress. So what you can
do is you can measure your progress,
you can measure your progress by looking at some of these things
like mean time for detection mean time for resolution or that
means that are you detecting things quickly, are you able to provide
solutions quickly and what is the
interval between your failures and are
you improving your system reliability, how is your customers
are feeling about your systems and are you able to
increase your developer velocity by achieving your service
objectives? This will help you to understand that
whether you are align with
business goals, because end of day this is all about business.
You have to add value into your business.
Unless you are adding value to your business observability will
your car, your business partners will not see that gate or
value and
the anti patterns you can turn around to best practices as well. The opposite,
ensuring that there's a standard in login, there's a better
way of managing logging and better
way of using the traces
your instrument. You focus on instrumentation,
automated and responses and
continuous driver to achieve performance optimizations
and things like going into some of these AWS
managed services. These are very good because it's managed
and it's all are integrated with Cloudwatch and you
are able to use the SNS and other
the alerting capabilities and
the dashboardings. So there's a unified way
of doing that. So that's the whole point when it comes to AWS.
AWS has bundled these things so that it's
one place to go and the unification and
the simplification will provide you far value
and help you in your journey from moving from proactive
to reactive and then to autonomous.
With this I thank you for taking
time to listen. I hope you found this session useful.
You can find me in LinkedIn. If you have any questions,
send a note in LinkedIn there
are lot of nice thought provoking
videos presentations happening. Part of
observability 2024. I encourage you
to go and listen to others and I'm
sure everything will help you in your observability journey.
Thank you very much for taking time. It was my
pleasure presenting.