Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi, this is Samir here. Welcome to this event
by Conf 42 around the observability domain.
I'll be presenting the session on beyond
monitoring the rights of observability, which is around the observability
domain. Just to give you my introduction,
I'm Samir, I'm part of the abidian organization.
I'm based out of Mumbai. I provide architecture and
technology leadership to large
transformation programs and large deals. So I think let's
begin with the session. The first slide
here, it actually provides the
agenda that we would cover as part of this particular session.
So we'll try and understand and
get a view on what observability is
under the hood. Then I'll
walk you through what the open telemetry framework is, what constitutes
the open telemetry framework. Then we have the observability
building blocks in terms of the functional building blocks,
in terms of the framework, in terms of the architecture.
And then there are the KRAs and the KPIs that
are impacted by this particular platform.
Once this platform is deployed as part of various
opportunities, they basically impact those,
those various KRAs and KPIs. So I'll touch upon that as
well. Then there is the exact tooling stack in
terms of what constitutes the solution building blocks,
the ISV solutions, the hyperscalers and the open source
solutions that we have. And then a touch upon the
self healing infrastructures as well. And then a view on the AI Ops.
So this is basically what I want to cover in this
session. So let's begin. So system
unavailability and under performance in a landscape,
they negatively impact the user experience and also the
customer satisfaction.
Obviously this definitely constitutes to a revenue loss. If you see
a solution. If you see a portal where you
basically have issues in terms of the performance or
reliability or scalability,
the users would just run away from the portal if the response time
is not basically
not performing as per the slAs.
And typically it's about 2 seconds or 3 seconds. I think
if it takes time to load more than 2
seconds or 3 seconds, I think the customers or the users would
just run away from your website. So I think that's just
an example. And what
observability does is it
enables the organizations to find needle in a haystack by
identifying the system issues that are happening
even before the system, even before the customers would be able to track
it. Right. So I think that's about
the exact sort of objective. And then we have
observability provides the stakeholders with multiple
actionable insights into the distributed infrastructure and
that's something that is part of large
organizations and large complex architectures,
including the likes of Uber and AWS
and all the sort of complexities that we
have. These are solutions that are deployed.
Obviously, it's something that will be based on
what exactly are the requirements, what constitutes,
or what are the basically requirements that are coming in from an
application standpoint, from a business standpoint. And then you basically build
that platform incrementally. And it's just
not kind of a single dimensional aspect. I think there are various teams and
stakeholders that are basically
the users of these systems to get insights, to get
the insights and inputs around the various elements
that are happening in your IT landscape. So making sure that
in case there are any bottlenecks,
in case there are any predictions around systems
or applications or components going down, this particular solution
tracks it basically in advance, which is
basically making sure that it ensures
the uptime and availability of those solutions. So this is basically about
the overall background of the observability solution or the observability platform.
So next is under the hood.
In a modern complex infrastructure, on a
modern IT landscapes, there are tons and tons of
lecture solutions that are run, be it on premise,
be it on cloud, and the number is
just like humongous, right? I think you have your application
servers, you have your database servers, you have your web servers,
load balancers, the network elements,
the identity and access management systems integration,
your machine learning elements, data warehouses,
and just the list is something that
doesn't stop for this kind of a complex landscape.
What's required is you need to have a system or a
platform in place which ensures that
it gathers the insights from these
endpoints and provides a kind of a glass
pane view, single glass pane view,
to understand what's happening in those
nodes, what's happening in those
various components, various services, and take proactive
actions to make sure that any course correction that needs to be
done, it is done through various
rules and various policies that you implement as part of your self pinning infrastructure.
Right? So that's basically what happens
with the solution.
From a solution standpoint. It does gather all the insights
from various endpoints, it explores the patterns, it explores
the various properties,
and then tries to find and tries to deliver actionable insights
in a way that will make sure that the system uptime and availability
is maintained, not impacting that in a
negative way. So that's about the elements
of observability under the hood. So this
is a quick comparison between observability and monitoring. So we
have in the, in the earlier times.
Right. Monitoring was something that was,
let's say, medium to
understand and analyze what
was happening in your applications, what was happening in your various
applications that are deployed as part of your organization,
your it landscape. But monitoring was, if you see
the table here, it was more reactive, right.
And situational, wherein you really didn't have the
intelligence. You didn't have the
intelligence to monitor it and
take a corrective action on that. So I
think the first two elements are something that compares that. So one is reactive versus
proactive. One is situational, others is predictive.
Those are the two key, let's say, points. We also
have something that was more speculative than data driven. So when
I say speculative year,
monitoring was the case where you will try to, let's say,
basis the data that you had, basis the limited access of data
that you had about applications and your infrastructure.
You would try to, let's say, analyze and
try and understand the root cause of why certain things happen
or certain things have happened in your application.
Maybe a system going down, maybe a database not working,
maybe there's a performance issue, maybe there's a scalability issue,
but it was more speculative. So we really didn't have a concrete answer
to why that happened. Versus observability. Right. Which is data driven.
So observability actually consolidates
it ingests all the data from various infra applications, be it
your cloud environment, be your on premise environment.
And it then analyzes that data to find insights,
to find correlations, to find the patterns
that are required to, let's say, track your bottleneck.
So that's basically the difference what when versus what when, why,
how. So that's basically the
next generation of monitoring. If you have to basically
define it that way, which is observability. Right. Then you
have expected problems and unexpected problems.
When we are doing this comparison, expected is something that there is
something happened in the system. All right? So these known unknowns,
we know that this problem has happened, and now we have to do a course
correction versus the unexpected problem, which is where
you basically try and predict that, okay, something is
going to happen in, let's say,
this node or something is going to happen in this component or this service,
and this is what you need to do to basically correct it. So those
are the two differences. Something which is more predictive versus more situational
data silos. Yes. Monitoring was more built on data silos,
limited access to the data about your applications,
about your infra, about your various solution
components versus data in one place where you're basically aggregating
data from various solutions, various systems,
and then analyzing it to find patterns, to find insights,
and basically leverage that to have actionable,
let's say, decisions taken in terms of how we can improvise the
uptown and availability. And then there was data sampling versus instrument,
everything, which means that we have those
elements monitoring 24 7365 days a week,
making sure that there's no downtime on those systems.
Sorry. As for Gartner, and that's
probably something which says that more and more systems,
more and more, let's say organizations, are adapting it,
knowing what value it brings in knowing how critical
it is, knowing what impact can it make on
system availability and overall, let's say,
in terms of the customer experiences, et cetera. So I
think this is something that is a critical platform,
critical layer of
your it landscapes in the current era.
So we come to next slide, which actually talks about telemetry.
So it is basically an open telemetry architecture which
is arrived by CNCF,
or rather driven by CNCF. I'm sure we have other isvs
and product companies and SIS and consulting organizations supporting
it and driving that community part. But yes,
those are the core pillars of telemetry, which means that those
three elements, right, the telemetry tribes, metrics, logs and traces,
are the ones that are leveraged or
are the parameters that are leveraged to build
your observability stack. Matrixes are nothing but
numeric values measured over an interval of time.
And this is something that can be
related to your cpu, memory,
network bandwidth, et cetera. And those are the ones that
are gathered in your observability
stack. Then we have logs, right? These are the time stamped
text records of the events that occurred at particular interval or a
particular time. Which means stack trace.
It means the call stack, right? What API
is getting called, what time it is getting called.
That API calls which other support APIs, which other things are
embedded into that API. So that entire logging is something that basically
a second type of telemetry parameter, and the third
is a traces, right? And this represents the end
to end journey of a user request to do the entire distributed architecture,
which means it's not just related to your single process
or application, it actually cuts across your various nodes,
various applications and systems to come up
with the details around what is exactly happening in terms
of your end to end user journey. So these three parameters, or these
three telemetry types are the ones that are recorded, are gathered,
and there are derivatives of this that basically constitutes the entire,
let's say, gamut of things around the observability platform,
the user experience, the uptime, the availability, everything is
then calculated based on these parameters, even the KPIs and the
keras, I'll come to that in a bit. But these are the key ones.
And this is basically an opal telemetry framework. All the
isvs and open source tools and the hyperscalers are the ones
which more or less, in a way have to, let's say follow
it, have to abide by it, have to align with
this framework. Because I think ultimately these are the best practices,
right? So I think that's about telemetry, pillars of
telemetry. And here is a good,
let's say details around what are the KPIs
in the KRS that are impacted by your observability,
right? So this is customer experience being one thing,
which is making sure that the impact
on the end to end customer journeys is something that is
something that is also observed, that is also monitoring by
observability. Then there is MTTR and MTBF,
right? So there is mean time to repair and mean time between failures. So if
you have a system in place which can predict
the issues even before they happen, I'm sure all these parameters would
be under control, right? MPTR and MPBF.
It can actually do the
analysis, it can find the insights, it can find the patterns,
it can find the correlations between those patterns and come
up with a decision with an insight
that can be leveraged to do a course correction in terms of
your entire portfolio of applications and
solutions. Then there's reliability and availability.
Definitely. All these parameters are something that are directly getting
impacted by observability. It improves reliability,
it improves availability based on the overall,
let's say goals and objectives that you have. So if there's three nines, if there's
four nines, five nines, right. So I think, I'm sure we
can actually configure these
policies and processes and
rules to make sure that we have this in
control, right? We have this architected,
we have this defined and monitored,
managed as per what the SLas are.
Performance. Yes, it does have an impact on the performance and
scalability both. It can actually provide you
with so much data around, right? You're gathering about
your services, about your network calls,
about your distributed tracing,
logging, all these elements
that come in, right. If you consolidate these details,
you will ultimately find that we will
be having the insights gathered from the various
endpoints, from the various nodes to make sure that we
improve these parameters. We have these parameters under control
so that it doesn't really impact
the user experience, uptime and the availability. So those are
about the KPIs and the KRAs.
This is in terms of the platform objectives, right?
If one decides to build an observability platform,
what constitutes that observability platform? What would really go into
the platform? What are the solution building blocks,
what are the infrastructure elements,
what kind of, let's say what
solutions or tools you would want to leverage, whether it's isvs,
whether it's open source tooling,
whether it's specific to any hyperscalers like Azure or AWS.
So this is just kind of a platform objectives
in terms of what you want to achieve via this observability platform.
So it definitely enhances the visibility of system performance and
health as we saw all the KRAs and the KPIs.
Then it discovers and addresses unknown issues
and with accurate insights. So it has that elements which are built into
it, about root cause analysis, about pattern identification,
about insights, et cetera. So this basically
is that element. Then there are fewer problems and backouts as a result
of predictive capabilities. So it can actually do that.
It can analyze various issues
and bottlenecks that you have and then make sure that you have the
right set of rules deployed, right set of policies deployed
in your landscape. Then comes the predict issues
based on system behavior by
combining observability with aiops, machine learning and automation.
So this is the second stage of your
maturity, right? The first being you have observability in
place now you have to build aiops. Aops actually includes other
elements as well. Not just monitoring, not just observability.
We'll have the ITSm bit as well.
We'll have the automation bit as well into it,
which will make sure that all that
you do, all that you can monitor
in those domains can be leveraged to have your AI Ops,
which is basically nothing but making sure that you
have the artificial intelligence built into it, so that there
are less manual interventions to drive your processes and
orchestrate various, let's say, scenarios that you would
have. So those are what AIops is. And I have
kind of a slide also explaining that. Then catch resolve issues
in early phases of software development process. This is something
that relates to while you are actually building your software, right? If you
have that entire element which
goes through your CI CD automation,
you can actually leverage a lot of tools around that in
your pre production, in your staging before it finally goes to production,
to make sure that you have the right quality of platforms deployed,
right quality of applications deployed, and deep dive into
logs and inspect stack trace error. So this is again, something that is
already covered in the earlier point, and then
the framework part, which basically is nothing but making
sure that once you have those components or
the platform built, created and deployed, you will
make sure that there is the right. So I think each of these
basically mentions about the system of duty.
So if you go to each of these points,
what does it finally do once you have the platform deployed,
what it originally does. So it makes sure that the overall health of the
system, which is basically the uptime and the availability,
is something that is as per
the superior SLA, then we have tooling to
help debug in production systems.
So I think we'll have the exact set of tools with
us to analyze the issues. If something happens
in production, why it happened and what is it that is needed to correct
it and fix it, is what is also
one of the, let's say, things that the
observability platform provides, then you can diagnose
infrastructure problems in production, do the root cause analysis as
well,
which is a capability that the platform provides. And as I said,
it's unknown, unknowns, right? We really have no clue
in terms of what may come up, but the platform
has the intelligence to make sure that it can analyze your data
from various endpoints, from various nodes, to make sure that it does the
anomaly detection and
make sure that time and availability is maintained.
So this is basically the platform framework. This is
the reference architecture for observability.
If you see it's a kind of, let's say pipeline
or a funnel, which starts with ingesting
data from various endpoints into
your observability stack. The first stage of
that platform is data ingestion and collection, which collects the
log data, which collects the application performance
monitoring data, which collects the metrics data, the uptime
data, and user experience data as well, right? The second stage
of it is aggregation and processing. So the second stage is where you actually
make sure that the data is in the right quality right
format, and then the data is aggregated to derive
insights and
also run various machine learning algorithms on
it to have the solution
created so that we can leverage it for anomaly detection,
root cause analysis, pattern discovery, et cetera. So the next stage is
nothing but that analysis part, that machine learning part. So while you have the data
in the right format, it's aggregated. So I think
it's typically three stages, right? So first stage is the raw data,
then you have the cleansed data, and the third you have the aggregated data.
But this machine learning part, right, which actually is like a
system which will run through the data
and make sure that it finds the patterns, it finds
the insights based on the
various, let's say, things,
various rules that would be there.
So this is basically something that is like a
system which goes in a kind of, let's say circular
manner, right? And finally, when you
have that third stage, the fourth stage is the
alerts and your dashboards,
which will be provided to your ops team,
to your SRE team to basically,
finally, let's say, understand the
various elements of your it landscape and
make sure that all the systems are up and running.
There is also an element,
or rather the self healing infrastructure, where in
case there's something
that needs to be fixed, right. The systems can
be built in such a way that even without your service desk
intervention, even without your level one support
or level two supports intervention, the system would be
able to roll out a
fixed basis of some of the issues
that defined in the applications, right, in your
it landscape. So these are systems that invariably
are, let's say, converging towards self
or autonomous systems, where we'll have less and less lean
teams to manage that entire gamut of
your infrastructure and application. So that's about
the observability platform architecture. I would
want to just touch upon
the logical architecture,
which basically has six different elements,
or six different, let's say,
data elements that are captured. So there's log
data, there is metrics data, there is synthetic data, there is
APM data, user experience and uptime,
right? So if you see each of them, if you
see the metrics data, for example, it's the host and container
matrices, then we have the database matrices,
then there is a network metrics as well. So if
there is a kind of, let's say a system
that you built, you basically make sure that these set
of data elements are the ones that are ingested into the platform.
And this is what is leveraged to find insights and
patterns and do the root cause analysis.
I think it may sound very
simplistic when I'm explaining it, but I'm sure there
is these things that are ingested,
they run into terabytes and petabytes, because I think we are monitoring the system
every second to see the,
to actually observe what exactly is happening in
the applications in your infrastructure.
I think the kind of complexities that we have in
the application, I am sure with these
kind of complexities, I think there is no way but to leverage
automation. There is no way but to leverage AI. There's no way but to
only leverage these cutting edge platforms to make sure that
your uptime and your availability is maintained.
I might just want to cover a few more slides here.
One pertaining to. Yeah,
so I think a quick one. This is the tooling
stack of
your system. One is, one on
the left side is the one that actually is an open source
hybrid stack, which constitutes logtash, promntus,
nagios, jagger, java, melody, elasticsearch and kibana.
And the one on the right is open source track, leveraging elk,
which is elastic
search, lockdash and kibana. So this is a mix
of these two things I think ultimately depends on what are
your final requirements. I think the more concrete and
focused solutions you have, there's more capability and features in it.
But definitely I think that's something that needs to be analyzed on a case to
case basis. But this is the open source, let's say stacks
that are typically leveraged for various tiers,
various, let's say stages
that you have as part of your monitoring platform logging, monitoring,
infra monitoring, app monitoring, distributed tracing,
user experience monitoring, et cetera. So I think this basically gives a view
on the tooling part or the tooling architecture.
I think I have almost reached the end of the
presentation. I think rest is. I probably just want to basically touch
upon that. These are the KrAs and
the KPIs that you typically
have built into some of these tools. So you really don't
have to do the heavy lifting around it. You'll make sure that some
of the templates that are there that are rolled out by various
tools, you can just leverage them, do a bit of customization,
do some element of configuration, and they are still up and running. So these are
the various KrIs and the KPIs, the ones that you just saw, the log data,
matrixes data. This is basically what constitutes that. And this is around
the matrix's part. A lot of templates can be leveraged
to build these solutions. And I think this basically is covering
and explaining the elements around applications, the databases
and things like that. This is the dashboard, an azure
monitoring dashboard in terms of what exactly it looks like
when it comes to complex landscapes,
monitoring the security, the application side of things, the infrared
network. But this is just kind of a simplistic view. I'm sure you would
have seen all the command
centers with a number of,
let's say, dashboards that you can find in a room where the SRA
teams and the Ops teams are actually sort of monitoring it
all the time. So this is basically just a view on what it constitutes.
It's probably just extrapolated to the complexity you have in the landscape.
And that's probably the number of, let's say, elements that you will want to basically
analyze, right? So I think that is about
it. I just want one last thing to
include is about aiops. Aiops.
It is the next stage of your
solutioning or your platform. So it's just not
limited to monitoring. It includes your itsm and automation
around that. And it also includes the things
that you will basically leverage to have a kind have an autonomous
architecture, right? So it's a mix of those things, right.
You observe, you engage and you act. So you do the automation.
So I think these are the elements that basically
are the next stage of, let's say, capabilities that you'll have in your landscape.
I think the diagram on the left, it shows that the data collection is
eventually from various systems and your application is not just limited
to your observability stack. And then you have the machine learning systems
to do the real time processing and the historic processing. And then the next level
of let's check up capabilities to do the anomaly detection and pattern
discovery. So this is about it. I think I have covered
most of the things, some of the elements that I couldn't cover are there in
my slide deck and I think that's
about it. I've, on the last slide, I just have
my coordinates. You can just feel free to reach
me if you have any doubts
or you have any specific inputs around the presentation.
And I hope you have found this presentation insightful
and informative. Thank you. Thanks for joining. Thanks for watching. Thanks. Have a nice day.
Bye.