Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello everybody, I'm Rajulakshmi from Psi 24/7
happy to virtually meet you all in this event. Conf 42
SRE conference thank you for joining my session
on the topic future of observability in an experience driven
economy. And the agenda that
we'll be discussing today are about digital experience,
the importance of it, observability versus monitoring.
What are the differences and what is DevOps SRE
and what does observability do with them? What are the pillars
of observabilitys, how you can achieve them? And how can
tools help in achieving observability?
So first, when we talk about experience,
particularly the digital experience,
the first thing that comes to my mind is this book by
Kathy Sierra. The book by name, badass making users
awesome I'm sure some of you would have read this book,
or I would strongly recommend you to read this book because
she gives a different perspective of how you have to keep
your users in the center of your business. And this
opens our eyes in a different perspective. And there is
a question in this book, and this goes like this,
which would you rather have a user feel? And she goes on to give
four options, and you have to choose only one of the options.
Let's see what comes to your mind when you look at the options.
The options are, do you want your users to
have the feeling that the product is awesome? This is the product that
you are building, or you own the product. So do you want your users
to have the feeling the product is awesome or the company is
awesome? The company that is building the product is awesome or
the brand is awesome? The brand that's owned by the company that is building
the product is awesome. Sounds tricky, right?
All of these answers look like they are correct, but she
says none of this. There is one more choice,
which is the right answer and which is I am awesome.
So you have to make your users feel that
they are awesome. But wait,
people don't actually talk like that. Nobody says, I am awesome
because of this product. They only go on to say,
I like this product or the app is amazing. That's a question that comes
to all of us. This is exactly the snippet
from the book. So she goes on to say,
when a customer or when a user, he or she
says, this product is amazing, you should see what it does.
They actually mean I am amazing.
You should see what I can do with it. So when you have
such users who have a feeling that they sre awesome because
they are using the product, they will do a lot
of things. When a user feels they are awesome. These sre some of the things
that they will do, they will go about talking for the product,
evangelizing, free marketing. They will remain loyal,
they will tolerate problems. Whatever be the problem in the product,
they will be able to tolerate it and they'll resist competition.
They wouldn't want to go to the competition, they would want to stick with
the product that they sre using. And they'll form community groups.
And there are situations where even before the
customer, even before the product owner answers the questions,
there are diehard fans who would want to solve problems for other users
and they'll show off their results and lot more. So these are all
the outcomes when you have the user feel that they are awesome.
So such users are the secret for
sustained success. The successful users, all of us at business,
we want to be doing business in the long run.
Sustained success. The secret lies with successful users.
And if you want successful users,
user experience is important. And in
this digital world, digital user experience is important.
User experience could be in any format. It could be in the way your
support technician picks up the calls and answers a query.
Or it could be the way how you are designing your UI,
keeping it very simple, having all the configurations,
easy to do, easily understandable.
Or you automate many things such that
the user may not do out of the box stuff.
It could be in any format. But experience plays a
major role as a differentiating factor if you want to stay ahead
of your competition. And there is an
article, Forbes article, that talks about 100 things in
the digital world and user experience, 100 stats, and I've just taken
few from that. 89% of companies have adopted digital
first strategy. They have started moving to the digital world.
And 86% of companies believe that cloud technology is
critical for digital transformation. You want to take your product to
the digital world. You can't say that I would want to
take it to the digital, but I will still be using my legacy code,
legacy software that I have been doing. Technology? No, you have to
adopt to the relevant cloud technologies if you want to adopt
digital. And 67% of consumers will pay
more for a great experience. So experience is important. They don't
mind paying for that extra for that.
87% of companies think digital will disrupt their industry.
So people have started feeling, and I should say this last
two and a half years, this pandemic has only accelerated
the digital adoption and people have started moving to
the digital world. 83% of enterprise workloads
are in the cloud. It's not that only the startups or SMBs are in
the cloud, even enterprises have realized the importance and they are
moving to the cloud. So all these facts and the references that I
quoted goes on to say that digital experience
is important and we are
living in this era of experience driven economy.
This is going to have an experience that you're going to give for your customers,
is definitely going to have an impact on your business
and the economy of your business.
Let's now look into observability. What is
observability and how is it different from monitoring?
Is monitoring and observability, are they the same or are there any differences?
So I would want to put it in a simple way as
the observability. When you talk about observability versus monitoring,
the base of any system or any monitoring
that you want to do is observabilitys. That is the underlying platform.
Because only if your systems are observable, whatever you
want to observe, be it your server or your application, whatever you want to
do, the system have to be observable. There must be
some ways, some APIs or some protocols using
which you can fetch the relevant data that you want to
monitor. So the underlying is observabilitys on
top of which monitoring lies only if your systems are
capable of being observed.
You can do monitoring on top of it and you can collect all the relevant
metrics. Depending on what you are monitoring, the metrics
will vary and you can collect the relevant metrics. So that monitoring is the
second layer and there's no point in just collecting all
these metrics and keeping it with yourself. You have to do analysis
on top of it with the data, the humongous amount of
data that any system is collecting. There has to be analysis
and there has to be segregation and make meaning out of the data
that you collect. There's no point in just collecting data if you're not going to
make sense out of it and give that benefit to customers.
So this is the stack that I would call observability is the underlying
platform. Then you can do monitoring and then you have to do analysis
on top of it. Don't just keep your data idle. If you're not going
to do anything with the data that you're collecting,
do not collect it. Why waste your resources?
So keep in mind, all the data that is being collected
has to be analyzed, maybe for various reasons only.
If you have such data and such analysis, you can apply
the latest technologies of the AI capabilities.
I'll be talking about AI towards the end of the session too. But analysis
is important. Let's move on
to how I see DevOps and the definition
for DevOps SRe. And what is that to do with observability?
DevOps. So if you take DevOps, there are different definitions
of continuous feedback mechanism,
is what we all know of, but I see it in a
different analogy. So there
used to be, no, I shouldn't say there used to be. There is still this
role called developers whose actual role is to
write code and then make sure that there are no errors
in the code, bundle it as a product,
build it, attend to all the code level errors, and they think with that
their role is over possibly ten years ago.
I've been in the industry for 22 years, I've been a developer myself.
So we used to do this. We think that that's the role of
developers, and with that we are done. And we pass this on to another person
who's the operator, who actually takes care of
deploying the build, attending to customer complaints, making sure
the application is up and running, keeping up with the slas,
fixing any problems that happens in the system. So all this used to be the
role of operator. So these two roles were actually, or these
two roles, these two persons, they were actually different
and there were a lot of blame games. This is
not my problem. This is not my problem. This used to happen,
but with digital adoption,
with the latest technologies and the transformation that is
happening, where we are moving to an era of deploying
multiple builds within the same day, we at
site 24/7 we do deploy, develop, and deploy three
to four builds a day from what it used to be three bills.
I mean, a build in three months. That's how systems or companies are evolving
in such a situation. These two roles cannot be separate
and they merged, and that's when DevOps was
born, where this is how I sre DevOps as
a person or a character, or however you want
to define, they have to have some amount of knowledge of what
is happening in development to what is happening in deployment.
Only then they'll be able to quickly fix any problem
and then take it to the production environment.
And as seen in this diagram, he or SRE is
not a superhero to do all this by themselves, they need to make use of
tools. There are a lot of tools, be it in house tools or third
party tools, open source tools, lot of tools available, which helps
DevOps people in their day to day activities
to make sure that they are able to solve problems easily.
And DevOps is moving from in
the cloud. When we talk about things are moving to the cloud, digital is
being adopted. DevOps is transforming to Devsec
Ops, where security at all these layers are important,
be it at the application layer or at the infrastructure,
at all the places, it has to be safe and secure and that
has to be taken care. That is added as an additional role for
DevOps. Now this is about DevOps. What is
SRE? Are they both different or are they both same?
There are a lot of, what is it again? Definitions for
these two. But the simple definition or the simple way
of putting things about SRE is if DevOps is about
principles of what to be done, SRE is about
how you do things. So if you take the differences of DevOps
and SRE, DevOps might go with connecting the development and Ops team
with a set of principles and the primary focus here on
the delivery. Whereas SRE are the
more ops oriented and they are more towards
the production environment, where they respond to incidents,
monitor all the events and make sure that they reduce
the fault and takes care of automation. They are to do with the reliability
of the systems and particularly things when
it is in the cloud and where you have to take care of all your
deployments. Sre plays a major role. And if I have to put
this in the development terminologies or in the
Java terminologies, if I have to say SRE implements DevOps.
That's how I would want to call this as.
So with that SRE definition,
if we have to say, if you take a cloud architecture, there are various
layers in any cloud architecture, and starting from
end user layer to application to platform to infrastructure
layer. And it is important for the SRE to have
an end to end visibility across all these layers,
because the problem could be anywhere. If you
see it in the cloud, if your application is going down,
it could be because of an ISP problem in the customer
end, or it could be because there
is a problem in the way the application is written, there is an indefinite
loop that is happening in the code, or it could be in a database
connection being not closed, or there is a leakage of resources that
is being used in a file in a platform layer, or even it
could be in a problem in a port, in a switch at the
infrastructure layer. Any problem anywhere in the stack
is going to impact your application, it's going to impact your
business. So end to end visibility of all these
layers is important, and SRE has to know that for which
tools will be helpful. And that's about the observability.
SRE has to have an overall view of what is happening in
all these layers. Now, when we talk about the pillars
of observability. All of us know the three pillars.
There are three pillars. In fact there is another fourth pillar that
is getting added. But the main three pillars of observabilitys that
is being discussed are about metrics.
We'll see in details. All of these two metrics
is the first pillar traces where you get end to
end visibility or you get the line of code that is having issue and
logs. So these three are considered as the three pillars
of observability. So let's get into the details of what
do you mean by metrics, traces and logs.
That's what we will cover in this section, achieving observability.
So when we talk about metrics, as I initially said
during the observability section itself, anything that
you want to monitor, the metrics that you are monitoring will vary
depending on the component that you sre monitoring. If you are monitoring your
server, the metrics will be what is the cpu utilization,
what's the memory utilization, what are all the processes that are running in the system.
So those will be the metrics. If you're monitoring your application, the metrics
will be what is the response time, how many times a particular
transaction is being called and how many times
people are hitting the system. So those will be the metrics. What is the database
calls. So those will be the metrics. And if
you're going to monitor your database,
the number of connections, are all the connections
closed? What are the slow queries? The metrics will vary.
So depending on the component that is being monitored,
the metrics will vary. And irrespective of whatever be the component
that we are monitoring, we have to make sure the
basic things are being collected. So in metrics,
whatever be the component that you are monitoring, the primary or
the important thing is, but the availability metric,
uptime is an important metric to collect, be it your application or
your database component, or your infrastructure server
network, underlying components are all of them up and
running. The industry standards expect 99.99%
availability. It's almost like 100%,
but you can have small difference here and there. So the
industry standards have moved from three nines to five nines.
That's the expectation. That's what the competition is giving.
It has to be all the time, most of the time up and running,
which means you have to monitor all the resources, the entire
stacks. Availability metrics have to be monitored.
And in the cloud, security metrics
have to be monitored. Security metrics again will vary. If you sre talking about an
application, the metrics will be about have I made
sure that how sre you defining your security XML?
Have I made sure that if there are going to be hundreds of requests
coming in at the same time, have I defined all the thresholds properly?
So those are some things that you have to monitor.
And at the network level, am I able
to monitor all the relevant details and
have I made sure that one user's data is not accessible
by the other user being in the cloud, people are
believing or people are trusting the vendors and giving
their data into your system. So you have to make sure that the user segregation
is taken care properly when you are designing your application and when you
are designing your database itself. So there are various metrics which you
have to look into with respect to security aspects too.
That has to be taken care. Then performance
metrics. You have to take care of all your availability metrics,
but that's alone not sufficient. You sre having your resources up and
running, but if they are going to be performing very, very slow,
it's going to have an impact on your business. Industry expects,
industry standards expects just 2 seconds for any application
to respond. And it plays an important role in defining your
SEO. What is it in defining your marketing standards
and making sure that you are able to come up in any of your
search engine optimization. So performance plays a key role of
how quickly you sre able to respond. And performance again has to cater
with all the layers, with your applications performance, your database
performance, your server performance, your network performance, your end users
performance, where you have to take care of your ISP,
your browser, your device type and the version.
All those plays an important role. So performance metrics have to
be monitored and scalability metrics. There are
different areas, the different domains.
If you are a startup to SMBs, to enterprises, you have
to take care of how quickly can you define and you
can define and do auto scaling. And when we talk about
scalability, both the aspects of vertical scaling and
horizontal scaling has to be taken care. There could
be situations where depending on the load,
your system can only handle certain amount of load. You want to
add more instances and take care of horizontal scaling.
Or there could be situation where there are cpu
intensive calculations that are happening where you want to increase
the size and take care of vertical scaling. All these metrics have to be taken
care. Then cost metrics have to be taken care. When you are deploying in the
cloud, are you really utilizing the resources that you have
purchased? Lot of time. What happens is when
we purchase the system or when we use a cloud environment,
we use our credit card and we start using it
with that. There is another department, the finance department,
that takes care of paying the bills every month. Are we fully utilizing
the resources those have to be monitored there SRE
surveys that says that 30%
of the resources are not being utilized under
utilization. So you need to measure all those metrics and
make sure that you use fully of whatever
you are preparing. So cost angle have to be monitored. These are the
different types of metrics that you have to take care of when
you are monitoring your entire infrastructure. Moving on to
traces what is that you have to do in traces? Traces is to
do with having an end to end visibility or
pinpointing to the line of code that is having issues,
particularly where the industry is moving from a monolith
architecture to a microservice architecture and each of the
components is running in its own container,
where the container can be spawned,
deployed, destroyed of its own. So it's very changing.
Even though the architecture looks very simple, monitoring this environment
is very changing. Each of them can have its own programming
language too. Then how do you make SRE to find out where the problem
is? So in such situations, tracing across
all these tires, be it your client or your web or your server
or your data tire, where within your server layer you
can have set of missions for data securing, set of
missions for data collection and set of missions for data processing, you need
to know exactly where the time is being spent. There are
situations we have faced where a particular transaction takes
or does millions of method calls and you will get such
issues only in real time deployment. So you need to know it's
very tough or it's very changing to find out the problem
in the real time deployed environment. So the
tools will help. And tracing across all these layers is also
important. Distributed tracing in a microservice architecture where
one problem will have a cascading effect on the entire architecture
of your application. So it's important for you to trace
them and then moving on to the third pillar, which is again, which is logs.
Logs. Why has logs become a pillar? Is all
of us. Whenever we have a problem, we go look at the logs to
find out what are the issues in it. And when we have our deployment
in a distributed architecture, finding out exactly,
or taking remote control of each of those missions and going and looking
into where the problem is, is going to be challenging. Instead, if we are
able to collect all these logs from the distributed architecture,
do some processing on top of those logs, and store it in such
a way where you can easily touch them and look at what the problem is,
that's going to be helpful for SRE. So consolidating
logs from across all the servers, doing all the real time,
prepping and storing it in an easily queryable format
is what is log management. In one simple term it is converting
your unstructured data into structured data is log management
and out of the box there are support for all the common
applications that are available and the logs SRE also helpful
for your audit trials, be it your database logs,
applications logs, network logs, there are different types of logs and
all these logs SRE required in a cloud environment.
These are required for your auditing and compliance too.
And these logs will actually help you to find out what changes have
gone in, who has done the change and where was the change done, when was
this done? These are some important queries to be addressed to build
the trust because sometimes you might never know. The customer will come and say I
don't know when this change was done or who has done this change. Can you
help me in finding this out for which log management will be
helpful? So to put it in a nutshell, metrics traces logs
forms the three pillars of observability and definitely
the AI ops which people are using helps
SRE let's quickly look into just spend some two 3 minutes and look
into how AI ops helps SRE.
There are different ways how AI can help. It can help
with evaluating past performance because you have all the data
in your system. Based on the historical data that is available with
you, the AI system will be able to predict and tell you
what to expect. Suppose you had a sale last
time, new year sale, you had a huge number of people landing
on your system and you wanted some ten extra servers. Based on
these data, when you are planning for another sale,
the system will be able to tell you what is that you
have to be prepared for and it will also enable help in enabling
communication. Chatbots the chatbot integrations
because all of us have our own communication channels, possibly you're using slack
for your communication, you're using Microsoft Teams for your communication,
and you don't want to go into another mission
to look at or another window to look at what the problem is. Whatever be
the monitoring solution, you can directly integrate into your chat
communication using chatbots. That will help. That's again possible using
AI Ops manage a flurry of alerts any monitoring tool
is to do with having a lot of alerts, segregating the alerts
based on the severity, assigning it to particular technician.
All these can be the AI can help in doing it
automatically so it can help.
One step ahead for you in resolving the issues
and then data correlation across tools.
Has any system uses multiple tools,
so the data results from all
these tools can be correlated with the help of AI Ops.
Again, correlating test results we do a lot of testing. We do development
testing, changing testing, deployment testing, post production, pre production
testing. All these test results have to be correlated where AI can
help and when we talk about AI, one of the important thing we have to
keep in mind is exhaustive. Training is important.
The AI system is only as powerful as how much
it has been trained. The accurate data points that
are fed into the system will help it to make correct prediction.
So that is something that we have to keep in mind when we talk about
AI Ops.
Based on all these data, it can do an exhaustive self training so
that it prevents false alerts. So these are some ways in which AI can help.
It can help in decision making coming up with forecastings
dynamic threshold settings the user need not set any threshold. Make it easy
for customer. Let the system decide based on the historical data.
I will adjust the thresholds and for each of the transaction
the threshold can be adjusted. Each of the server the thresholds can be adjusted.
You can take a lot of collecting actions using automations
and scriptings which can be run automatically based on the predefined
threshold settings that SRE available. These will help in
resolving the problems and it will help the
SRE to make sure that they keep the system up
and running all the time. Conversational chatbots too.
The aiops there is a future is in aiops,
finding the accurate anomalies, reducing the noise, more forecasting,
and we are moving towards from being proactive rather
than being reactive. So all along the industry has
been in a situation where when a problem occurs,
how do I go and fix it? What are the tools that can help me
to fix the problems from that situation it's moving towards
let's be proactive, let's make sure that the problem does not happen in
the first place at all. So to achieve that, AI will
be helpful. So in a nutshell, if I have to say,
metrics, traces, logs combined with AI
Ops will be able to help sres in their day to
day activities and make it smooth and easy and pass
on that benefit to customers. These are possible with the
help of monitoring tools, collecting all your metrics,
your traces, your logs, all in one
console and apply AI on top of it.
There are many such tools available in the market. One such
is site 24/7 which is an aipowered full stack monitoring
platform that lets you take care of all your monitoring needs.
The stack that I talked about from one single console. So we
do have from website monitoring to server to cloud network,
application performance, real time real user monitoring,
application log management, cloud spend and status iq on
top of it. We do have alerting,
reporting and apply AI on top of this and PSi
twenty four seven is hosted on Zoho's data center. Zoho has been
in business for 25 years. PSi twenty four seven is a mature product
in the market for close to 16 years now and we are hosted
on Zoho's data center. Zoho has its data centers in five different
regions, ten different data centers in each of the region. We have a primary and
a secondary data center. The customers can choose the data center
so that the data resides within the geographical boundary
of that particular region. We have it in one in us, one in Europe,
one in India, one in China, one in Australia. We are coming up with more
data centers depending on customers'requirements, too.
Being a cloud provider, we do take privacy, security and compliance
very seriously and get all the required or relevant certifications
that are required for us to be a cloud provider.
So the key takeaways from this session, the last almost
30 minutes that I've been talking SRE about experience is
important, and when it is in the digital world,
digital experience is important. And when we talk about observability,
the main three pillars and how you can achieve the same,
and AI can help sres in achieving observability.
PsI twenty four seven is an aipowered full stack monitoring platform.
So I would like to close with this quote. This is one of my favorite
quote, and the quote goes like this.
We shape our tools and they in turn shape us is a
famous quote by Marshall McLean. And what this means is
the tools that you have in your hand has a great impact
on the day to day activities that you do. When you have
a hammer in your hand, everything looks like a name that's a screw over
there, but you only have a hammer in your hand and you
will end up only hitting it. So it's important for you to
choose the right set of tools depending on what your business
needs are what your customer requirements are, so that you can
take your business to the next level and be successful in whatever
role that you are playing. Thank you for your time.
Have a nice time. In the event, if you have any
questions, feel free to write to me or write to support
email id that's provided here. I'll be happy to arrange a
one on one session if you need a demo of the product.
Thank you.