Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi everyone, I am Pravar Agrawal and I welcome you all to
my session on debugging cluster level issues as on call
SRE. Let us quickly take a look at
today's agenda. So we start off
by getting to know about reliability
engineering in much briefer way and
also about the role of an on call engineer, what it is really
about. We will also try to find some commonly
occurring cluster level issues inside Kubernetes
ecosystem. We will try to
understand different approaches to debugging such
issues also with the help of different types of automation.
And if you are a beginner, then what should you
be really looking for in the role of an on
call engineer? A little bit about me
I work as a senior engineer at IBM for the
IBM Cloud Kubernetes service and these are
different handles to stay
in touch with me on Twitter,
LinkedIn and Slack. Okay,
so with an introduction to site availability engineering,
I assume that most of us so
far has been hearing this word a
lot of times and have a lot of awareness around it as well.
But to still understand it is more of an
approach or a methodology to like
for the it operations to use different tools in solving different set
of problems and even automate different operation
level tasks. It's a practice
to ensure the our
systems are reliable and scalable and
the SRE folks also try to maintain a balance between releasing
different features or new features and at
the meantime also ensuring reliability for
their customers as well. Also, there is a word
we often hear along with SRE which is
the on call and it basically means for
set of period of time, you are available to respond
to production level issues or incidents with
urgency. And different
companies, basically they have their own implementation of this on
call process. But everyone's main
aim is to ensure the availability of our
production and the support to our production 24/7 by
managing different incidents. And there are few rules
around it. So they could be starting
off with acknowledging and verifying the alert, trying to analyze
the impact caused by that issue,
then communicating different teams involved
through proper channels, whichever is
used in your organization, and also
eventually applying a corrective action or solid fix
to that issue. There are many tools
which have been popular in last couple of years for incident management.
There is pagerduty, Jira,
Obsgeny and ServiceNow, but there are other tools,
these are more popular. So this diagram is basically
a representation of SRE and the on call ecosystem.
And it involves different phases,
starting from the deployment,
troubleshooting, receiving the alerts, then the
escalation metrics which is involved and tracking
everything with the help of proper ticketing system.
Moving on to different cluster issues which
are very common and we often see those. So we
are taking, or we are taking an example of an
environment which comprise of a single or
multiple kubernetes clusters running in production.
So there could be issues related
to the services running on to node and it's not always possible
to do a manually SSH and try to debug those.
We will take a look at how we can basically automate
this process and how we can utilize
or debug these sort of issues in the next slide.
Then there could be issues related to multiple pods getting stuck
in pending or terminating state, which is a very common early
occurring issue. So this
mainly happens when there are the like for
the pending state, when the schedule ability
is, is very low for uh,
the pod going on to different nodes during its life
cycle and terminating mainly when the application is
lost, uh is not responding.
Basically when you issue a delete command of
the application resource inside the cluster, then there are also
uh errors related to the API endpoints.
Going down or responding is slow. So these API endpoints
could be related to the application or
at the application level or something
related to Kubernetes core components as well. Then the unavailability
of HCD pods. Also when
you want to issue different reloads of various
worker nodes manually doing
it is not always an option. Then there
could also be issues where your discs are getting full
or maybe reaching the 90% capacity and you often get alerted for
it and also your health checks are failing or maybe readiness
or liveness probes are failing. So how we can
approach towards debugging these sort of issues.
So there is no such golden rule to do it, but there
are different right ways to do it.
We will start off by analyzing the error message which we
have received and try to see if the
low, if the blast radius can be lowered or it is like
how many services or how many components it is impacting
at the backend. So lower the blast radius is better it
is for us to debug and issuing the corrective action.
Then we also might want to utilize our monitoring
tools like Prometheus log dnssdig properly
to take a look at the last recorded state of the application
with the help of different metrics which were pushed till the time it
was running fine and also different logs which were pushed,
maybe the application level log as well as the cluster
level logs to understand more the current
state and the previous state of the and if
it is kubernetes related, then we can try to get the access to the
cluster as soon as possible and try to find out
different master components or other important components are
functioning properly or not. If they're deployed as pods,
we can try to get the logs of the pods. Otherwise, if those are running
as system D services, we should try to check
the service level logs and if it's
a widely impacting issue, we should try to isolate that service by
restricting its usage. Of course we are going to need the
automation to our help
and why we basically need it so that we can reduce the
time which we need to respond and get
our infrastructure statistics. At the earliest.
We basically automate or we implement the automation to
get the cluster level statistics, or maybe run some real time
commands with maybe
a single line which is not always possible. If you're
trying to get the access of the cluster and then trying
to run those commands, and also to handle node level reboots
or restarts of different core services, and also to query
historical data or maybe try to analyze different patterns
which you could be seeing related to one issue or different issues right then
also preparing the cleanup jobs to clean
up the resources so that you always don't
face the resource crunch whenever it is needed,
and different types of automation which we might want to implement.
So we start off by writing meaningful
and well documented run books for the on call folks which can direct
them to different points
available to that particular issue which has been reported,
right? So these, these points could be directions
related to different automations which are linked to it,
which the on call SRE can utilize. It could be a Jenkins
job, or maybe some GitHub issue
which was also reported in a similar way. So the
runbooks are really necessary in this
case, and implementation of bots
which can also take care of the chat ops basically. So these bots
are the main resource for running on our
communication platforms like slack teams or Metromost,
and these can help in gathering real time cluster statistics,
maybe issue some commands or even the Kubectl commands,
and also issue reboots and reloads of
our worker nodes if authorized
to do so. And one of the example
of bot is bot cube, which is a very popular
tool and has an extensive usage and connectivity to
different platforms available, and it's an open source
tool, then we can also deploy
some custom scripts running as kubernetes resources inside our
cluster in the form of daemon set
or sidecar containers like running inside sidecar
containers or maybe custom resources. And these
will basically invoke whenever they face a certain situation
so one example could be if my disk is getting full,
then I can invoke a script which can clean up my disk and
maybe try to shuffle some resources from it.
Then of course we cannot escape implementing
or taking the help of resources available
with the AI based capabilities
nowadays. So these could be some analyzers to gather
more detailed information for the cluster. We should
also be looking at integrating our existing monitoring solutions or
the tools to extend their capabilities, like curating
some custom dashboards which can give us a holistic as well
as a detailed view of what is basically happening inside the cluster
irrespective of what tool you are using. But nowadays
the tools, they basically have a
numerous support of different type of dashboards which we can utilize.
So moving on to the last part,
for someone who's starting afresh in this field,
basically more the exposure you have, more season
you will get in handling different situations, and it's always essential
to get the broader understanding of the entire architecture
and the infrastructure so that you will be able to link
to different components whenever there's a situation.
Also keep your runbooks handy to deal with different
issues, errors or warnings, and also try
to analyze the historical or recent occurrence of
such similar issues which might have caused the outages
in real in recent past. These could
prove as a different like as a very good learning point.
Lastly, I would like to say if the alert
auto resolves, it doesn't always mean that there
is nothing wrong, so it's better to look anyways,
that was all from my side. Thank you.