Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello, my name is Maria and I'm the developer advocate at
Botkube. In today's session we'll be talking about Kubernetes
troubleshooting demystified, and I'll be presenting your five best
practices to level up your Kubernetes troubleshooting workflow.
Just a little bit. About me my name is Maria, as I said
before, and I'm a developer advocate at Cube shop and I work on
the Botkube project. It I have a background in industrial
systems engineering and I've also been working in developer
relations in software and engineering for the past few years.
I also have a really cute dog named Malcolm.
To say that the Kubernetes space is complex is an understatement.
There's a steep learning curve and on the left you
can see that there are lots of
tools to learn. So here is the map of the CNCF landscape
and there are probably more tools that have been added since this
picture has been created. But with Kubernetes you have to know about
container orchestration, you have to know about configuration management,
deployments and networking, all just to get your kubernetes up
and running. Additionally, troubleshooting is challenges,
especially in this hybrid world that we're living in.
Being able to communicate effectively with your teammates is
more difficult than ever, especially when you have teams across different
time zones with different levels of Kubernetes expertise.
And just being able to share context is very difficult.
So what is Kubernetes troubleshooting? So in short,
Kubernetes troubleshooting is a process of identifying and resolving issues
in a Kubernetes cluster. So this means solving
problems related to deployment,
networking challenges, resource allocation and more in a
timely manner. So here's an example of a Kubernetes
troubleshooting scenario. This is the Ohm killed
error, and this occurs when there's excess memory
and Kubernetes will automatically terminate their
pods. So first you need to identify the
container or pod that was terminated. Secondly,
you need to check memory usage of the container or pod.
Then you have to look for any errors in the container or pod
logs. Fourth, you need to update the container and pod
image, and fifth, increase the memory limit for the container
or the pod. So this is a five step process.
And you might say, Maria, this is five steps. It's not that complicated.
However, when you're finding the root cause of the issue, you have
to go through multiple substeps in this five step process.
So these five steps can take minutes
or even hours or even days to solve if you don't have
an efficient troubleshooting workflow and
it gets even more challenges than when you add in multiple clusters.
So in a large scale production environment, it's very difficult to identify
the root cause of the issue. So if one cluster is having
issues, how can you tell what is going on with
each cluster? So how can you identify and diagnose
your problems when your problems are distributed across multiple systems?
Additionally, with multiple clusters, you're going to have multiple tools that you
use for your observability, your monitoring and
your resolution. And then being able to collaborate
and assign responsibility just becomes more difficult the more
complexity that you add in. So here are my five
kubernetes troubleshooting best practices number one,
you want to centralize your monitoring and observability.
This means you want to put all of your information into one place
where everybody can have a shared context and a source of
truth to be able to act on the error.
Second, you want to have proper incident response
and collaboration. So what you need to do is have some
sort of avenue
to be able to have your incident response and collaboration in one place so
they're not in two separate channels and you can have everything in one streamlined
place. Third, you want to have establish a feedback loop.
So this means keeping track of all of your insights from
previous incidents and errors so you can have more insights
on what's going on in your system. Fourth, you want to be able
to streamline your command execution so as you scale to avoid redundancy,
you want to be able to make a
single command across multiple clusters.
And fifth, we want to be able to automate your observability and
delivery process. So automation is key when it covers to
efficiency. So what is Botkube and
how does it help teams follow troubleshooting best practices?
Botcube is an open source collaboration Kubernetes troubleshooting tool.
This means you're able to monitor and troubleshoot your events in the same platform.
So this means instead of having to screen,
share or hop on a meeting to solve an error,
you're able to solve everything in your
chosen platform. And today we'll be talking about how Botkube works well with
Microsoft Teams and Azure. And then with
Botkube you're able to improve your developer experience, because nowadays
if you're a developer working with kubernetes, you're almost forced to be kubernetes
expert just to know the status of your applications.
But with Botcube you're able to get self service access to your resources without
having to deal with the knowledge gap. And finally,
because Botkube can easily connect in with any of your communication platform tools,
you're able to use Botkube from a mobile device, meaning that
you can use Botkube on the go. So just a quick overview.
Botcube works with Slack, Microsoft Teams,
Discord and Mattermost and currently you can monitor
your kubernetes events via
kubernetes events and Prometheus. And we
also have more plugin system where we
have more sources
where you can link Botcube to. Additionally,
you can control your kubernetes, so act on those events
with Kubectl and hem. And secondly,
you can automate your event responses with Botcube's
actions and you can extend Botcube to
any source executor via the plugin system I mentioned before. And via
the BotKube web hosted app, you're able to audit your events and commands
from all of your clusters. And in that web hosted app it's
easier to manage your botkube installation and configuration for all of your clusters.
So back to our best practices.
So empowering observability with Bachube and
you see an example right here. You're able to receive your
real time updates in your communication platform
and you can get your changes about your
new resources or updates that happen to your system.
And with Bachube it's very easy to create channels
and separate the information that you get. So for example,
the front end developer channel does not
need to have all of the Kubernetes alerts that you get
versus a platform engineering channel that should have all the
access to the need to everything that's going on in the cluster.
Secondly, incident response and collaboration.
So you can see this GIF, the team is
reacting to an error that occurs and they're able
to run a command right in the communication platform
that they're using. So you're able to not only receive alerts,
but you're also getting context about what's happening. You get logs of what you're doing,
you're able to filter those logs and you're also able just
to have a history of events that is right in your
communication platform of choice.
And third, establishing a feedback loop.
So this is an example of audit log that you'd be able to access with
the Botcube web hosted app. So you're able to
get insights about your team's performance
and potential issues. So if you notice that certain developers
on your team are the ones who ran the last
command before something goes down, you're able to get performance insights on what's going on
with your team. And as an industrial engineer,
I believe in continuous improvement, and you can't have continuous improvement
without having data to back it up. So this autolog is
your source of truth to be able to make changes to
improve your system. And next
we have streamlining command execution. So here you see
the botkube. You're able to change
your namespace, you're able to change the cluster and be able
to run commands across multiple clusters.
So this allows you to scale
fairly easy, fairly easily and fairly quickly.
And you're also able to give non Kubernetes
experts access to the ability to run
Kubernetes commands or helm commands or any executor
that you choose fairly easily and
very quickly within the communication platform.
And finally, you want to be able to streamline your automation and
developer empowerment. So here's an example of an automation
with Botkube. So this automation runs automatically
every time there's an error. So this automation is to run
the Kubecontrol logs function.
So instead of having to repeatedly write kubectl
logs over and over again when you receive an
error, Bachube does that for you. And you're able to
reduce the amount of time in your troubleshooting workflow.
So this scales really well. So you're able to
work with different tools across the cloud,
native landscape. So you can use this with Prometheus,
you can use this with Argo CD, Flux CD and
many more, and you can reduce your time well
in your troubleshooting workflow. So here is a new improved
Kubernetes troubleshooting workflow with BAQ.
So with the automations in place
and the alerting in place, you're able to reduce your
five step process into a two step process.
So as you know, with scale,
this will scale really well. So imagine you being able to
reduce your troubleshooting time by 30
40% and then scaling that across all of your
clusters that you work with. So this will allow
your teams to work more efficiently and quicker
and be able to work on the more important stuff
besides debugging errors. So in conclusion,
a strategic approach to Kubernetes troubleshooting is vital
for multiculture environments. And as we
know, complexity and scale is becoming more and
more important as we go on. So it's very important to have
a very calculated and targeted approach to kubernetes
troubleshooting, and not just sort of have an ad
hoc way of dealing with errors. And by following
the best practices aligned that
I've talked about before. You will be able to take
your kubernetes troubleshooting to the next level, and finally,
integrating solutions like Botkube will be able to enhance your efficiency
and reliability across all your kubernetes clusters.
So just quickly, how to get started with Bachube it's very
easy. You can either install
Bachube via the web hosted app, or you can go
to our GitHub and install the manual way in your cluster
with helm. And it's very easy to configure
it to whatever you're working with via
our web hosted app. And I will show you how to get configured
with Botkube and teams and aks
in a moment. So here is the demo,
the botkube dashboard. But first you would just get
here by just going to the botcube website, and then next you'd click sign
in, get all your login information, et cetera, et cetera.
So I'm just going to make a new instance. I would do this the same
way that I would do all of my botkube instances.
So here we're going to go to the official Botkube Slack
app, and this requires starting a free tutorial. So we have a
30 day free trial to be able to support multi cluster management.
And then after that it's $25 per node per month. So here
you would just connect your slack workspace, click add to slack, then you
would just select whatever Slack workspace you'd be working in.
I have my own debrel demo one, and I'm already
connected, so I can just continue.
I'll call this instance botkube demo production.
Then next, since I already have this pre configured, I'm going to
call this cube tomorrow production.
And because this is going to be just for my production, I'm going to put
it in my production channel, in my dev prod channel, which is
going to host my cluster dedicated for production and my cluster
dedicated to staging. And I'm just going to show you how
easy it is to add baku cloud to a channel. Just an example.
So click that open slack. Then I
would go to integrations add up
and then click on this. And then basically you're good to go.
So now if I want to, I can add or
remove as many channels as I'd like. So for this purpose of this
demo, I'm just going to be using helm Kubectl Kubernetes.
So it's just the same standard process. And then I'm going to go,
I'm going to make this bigger. Hopefully everybody can see that and
it's the same installation process that you would use for single cluster,
just very copy and paste. Great.
He'll, let's hop into slack and see what top
it is. I was playing around with this earlier.
All right, perfect. So we have Botkube activated
in our production channel, in our dev prod channel.
So I'm just going to do a few botcube commands just so you
can see botkube up and running in some real world
scenarios. So first I always do the botcube ping,
just so I know that my botkube instance is
up and running. Then I'm going to be running the help command which
will show you a guide of all of the
commands and plugins that we work with and just give you
just more detail of what's going on. So I'm going to
find out our list of executors.
So today we'll be working with Doctor, which is our chat
GBT plugin, helm and Kubectl.
So next we're going to run some simple Kubectl
functions in Bachube and I'm
going to be using the slack interactivity feature which
basically allows you to build out commands using buttons
instead of having to manually type them out. So we're just going to run a
quick get and then we're going to do a get pods
just so we can see what's happening in our cluster.
And you can add or remove
functionalities like this in the Botkube cloud web
hosted app. So if you want it to be read only you can
take this out and you can see we have three pods
going, so we have one that's failing. And then we're
able to have our notification come in
that there's an error. Then I can run a quick describe
and see what else is going on. And then with this log
that you get, you're able to also
filter out the input, the output, excuse me,
and filter out just what you want to see because sometimes those logs
can be hundreds of lines long. So it's really great
to get more information. So then we have some
more things coming. So we have an ingress that
was created. So we have an automation that I just inputted
to have an automation where you do a describe every time you have a created
resource. And then I'm just going to do a quick helm list.
So then I'll be able to see everything that's
going on on the helm chart that I have. So I have just bought Cube
on there right now. But if I had a more complex cluster,
I'd be able to see more of what's going on.
And then we're just going to see the doctor plugin. So we had an ingress
being created. What if I don't remember what an ingress is?
And I'm just going to ask doctor really quickly. And doctor almost
serves as having docs inside of your platform so you don't
have to navigate to another window. So going to tell me what's
a Kubernetes ingress? And I can be able to take that information
and be able to act on that alert that I just got
and that automation that I just got. So here's the demo,
and thank you so much for joining my presentation.
And right here, you can scan the QR code to get started with
Bachube. And thank you so much for having me.