K8s Troubleshooting Demystified: Five Best Practices to level up your troubleshooting workflow

Video size:

Abstract

This talk covers: Tackling K8s Troubleshooting Challenges Remote Collaboration w/ Botkube for an easier Troubleshooting workflow Simplify Kubernetes Issue Resolution: 5 Botkube Best Practices Maria showcases Botkube’s features via live demo and tutorial

Summary

In today's session we'll be talking about Kubernetes troubleshooting demystified. Maria will be presenting your five best practices to level up your troubleshooting workflow.
Botcube is an open source collaboration Kubernetes troubleshooting tool. It works with Slack, Microsoft Teams, Discord and Mattermost. Today we'll be talking about how Botkube works well with Microsoft Teams and Azure.
With Bachube, you're able to receive real time updates in your communication platform. You're also able to get insights about your team's performance and potential issues. And finally, you want to streamline your automation and developer empowerment. A strategic approach to Kubernetes troubleshooting is vital for multiculture environments.
How to get started with Bachube it's very easy. You can either install Bachube via the web hosted app, or you can go to our GitHub and install the manual way in your cluster with helm. I will show you how to get configured with Botkube and teams and aks in a moment.
Bachube allows you to build out commands using buttons instead of having to manually type them out. You can add or remove functionalities like this in the Botkube cloud web hosted app. Here's the demo, and you can scan the QR code to get started with Bachube.

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hello, my name is Maria and I'm the developer advocate at Botkube. In today's session we'll be talking about Kubernetes troubleshooting demystified, and I'll be presenting your five best practices to level up your Kubernetes troubleshooting workflow. Just a little bit. About me my name is Maria, as I said before, and I'm a developer advocate at Cube shop and I work on the Botkube project. It I have a background in industrial systems engineering and I've also been working in developer relations in software and engineering for the past few years. I also have a really cute dog named Malcolm. To say that the Kubernetes space is complex is an understatement. There's a steep learning curve and on the left you can see that there are lots of tools to learn. So here is the map of the CNCF landscape and there are probably more tools that have been added since this picture has been created. But with Kubernetes you have to know about container orchestration, you have to know about configuration management, deployments and networking, all just to get your kubernetes up and running. Additionally, troubleshooting is challenges, especially in this hybrid world that we're living in. Being able to communicate effectively with your teammates is more difficult than ever, especially when you have teams across different time zones with different levels of Kubernetes expertise. And just being able to share context is very difficult. So what is Kubernetes troubleshooting? So in short, Kubernetes troubleshooting is a process of identifying and resolving issues in a Kubernetes cluster. So this means solving problems related to deployment, networking challenges, resource allocation and more in a timely manner. So here's an example of a Kubernetes troubleshooting scenario. This is the Ohm killed error, and this occurs when there's excess memory and Kubernetes will automatically terminate their pods. So first you need to identify the container or pod that was terminated. Secondly, you need to check memory usage of the container or pod. Then you have to look for any errors in the container or pod logs. Fourth, you need to update the container and pod image, and fifth, increase the memory limit for the container or the pod. So this is a five step process. And you might say, Maria, this is five steps. It's not that complicated. However, when you're finding the root cause of the issue, you have to go through multiple substeps in this five step process. So these five steps can take minutes or even hours or even days to solve if you don't have an efficient troubleshooting workflow and it gets even more challenges than when you add in multiple clusters. So in a large scale production environment, it's very difficult to identify the root cause of the issue. So if one cluster is having issues, how can you tell what is going on with each cluster? So how can you identify and diagnose your problems when your problems are distributed across multiple systems? Additionally, with multiple clusters, you're going to have multiple tools that you use for your observability, your monitoring and your resolution. And then being able to collaborate and assign responsibility just becomes more difficult the more complexity that you add in. So here are my five kubernetes troubleshooting best practices number one, you want to centralize your monitoring and observability. This means you want to put all of your information into one place where everybody can have a shared context and a source of truth to be able to act on the error. Second, you want to have proper incident response and collaboration. So what you need to do is have some sort of avenue to be able to have your incident response and collaboration in one place so they're not in two separate channels and you can have everything in one streamlined place. Third, you want to have establish a feedback loop. So this means keeping track of all of your insights from previous incidents and errors so you can have more insights on what's going on in your system. Fourth, you want to be able to streamline your command execution so as you scale to avoid redundancy, you want to be able to make a single command across multiple clusters. And fifth, we want to be able to automate your observability and delivery process. So automation is key when it covers to efficiency. So what is Botkube and how does it help teams follow troubleshooting best practices? Botcube is an open source collaboration Kubernetes troubleshooting tool. This means you're able to monitor and troubleshoot your events in the same platform. So this means instead of having to screen, share or hop on a meeting to solve an error, you're able to solve everything in your chosen platform. And today we'll be talking about how Botkube works well with Microsoft Teams and Azure. And then with Botkube you're able to improve your developer experience, because nowadays if you're a developer working with kubernetes, you're almost forced to be kubernetes expert just to know the status of your applications. But with Botcube you're able to get self service access to your resources without having to deal with the knowledge gap. And finally, because Botkube can easily connect in with any of your communication platform tools, you're able to use Botkube from a mobile device, meaning that you can use Botkube on the go. So just a quick overview. Botcube works with Slack, Microsoft Teams, Discord and Mattermost and currently you can monitor your kubernetes events via kubernetes events and Prometheus. And we also have more plugin system where we have more sources where you can link Botcube to. Additionally, you can control your kubernetes, so act on those events with Kubectl and hem. And secondly, you can automate your event responses with Botcube's actions and you can extend Botcube to any source executor via the plugin system I mentioned before. And via the BotKube web hosted app, you're able to audit your events and commands from all of your clusters. And in that web hosted app it's easier to manage your botkube installation and configuration for all of your clusters. So back to our best practices. So empowering observability with Bachube and you see an example right here. You're able to receive your real time updates in your communication platform and you can get your changes about your new resources or updates that happen to your system. And with Bachube it's very easy to create channels and separate the information that you get. So for example, the front end developer channel does not need to have all of the Kubernetes alerts that you get versus a platform engineering channel that should have all the access to the need to everything that's going on in the cluster. Secondly, incident response and collaboration. So you can see this GIF, the team is reacting to an error that occurs and they're able to run a command right in the communication platform that they're using. So you're able to not only receive alerts, but you're also getting context about what's happening. You get logs of what you're doing, you're able to filter those logs and you're also able just to have a history of events that is right in your communication platform of choice. And third, establishing a feedback loop. So this is an example of audit log that you'd be able to access with the Botcube web hosted app. So you're able to get insights about your team's performance and potential issues. So if you notice that certain developers on your team are the ones who ran the last command before something goes down, you're able to get performance insights on what's going on with your team. And as an industrial engineer, I believe in continuous improvement, and you can't have continuous improvement without having data to back it up. So this autolog is your source of truth to be able to make changes to improve your system. And next we have streamlining command execution. So here you see the botkube. You're able to change your namespace, you're able to change the cluster and be able to run commands across multiple clusters. So this allows you to scale fairly easy, fairly easily and fairly quickly. And you're also able to give non Kubernetes experts access to the ability to run Kubernetes commands or helm commands or any executor that you choose fairly easily and very quickly within the communication platform. And finally, you want to be able to streamline your automation and developer empowerment. So here's an example of an automation with Botkube. So this automation runs automatically every time there's an error. So this automation is to run the Kubecontrol logs function. So instead of having to repeatedly write kubectl logs over and over again when you receive an error, Bachube does that for you. And you're able to reduce the amount of time in your troubleshooting workflow. So this scales really well. So you're able to work with different tools across the cloud, native landscape. So you can use this with Prometheus, you can use this with Argo CD, Flux CD and many more, and you can reduce your time well in your troubleshooting workflow. So here is a new improved Kubernetes troubleshooting workflow with BAQ. So with the automations in place and the alerting in place, you're able to reduce your five step process into a two step process. So as you know, with scale, this will scale really well. So imagine you being able to reduce your troubleshooting time by 30 40% and then scaling that across all of your clusters that you work with. So this will allow your teams to work more efficiently and quicker and be able to work on the more important stuff besides debugging errors. So in conclusion, a strategic approach to Kubernetes troubleshooting is vital for multiculture environments. And as we know, complexity and scale is becoming more and more important as we go on. So it's very important to have a very calculated and targeted approach to kubernetes troubleshooting, and not just sort of have an ad hoc way of dealing with errors. And by following the best practices aligned that I've talked about before. You will be able to take your kubernetes troubleshooting to the next level, and finally, integrating solutions like Botkube will be able to enhance your efficiency and reliability across all your kubernetes clusters. So just quickly, how to get started with Bachube it's very easy. You can either install Bachube via the web hosted app, or you can go to our GitHub and install the manual way in your cluster with helm. And it's very easy to configure it to whatever you're working with via our web hosted app. And I will show you how to get configured with Botkube and teams and aks in a moment. So here is the demo, the botkube dashboard. But first you would just get here by just going to the botcube website, and then next you'd click sign in, get all your login information, et cetera, et cetera. So I'm just going to make a new instance. I would do this the same way that I would do all of my botkube instances. So here we're going to go to the official Botkube Slack app, and this requires starting a free tutorial. So we have a 30 day free trial to be able to support multi cluster management. And then after that it's $25 per node per month. So here you would just connect your slack workspace, click add to slack, then you would just select whatever Slack workspace you'd be working in. I have my own debrel demo one, and I'm already connected, so I can just continue. I'll call this instance botkube demo production. Then next, since I already have this pre configured, I'm going to call this cube tomorrow production. And because this is going to be just for my production, I'm going to put it in my production channel, in my dev prod channel, which is going to host my cluster dedicated for production and my cluster dedicated to staging. And I'm just going to show you how easy it is to add baku cloud to a channel. Just an example. So click that open slack. Then I would go to integrations add up and then click on this. And then basically you're good to go. So now if I want to, I can add or remove as many channels as I'd like. So for this purpose of this demo, I'm just going to be using helm Kubectl Kubernetes. So it's just the same standard process. And then I'm going to go, I'm going to make this bigger. Hopefully everybody can see that and it's the same installation process that you would use for single cluster, just very copy and paste. Great. He'll, let's hop into slack and see what top it is. I was playing around with this earlier. All right, perfect. So we have Botkube activated in our production channel, in our dev prod channel. So I'm just going to do a few botcube commands just so you can see botkube up and running in some real world scenarios. So first I always do the botcube ping, just so I know that my botkube instance is up and running. Then I'm going to be running the help command which will show you a guide of all of the commands and plugins that we work with and just give you just more detail of what's going on. So I'm going to find out our list of executors. So today we'll be working with Doctor, which is our chat GBT plugin, helm and Kubectl. So next we're going to run some simple Kubectl functions in Bachube and I'm going to be using the slack interactivity feature which basically allows you to build out commands using buttons instead of having to manually type them out. So we're just going to run a quick get and then we're going to do a get pods just so we can see what's happening in our cluster. And you can add or remove functionalities like this in the Botkube cloud web hosted app. So if you want it to be read only you can take this out and you can see we have three pods going, so we have one that's failing. And then we're able to have our notification come in that there's an error. Then I can run a quick describe and see what else is going on. And then with this log that you get, you're able to also filter out the input, the output, excuse me, and filter out just what you want to see because sometimes those logs can be hundreds of lines long. So it's really great to get more information. So then we have some more things coming. So we have an ingress that was created. So we have an automation that I just inputted to have an automation where you do a describe every time you have a created resource. And then I'm just going to do a quick helm list. So then I'll be able to see everything that's going on on the helm chart that I have. So I have just bought Cube on there right now. But if I had a more complex cluster, I'd be able to see more of what's going on. And then we're just going to see the doctor plugin. So we had an ingress being created. What if I don't remember what an ingress is? And I'm just going to ask doctor really quickly. And doctor almost serves as having docs inside of your platform so you don't have to navigate to another window. So going to tell me what's a Kubernetes ingress? And I can be able to take that information and be able to act on that alert that I just got and that automation that I just got. So here's the demo, and thank you so much for joining my presentation. And right here, you can scan the QR code to get started with Bachube. And thank you so much for having me.

Slides

Download slides (PDF)

See all 21 talks at this event!

Conf42 Kube Native 2023 - Online

September 28 2023

K8s Troubleshooting Demystified: Five Best Practices to level up your troubleshooting workflow

Video size:

Abstract

Summary

Transcript

Slides

Maria Ashby

Developer Advocate @ Kubeshop

Join the community!

Featured event

2026

2025

Info

Conf42 Kube Native 2023 - Online

September 28 2023

K8s Troubleshooting Demystified: Five Best Practices to level up your troubleshooting workflow

Video size:

Abstract

Summary

Transcript

Slides

Maria Ashby

Developer Advocate @ Kubeshop

Join the community!