Give Your managed kubernetes Troubleshooting Skills a Superpower with AI and K8sGPT

Video size:

Abstract

Troubleshooting K8s clusters, including Oracle OKE, is challenging for DevOps teams. Enter K8sGPT era: a new tool leveraging AI to provide real-time insights and context analysis. Join this session to see how it integrates with AI (OpenAI, Cohere, or Ollama) to enhance Kubernetes operations.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hi everyone and welcome. I'm very glad to be here at Conf42 DevOps 2025. This is the first Conf of the year. Today I want to talk to you about how you can pimp your Kubernetes job shooting skills using something called KCPT. We'll explore how this AI powered tool can bring real time insights and help you resolve your key issues faster. And more effectively, but before we get started, let me do a quick intro. My name is, Cocella AKA cloud dude. I'm a founder and principal consultant at cloud thrill. I have almost, roughly two decades, almost two decades of experience in IT and Oracle tech. I know specialize in multi cloud operations, DevSecOps, and a tiny bit of AI inference. there's not much more to say that I'm a huge cloud and automation fan. And I share all my labs and discoveries on my blog. now it's cloud. ca slash blog. And in my GitHub repost, that's, github. com slash workdba. That's not a handle. Sorry about that. I'm also the host of TechBeats Unplugged podcast since a year, which I'm very excited about. you can find a cool episode where I hosted Kelsey Hightower, chilling about DevOps for almost two hours. So, please, check the QR code. If that content, speaks to you. Okay. No, that we're done with the intro. Let's get, to the reason why we all here today. Right. Which is the agenda. So why are we going to talk about today? Well, first, we'll talk a bit about, context and compare, what AI is good at, versus a traditional job shooting methods and challenges. Then in Section 2, we'll learn more about what's KSGPT, really. We'll explore key features, benefits, along with some specs and AI backends and installations. Then in Section 3, we'll dive into all the KSGPT commands you need to work with and describe basic and the operator architecture. Because, you know, there's a CLI and there's also in cluster, installation. And finally, we'll have some fun with a little, demo that I will be, running. So I hope you will like it. Okay, but let me start with the memes, actually. So Of course, AI, today is stirring up endless headaches across tech industries. I don't know if you heard, but Mark Zuckerberg recently announced AI will replace his mid level engineers by 2025. That's a scary thing to tell your engineers if you ask me. But hey, this is the same guy who burned almost 50 billion dollars on Metaverse. Remember? so who's right? will AI replace your job? Maybe, maybe not, but it certainly won't hurt you to use AI today to do your job better. at least that's my opinion. On the flip side, you don't want to completely depend on GPT for your technical growth in the company, right? Cause it's not healthy and fundamentals are still important. So the key is, is really finding the balance between the two. all right, done with the meme, back to business. I am not going to talk about what is AI, but it's important to highlight some key events or concepts, right? This is just a language model history. That's like a timeline of what happened, and, and when, but I could also, explain the, We can start to explain neural networks, right? So what is a neural network? Well, neural network, is a network that helps analyze different, complex data like text, image, video or audio. but there are many types of networks depending on, on the data. The type of data that you have, right? For example, for analyzing images, you'd have to use CNN, which is Convolutional Neural Network. This is also used in your test labs to avoid obstacles. For text, you would need to use RNN, which is Recurrent Neural Network. That's the ancestry, the ancestor of, recurrent language models. Right. The back then there was no record, for example, of the order of the words, and he couldn't handle large texts. So it would forget the first word by the time it gets to the last one. Right. And it was also running sequentially. So what really happened after that period? Well, although OpenAI opened up shop in 2015, what's considered the game changer in these LLM journey is the introduction of Transformers architecture. Sure. Right back in 2016 and 17, by researchers from Google and your University of Toronto, the paper was called attention is all you need. And this really marked the shift from the sequential to self attention, bringing, More features like, positional encoding for word ordering, which is very important. and also attention, which allows to more accurate translation, of the words. Let's say, for example, you take a sentence, in English, European union, right? You would like to translate into French. Well, in French, it's not neutral, like in English. It's l'Union Européenne. Right, so this is a completely different gender, right? So the attention is attaching the right gender link between the words or flipping them due to the training data. You also get self attention that allows the underlying meaning of the words. I'm talking about grammar, Gender and tense aware and context aware synonymous, all that kind of stuff. And this is on top of handling huge data sets of gigs and gigs and terabytes. So the transformer was the discovery, the big breakthrough that laid the groundwork for subsequent groundbreaking models up to the GPT, and nowadays models, right? And beyond. Right that there's a link, actually in the slide, if you want to learn more and dive deep into it, into the timeline, but bottom line is the transformers is why, we're, using GPT chat GPT today. Right. So I hope he was, was informative enough and he gives you, he gave you a glimpse of the timeline. Now let's talk about what AI is great at. It is of course, very good at. Pattern recognition, looking at the repetition of a lot of noise, translation, translate noise into a signal, and prediction, predicting the next token, for example, right? But that's also funny because the problems we usually struggle with as SREs and KS clusters are also pattern recognition, translation, and prediction. Right. For example, when you consider the failures from your workload, the nodes, the back end and how to actually spot these patterns, we have also challenges in how to find out how did something go wrong and why, which can very be very complex in Kubernetes clusters, right? How to correlate those errors throughout the stack. The stack, we also have trouble forecasting or, do some capacity plan. And, let's say you have a stateful set. Like I'm just like showing an example. You have a PVC failure, because something went wrong with, any other layer, like the storage, right? when this happened, You got a lot of reasons that could lead to this issue. I'm taking, one managed cluster in the cloud, for example, OCI. It could be in any, cloud managed cluster. let's say that issue that you can have in a PVC could be because of the wrong storage class, or it could be because of the, block storage used with, Was used with the wrong file system like xt, four, XT four instead of XFS, for the PV with the CSI plugin, or somebody just went to the console and, changed the parameter, of the file storage that was dynamically, generated or created by CSI plugin. So all these can happen can be the source of your. Issue. But how do we correlate all this today? Like, oftentimes it's painful as this involves very deep knowledge, and in every layer of the stack. And obviously, right? So once you pass the euphoria of day zero after deploying a cluster, moving on your application in your, in your past, with time, then you realize how painful. It is to troubleshoot clusters in production, and the more clusters grow in size and complexity, the more difficult spotting the root cause of issues becomes, especially when multiple components are added every day from different teams like ingress, storage, services, CRDs, operators, you name it, are involved. Each time you add a Helm chart or an operator, things can break. And of course, first thing you do. Is check the error message or traces, but they're often cryptic. Like you wouldn't even know where the problem is coming from while sifting through the piles of logs. thing is many Kubernetes issues are Linux issues under cover. You have sometimes you limit issues. You got system D config issues and always issues, right? And the sheer volume of logs and events and metrics. In Splunk or any other login system that needs to be analyzed makes the remediation painfully long. And that leads to, that will slow down your operations or causing even longer downtime as we all experienced in the past. And to make matters worse, all the best SREs that you have, they hold that knowledge of. How every piece works, but could never put that into paper or documentation because it's so immense, they don't have time or it's based on a decade of, experience, so it stays in their head forever and never shared with the junior, junior series. Right. So imagine how challenging the onboarding could be, or the first day on call for a junior, pretty, pretty challenging. I, I, I must guess. and, as shown here, once the failure happens, we have only limited window, to filter the signal through the noise and God knows we have a lot of noise sometimes to get the clarification. Right? So this is the kind of issues, a conundrum, SRAs find themselves into, during the troubleshooting. But what is interesting also is that there is always a pattern that can be extracted from these use cases, throughout the lifecycle of the cluster or of the workload. So what if we can? Codify those repeatable conditions for those failure mods, in order to get closer to a best guess or a best effort, type of advice, right? So what can we do about this? Well, fortunately for us, some people start to work on the issue and this, inspired, people like Alex, John, Alex Johnson, and a bunch of, maintainers. I had to create case you pt. So what is case you pt really? Well, simply put, it's just a tool that scans your cluster diagnose and triage your issues and enrich it, using L. L. M. S. Or A. I. Backends. Even if you're not seasoned expert, think of it as a guru. That would filter the relevant info from the noise on your behalf and come up with advices, to fix your problems. Right. Sounds like magic, right? So in a nutshell, this project is an open source project that was donated to CNCF, I think a couple of years ago, with more than 92 contributors, 6, 000 stars in GitHub. Feel free to give it a star too. And it allows fast reach with AI health and issues analysis. The best part, it has an SRE experience codified into 19 analyzers. We'll talk about it, deeper, in the next slide. and it converts the complex signal into understandable suggestions. This is like the best part where like you get the, the suggestions that is, understandable and ready to use as solution. It also supports over 12 AI endpoints. I'm saying 12, but maybe, it got bigger, right? So think of any type of, AI endpoints you could, you could imagine, right? So, they support open AI, local AI, cohere, hug and face, all online, other cloud AI solutions. I'm saying talking about Azure open AI, Amazon bedrocks, Google Vertex, OCI gen, AI. So the, the whole package. And last but not least, security, because security is important. So KCPT also integrates with, dynamic scanning tools like Trivi for a CVE review or scanning, which is, which is pretty cool. It's also, integrated with Prometheus. I will talk about, further plugins, in the common, slides. But, as I say, like case GPT gives the Kubernetes, Kubernetes, superpowers to everyone, right? So, because it makes the process so easy through a one liner instead of a sick and through like so many commands that it becomes a no brainer, for anyone who's stuck and needs help with Kubernetes, anywhere, no matter how junior, you could be, and this can save you, tons of hours, that for sure. Key features of KGPT. Well, the bottom line needs to remember is everything is built around work smarter, not harder, right? So we already said that it's, it has a codified SRA knowledge, the guru part, that knows what to look for, where to look for it. And also the AI would cut through the noise and give you that suggestion. That's not all. Because case GPT is only paper with AI to improve the text output. The real secret though, real secret sauce is the built in analyzers. And there are 19 covering most of Kubernetes resources, right? So we can see here you have a replica set service, ingress, stateful set, node, network policy, pod deployments, PVCs. Cheers. You name it, right? So all of those are actually, analyzers. They call them filters as well. The we're a low, case you PT to get the information needed from, those, primitives from the API server. Right. And the good thing is that you can even build yourself, new and custom analyzers, if you want to, because it's supported and you can contribute back to the project. So that's, that's the cool part of it. Right. And then we have anonymization. So we know that usually, even when you have an issue with troubleshooting, with vendors, they can allow you to sanitize the content. You don't want to send your sensitive data, to, to your funder, let alone, open AI or cohere, right? So case GPT offers a way to mask the data. As you can see here, you have the error reported during the analysis. We have a stateful set with a full name. Right. So that's the original data. That's not, that's not masked and what GPT does when you specify the anonymization, it will mask that data. That name will be masked before being packaged into the payload and then sent to the AI, back end once it's received and then the variances run on that input, the AI back end would send back the information and GPT We'll actually, process that solution return to the user and amass the data that was amassed. As you can see here, you can see that the final output return to the user has the right NAM. So the user would know which resource are we talking about. Right. So that's, the cool part, but be aware that, some analyzers, not everything is masked, unfortunately. So you get, some resources like advance, replica said pod PVCs aren't, which might, be problematic in some situations, but. What option do we have to solve the problem? Well, I do have one and that's just me, but you can go fully private using local LLMs because you can have an inference server in the, in the, in within your data center and you wouldn't have to worry about sending, sensitive information to open AI or cohere because everything is within your data center. So. That's more safety. Anyway, we also have remote caching. So remote caching also allows to afloat the cache data to a remote storage service like S3 or Azure Blob Storage, stuff like that for persistency and also sort of relief. your server, from, storage pressure. I'm talking specifically for cases where you have, case GPT installed as an operator, right? So you will just, move that storage out of your nodes, in order just to release a little bit of, of space. Right. And what are these? Why are these data? Right. I was first asked myself, like, what kind of data can be cached and can be offloaded? Well, it could be the analysis results. It could be, the diagnostic data, and it could be the key. It's config data, right? So specifically for in cluster installation, right? and then, we have the integration. I already spoke about the integration. you have integration with Prometheus, which allows case GPT to leverage real time metrics in order to give more insights, more concept, context to the air back end so they can get the best solution. Right. So we could scrap those metrics from Prometheus server. And you can, also leverage Trivia with the integration. So you can scan the cluster for, potential vulnerability, and also configuring misconfiguration, cases as well. So, and, and I actually could have put, other plugins cause it's, also supporting, Kyverno and Kera. As well, just so for you to, to know, and, the way to activate, an integration, is through a simple command, which is KGPT integration, activate, trivy, and then right after that, you can just. do a filter list, filters list, and then you can see, that you have two new analyzers. They're not analyzers really, because they're linked to the, the plugin that was integrated to it. That's why you have the parentheses and integration. so you can run, two types of analysis, which is a vulnerability report, and for CVS and a config. Audit report as well. And that might prove to be very useful if you want to do everything in one location. And of course the LLM AI backends, you can see the list in here. It's very, very exhaustive. but if you find a backend that was not listed or support, feel free to open an issue or reach out to the team. But to be honest, I'm pretty okay with this list because it's very, very exhaustive. So. That, that makes, that makes me happy. If I want to choose an AI backends any day or change it, I'm sure I'll find it in the, in the list. I just want to share an example. Let's say OCI generative AI from Oracle cloud. what you would need to do is have a credential to. have access through API to your cloud provider. And then, the second command is, KCPT auth add backend. You specify the type of backend, which is OCI. And then you'd already have a model, before. So you specify the ID of the model and the compartment, ID as well, which is like resource group in Azure or a project in, GCP. So it's pretty straightforward. Really straightforward, and right after that, you can use kcp to analyze issues to be solved and use the explain flag and the back end to specify which back end you want to use to get your solutions. In terms of installation, very straightforward. Again, cross platform. If you wanna install in Red Hat, or Enterprise Linux Fedora, there are RPM available. if you can run a brew install if you're in Mac os, or you can, install in Ubuntu through a Dian package. You can even install Windows. And in WSL within Windows as well. So it's pretty, ubiquitous in terms of, installations. And all you need after that actually is have access to your Kubernetes clusters, have a context, through kube config. And after the installation of kcpt, you may need to, eventually, if you have access to a backend that requires an API key, need to create your API key, as you can see here. I just chose OpenAI, but it's. It's pretty, it's the same everywhere. they now recommend to have project API keys, maybe for, sort of a security, purpose, or, you know, to keep track of all the keys that you create. And once you create that key, you run case you beat the off, add. And you specify the back end. You don't need to specify open a back end because it's a default one. So if you run case you beat the off ad, it will just ask you for the open a I key by default without even specify an open a I once this done, you can see that the status off the the authentication or the back end is now active. That means that you can use it to explain, On top of the analysis, right? So you can run the case you pt, analyze and you put you add the explain flag plus the back end flag again. Open AI doesn't require to put the, dash B open AI, because it's the default one, if you don't specify any, it will go directly to use your open AI back. And if it's defined. One more thing to notice as well. So once you, you, installed case GPT and once you added one back end, there is a configuration that is stored, in your user directory. So let's take unique linux. For example, it's under, the home directory dot config case GPT case GPT dot yaml. And, one thing you need to know is. The API key is stored in plain text, right? So that's, one of the things that I, noticed. So like there is no, encryption currently, enabled in order to sort of mask, that, type of, type of secret. Right. So you got to be mindful of this, or maybe try to play with the privileges of this file or use an elevated user in order to run GPT eventually. and again, I also always have my solution, my option, which is, the, just go fully private and use, local LLM, inference, within your, data center. And when you have that, you don't need a, an API key. You will just have like SSL configured and you have firewall and then you'd be, safe on that matter. You won't have to store any API key or there, there won't be any secret sprawl. All right. In terms of CLI is that it's straight to the point. All right. No fluff, just what you need. so in terms of commands, it's just like 10 beautiful and easy, easy to remember commands. So let's explain a few useful commands and flags so we can focus on, on what matters. The first one we just described, it is the auth. Right. If you want to add a back end, we did that with open AI, right? You can add list and remove AI back end. and after that, you can eventually get some, appearance from your back end when you were in case you beat the second one is analyze. So analyze, would just allow you to find problems within your communities cluster and provide you with a list of issues that need to be resolved. Right. So it's like, events or logs on steroids, right? It will just show you the issues that need to be resolved. It will not reply. It will not give you, solutions. Why? Because at this point, you're not connecting to your AI backend. So right, right here, there's just a list of issues that are listed through the KGPT with the analyze. After that, you have the filter filter. We said that we already said that we have 19, analyzers, right? So you have to specify which analyzer. Analyzer you want to run because sometimes you have a multi cluster environment or you have, many namespaces, different teams with, 20 teams with thousands of, of, resources, different type. Across the cluster. So you don't want to run just the analyze. Explain, especially if you add explain, which is crazy. So you want to run it based on every type of resource in your cluster. You want to be a bit more specific, right? Just to reduce the amount of time it will take and also reduce the amount of output you will get, right? Because if your cluster is filled with issues, then like you would take three or four or five pages, right? And that's not what you want. And now, this one is special because this is where the magic happened, right? So explain or dash E, is the explain through AI, right? So this, is the flag that would, trigger the communication with your AI backend. Right. So from this point, you're actually starting to pay. If you added your API key from OpenAI or code here, if you run a command, analyze, and you edit, explain, that's where you communicate with the AI backend and get some info, right? So before, you won't pay any, but at this point, that's where the counter actually is starting to run, right? Because it's an API call to your AI back end. After that you have back end, obviously, because sometimes, you configured, you happen to have, you know, good for you, cohere, anthropic, and, I don't know, it could be even grok if you want to, right? You have all, all of them defined. And you can choose which back end. The default one is OpenAI, but nothing stops you from using another one. And one of the advantages is that you're actually building your own benchmark based on the quality and accuracy of the solution that are provided. So like you can compare OpenAI with Anthropic and see Which one is answering better to you, depending on the use case. And, you can also think of it as a second opinion, right? So it's like when you go to a doctor, doctor says, yeah, we, we need to go for amputation and you're like, Oh, you know what? I might need another, opinion. So I, I, I may, I may ask another doctor just, just to be sure. Right. So that's also very interesting feature. We talked about integration to activate, remove, or list the integration tree, V or whatever, all the plugin, available. And then we have, dash D or, with doc, this one is cool. Cause it's a no brainer. sometimes you just want to attach the documentation. So wherever you show or a filter analyzer, you choose, you can add dash D and then it gets you the documentation. The info, the documentation related to the resource you're trying to find a, a solution to, which is, which is kind of, kind of cool. So instead of a Googling, you'll just having it, in the same command anonymize, we talked about anonymized, it allows you to, you know, mask your sensitive data. And then we have namespace, right? I already talked about the filter, so you can filter the type of resource. But how about filtering the namespace? Because sometimes you have so many namespace, in a cluster because there's so many teams, right? So, but if you're actually. Just responsible for one namespace. There was an issue in a namespace and you want to check, or get some solutions regarding an issue with that resource in that namespace. That's your, your, your, your go to shop. Actually, this is my, my favorite, company in namespace and filter is like the best you can think of. which is, which makes it very, very more specific and granular in, in your research and also it's faster. for getting the answer. this one is new. this is, the stats, flag, which, actually, is only available since, the version three, 43, not before. Right. So it's very recent. addition, and this will allow you to display statistics of each analyzer run. Right. So, you can just show like how long it took you to run the analyzer for this or that, that type of resource. You can see here that some of them take milliseconds. I'll attack seconds to minutes depending on your workload. But it might be useful if you're trying to troubleshoot case GPT analysis at some point. So that's that's kind of cool. All right. so in terms of architectures, there are two. I already talked about the CLI, right? So the CLI, you can install it in any operating system, but it's everything. It's process based. It's within the host where you run it. Right. It's not inside the cluster. So it's a binary, as you can see here. So you can have within the binary, you have the analyzers and when you run it, analyze and you filter pod, one of the analyzers, will be chosen. And that analyzer has, actually a bunch of, a bunch of, primitive. Actually, that it gets from the API servers, related to the, the type of analyzer, as you can see here, like you have the pod list advanced status, right. And all this is actually, sort of gathered, compacted, and then given into making it easier to build results. so he can send it into package it and send it as a payload to your AI provider. Right. So analyzer. Fetches the data, organizes it, packages it. You've got the built results that can be run with the MNLize. And then if you added the explain, it will send all this information to the AI provider. The inference, would occur on that input. And then at the end, the enriched results are sent back with the solutions. That's actually the main loop of KHCPT with AI. And at the end, the user has access to, those final suggestions. Right. But there's also case GPT operator, architecture. Why? Because this is, this happens when you, you install it in cluster. This is what we call in cluster installation. Right. So I'm just trying to, to actually show you the components. So we have the case that I will explain case GPT config case GPT. Operator. And then we have the deployment with the analyzers. Right? And oops. So this has access to the API server, as I explained with the binary version, and you will have also access with whatever integration that is available, like Trivi or Prometheus. So we'll scrap the data, the metrics from Trivi, from Prometheus. You can run Trivi, scans, Through GPT, through case GPT, and then the analyzers will just build those, results, send them to the, API for inference. And then after that, it will bring back the analysis results. One thing you need to know is that there is a requirement to install the kshpt operator. So the requirement is to install the operator through the Helm chart. So this does nothing, just install the operator. Then you will need to create a secret to store the API key of your AI provider. Why? Because the CLI, right? You don't have to do it anymore. We don't have to store it or do it interactively. like with, the CLI with the binary, everything is stored within the cluster. So API key has to be a secret right after that. Then you can install the config object. I also talked about KGBT config, right? So this object, which is a KGBT resource, can be run. It's just a manifest that you will run and will help, the operator create a deployment and a bunch of CRDs. And if you want to upgrade, you just change the value of the version of KGBT in that manifest, and it will upgrade it for you. automatically, and then, so the operator will allow you to hold primitives that contain the suggestion return from your AI model, right? So instead of having the result when you run the command in CLI, now it is stored within special CRDs that are called results, right? So these CRDs are called results, resources, and they can even be sent to a Grafana dashboard. I can see here on the left side. Right. It can be sent to Grafana so you can see it within, one of the dedicated dashboard if you enable it, in your Helm chart, right. That would be super nice because instead of like running kubectl to get the results, you can have access in a web browser, right. And, just, want to show you. Like the format of that kind of resource, here, it's a simple example. So we try to, deploy a pod with, like a fake or wrong name of, image. And then this can happen, automatically. So we didn't have to run anything. as opposed to the CLI. And after that, it was directly sent to AI and then AI got the information and then send back some suggestions. As you can see here in the solution, check the image name, ensure the image exists, blah, blah, blah. and this is basically like a simple back of Poland image, issue. but that just shows you that, these resources will be created automatically. and then it will be just, Pylab actually as a resource, ready to be accessed, instead of just running a command separately. This is not a diagram, just, just to show you that the PHP deployment has, pods where the analyzers would run and communicate with the API server and the inference API server. There's also a way to connect with, do remote caching and also send notifications to Slack as well. Right. So it's pretty much integrated, to be honest, I was going to run a gif, but, there's no way I can play it. So this is the last frame. As you can see here, we have two resources. We have the controller, controller manager operator, and we have the deployment, when you, deploy the case GPT config. demo time, right? So, hopefully I, I'm sorry again, if I took too much, but, I really want you to have, enough background on the tool and also, on the challenges and, and AI and every single, feature or, or, or use case. but I, I may do something fun. And right now, maybe I, we can play something, we can run some commands for you, right? Let me see what I can play. Right. Okay. So, I think, what we can do here may be is what I want to tell you is that, right now in my terminal, I, One of the prerequisites is installing kcpt and also, I have a kube config that is already set up. I have access to a managed cloud cluster within OCI and my version of kcpt is 0. 3. 48. That's the most recent one, right? And now we can start to play with it. So one of the first, command that I may, need to run is, kcpt auth list. Right. You can see here that nothing, there's a default open AI, but nothing is configured yet. Right. But if I run it here, this is not the terminal when I, when I already configured my back ends, I have open AI that is defined and Olama also that is defined. Okay. Right. So, but if I wanted to enable something, I could, enable, Olama because I have actually a bunch of models that are running. They're all, quantized models that are running within my laptop. I have like a, just a tiny, graphical card, but they can still run and I can use it with case GPT. Easily, right? So what can we do? Well, in my I'll share my repo with read me. You can see here that there is a way to add the new back end. So I'm just copying it here. So what that means is that I'm choosing all I'm as a back end. I'm choosing the model. Because, this is llama three latest, right? I want to specify the model and also I'll specify the end point, right? So if I click here, I come here, right? I could run this, right? All I'm, that's as straightforward as it can get, right? If I do off list right now a second time, you can see that, Olama has been enabled. Now we have Olama that is active, right? So it's enabled, but let's get back to like the kind of, analysis that we can, run. Right now, simple example, I'm not going to do something crazy, right? So we have, just, pod and I put the, that wrong, image name, right? And I tried to deploy it, right? And it was in a specific, I call the broken namespace just to, be sure, right? See, so if I get the pods in here, we can see that we have a image pullback of already, right? So the first thing you can do is, yeah, I can run it here. And then if I run it, it will actually, no, not this one. Okay, so I specify the back end, which is Olama, and then it will run. And if I duplicate this one, right, see here, it's taking its time because Obviously it's locally run. so it's taking his time and I'm sharing my screen. So I think, there's a kind of, it's capping out a bit, but I can also do that and call in, open AI. Cause, I also happen to have that backend active, right? So let's see, see, so like this one was super fast because it was open AI, right? And then it would just tell you. That's the ANA analysis part. And then here we gives you, gives you the solution. Same thing that we, you've seen, earlier. I didn't want to do more right? But in the mean meantime, I can actually run, the last thing I could run is this one vulnerability report. Right? I can do this and I can try to explain it, right? And if I explain it, I can use open AI to explain it as well, but it might take some time. Right. So if I do with that explanation, it will just list all the CVS, that are actually, triggered in my, or existing in my cluster. You see how many we have. Right. But if I want to run the, the inference on it, just to have the suggestions, this is the kind of CVS you can have. With the solution. So the green part is the response from Olama eventually, right? Cause there you go. So this one is Olama and we have more or less the same. See, like this is an answer from a local open source model, right? And this, this was from, open AI, right? So they're not depending on the issue, right? But of course there's a better quality when you pay from open AI, but it tells you that you can run. Easily a local, appearance, if you want to sort of, manage your, your, your expenses, when you use a local for specific issues and open AI for, others, but this is just an example. Let's say, for example, the CVE, 45, 3, 3, 7, you see, it's about. I think it's about OC4G or something like that. So if I come here, it's really hard to see what is need to be done. Right. But the advantage here is that KCBT leverage, Trivi and then Ascent. The message or like the, the payload to AI, and then your LLM would just give you some, solutions, right? As you can see here, update Apache log for G J to version two 19, consider using a more secure login library, blah, blah, blah, right? So this is the kind of information that it can give you, whether it's analyzing, resources based on issue, as I said, right? Or, Based on vulnerability or misconfiguration, right? So I think I'm done with the demo. I don't want to do more than this because it's already too much. And, the new project that the team has is called a sketch nest, right? So what is, sketch, sorry about the name, but like, what is a sketch next is a Kubernetes scheduler that uses insight from KCPT to make a intelligent, decision about where to place your workloads, instead of a default scheduler scoring, that is existing, by default, like the vanilla version of your scheduler. You can use a sketch next in order to like smartly, place your, your, workload. So maybe you don't have any topology key defined, for example, so sketchness will, will actually just help you, like dynamically, schedule, the resources based on like recent events. which is not always, maybe accessible to the, basic, so the, the default scheduler, because they rely on like, some very basic scoring. So this is the, the curing project that is integrated also with kcpt. So, feel free to give it a, a look. but in the meantime, thank you everyone. I know it was too much. If you want to reach out to me. Again, I, I work at cloud thrill. I can reach out to me on Twitter or, LinkedIn, dot com as well. This is my GitHub repo. So thanks everyone. And, that was amazing to be with you. Appreciate it.

See all 58 talks at this event!

Conf42 DevOps 2025 - Online

January 23 2025 - premiere 5PM GMT

Give Your managed kubernetes Troubleshooting Skills a Superpower with AI and K8sGPT

Video size:

Abstract

Summary

Transcript

Kosseila Haddalene

Founder @ cloudthrill

Join the community!

Featured event

2025

2024

Info

Conf42 DevOps 2025 - Online

January 23 2025 - premiere 5PM GMT

Give Your managed kubernetes Troubleshooting Skills a Superpower with AI and K8sGPT

Video size:

Abstract

Summary

Transcript

Kosseila Haddalene

Founder @ cloudthrill

Join the community!