Conf42 Site Reliability Engineering (SRE) 2024 - Online

Journey to Next-Gen AIOps Powered by eBPF and GenAI

Video size:

Abstract

Dive deep into the journey to Next-Gen AIOps! Beginning with foundational steps, such as observability, we progress to advanced AI-driven problem-solving powered by eBPF and GenAI, aiming to shift the paradigm from reactive to predictive and beyond to zero-touch-operations.

Summary

  • Michele is a site reliability engineer at Accenture. Topic of our talk today is journey to next gen AI ops powered by EBPF and Genai.
  • Nostesiar Hangriska is a cloud advisory manager at Accenture. He will discuss topics on aiops and observability. Nasya will take you on a voyage and explain the various steps of the journey. There will be a custom demo before wrapping everything up.
  • Sres aims to provide a unified view of the entire it landscape through observability. The idea is to let AI opposite the correlation, highlight critical incidents for faster resolution. Nasya: Why do we need aiops and observability for site reliability engineering?
  • The IOP's journey from reactive monitoring to the zero touch operations. The next stage would be full stack contextual observability. Here, automated systems powered by AI take over routine tasks and decision making processes. This is not an overnight process, and this is an iterative process.
  • EBPF stands for extended Berkeley packet filter. EBPF is a technology that is able to run sandbox programs in a privileged context. Nastia previously covered the various enablement steps towards AI Ops. Now we want to dive deep into some concrete use cases.
  • AI can help prevent a feature deployment to an unstable or faulty production environment. EBPF enables context awareness on a more cloud native level. If we want to reach zero touch operations, we need more on that.
  • This demo is based on a devsecops use case. The goal of the pipeline is to deploy a containerized application. After which we will use this EBPF data with our genai copilot, which will provide suggestions in case any vulnerabilities are fine. The entire end to end process is then visualized.
  • AIops enablement steps start from reactive monitoring by slowly enhancing observability, contextualization and automation. North Star is represented by zero touch operations, where systems are able to automatically resolve the issue before the failure occurs. And the final takeaway from today's session, start simple and scale fast.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hi everyone, and thank you for watching our session. The topic of our talk today is journey to next gen AI ops powered by EBPF and Genai. My name is Michele, and today I will be presenting together with my colleague Nastia. I am a site reliability engineer at Accenture. In the past years, I've been focusing on various s related topics such as aiops, observability and chaos engineering. Those are also topics which I also presented on various other occasions as a public speaker. Over to you, Nasya. Hello everyone. My name is Nostesiar Hangriska, and I'm working as a cloud advisory manager at Accenture and helping our customers with the topics on aiops and observability. Thank you, Nastia. So let's start by taking a quick look on what's on our agenda today. We start with the motivation, why aiops? What solve the fast with aiops? And why do site reliability engineers need it? Nasya will then take you on a voyage and explain the various steps of the aiops journey. After having a short intro into EBPF, we will then jump into a few cool aiops use cases, followed by a custom demo before wrapping everything up. Okay, I suggest we get started. Why do sres need aiops? Here we see multiple challenges that we have identified in operations across different industries, so let's maybe pick a few. A very common one with big organizations is complex and siloed it environments, especially if talking about multi cloud environments, that aim to resolve by providing a unified view of the entire it landscape through observability. Then we have incident resolutions, time. So with traditional operations, it simply takes too much time to identify the issue and to manually fix it. And by the time we fix it, by the time it's resolved, the impact of the failure might have already spread it. We don't have alert fatigue. Perhaps an organization has already set up a level of monitoring, but if it doesn't smartly correlate and analyze failures, errors, problems, the team will end up being spammed with many relevant alerts. So the idea is to let AI opposite the correlation, highlight critical incidents for faster resolution, and basically reduce the overall noise. As you can see, there are many challenges ahead. The reason for that is that systems today are becoming more and more complex and more and more critical. So the figure to the right illustrates exactly this. So due to heavy complexity, it's impossible to see what's going on with our system because we are lacking a level of visibility. Okay, why sres need aiops? In the beginning, I already mentioned that I am foremost a site reliability engineer. And when dealing with the type of challenges that I mentioned just now, or when architecting solutions to these challenges, I always tend to frame everything from an SRE perspective. So I always refer to SRE best practices such as measure everything, learn from failure, and most importantly, automate everything, or at least automate what makes sense. So the question here why do we need aiops and observability for site reliability engineering? So, as mentioned, as sres, we tend to automate as much as possible, since the goal is always to reduce, toil and improve key KPI's. So talking about meantime to detect mean time to resolve and so on. But in order to achieve that, we first need to access end to end insights. And this is why, as a prerequisite for aiops, we also need to enable observability, which Nasya will soon explain in detail. So this diagram that you're seeing right now is illustrating the big picture. So how do we as Sres position aiops and observability in the greater context? And today we're gonna talk to you about the IOP's journey from reactive monitoring to the zero touch operations. So imagine yourself on a ship, and your ship is sailing through the sea of a tee. And on the ship you have the data that you gather from your main sensors, the data about the speed, about fuel, about engine health. Those things are metrics. And those metrics are gathered by the vigilant clickout team who is reacting and informing the responsible team when something goes wrong. Here we're talking about reactive monitoring. The next stage would be observability. At the observability stage, in addition to the metrics that we are gathering, the lookout team will also look into the journal logs of the captain, and they will also follow the crew members, see what they are doing, and to document what they are doing. That would be traces and the journal of the captain would be the logs. And the combination of all three, like speed like metrics, traces and logs, is the magic triangle of the observability. This triangle gives you an understanding and the behavior of the system, so it helps you to understand not only what is happening, but also why is it happening. And this is very important if you want to understand the root cause, not only what went wrong, but why exactly it went wrong, which is very important when you address and want to know how to fix it. So the next stage would be full stack contextual observability. At this stage, we are not looking only at the ship itself and gathering information from the ship itself. But we are looking into the wider context, we are looking into the business context, the purpose of the ship, where we are sailing, why we are sailing, the depth of the sea, we are looking at our route. And if we are traveling from Italy to Spain, then we probably need one amount of supply. But if you are traveling from Italy to Spain via Alexandria, then we need another amount of fuel and supply. So we gather and correlate all the information that we gather from the context in which ship is sailing and from the ship itself. We combine it to get a bigger picture and better understanding of the behavior of our system. In the next step, we are talking about intelligent or predictive observability. So based on all the data that we are gathering, and we are looking into the context, and we are looking into the ship itself, and based on this historical data, we can predict what will be happening, understanding and seeing the failure before it happened. That's our goal here. And this proactive approach allows us to have strategic planning, minimize disruptions, and optimize the ship's performance. However, all the resolution steps would be still manual, and thus we go into the next stage, which will be autonomous or zero touch AI Ops. Here we have already our autonomous voyager. Here, automated systems powered by AI take over routine tasks and decision making processes. The automated navigation system, for example. No one needs to steer the boat, and the ship becomes self aware and self healing, responding to challenges without human intervention. This is why we call it zero touch operations. Here you look at the full journey from reactive monitoring through awareness of observability. Proactive context, preventive predictive observability, moving towards the zero touch powered by automation. Of course, this is a journey. This is not an overnight process, and this is an iterative process, as different parts of organization might be a different stage of maturity on this journey. Also, there is no one size fits at all. That means that not for every organization requires to reach the same level of automation and maturity as the other one. But it is good to know where you are and it is good to make an assessment and to see where you on this part of the journey. Before we jump into aiops use cases, let's have a quick brush up on EBPF. So what is EBPF? As we can see here, EBPF stands for extended Berkeley packet filter. And EBPF is a technology that is able to run sandbox programs in a privileged context. So basically, EBPF extends the OS kernel without really changing the kernel source, without requiring a reboot, and without causing any crashes. So how does EBPF actually work. So let's look at the figure on the left side. Let's start from the user space where we attached our application, microservices, networking and basically various processes. On the other side, we have the kernel space at this point, currently entirely decoupled from the application. Then the application process will at some point have a system call. Execv is often used, but there are also others, as we can see from the figure. And the system call will create an event which is then calling the EBPF program that we have injected. This way, every time that the process executes something on the kernel side, it will run the EBPF program. And this is why we say that EBPF programs are event driven. So basically every time that we have something, anything happening in our program, it will run the system call inside the kernel, which then calls the BPF program, which will run on the schedule and will start our application. So in this way we strengthen the coupling and context awareness between the kernel and our application sitting in the user space, and the kernel becomes a sort of big brother as it is able to see everything. Nastia previously covered the various enablement steps towards AI Ops. So now we want to dive deep into some concrete use cases. The idea of this part is to give you a flavor of the art of the possible. So we will tackle some challenges and pain points, see what the current trends on the market are, as well as explore solutions that reflect recent state of the art AI ops by leveraging technologies such as GenaI and EBPF. Okay, now that we have refreshed relevant concepts, let's look at some SRE challenges that are currently very predominant in different industries. There are obviously numerous challenges, but for today we cherry picked a couple of exemplary ones. So let's start with the first one. This is in the context of cloud native and kubernetes. So we are talking about deploying containerized applications. For those who have already worked with kubernetes and containers, they know that it's often a bit of a struggle because it often happens that it's a black box approach and it's very hard to debug what's happening really inside of container. So how do we do risk assessment inside containers? How can we be sure that there are no vulnerabilities before we deploy this into production? So the question here is, how can we apply aiops to immediate such risks? So we want to do ships left, but how do we do it? When, what is the effort? Does it scale? These are all questions that we need to consider staying close to the concept of deployments. Imagine now several developer teams or sres, deploying various features on a system in production. Now imagine that a new feature has passed all tests and is about to be deployed. However, in the meantime, an issue occurred in production and the system is unstable. So just taking a step back, is it really smart to deploy a new feature on an unstable production environment? Especially if you consider that we have multiple sres deploying features concurrently, meaning we need a mechanism to prevent these situations. And the final challenge, when deploying something to production, we often forget to enable context aware reliability since the very beginning of the software development lifecycle. So not just in production, but also in the previous stages. Okay, let's start with the first use case, classic deployment structure devsecops pipeline running on kubernetes and or cloud native. Then we have our application that we want to deploy. As mentioned earlier, the application is running on a pod with possibly many containers, so we don't know what's happening inside the container. So the question obviously arises, what about security vulnerabilities? Obviously with EBPF we have visibility into what is happening and what the vulnerabilities might be. So on the left hand side, we have just one singular application that we want to deploy, one user space and one pod. So that is something that we can monitor relatively easily thanks to BPF, which feeds this information to my end to end SRE emission control, which can trigger actions depending on detected issues. However, what happens if I have a much, if I have much more than one deployment? So imagine a huge and complex microservice deployment. Our situation doesn't change much. So when creating a container, accessing any file or network, so keep track of the b icon underneath. EBPF is able to scale and see all of this. So EBPF is aware of everything that is going on in the node, and that is why we say that EBPF enables context awareness on a more cloud native level. So this is an example of how we can trigger specific remediation actions based on enriched insights coming from EBPF. Now that we know how to use EBPF to obtain actionable insights on a more cloud native level, the question now arises, how do we prevent, for instance, a feature deployment to an unstable or faulty production environment? So we have our site reliability engineer who is about to deploy a new feature onto production. Let's consider our devsecops pipeline as part of the various pipeline stages. We have a step where we filter and analyze relevant EBPF events, and based on this data we can create proactive alerts, but as mentioned before, if we want to reach zero touch operations, we need more on that. This is where AI comes into play. There are obviously various benefits for placing AI here, but most importantly, based on context and topology aware data from numerous sources, including EBPF, we can predict anomalies much, much more efficiently. Such AI should also be capable of detecting an unstable prod environment and based on this information, trigger an action that blocks any deployment to production until the environment is stable again. And this is a good example of zero touch automation, if you remember the last pillar in the AIops maturity curve that Nastia presented because we are moving from reactive to predictive and our system is now self healing, meaning no manual operations are required. Genai is another technology that has realistically flooded the market, and that is an important part of our AIOP story. So we will now take a look at an SRE copilot powered by Genai. So imagine that your observability platform on the top left detects an increased response time for a specific service. As a direct consequence, the error budget is burned and the SLO is breached. So these are two KPI's. So we are keeping measure of both of these KPI's. So the moment that the error budget is burned, we trigger two processes. As you can see, one of the trigger processes proactively fires an alert to our SRE teams so that they know what's going on, so that they know that the error budget has been burned. But since we have no time to wait, the second one in parallel launches the OpenAI feature. Our SRE copilot, as we want to call it, which generates a post mortem and suggests a problem resolution which will display on our SRE mission control dashboard so that the SRE can check the insights suggested by the copilot. So this is an example of how we can leverage an AI in order to obtain much more rich insights into our issues. Now that we have an idea of what kind of use cases we are dealing with, let's jump straight into the demo. This demo is based on a devsecops use case. We have our devsecops ci CD pipeline which we have implemented in GitLab. The goal of the pipeline is to deploy a containerized application, so we are in a cloud native context here. We also deploy our honeypot where we will execute our EBPF experiment, after which we will use this EBPF data with our genai copilot, which will provide suggestions in case any vulnerabilities are fine. So this entire end to end process is then visualized in our end to end devsecops mission control. And for this demo we have used dynatrace. Okay, let's start the demo. So this is the devsecops pipeline that we implemented in GitLab. Imagine that we are trying to deploy a containerized application, and here you can see all of the pipeline stages. So in the first stage check, we start by deploying the application to a Kubernetes cluster. Then we have the deploy phase. Here I am deploying my honeypot with Kubernetes, then the experiment EBPF phase. This is where I execute my EBPF experiment by deploying Silium's tetragon, which is an EBPF based security observability tool. And finally AI check. This is the point in which I feed my EBPF data collected in the previous step to my OpenAI SRE copilot, which based on this data will provide relevant suggestions. And then finally I have the cleaner phase which simply cleans up my environment. Okay, I suggest we run the pipeline. So as you can see, the first stage already completed. I prepared the deployment of my containerized application. So now we can launch the deploy stage. And I can see here that all pods have been deployed as well as the honeypot, which we can see here printed out. Okay, let's move to the EBPF phase. Let's launch it. If we look here closely in the outputs. Yes, exactly. Here you will notice that we are seeing here inside the tetragon logs, that etc. Password has been exposed in the container. And this is obviously a vulnerability list which in the next stage. So if I run the Genai stage. So now I'm feeding the EBPF data to my copilot. So we can notice here that the copilot identified that this vulnerability in our EBPF data and is warning us by providing suggestions. For example, as you can read, reading, etc. Password can pose a security risk as it contains sensitive information, so it may lead to password tracking and other vulnerabilities and problems. So this is the part in which my copilot is telling me, hey, be careful, you are trying to execute a deployment to production, but your etc. Password is exposed. So this is the data that we have collected in the previous stage via EBPF, and we are now feeding it to our Genai I copilot, which is alarming us and alerting us and telling us careful here, you don't want to deploy this to production. And this is obviously extremely valuable information for our SRE teams. Finally, we have the cleanup stage, which simply cleans up my environment after the pipeline has been executed. Okay, now we have built, and we've seen how we built our devsecops pipeline and how we do the deployment. But the question now comes, how do we monitor this? So, as already mentioned in the beginning, we have built the dessert cops mission control demo with Dynatrace. So what you're seeing here is a dynatrace dashboard which shows EBPF events which have been collected. So the honeypot heartbeat as well as the trend of EBPF attack events. Now, this has been implemented with a classic dynatrace dashboard. However, if you want a more custom feel with Dynatrace app engine, you can also build your own web application, which is exactly what we did. So this is another version of our mission control. As you can see here, we are mapping and tracking the various stages of our GitLab pipeline in real time and reporting relevant analytics, pipeline status, failure ratio, heartbeat. But the most interesting piece of data is exactly the copilot suggestions that we're seeing here. So this copilot is warning us about the eTC password vulnerability, which I just previously explained. So this is, in summary a great example of how we can leverage an AI and a BPF to enrich our overall end to end insights. Okay, time to wrap things up. First thing we have seen on this session are the different AIops enablement steps, starting from reactive monitoring by slowly enhancing observability, contextualization and automation. The North Star is represented by zero touch operations, where systems are able to automatically resolve the issue before the failure occurs. After that, we have looked into several aiops use cases in a devsecops demo, through which we saw how we can leverage an AI and EBPF to significantly enrich end to end insights, which can be of crucial answers to site reliability engineers. And the final takeaway I would like to point out from today's session, start simple and scale fast. Thank you for watching.
...

Michele Dodic

SRE Associate Manager @ Accenture

Michele Dodic's LinkedIn account

Anastasia Archangelskaya

Cloud Advisory Manager @ Accenture

Anastasia Archangelskaya's LinkedIn account



Join the community!

Learn for free, join the best tech learning community for a price of a pumpkin latte.

Annual
Monthly
Newsletter
$ 0 /mo

Event notifications, weekly newsletter

Delayed access to all content

Immediate access to Keynotes & Panels

Community
$ 8.34 /mo

Immediate access to all content

Courses, quizes & certificates

Community chats

Join the community (7 day free trial)