Chaos Engineering in the Fast Lane: Accelerating Resilience with AI and eBPF

Video size:

Abstract

The rapidly growing complexity of distributed services and continuously evolving micro-service architectures currently affect individuals and organizations on every level. These phenomena, in turn, result in an ever-increasing amount of time-to-detect and time-to-resolve potential issues. Despite continuous and increasing investments in automation and security, the challenge remains with SRE and DevSecOps teams struggling to uncover and patch weak spots. While traditional, infrastructure-based chaos experiments only begin to scratch the surface of this challenge, Chaos Engineering powered by AI digs deeper. In this session, we will provide a detailed walkthrough on how to ideate, plan, implement, and run chaos experiments. To achieve better context-awareness on a more cloud-native level, we will apply an evolving piece of technology called eBPF, which makes the kernel programmable. On top of that, we will demonstrate how the integration of GenAI can help us evaluate the magnitude of the impact of the chaos experiment. All of that will be exemplified in a live demo, featuring a CI/CD DevSecOps deployment pipeline instrumented with eBPF, as well as integrated with OpenAI and observability tooling.

Summary

Francesco Sbaraglia is SRE tech lead at Accenture in ISG and observability lead for EMEA. Michaela is a site reliability engineer and SME. Kaus Engineering in the fast lane, accelerating resilience with AI and EBPF.
We will provide an overview of today's burning challenges and trends in the world of chaos engineering. We will then introduce an interesting technology, EBPF, and show how we can leverage it to enhance our chaos engineering practices. As always, we will be presenting a custom demo in order to put into action everything that we've learned so far.
EBPF stands for extended Berkeley packet filter. It can run sandbox programs in a privileged context. EBPF extends the OS kernel without changing the kernel source, without requiring a reboot. A lot of enterprise companies are actively using the technology.
EBPF enables context awareness on a more cloud native level. Imagine a huge and complex microservice deployment. With EBPF, we have visibility into what is happening. How can we detect security vulnerabilities?
EBPF can be used to gain even better visibility into our system. We can use EBPF to collect even more events and metrics, which we can use to trigger the big red stop button. Data can then be used as an input to our AI prediction in order to generate actionable insights.
Design an inset security clouds experiments within kubernetes. APF will help us to design new security cows experiment. Top three vulnerabilities for Kubernetes attacks. Especially on unintentional cluster misconfiguration.
We use a causal engineering copilot powered by generative AI. Every time that we deploy something new, this will start to run. And what we'll do is try to understand based on various data. Those are not only the classical house experiments, but of course we can also extend in these security house experimentation.
A small script, we just use the OpenAI service. What we are doing is first of all we are creating a system role. Second step bringing more context hunting like more data. We always need to validate before we run. But this is a really good starting point.
We have a look first of all to our retail shop. Every process in this case will be monitored and will create an event inside VPF. The interesting is coming when I'm applying a tracing policy. To try to understand the behavior and I want to replicate the same behavior as an experiment.
So there will be a new experimentation that will come directly using a BPF. If you are interested, you can ping us also after this talk and I will hand over to Michaela for the wrap up. Thanks Francesco, for the fantastic demo.
Today we saw how we can leverage AI Genai and EBPF to better detect running chaos experiments. Start simple and scale fast.

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Welcome everybody, and thank you very much for joining this talk. These is Kaus Engineering in the fast lane, accelerating resilience with AI and EBPF. I'm Francesco Sbaraglia. I'm SRE tech lead at Accenture in ISG and observability lead for EMEA. I have the pleasure to have with me today Michaela hi, my name is Michaela and I'm a site reliability engineer and SME at Accenture. The past years I've been focusing on various essay related topics such as aiops, observability and chaos engineering topics which I also presented on various other occasions as a public speaker. So I'm also very excited to present our new topic to you. Therefore, I suggest we don't waste time and kick off right away with the agenda. We will start off by provide an overview of today's burning challenges and trends in the world of chaos engineering. We will start with a list of pain points that we observe in the industry and then start to provide solutions for each one of these step by step. We will then introduce an interesting technology, EBPF, and we will show to you how we can leverage it to enhance our chaos engineering practices, especially when it's combined with the power of OpenAI or AI in general. As always, we will be presenting a custom demo in order to put into action everything that we've learned so far before finally wrapping up with the conclusion and takeaways. Furthermore, keep in mind that in this talk, the focus is rather on security observability rather than security itself. Here is an overview of the overall benefits that chaos engineering brings to the table. As mentioned in the beginning, both Francesco and I are site reliability engineers and therefore, it's only natural for us to try and formulate everything and all these statements from an SRE perspective. Having said that, the first point is more of a cultural one. Without the knowledge and best practices of chaos engineering, there is little anyone can do. Therefore, the first step is to bring awareness and upskill your SRE resources based on the latest trends and best practices. So as it says, bring them up to speed. Okay, so you're asking yourself, what is the ultimate goal of chaos engineering? What is the purpose of running a chaos experiment several times? I'm quite sure it's not just an attempt to destroy our system, but rather to test and discover its blind spots, which are revealed via chaos experiments. Once the vulnerabilities are revealed, we can start implementing approaches that will reduce unknown behavior or anomalies. So the goal here is to predict the unexpected and in order to do that, we first need to make our system observable. Obviously, once we have implemented mechanisms to discover and resolve potential failures, we implicitly reduce mean time to detect and mean time to resolve. Once a chaos experiment has detected a blind spot in our system which could lead to a potential failure, we are able then to determine its root cause and therefore we reduce also finger pointing issues that often occur between various teams in various organizations. And in such case the root cause is clear and so is the responsible team in the organization that needs to take care of it. Therefore, with chaos engineering, we also introduce better visibility and improve the overall communication between s three teams. Okay, that was a high level overview. We will now deep dive into the various challenges and provide a solution for each one of these. Let's start with the first one. As already preempted, without observability, I am unable to identify the issues that are impacting my system. Basically, I'm blind, especially in the case of connectivity issues. The question is, how can I then troubleshoot my Kubernetes clusters if I don't have the proper level of visibility, if I don't have the tools to look into my Kubernetes cluster? Think in general about containerized applications. What is really going on inside a container? There could be vulnerabilities. For instance, a container trying to read something that it shouldn't have access to. In that case, how do we conduct a risk assessment? That's the question that we want to answer with these challenge. In order to answer this question, we first need to introduce an interesting piece of technology. So let's start talking about EBPF. What is EBPF? EBPF stands for extended Berkeley packet filter. It can run sandbox programs in a privileged context. What does that mean? It means it can make the kernel programmable. Furthermore, EBPF comes from the open source community. There are a lot of enterprise companies interested and actively using EBPF, such as Meta, Google, Netflix and other big names. Also, keep in mind that EBPF is not really the new kid on the block. It already existed before, but with a different focus. So EBPF originally had its roots in the network area. It was used as a performance analyzer for networks, for example, used to detect malware and whatnot. But as we will see throughout these session, EBPF can be extended to much, much more. So basically, EBPF extends the OS kernel without changing the kernel source, without requiring a reboot, and without causing any crashes. Let's now see how EBPF actually work. I suggest we start from the user space where we attached our application, microservice networking components and various other processes. Then we have the chrono space at this current point in time entirely decoupled from the application. While the application process will at some point have a system call, for instance execute, the system call will create an event which is then calling the EBPF program that we have injected. This way, every time the process executes something on the kernel side, it will run the EBPF program. These is a great feature as we can use EBPF to understand the system calls exactly as they are triggered in prod. Meaning with EBPF we can replicate and detect a real incident as if it would have occurred in Prod. This is super helpful because when an incident in prod occurs, I can then track and understand all system calls and replicate them more accurately into a chaos engineering experiment. Basically every time we have something, anything happening in our program, it will run these system call inside the kernel, which then called these EBPF program, and which will then run on the scheduler and will start our application. This is why we say that EBPF programs are event reading and this is how we increase awareness from the kernel side. Overall, EBPF tools have the ability to instrument these system without requiring any prior configuration changes. And in this way we sort of strengthen the coupling and context awareness between the kernel and our application sitting in the user space. And remember, the kernel becomes a sort of big brother because from this point onwards it is able to see basically everything. Okay, so we learned about the basics of EBPF and its benefits, but the question remains, what is actually an EBPF program and what does it look like? So the question we want to ask ourselves now, from a technical perspective, how do we load our EBPF program into the kernel? Let's start, for example, by taking a look at this python code that compiles our EBPF program. First thing we need is to load the EBPF library, which we need to move the program inside the kernel. I then have my hello world program. So I load EBPF and attach it to the kernel via our execve system call. Remember, this system call will invoke the hello function. And this is how we get our program into the kernel. So if we look at the right hand side, every time a new program runs in this virtual machine, my hello EBPF program will be triggered from inside the kernel. And this is how we increase awareness from the kernel side. Now that we introduced EBPF, let's see how we can use it to improve our overall chaos engineering practice. Consider a classic deployment structure, CI CD pipeline running on a Kubernetes and or cloud native. Then we have an application that we want to deploy, and the application is running on a pod with possibly many containers. And let's say that we don't know what's happening in the containers as of now. Let's say that we now use a chaos engineering platform to inject a chaos experiment into our user space. So the question is, what about security vulnerabilities? How can we detect these? How can we detect that dacaos experiment is running in my Kubernetes cluster? I believe you can already anticipate the answer. With EBPF, we have visibility into what is happening and what these vulnerabilities might be, therefore also visibility into the chaos experiment we just injected. Keep in mind that on the left hand side, we just have one singular app. We want to deploy one user space and one pod. That is something that we can monitor relatively easily. As I can see what is happening with my app, thanks to EBPF, which also feeds my information to my end to end mission control. These mission control launches a trigger in case of any issues detected. However, if we look on the right hand side, we have a much more complex environment. We have much more than just one deployment. Imagine a huge and complex microservice deployment. So what does it change? Not much, really, because when creating a container, accessing any file or network, pay attention to the little EBPF icons next to these squares. As you can see, EBPF is attached to all of these, and EBPF is able to scare and sre all of this. EBPF is aware of everything that is going on on the node and can help us reproduce any disruptive application and platform behavior. This is why we say that EBPF enables context awareness on a more cloud native level. Now that we have thoroughly analyzed the role of EBPF, let's move on to the next challenge. Now we can leverage EBPF to gain even better visibility into our system. How do we do that? We can use EBPF to collect even more events and metrics, which we can use to trigger the big red stop button. For those who are unfamiliar with the concept of big red stop button, it is basically a metaphorical, it's a metaphorical term used in chaos engineering to indicate an imaginary stop button, which sort of aborts the chaos experiments. If we observe that things start to go wrong and we want to prevent it from further damaging our system for instance. Interestingly, this term is inspired from an actual red stop button which is used in machine production. So they always have a sort of big physical red button there to abort any operations if they see things start going wrong. Nevertheless, going back to our approach, the goal is to make use of EBPF insights to generate some automation actions. Let's take a good look at what EBPF and what benefits we can get from EBPF, which can be used to enhance our chaos engineering approach. Firstly, no application changes, and this also means no prior configuration changes and no need to change the kernel space, no need to reboot, and it doesn't cause any crashes. Secondly, EBPF sees all activities on the node, whether it's one deployment or more deployments, as we previously saw one or more container. The point is, EBPF always scales. Then we have the basis for security observability. EBPF increases context awareness, provides a deeper and more cloud native level of visibility, and makes it easier to detect and respond to security threats and vulnerabilities, which obviously without EBPF we wouldn't be able to do on such a deep level, for example seeing what is happening on a container level. Otherwise we would be totally blunt. Finally, this data can be used to generate metrics and events, which are then used as an input to our AI prediction in order to generate actionable insights. I mean, in the end, that's the whole point. Why are we collecting all of this data with or without EBPF, if we're not going to make use of it to generate some predictions? So, to summarize this slide, with EBPF, we don't need dozens or hundreds of tools to have a better control of our experiments. EBPF does it all. And with this we conclude our second challenge. Thank you so much Mikhaila. Let's have a look and move over on the next challenge. Design an inset security clouds experiments within kubernetes how we are going to solve this? Yes, we can use a BPF. APF will help us to design new security cows experiment will help us to understand the behavior of an application under security attack. And we can also try to understand what are the steps in the different levels and what sre these sequence of the steps to try to breach and gather data from our cluster. But as well use our cluster maybe to do other faber attack. What we can see here is the classical attack integrated of a Kubernetes cluster. We know in Kubernetes cluster we have a master node, maybe they have component like FTCD. We have a control plane, we have a worker nodes where they have maybe a Kubernetes that can be accessed via API. And then last but not least, we have our pod where our application will run. Of course if you see in a different level, we can have a different vulnerability of different attack interface. On these right hand side, what you can see there are these typical cyber attack can example can be I can run my kubernetes on the same node, I can have maybe a privileged pod that can run, I can escape the pod, maybe we reach other pod or we'll try kind of to inject any malicious code. We can have also malicious web book. So this will call maybe outside of our cluster and we'll try to get these other information. And maybe another example can be to gather token from outside or to read the other token that we are actually we are not allowed to do. And in these case we need to try to understand how we can catch a lot of data. Of course we have already our security, our CM, we are monitoring this behavior, so we know exactly what can happen and that can be really bad. And then what we see like on the right side down. So we have these top three vulnerabilities for Kubernetes attacks. Let's get only three of them. So maybe I can misconfigure my container, maybe I can have a malicious container image that we run and this will try to escape or to gather data from someone else. But what we will see and what we will focus also later, it's especially on unintentional cluster misconfiguration. So we have a misconfiguration inside our cluster. One of the pod will enable us to do other attacks or maybe to get other information. And these we are trying to understand how a BPF is helping us to do the discovery, but also to try to understand what are the sequence and maybe replicate in another experiment. What you see here, in fact what we can do is to use a capability of a BPF of creating network policy. So in this case we use psyllium, EBPF, what are the benefit that will give us at the end. So the first thing is create a better network experiment, because now I don't need to use any other software outside, but I can use the model from EBPF to create this network experiment. I can isolate my pod, of course I can create some rules. We will see also later I have also apple that will tell me what are the internal connection or the connection to outside. And I can have a look on this case to kind of troubleshoot. If I have this problem I can do also experiment in a service mesh. This will help me of course to do other experiments that are a bit more complex, maybe crossing different cloud or different Kubernetes cluster and the multicluster. In fact experiments is the fact when we have designed our architectures to have different automation of kubernetes cluster and then in this case what I can do is to try to understand if I have an isolation of this region, what will happen in the end. And we always start with these fact with the question of what if it's also pusher proof because as you will see also in the live demo later wherever when I will do like a node I o resource exhaustion. What I can do is that I can use now MVPF and we were already using before because we were already using for the classical network performance or for network observability. But in this case we will use a more active way and we can create our VPF model and we can inject these resources. Of course a bit more like affecting than before, because with the BBF, as was explained before also by Mikira, we move away from our user space and we are running in the kernel space network resources option. Of course now I can do a bit more, I can use like BGP, I can do other experiment, a bit more complexity. All they are like just using like YamL file and you can read there is also the source for this network policy, how they work and how it's super easy to apply. We will show later one of them and how we do this network troubleshooting. Last but not least is about pod resources option. So I can run and use the performance testing from MVPF. And here I can understand what will happen when a pod will consume the old cpu, all the memory, all the disk. And in this case I can prevent because I can also act in blocking something. So at the end a VPF removes the need to kill or delete the pod and to deploy any other new tools. I'm running already because I have a PPF installed and it's running already on my kernel, so I can trigger whatever I want and I don't need to restart nothing at the end. Okay, now let's move on. The next challenge and actually is the last one before our demo. So getting started with the very first causal engineering experiments, that's always a question that we have. So where we start, how we start, how we do reduce this toil at the end. So what we thought about is to create a house engineering copilot that's powered by generator Bi. We'll gather all the data that I have, like my postmortem data, my incident data, anything, any documentation that I have architecture and we can use this to generate our first experiment. So we will see in the next slide how we do it. First of all, we do like a bit of architecture. So we have on the left side our SRE. Our SRE will use our cows. Engineering copilot okay, let's move on to our last challenge. So getting started with our first house engineering experiment. So we always have the challenge. We don't know how to create maybe the first experiment, the first hypothesis, or maybe the steps and the one that are most effective for our usage. How we solve it. We solve it using a causal engineering copilot powered by generative AI. At these end we will see also later is a script. In the script it's super easy. We integrate via our CI CD pipeline. Every time that we deploy something new, this will start to run. And what we'll do, we'll try to understand based on various data. Let's have a look on the architecture. As you can see there's really simple architecture in the left side. Our SRE will use these copilot. We'll create maybe manual for the first time and then later they will move inside CICD pipeline with just a Python script. We have a look on the Python script also in the next slide. But it's interesting is that in these case we can generate an hypothesis and we can also have the usage of historical data. What are the historical data at the end? So what you see here, usually with EBPF we extract a lot of data, this data that we were mentioning before about the application behavior or platform behavior. And in this case what we can do is to give this as a context inside our generative AI. We can also get data that are automatically coming like from our observability tool. So those will be real time data. And these maybe can be used as a stop button or as a trigger for the next experiments. In fact, here we see already that we use AI for three different topology. The first thing is to try to understand unknown pattern. So not just using generative AI to generate something new, but also using AI to predict which ones are the pattern that are most effective but also easy to use without having a lot of risk. Second, we'll analyze the application behavior based on this APF data. We can try to understand the different steps or the different system calls that the application is doing, and we can simulate and replicate them in a controlled way. And last but not least, based on our historical attack, the one that we had before or our postmortem reports, what we can do is predict the area that where we can OpenAI these next incident, but also the area where we can concentrate our experiment. Because maybe we want to improve the classical mitdmetDr. So in fact, AI and collection of generative AI. What we'll do for us is first of all we'll answer a question, what is the cause, what are the problem or the application that can be affected or in general component that can be affected. And if we can predict the behavior. And if you see here, there is a cycle. So every time that we finish one of the experiment, this experiment is used by the next run of generative AI that will improve and make a better hypothesis or create new hypothesis, having a look at different area. And in this case, in fact we can automate completely. Those are not only the classical house experiments, but of course we can also extend in these security house experimentation. In fact, with the BPF we are collecting also security and auditing. Okay, already in fact we also tagulated the last challenges. But before moving on, we have a look on these small example that we build from the general DBI. So this is a small script, we just use the OpenAI service. And what you can see, it's a bit different than the classical one. But what we are doing is first of all we are creating a system role. So this is a classical prompt engineering. You see that we are embodying an expert on causal experiments. We are also in these second step bringing more context hunting like more data. Maybe there are the historical data, maybe there are past incidents, maybe we have also available like some postmortem report. And maybe we have light data from our observability tool. This will of course create a bit more context to our generation of these hypothesis. And we see the last comment is about create can hypothesis for an experimentation. And I want to have step by step for a specific service. And the system, what will generate, of course will generate the hypothesis for us, but will also generate which are the possible experiment that we can use. We always need to validate before we run. So we really encourage to do a dry run before and to validate what is the output. But this is like a really good starting point and we can move on these next on the demo for today. Okay, let's move on on the demo. So what we will see later in the architectures is the classical boutique shop. So we have a boutique shop, we have a front end, we have a couple of services, front end service, checkout service, payment services and so on. So we see that 1 hour target in fact is our checkout service. And the view that you see here is from Abol Ui. So in Apple UI we can have a view on what are the connection inside the cluster. You see there are a lot of tcp connection from inside, from outside. But also how these different services they are calling each other. Okay, here we have a look first of all to our retail shop. So this is the one that we saw before as architectures. So it's the classical one where maybe I can add something on the chart, then I can do maybe like go shopping again. And last but not least I will do maybe placeholder and my order is placed. So what we'll do like under the wood, of course there is the wall application. These they are called that I done. So I put something like in the cart. Maybe these is these front end that is doing a call to shipment service to the cart service. And lately last to the checkout and payment service. So super easy like demo. Let's jump on our experiments in the end. So the first thing that you can sre here, I want to attach to one of these pod and service that I'm running. So I will have a look what I have. So as we said, we want to have a look on the checkout service. And I put here on my checkout service. So I want to handle my checkout service before I handle my checkout service, what I will do. So we deployed in this case tetragon. Tetragon is listening to any EBPF events and we will see what we can achieve. So first of all I'm going list and in this case and the first thing that what I can do, I will enter inside my pod. So I'm inside my container. And what I can do here, for example, just a clear already message. So in this case, you see nothing is happening under the tetragon. But the interesting is coming when I'm applying a tracing policy. So imagine that I want to understand the behavior and I want to replicate the same behavior as an experiment. So I prepared some of them. So I will apply one of them. That is the EBPF one. So the EBPF of course will try to catch some of the library that are loaded. To try to understand if my container is doing something that is not allowed. The second one that I want to start, I created one about the capability. So if my pod is trying to use any capability, then of course these will be will create an event that will be triggered at these end. So if now I run the message so it's using some of the capability in the next I would see when these will be catched by tedragon. So I can do something a bit more. So I will apply the next one on the processes. So in fact every process in this case will be monitored and will create an event inside VPF. And now imagine here. So I'm attached in the one that is the load generator. In this case I see immediately that there is a system call. So if you see now I can go a bit deeper like I'm having a look on all system call that this service is doing and in this case I see also that I have locos that is running this system call. Maybe I enter again so I'm a bad guy. So what I can do is maybe try to understand my password so the list of the user that we run and to also try to understand if I can do some privilege escalation or maybe I can run with a different user that is open. And in this case you see every event will be automatically create an event inside tedragon. So tedragon of course in this case I'm using via command line but I can also export and use in another way. We will run one of the experiment that is super easy one. So that will only just increase a bit of the cpu of my pod. So I'm getting one of them. So we just run one command in this case and you see immediately that also this is generating an event and it's a bit different because it's a process in this case so it's not a real system call and now I'm generating load. So imagine that this is a normal experiments, so I understand when it's real starting my experiment and understand what are the behavior of my application. So by Sean's these, if I'm attaching and listening on the basic application that of course is a demo purpose. But if I know that there is a special process that will need the resources, what I can do is to attach and listen and to try to understand if the behavior of this process is changing based on the load that I'm inserting. So of course I can do like a bit more so we can go in security observability where I know also the steps because of course those steps that are in this case that I'm running, I have a cycle hands and I can generate a better experiment now. So I will stop now this and what I will do is maybe I will run these message and the message is a bit trying. You need to have the privileges to run this command at the end. And what you see here, in fact I have this command that is first of all running and starting my VPF program that I loaded. And these next, what I see is in fact all events that are getting created by my call. Last but not least as we provide it, we want to understand how we are going to kind of understand which connection we have, if we have problem with connection. So what I created is a tracing point. So I can attach these tracing point to a EBPF. And what will do this will help me to understand the connections that are done, if the connection, they are successful, but also which kind of TCP connection that maybe they are broken and they are not going in the right direction or maybe they are just low. So let's have a look. So first of all, I want to see the one that I already have running in this case. So I see that I created three tracing policies. What I want to add is the last one. So that's about having a list these on all TCP connection. Let me apply and you will see now that we'll log all the events about this TCP connection. Yeah, you see, so in this case we have a process that is named locust. So locust, what is doing is opening a TCP connection to another endpoint. In this case we know that is our front end and we know also that is exchanging some called, because locust for us is used as a load generator, is trying to kind of test anything that we have. So it's doing all API calls and we see all the API calls. But what I see, that's also interesting because I have other calls that are not the standard one and not the normal one. So now what you can imagine is that in this case I can script all these called that I have or DCP connection. What I can do is replicate an incident. Maybe I can listen in some of the real production services and can try to understand all the events that are getting integrated, all called that are done. I can try also to kind of map them and give to AI to understand the behavior of my application. And I will know exactly what are the lateral movers that are done and what I can do now I can replicate and create a better experiments for the next round. So in fact, we use here the combination of different tools as we saw before. So first of all, we instrumented and we added CNU. We are using the version 1150. We also sre using Apple, because we want to observe all these connection as we saw in the UI before. I will be able also to run Abol, of course in a command line where it's giving me like other dimension, because those data can be used later for other experimentation. And what I'm using at the end is tetragon. So Tetragon is creating of course a JSON file. So I can consume this JSON file automation, but I can also create metrics and connect this to an example to my CI CD pipeline. So imagine that as always we have, so we have our customer deployment and production system, or to maybe before an integration environment. And I want to run an experiment, but I want to know also if the experiment that I created is exactly doing the behavior that I want to have. Example that my application will not respond after maybe three or four calls, or maybe with the load that I integrated is the right one. And here, in fact, what I can do, I can be sure 100% that my cousin experiment is running as expected. And on the other side, what I can do later. So there will be a new experimentation that will come directly using a BPF. So where I don't need to inject nothing manually anymore, but what I can do, I can use again a different layer. I can go in a layer of a kernel space where I can generate an experiment directly there. That's a bit dangerous. So that's why we are tuning it. And maybe we can show in the next demo for the next time. Thank you very much for joining today for this demo session. If you are interested, you can ping us also after this talk and I will hand over to Michaela for the wrap up. Thank you so much. Thanks Francesco, for the fantastic demo. I suggest we wrap things up now. Today we saw how we can leverage AI Genai and EBPF to better detect running chaos experiments. Remember, EBPF goes beyond classical observability. EBPF is extremely helpful, as when an incident in protocols, we can use it to track and understand all system calls and replicate them more accurately into a chaos engineering experiments. Additionally, not to forget the role of AI, which can be used to significantly enhance threat and anomaly detection. And these final takeaway, I would like to point out from today's session. Start simple and scale fast. So you don't know where to start from. Well, start from a simple experiment, see how the system reacts, see how it goes, and as you proceed, you can scale, you can basically build more and more on top of that. Well, it seems it's time to close the curtains. Thanks a lot for watching. And until next time.

Slides

Download slides (PDF)

See all 22 talks at this event!

Conf42 Chaos Engineering 2024 - Online

February 15 2024

Chaos Engineering in the Fast Lane: Accelerating Resilience with AI and eBPF

Video size:

Abstract

Summary

Transcript

Slides

Francesco Sbaraglia

SRE Tech lead ASG @ Accenture

Michele Dodic

SRE Associate Manager @ Accenture

Join the community!

Featured event

2025

2024

Info

Conf42 Chaos Engineering 2024 - Online

February 15 2024

Chaos Engineering in the Fast Lane: Accelerating Resilience with AI and eBPF

Video size:

Abstract

Summary

Transcript

Slides

Francesco Sbaraglia

SRE Tech lead ASG @ Accenture

Michele Dodic

SRE Associate Manager @ Accenture

Join the community!