Transcript
This transcript was autogenerated. To make changes, submit a PR.
Welcome everybody, and thank you very much for joining this talk.
These is Kaus Engineering in the fast lane, accelerating resilience
with AI and EBPF. I'm Francesco Sbaraglia.
I'm SRE tech lead at Accenture in ISG and
observability lead for EMEA. I have the pleasure to
have with me today Michaela hi,
my name is Michaela and I'm a site reliability engineer and SME at Accenture.
The past years I've been focusing on various essay related topics
such as aiops, observability and chaos engineering topics
which I also presented on various other occasions as a public
speaker. So I'm also very excited to present our new topic
to you. Therefore, I suggest we don't waste time and kick
off right away with the agenda. We will start off by provide
an overview of today's burning challenges and trends in the world
of chaos engineering. We will start with a list of
pain points that we observe in the industry and then start to provide solutions
for each one of these step by step.
We will then introduce an interesting technology,
EBPF, and we will show to you how we can
leverage it to enhance our chaos engineering practices,
especially when it's combined with the power
of OpenAI or AI in general.
As always, we will be presenting a custom demo in
order to put into action everything that we've learned so far before
finally wrapping up with the conclusion and takeaways.
Furthermore, keep in mind that in this talk, the focus is rather
on security observability rather than security itself.
Here is an overview of the overall benefits that chaos engineering brings
to the table. As mentioned in the beginning,
both Francesco and I are site reliability engineers
and therefore, it's only natural for us to try and formulate
everything and all these statements from an SRE perspective.
Having said that, the first point is more
of a cultural one. Without the knowledge and best practices
of chaos engineering, there is little anyone can do.
Therefore, the first step is to bring awareness and upskill
your SRE resources based on the latest trends and best practices.
So as it says, bring them up to speed.
Okay, so you're asking yourself,
what is the ultimate goal of chaos engineering? What is the purpose
of running a chaos experiment several times?
I'm quite sure it's not just an attempt to destroy
our system, but rather to test and discover its
blind spots, which are revealed via chaos experiments.
Once the vulnerabilities are revealed, we can start
implementing approaches that will reduce unknown behavior or anomalies.
So the goal here is to predict the unexpected and
in order to do that, we first need to make our system observable.
Obviously, once we have implemented mechanisms to discover
and resolve potential failures, we implicitly reduce
mean time to detect and mean time to resolve.
Once a chaos experiment has detected a blind spot in
our system which could lead to a potential failure, we are
able then to determine its root cause and therefore
we reduce also finger pointing issues that often
occur between various teams in various organizations.
And in such case the root cause is clear and so is the responsible
team in the organization that needs to take care of it.
Therefore, with chaos engineering, we also introduce
better visibility and improve the overall communication
between s three teams.
Okay, that was a high level overview.
We will now deep dive into the various challenges and provide a solution
for each one of these. Let's start with the first one.
As already preempted, without observability, I am unable
to identify the issues that are impacting my system.
Basically, I'm blind, especially in the case of connectivity issues.
The question is, how can I then troubleshoot my Kubernetes
clusters if I don't have the proper level of visibility,
if I don't have the tools to look into my Kubernetes
cluster? Think in general about containerized
applications. What is really going on inside a container? There could
be vulnerabilities. For instance, a container trying
to read something that it shouldn't have access to. In that case,
how do we conduct a risk assessment? That's the
question that we want to answer with these challenge.
In order to answer this question, we first need to introduce an interesting
piece of technology. So let's start talking about EBPF.
What is EBPF? EBPF stands for extended
Berkeley packet filter. It can run sandbox programs
in a privileged context. What does that mean? It means it
can make the kernel programmable.
Furthermore, EBPF comes from the open source community.
There are a lot of enterprise companies interested and
actively using EBPF, such as Meta,
Google, Netflix and other big names.
Also, keep in mind that EBPF is not really the
new kid on the block. It already existed before, but with a
different focus. So EBPF originally had its
roots in the network area.
It was used as a performance analyzer for networks,
for example, used to detect malware and whatnot.
But as we will see throughout these session, EBPF can
be extended to much, much more. So basically,
EBPF extends the OS kernel without changing
the kernel source, without requiring a reboot, and without
causing any crashes. Let's now see how EBPF
actually work. I suggest we start from the user space
where we attached our application, microservice networking
components and various other processes. Then we
have the chrono space at this current point in time entirely decoupled
from the application. While the application process will
at some point have a system call, for instance
execute, the system call will create an event
which is then calling the EBPF program that we have injected.
This way, every time the process executes something on
the kernel side, it will run the EBPF program.
These is a great feature as we can use EBPF to understand the system
calls exactly as they are triggered in prod. Meaning with
EBPF we can replicate and detect a real incident
as if it would have occurred in Prod. This is super helpful
because when an incident in prod occurs, I can then
track and understand all system calls and replicate
them more accurately into a chaos engineering experiment.
Basically every time we have something, anything happening
in our program, it will run these system call inside
the kernel, which then called these EBPF program,
and which will then run on the scheduler and will start our application.
This is why we say that EBPF programs are event reading
and this is how we increase awareness from the kernel side.
Overall, EBPF tools have the ability to instrument these system
without requiring any prior configuration changes.
And in this way we sort of strengthen the
coupling and context awareness between the kernel and
our application sitting in the user space. And remember,
the kernel becomes a sort of big brother because from this point onwards
it is able to see basically everything.
Okay, so we learned about the basics of EBPF and
its benefits, but the question remains,
what is actually an EBPF program and
what does it look like? So the question we
want to ask ourselves now, from a technical perspective,
how do we load our EBPF program into the kernel?
Let's start, for example, by taking a look at this python code
that compiles our EBPF program.
First thing we need is to load the EBPF library,
which we need to move the program inside the kernel.
I then have my hello world program.
So I load EBPF and attach it to the kernel via our
execve system call. Remember,
this system call will invoke the hello function.
And this is how we get our program into the kernel.
So if we look at the right hand side, every time a new program runs
in this virtual machine, my hello EBPF program will be
triggered from inside the kernel. And this is how we increase awareness
from the kernel side. Now that we introduced
EBPF, let's see how we can use it to improve
our overall chaos engineering practice.
Consider a classic deployment structure, CI CD
pipeline running on a Kubernetes and or cloud
native. Then we have an application that we
want to deploy, and the application is running
on a pod with possibly many containers.
And let's say that we don't know what's happening in
the containers as of now. Let's say
that we now use a chaos engineering platform to inject
a chaos experiment into our user space.
So the question is, what about
security vulnerabilities? How can we detect
these? How can we detect that dacaos experiment is running
in my Kubernetes cluster? I believe you
can already anticipate the answer. With EBPF,
we have visibility into what is happening and what these
vulnerabilities might be, therefore also visibility
into the chaos experiment we just injected.
Keep in mind that on the left hand side,
we just have one singular app. We want to deploy one
user space and one pod. That is something that we can monitor relatively
easily. As I can see what is happening with my app, thanks to
EBPF, which also feeds my information to my end
to end mission control. These mission control
launches a trigger in case of any issues detected.
However, if we look on the right hand side,
we have a much more complex environment.
We have much more than just one deployment.
Imagine a huge and complex microservice deployment.
So what does it change?
Not much, really, because when creating a container,
accessing any file or network, pay attention to the
little EBPF icons next
to these squares. As you can see, EBPF is
attached to all of these, and EBPF is able to scare
and sre all of this. EBPF is aware of everything
that is going on on the node and can help us reproduce any
disruptive application and platform behavior. This is why
we say that EBPF enables context awareness on a
more cloud native level. Now that
we have thoroughly analyzed the role of EBPF, let's move on to
the next challenge. Now we
can leverage EBPF to gain even better visibility into
our system. How do we do that? We can use EBPF
to collect even more events and metrics,
which we can use to trigger the big red stop button.
For those who are unfamiliar with the concept of big
red stop button, it is basically a metaphorical,
it's a metaphorical term used in chaos engineering to indicate an
imaginary stop button, which sort of aborts the chaos
experiments. If we observe that things start to go wrong and we
want to prevent it from further damaging our system for instance.
Interestingly, this term
is inspired from an actual red stop button which
is used in machine production. So they always have
a sort of big physical red button there to
abort any operations if they see things start going wrong.
Nevertheless, going back to our approach, the goal is to
make use of EBPF insights to generate some automation actions.
Let's take a good look at what EBPF and what
benefits we can get from EBPF, which can be used
to enhance our chaos engineering approach.
Firstly, no application changes, and this also means
no prior configuration changes and no need to change the kernel space,
no need to reboot, and it doesn't cause any crashes.
Secondly, EBPF sees all activities on the
node, whether it's one deployment or more deployments,
as we previously saw one or more container.
The point is, EBPF always scales.
Then we have the basis for security observability.
EBPF increases context awareness,
provides a deeper and more cloud native level of visibility,
and makes it easier to detect and respond to
security threats and vulnerabilities, which obviously
without EBPF we wouldn't be able to do on such a deep
level, for example seeing what is happening on a container level.
Otherwise we would be totally blunt.
Finally, this data can be used to generate metrics and events,
which are then used as an input to our AI prediction in order
to generate actionable insights. I mean, in the end,
that's the whole point. Why are we collecting all of this data with
or without EBPF, if we're not going to make use of
it to generate some predictions? So,
to summarize this slide, with EBPF, we don't need dozens
or hundreds of tools to have a better control of our experiments.
EBPF does it all. And with this
we conclude our second challenge. Thank you so
much Mikhaila. Let's have a look and move over on the next challenge.
Design an inset security clouds experiments within kubernetes how
we are going to solve this? Yes, we can use a BPF.
APF will help us to design new security cows experiment will
help us to understand the behavior of an application under security
attack. And we can also try to understand what are the steps
in the different levels and what sre these sequence of
the steps to try to breach and gather data from
our cluster. But as well use our cluster maybe to do other faber
attack. What we can see here is the classical
attack integrated of a Kubernetes cluster. We know in Kubernetes cluster
we have a master node, maybe they have component like FTCD.
We have a control plane, we have a worker nodes
where they have maybe a Kubernetes that can be accessed via API.
And then last but not least, we have our pod where our application will run.
Of course if you see in a different level, we can have a different vulnerability
of different attack interface. On these right hand
side, what you can see there are these typical cyber attack can example can
be I can run my kubernetes on the same node,
I can have maybe a privileged pod that can run,
I can escape the pod, maybe we reach other pod or we'll
try kind of to inject any malicious
code. We can have also malicious web book. So this
will call maybe outside of our cluster and we'll try to get these other information.
And maybe another example can be to gather token from outside
or to read the other token that we are actually we are not allowed to
do. And in these case we need to try to understand how we can
catch a lot of data. Of course we have already our security,
our CM, we are monitoring this behavior, so we
know exactly what can happen and that can be really bad.
And then what we see like on the right side down.
So we have these top three vulnerabilities for Kubernetes
attacks. Let's get only three of them. So maybe I can
misconfigure my container, maybe I can have a malicious container
image that we run and this will try to escape or
to gather data from someone else. But what we will see and what we will
focus also later, it's especially on
unintentional cluster misconfiguration. So we have a misconfiguration
inside our cluster. One of the pod will enable
us to do other attacks or maybe to get other information.
And these we are trying to understand how a BPF is helping
us to do the discovery, but also to try to understand what are
the sequence and maybe replicate in another experiment.
What you see here, in fact what we can do is
to use a capability of a BPF of creating network
policy. So in this case we use psyllium, EBPF,
what are the benefit that will give us at the end. So the first thing
is create a better network experiment, because now
I don't need to use any other software outside,
but I can use the model from EBPF to create
this network experiment. I can isolate my pod,
of course I can create some rules. We will see also later I have also
apple that will tell me what are the internal connection or
the connection to outside. And I can have a look on
this case to kind of troubleshoot. If I have this problem I
can do also experiment in a service mesh. This will help me of
course to do other experiments that are a bit more complex,
maybe crossing different cloud or different Kubernetes
cluster and the multicluster. In fact experiments is
the fact when we have designed our architectures
to have different automation of kubernetes cluster and then in
this case what I can do is to try to understand if I have an
isolation of this region, what will happen in the
end. And we always start with these fact with the question of
what if it's also pusher proof because as
you will see also in the live demo later wherever when I will do
like a node I o resource exhaustion. What I can do
is that I can use now MVPF and we were already using
before because we were already using for the classical network
performance or for network observability. But in this
case we will use a more active way and we can
create our VPF model and we can inject these
resources. Of course a bit more like affecting than before,
because with the BBF, as was explained before also
by Mikira, we move away from our user space and we are
running in the kernel space network resources option. Of course now
I can do a bit more, I can use like BGP, I can do other
experiment, a bit more complexity. All they are like
just using like YamL file and you can read there is
also the source for this network policy, how they work and how it's super
easy to apply. We will show later
one of them and how we do this network
troubleshooting. Last but not least is about pod resources
option. So I can run and use the performance testing
from MVPF. And here I can understand what will happen
when a pod will consume the old
cpu, all the memory, all the disk. And in this case
I can prevent because I can also act in blocking something.
So at the end a VPF removes the need to kill
or delete the pod and to deploy any other new tools.
I'm running already because I have a PPF installed and it's running
already on my kernel, so I can trigger whatever I want and I don't
need to restart nothing at the end. Okay, now let's move on.
The next challenge and actually is the last one before our
demo. So getting started with the very first causal
engineering experiments, that's always a question that we have. So where we
start, how we start, how we do reduce this toil
at the end. So what we thought about is to create
a house engineering copilot that's powered by generator Bi.
We'll gather all the data that I have, like my postmortem
data, my incident data, anything, any documentation
that I have architecture and we can use this
to generate our first experiment. So we will see in the
next slide how we do it. First of all,
we do like a bit of architecture. So we have on the left
side our SRE. Our SRE will use our cows.
Engineering copilot okay, let's move on to our last challenge.
So getting started with our first house engineering experiment.
So we always have the challenge. We don't know how to create maybe
the first experiment, the first hypothesis, or maybe
the steps and the one that are most effective for our
usage. How we solve it. We solve it using a causal
engineering copilot powered by generative AI. At these end we will see also
later is a script. In the script it's super easy. We integrate via
our CI CD pipeline. Every time that we deploy something new,
this will start to run. And what we'll do, we'll try to
understand based on various data. Let's have a look on
the architecture. As you can see there's really simple architecture in
the left side. Our SRE will use these copilot. We'll create
maybe manual for the first time and then
later they will move inside CICD pipeline
with just a Python script. We have a look on the Python script also in
the next slide. But it's interesting is that in these case we
can generate an hypothesis and we can also have the
usage of historical data. What are the historical data
at the end? So what you see here, usually with EBPF we extract
a lot of data, this data that we were mentioning before about
the application behavior or platform behavior. And in this case what we
can do is to give this as a context inside
our generative AI. We can also
get data that are automatically
coming like from our observability tool. So those will be real
time data. And these maybe can be used as a stop button
or as a trigger for the next experiments.
In fact, here we see already that we use AI for three different
topology. The first thing is to try to understand unknown
pattern. So not just using generative AI to generate something new,
but also using AI to predict which ones are the pattern that
are most effective but also easy to use without having
a lot of risk. Second, we'll analyze the application behavior
based on this APF data. We can try to understand the different steps
or the different system calls that the application is doing,
and we can simulate and replicate them in a controlled
way. And last but not least, based on our historical attack,
the one that we had before or our postmortem reports,
what we can do is predict the area that where we can OpenAI
these next incident,
but also the area where we can concentrate our experiment. Because maybe
we want to improve the classical mitdmetDr.
So in fact, AI and collection of generative
AI. What we'll do for us is first
of all we'll answer a question, what is the cause, what are the
problem or the application that can be affected or in general component
that can be affected. And if we can predict the behavior.
And if you see here, there is a cycle. So every time that we
finish one of the experiment, this experiment is
used by the next run of generative AI that
will improve and make a better hypothesis or create new hypothesis,
having a look at different area. And in this case, in fact we can automate
completely. Those are not only the classical house experiments,
but of course we can also extend in these security house experimentation.
In fact, with the BPF we are collecting also security and auditing.
Okay, already in fact we also tagulated the last challenges.
But before moving on, we have a look on these small
example that we build from the general DBI.
So this is a small script, we just use the OpenAI service.
And what you can see, it's a bit different than the classical one. But what
we are doing is first of all we are creating a system role.
So this is a classical prompt engineering. You see that we are embodying an
expert on causal experiments. We are also
in these second step bringing more context hunting
like more data. Maybe there are the historical
data, maybe there are past incidents,
maybe we have also available like
some postmortem report. And maybe
we have light data from our observability tool. This will of course create
a bit more context to our generation of these hypothesis. And we see
the last comment is about create can hypothesis for an
experimentation. And I want to have step by step for a specific service.
And the system, what will generate, of course will generate the hypothesis
for us, but will also generate which are the possible
experiment that we can use. We always need to validate before
we run. So we really encourage to do a dry run
before and to validate what is the output.
But this is like a really good starting point
and we can move on these next on the demo for today.
Okay, let's move on on the demo. So what
we will see later in the architectures is the classical
boutique shop. So we have a boutique shop,
we have a front end, we have a couple of services, front end
service, checkout service, payment services and so
on. So we see that 1 hour target in fact is our
checkout service. And the view that you see here
is from Abol Ui.
So in Apple UI we can have a view on what are the connection
inside the cluster. You see there are a lot of tcp connection from
inside, from outside. But also how these different services
they are calling each other. Okay,
here we have a look first of all to our retail
shop. So this is the one that we saw
before as architectures. So it's the classical one where maybe
I can add something on the chart, then I can do maybe
like go shopping again. And last but not
least I will do maybe placeholder and my order is placed.
So what we'll do like under the wood, of course there is the wall
application. These they are called that I done. So I put
something like in the cart. Maybe these is these front end
that is doing a call to shipment service to the cart service.
And lately last to the checkout and payment service.
So super easy like demo. Let's jump
on our experiments in the end. So the first thing that you can sre
here, I want to attach to one
of these pod and service that I'm running. So I
will have a look what I have. So as we said, we want to have
a look on the checkout service. And I put here
on my checkout service. So I want to handle my checkout service before
I handle my checkout service, what I will do. So we deployed in
this case tetragon. Tetragon is listening to any
EBPF events and we
will see what we can achieve. So first of all I'm going
list and in this case and the first thing that what I can do,
I will enter inside my pod. So I'm inside
my container. And what I can do here, for example,
just a clear already message. So in this case, you see
nothing is happening under the tetragon.
But the interesting is coming when I'm applying a tracing policy.
So imagine that I want to understand the behavior and I want to
replicate the same behavior as an experiment. So I
prepared some of them. So I will apply one of
them. That is the EBPF one. So the EBPF
of course will try to catch some of the
library that are loaded. To try to understand if my container
is doing something that is not allowed. The second one
that I want to start,
I created one about the capability.
So if my pod is trying to use any capability,
then of course these will be will create
an event that will be triggered at these end. So if now I run
the message so it's using some of the capability in
the next I would see when these will be catched
by tedragon. So I can do something a bit more.
So I will apply the next one on
the processes. So in fact every process in this
case will be monitored and will create an event
inside VPF. And now imagine here. So I'm attached in
the one that is the load generator. In this
case I see immediately that there is a system call. So if you see
now I can go a bit deeper like I'm having
a look on all system call that this service is
doing and in this case I see also that I have
locos that is running this system call.
Maybe I enter again so I'm a bad guy. So what I
can do is maybe try to understand my password
so the list of the user that we run and to also try
to understand if I can do some privilege escalation
or maybe I can run with a different user that is open.
And in this case you see every event will be automatically
create an event inside tedragon.
So tedragon of course in this case I'm using via command line but
I can also export and use in
another way. We will run one of the experiment that
is super easy one. So that will only just
increase a bit of the cpu of my
pod. So I'm getting one of them.
So we just run one command in this
case and you see immediately that also this
is generating an event and it's a bit different because
it's a process in this case so it's not a real system
call and now I'm generating load. So imagine that this is a
normal experiments, so I understand when it's real
starting my experiment and understand what are the behavior of my
application. So by Sean's these, if I'm attaching and listening
on the basic application that of course is a demo purpose.
But if I know that there is a special process that will need the
resources, what I can do is to attach and listen
and to try to understand if the behavior of this process is changing
based on the load that I'm inserting. So of course
I can do like a bit more so we
can go in security observability where I know
also the steps because of course those steps that are in
this case that I'm running, I have a cycle hands and I can generate
a better experiment now. So I will stop now
this and what I will do is maybe I will run
these message and the message is a bit trying.
You need to have the privileges to run this command at the end. And what
you see here, in fact I have this command that
is first of all running and
starting my VPF program that I loaded.
And these next, what I see is in fact all events
that are getting created by my call. Last but
not least as we provide it, we want to understand how we are
going to kind of understand which
connection we have, if we have problem with connection.
So what I created is a tracing
point. So I can attach these tracing point to a EBPF.
And what will do this will help me to understand the
connections that are done, if the connection, they are successful, but also
which kind of TCP connection that maybe they are broken
and they are not going in the right direction
or maybe they are just low. So let's have a look. So first of all,
I want to see the one that
I already have running in this case. So I see
that I created three tracing policies.
What I want to add is the last one. So that's about
having a list these on all TCP connection.
Let me apply and you will see now that
we'll log all the events about
this TCP connection. Yeah, you see,
so in this case we have a process that is named locust.
So locust, what is doing is opening a TCP
connection to another endpoint.
In this case we know that is our front end and we know
also that is exchanging some called,
because locust for us is used as a load generator,
is trying to kind of test anything
that we have. So it's doing all API calls and we see all the
API calls. But what I see, that's also interesting because I have other
calls that are not the standard one and not the normal one. So now what
you can imagine is that in this case I can script all
these called that I have or DCP connection.
What I can do is replicate an incident. Maybe I can
listen in some of the real
production services and can try to understand
all the events that are getting integrated,
all called that are done. I can try also to
kind of map them and give to
AI to understand the behavior of my application. And I
will know exactly what are the lateral movers that are done and
what I can do now I can replicate and create a better
experiments for the next round. So in fact,
we use here the combination of different
tools as we saw before. So first of all,
we instrumented and we added
CNU. We are using the version 1150.
We also sre using Apple, because we want to observe
all these connection as we saw in the UI before.
I will be able also to run Abol, of course in
a command line where it's giving me like other dimension,
because those data can be used later for
other experimentation. And what I'm using at
the end is tetragon. So Tetragon is creating
of course a JSON file. So I can consume this JSON
file automation, but I can also create metrics
and connect this to an
example to my CI CD pipeline. So imagine that as always we have,
so we have our customer deployment and production system,
or to maybe before an integration environment.
And I want to run an experiment, but I want to know also if the
experiment that I created is exactly doing the
behavior that I want to have. Example that my application will
not respond after maybe three or four calls, or maybe with
the load that I integrated is the right one. And here, in fact,
what I can do, I can be sure 100%
that my cousin experiment is running as expected.
And on the other side, what I can do later. So there will be
a new experimentation that will come directly using
a BPF. So where I don't need to inject nothing
manually anymore, but what I can do, I can use again a
different layer. I can go in a layer of
a kernel space where I can generate an experiment
directly there. That's a bit dangerous. So that's why we are
tuning it. And maybe we can show in the next demo
for the next time. Thank you very much for joining today for
this demo session. If you are interested, you can ping us also after
this talk and I will hand over to
Michaela for the wrap up. Thank you so much. Thanks Francesco,
for the fantastic demo. I suggest we wrap things up now.
Today we saw how we can leverage AI Genai
and EBPF to better detect running chaos experiments.
Remember, EBPF goes beyond classical observability.
EBPF is extremely helpful, as when an
incident in protocols, we can use it to track and understand all
system calls and replicate them more accurately into a chaos
engineering experiments. Additionally, not to forget
the role of AI, which can be used to significantly enhance threat
and anomaly detection. And these final takeaway,
I would like to point out from today's session. Start simple and
scale fast. So you don't know where to start from. Well,
start from a simple experiment, see how the system reacts, see how it
goes, and as you proceed, you can scale, you can basically build more and more
on top of that. Well, it seems it's time to close the curtains.
Thanks a lot for watching. And until next time.