Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi everyone, and thank you for watching our session. The topic of
our talk today is journey to next gen AI ops powered by EBPF and
Genai. My name is Michele, and today I will be presenting
together with my colleague Nastia. I am a site reliability engineer
at Accenture. In the past years, I've been focusing on various
s related topics such as aiops,
observability and chaos engineering. Those are also topics which
I also presented on various other occasions as a public speaker.
Over to you, Nasya. Hello everyone.
My name is Nostesiar Hangriska, and I'm working as
a cloud advisory manager at Accenture and helping our
customers with the topics on aiops and observability.
Thank you, Nastia. So let's start by taking a quick look on what's
on our agenda today. We start with the motivation, why aiops?
What solve the fast with aiops? And why do site reliability engineers need
it? Nasya will then take you on a voyage and explain the various steps
of the aiops journey. After having a short intro
into EBPF, we will then jump into a few cool aiops use
cases, followed by a custom demo before wrapping
everything up. Okay, I suggest we get started.
Why do sres need aiops? Here we see
multiple challenges that we have identified in operations across different
industries, so let's maybe pick a few. A very common one
with big organizations is complex and siloed it environments,
especially if talking about multi cloud environments,
that aim to resolve by providing a unified
view of the entire it landscape through observability.
Then we have incident resolutions, time. So with traditional
operations, it simply takes too much time to identify
the issue and to manually fix it. And by the time we
fix it, by the time it's resolved, the impact of the failure might
have already spread it. We don't have alert fatigue.
Perhaps an organization has already set up a level of monitoring,
but if it doesn't smartly correlate and analyze
failures, errors, problems, the team will end up
being spammed with many relevant alerts. So the idea is
to let AI opposite the correlation, highlight critical incidents for faster
resolution, and basically reduce the overall noise.
As you can see, there are many challenges ahead. The reason for that
is that systems today are becoming more and more complex and more
and more critical. So the figure to the right illustrates exactly
this. So due to heavy complexity,
it's impossible to see what's going on with our system because we are lacking
a level of visibility. Okay, why sres
need aiops? In the beginning, I already
mentioned that I am foremost a site reliability engineer.
And when dealing with the type of challenges that I
mentioned just now, or when architecting solutions
to these challenges, I always tend to frame everything from
an SRE perspective. So I always refer to
SRE best practices such as measure everything,
learn from failure, and most importantly, automate everything,
or at least automate what makes sense. So the question here
why do we need aiops and observability for site reliability engineering?
So, as mentioned, as sres, we tend to automate
as much as possible, since the goal is always to reduce, toil and
improve key KPI's. So talking about meantime to detect mean time
to resolve and so on. But in order to achieve that,
we first need to access end to end insights.
And this is why, as a prerequisite for aiops, we also
need to enable observability, which Nasya will soon explain in detail.
So this diagram that you're seeing right now is illustrating
the big picture. So how do we as Sres
position aiops and observability in the greater context?
And today we're gonna talk to you about the IOP's journey
from reactive monitoring to the zero touch operations.
So imagine yourself on a ship, and your ship is sailing
through the sea of a tee.
And on the ship you have the data that you
gather from your main sensors, the data about the speed,
about fuel, about engine health.
Those things are metrics. And those metrics are
gathered by the vigilant clickout team who is
reacting and informing the responsible team
when something goes wrong.
Here we're talking about reactive monitoring.
The next stage would be observability.
At the observability stage, in addition to the metrics
that we are gathering, the lookout team will also
look into the journal logs of the captain,
and they will also follow the crew members,
see what they are doing, and to document what they are
doing. That would be traces and the journal
of the captain would be the logs. And the combination of all
three, like speed like metrics,
traces and logs, is the magic triangle
of the observability. This triangle
gives you an understanding and the behavior of the system,
so it helps you to understand not only
what is happening, but also why is it happening.
And this is very important if you want to understand
the root cause, not only what went wrong,
but why exactly it went wrong,
which is very important when you address and want
to know how to fix it. So the next stage
would be full stack contextual observability.
At this stage, we are not looking only at the ship itself
and gathering information from the ship itself.
But we are looking into the wider context, we are looking
into the business context, the purpose of the ship,
where we are sailing, why we are sailing, the depth of
the sea, we are looking at our route.
And if we are traveling from Italy to Spain,
then we probably need one amount of supply. But if you are traveling from
Italy to Spain via Alexandria, then we need
another amount of fuel and supply.
So we gather and correlate all the information that
we gather from the context in
which ship is sailing and
from the ship itself. We combine it to get
a bigger picture and better understanding of
the behavior of our system. In the
next step, we are talking about intelligent or predictive
observability. So based on all the
data that we are gathering, and we are looking into the context,
and we are looking into the ship itself, and based on this historical
data, we can predict what will be happening,
understanding and seeing the failure before it happened.
That's our goal here. And this proactive approach
allows us to have strategic planning,
minimize disruptions, and optimize the ship's
performance. However, all the resolution
steps would be still manual,
and thus we go into the next stage,
which will be autonomous or zero
touch AI Ops. Here we have already
our autonomous voyager. Here,
automated systems powered by AI take
over routine tasks and decision making processes.
The automated navigation system, for example.
No one needs to steer the boat, and the ship
becomes self aware and self healing,
responding to challenges without human intervention.
This is why we call it zero touch operations.
Here you look at the full journey from reactive
monitoring through awareness of observability.
Proactive context, preventive predictive
observability, moving towards the
zero touch powered by automation.
Of course, this is a journey. This is not an overnight
process, and this is an iterative process, as different
parts of organization might be a different stage of maturity
on this journey. Also,
there is no one size fits at all. That means that
not for every organization requires to
reach the same level of
automation and maturity as the other one.
But it is good to know where you are and it
is good to make an assessment and to see where
you on this part of the journey.
Before we jump into aiops use cases, let's have a quick brush
up on EBPF. So what is EBPF?
As we can see here, EBPF stands for extended Berkeley packet
filter. And EBPF is a technology
that is able to run sandbox programs in a privileged context.
So basically, EBPF extends the OS kernel
without really changing the kernel source, without requiring a reboot,
and without causing any crashes. So how does EBPF
actually work. So let's look at the figure on the
left side. Let's start from the user space where we attached our application,
microservices, networking and basically various processes.
On the other side, we have the kernel space at this point,
currently entirely decoupled from the application.
Then the application process will at some point have a system call.
Execv is often used, but there are also others, as we can see from the
figure. And the system call will create an event which
is then calling the EBPF program that we have injected.
This way, every time that the process executes
something on the kernel side, it will run the EBPF program.
And this is why we say that EBPF programs are event driven. So basically
every time that we have something, anything happening in our
program, it will run the system call inside the kernel, which then calls
the BPF program, which will run on the schedule and will start our application.
So in this way we strengthen the coupling and context awareness
between the kernel and our application sitting in the user space,
and the kernel becomes a sort of big brother as it is able to see
everything. Nastia previously covered the various enablement
steps towards AI Ops. So now we want to dive deep
into some concrete use cases. The idea of this
part is to give you a flavor of the art of the possible. So we
will tackle some challenges and pain points, see what the current
trends on the market are, as well as explore solutions that
reflect recent state of the art AI ops by leveraging technologies
such as GenaI and EBPF.
Okay, now that we have refreshed relevant concepts,
let's look at some SRE challenges that are currently very predominant
in different industries. There are obviously numerous challenges,
but for today we cherry picked a couple of exemplary ones.
So let's start with the first one.
This is in the context of cloud native and kubernetes.
So we are talking about deploying containerized applications.
For those who have already worked with kubernetes and containers,
they know that it's often a bit of a struggle because it
often happens that it's a black box approach and
it's very hard to debug what's happening really inside of container.
So how do we do risk assessment inside containers?
How can we be sure that there are no vulnerabilities
before we deploy this into production?
So the question here is, how can we
apply aiops to immediate such risks? So we want to do
ships left, but how do we do it? When, what is the effort?
Does it scale? These are all questions that we need to consider staying
close to the concept of deployments. Imagine now several developer
teams or sres, deploying various features on a
system in production. Now imagine that a
new feature has passed all tests and is about to be deployed.
However, in the meantime, an issue occurred in production and the system
is unstable. So just taking a step back, is it
really smart to deploy a new feature on an unstable production environment?
Especially if you consider that we have multiple
sres deploying features concurrently, meaning we
need a mechanism to prevent these situations. And the final
challenge, when deploying something to production,
we often forget to enable context aware reliability
since the very beginning of the software development lifecycle.
So not just in production, but also in
the previous stages. Okay, let's start
with the first use case, classic deployment structure
devsecops pipeline running on kubernetes and or cloud native.
Then we have our application that we want to deploy.
As mentioned earlier, the application is running on a pod with possibly
many containers, so we don't know what's happening inside the container.
So the question obviously arises, what about security vulnerabilities?
Obviously with EBPF we have visibility into
what is happening and what the vulnerabilities might be.
So on the left hand side, we have just one singular
application that we want to deploy,
one user space and one pod. So that is something that
we can monitor relatively easily thanks to BPF, which feeds
this information to my end to end SRE emission control, which can
trigger actions depending on detected issues.
However, what happens if I have a much,
if I have much more than one deployment? So imagine
a huge and complex microservice deployment.
Our situation doesn't change much. So when creating
a container, accessing any file or network,
so keep track of the b icon underneath.
EBPF is able to scale and see all of this. So EBPF
is aware of everything that is going on in the node, and that is why
we say that EBPF enables context awareness on a more
cloud native level. So this is an example of how
we can trigger specific remediation actions
based on enriched insights coming from EBPF.
Now that we know how to use EBPF to obtain actionable
insights on a more cloud native level, the question now arises,
how do we prevent, for instance, a feature deployment to
an unstable or faulty production environment?
So we have our site reliability engineer who is
about to deploy a new feature onto production. Let's consider
our devsecops pipeline as part of the various pipeline stages.
We have a step where we filter and analyze relevant EBPF events,
and based on this data we can create proactive alerts,
but as mentioned before, if we want to reach zero touch
operations, we need more on that. This is where AI
comes into play. There are obviously various benefits
for placing AI here, but most importantly, based on context
and topology aware data from numerous sources, including EBPF,
we can predict anomalies much, much more efficiently.
Such AI should also be capable of detecting an unstable prod environment
and based on this information, trigger an action that
blocks any deployment to production until the environment is
stable again. And this is a good example of zero
touch automation, if you remember the last pillar in the
AIops maturity curve that Nastia presented because
we are moving from reactive to predictive and our system
is now self healing, meaning no manual operations are required.
Genai is another technology that has realistically
flooded the market, and that is an important part of our AIOP story.
So we will now take a look at an SRE copilot
powered by Genai.
So imagine that your observability platform
on the top left detects an increased response
time for a specific service. As a direct consequence,
the error budget is burned and the SLO is breached.
So these are two KPI's. So we are keeping
measure of both of these KPI's. So the
moment that the error budget is burned, we trigger two processes.
As you can see, one of the trigger processes proactively fires
an alert to our SRE teams so that they know what's going on,
so that they know that the error budget has been burned. But since
we have no time to wait, the second one in parallel launches
the OpenAI feature. Our SRE copilot, as we
want to call it, which generates a post mortem and suggests a problem resolution
which will display on our SRE mission control dashboard
so that the SRE can check the insights suggested by the copilot.
So this is an example of how we can leverage an
AI in order to obtain much more rich insights
into our issues. Now that we have an idea
of what kind of use cases we are dealing with, let's jump straight into
the demo. This demo is based on a devsecops use case.
We have our devsecops ci CD pipeline which we have
implemented in GitLab. The goal of the pipeline is to deploy a
containerized application, so we are in a cloud native context
here. We also deploy our honeypot
where we will execute our EBPF experiment,
after which we will use this EBPF data with our genai copilot,
which will provide suggestions in case any vulnerabilities
are fine. So this entire end to end process is
then visualized in our end to end devsecops mission control. And for
this demo we have used dynatrace. Okay,
let's start the demo.
So this is the devsecops pipeline that we implemented in GitLab.
Imagine that we are trying to deploy a containerized application,
and here you can see all of the pipeline stages.
So in the first stage check, we start by deploying the application
to a Kubernetes cluster. Then we have the deploy phase. Here I
am deploying my honeypot with Kubernetes,
then the experiment EBPF phase. This is where
I execute my EBPF experiment by deploying Silium's tetragon,
which is an EBPF based security observability tool.
And finally AI check. This is the point in
which I feed my EBPF data collected in the previous step
to my OpenAI SRE copilot, which based on this
data will provide relevant suggestions.
And then finally I have the cleaner phase which simply cleans up my
environment. Okay, I suggest
we run the pipeline. So as you can see, the first
stage already completed. I prepared the deployment of my containerized
application. So now we can launch the deploy
stage. And I can see here
that all pods have been deployed
as well as the honeypot, which we can see here
printed out. Okay, let's move to the EBPF
phase. Let's launch it.
If we look here closely in the outputs.
Yes, exactly. Here you
will notice that we are seeing here inside the tetragon
logs, that etc. Password has been exposed in
the container. And this is obviously a vulnerability
list which in the next stage. So if I
run the Genai stage. So now I'm feeding the EBPF
data to my copilot. So we
can notice here that the copilot
identified that this vulnerability in our EBPF data
and is warning us by providing suggestions. For example, as you
can read, reading, etc. Password can pose a security
risk as it contains sensitive information, so it may lead to
password tracking and other vulnerabilities and
problems. So this is the part in which my
copilot is telling me, hey, be careful, you are trying to
execute a deployment to production, but your etc.
Password is exposed. So this is the data that we have collected in
the previous stage via EBPF, and we are now feeding it to our
Genai I copilot, which is alarming us and
alerting us and telling us careful here, you don't want
to deploy this to production. And this is obviously extremely
valuable information for our SRE teams.
Finally, we have the cleanup stage,
which simply cleans up my environment after
the pipeline has been executed.
Okay, now we have built,
and we've seen how we built our devsecops pipeline and
how we do the deployment. But the question now comes, how do we
monitor this? So, as already mentioned in the beginning,
we have built the dessert cops mission control demo with
Dynatrace. So what you're seeing here is a dynatrace
dashboard which shows EBPF events which have been collected.
So the honeypot heartbeat as well as the trend of EBPF
attack events. Now, this has been implemented
with a classic dynatrace dashboard. However, if you want a
more custom feel with Dynatrace app engine, you can also
build your own web application, which is exactly what we did.
So this is another version of our
mission control. As you can see here, we are mapping and
tracking the various stages of our GitLab pipeline in real time
and reporting relevant analytics, pipeline status,
failure ratio, heartbeat. But the most interesting piece
of data is exactly the copilot suggestions
that we're seeing here. So this copilot is warning us
about the eTC password vulnerability, which I just
previously explained. So this is, in summary
a great example of how we can leverage an AI and a BPF
to enrich our overall end to end insights.
Okay, time to wrap things up. First thing
we have seen on this session are the different AIops enablement
steps, starting from reactive monitoring by slowly
enhancing observability, contextualization and automation.
The North Star is represented by zero touch operations, where systems
are able to automatically resolve the issue before the failure
occurs. After that, we have looked into
several aiops use cases in a devsecops demo,
through which we saw how we can leverage an AI and
EBPF to significantly enrich end to end insights,
which can be of crucial answers to site reliability engineers.
And the final takeaway I would like to point out from today's session,
start simple and scale fast. Thank you for watching.