Abstract
Ensuring reliability within deployment pipelines for complex systems dealing with massive flows of real-time data can be a challenge. However, by enabling Observability across our E2E infrastructure and combining AI techniques with SRE best practices, we can successfully prevent faulty deployments.
Everyone knows that automation testing is at the core of any DevOps pipeline in order to enable an early detection of CI/CD platform overload. However, this does not apply to large systems dealing with massive flows of real-time data, where failures are often extremely hard to identify due to the infrastructure’s complexity – especially in multi-hybrid cloud environments.
Imagine this: you’re a DevOps Engineer at a major tech giant and you are responsible for the overall system health, which is running in prod. Numerous alerts, server crashes, Jira tickets, incidents and an avalanche of responsibilities, which sometimes simply feel like a ticking time bomb. Furthermore, your production environment is constantly being updated with various deployments of new code. While automated tests in CI/CD pipelines make sure that the new code is functioning, these do not cover all the dark corners of your end-to-end deployment platform. Imagine, for instance, that the building/testing/deployment service of your CI/CD platform is experiencing performance issues (or worse) at the same exact time when you are deploying a brand new feature into live code. How do we prevent a disaster? And what if we are in rush for a hot-fix?
The first step is to establish observability, which focuses on enabling full-stack visibility of your environment. This enables end-to-end insights, which provide transparency of all your internal components and help you understand how these interact with each other, as well as how they affect the overall systems. Unlike traditional monitoring approaches, the aim is to understand why something is broken, instead of merely focusing on what is broken. Ideally, we want to shift the paradigm from reactive to predictive.
Once observability has been enabled across our end-to-end environment, we will employ AI/ML techniques in order to pre-emptively alert, as well as prevent any form of system degradation. We combine these strategies mentioned so far with SRE methodologies (e.g. SLO, error budget) when measuring our system’s overall health. This type of approach provides an additional layer of reliability to DevOps pipelines.
For this session, we have prepared and analyzed several use cases, followed main principles, summarized best practices and built a live demo through a combination of Observability and continuous deployment tools.
Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello and thank you for watching today's session on AI driven DevOps
CI CD pipeline. This video was prepared by Francesco
and myself. We're both part of an Accenture group of highly motivated
SRE specializing the state of the art SRE DevOps practices
with these goal to promote a growth mindset that embraces agility,
DevOps and SRE as the new normal. I myself have
a background, software engineering, AI and industrial automation.
I specialize in several SRE related topics such as AIops
observability, brand chaos engineering. Thank you very much Mikhaila
welcome everybody. I'm Francesco Sbaraglia and I'm based in
Germany. I'm working as SRE and I ops tech lead EMEA
at Accenture. I have over 20 years of experience solving
production problems in corporate, startup and government.
Furthermore, I have deep experience in automation,
observability, multicloud and flows engineering.
I'm currently growing the SRE DevOps capability
at Accenture. Let's have a look at the agenda for today.
First we will have a look at why why
we need to monitor and observe SCi CD
delivery pipeline. Then we will have a look at the classic continuous
delivery pipelines architecture. We will refresh about observability
and how to use open telemetry in this case. Then we
move to the AI driven approach. We will explain what we
are doing, brand, how we are using AI. We will have a
demo and Michaela will go with the conclusion
and takeaways. Let's move now on the first point,
why to monitor a CI CD pipeline and platform? You can
imagine these CI CD platform is really critical. Let's see our
key challenges. SRE and DevOps needs a CI CD platform
that needs to be reliable brand with predictable failures.
SRE and DevOps needs a data driven approach
to run proactive capacity management when brand how
to increase resources SRE and DevOps needs a CICD
to deploy Oddfix and Bugsfix during outages.
Now you can imagine how a CICD platform became really critical.
Most of the time the CI CD platform is black box. So we
wanted to understand what is happening inside, how we can improve
all processes and also how we can improve each steps
of the CI CD pipeline. Thanks Francesco. Now that
we introduce the key motivation for today's talk, I suggest we dive
into the first major topic which is CICD.
What we see here is the architecture for our CI CD flow integrated
with the observability platform. So let's start from the top left.
Imagine a DevOps engineer who's writing some new code and when he's
done, he commits it to the git repository. Subsequently,
this commit will trigger our CI CD pipeline.
As you can see, we are using Jenkins that after the commit will
auto initialize the pipelines. Specifically, Jenkins will
initialize the dynamic agent pod which we also see here and which
is dependent on the resources that Jenkins takes from the Kubernetes cluster.
In the next step, the changes will be deployed to the production environment,
ultimately reaching our end users. So far,
everything I've talked about is pretty much standard I would say.
But what we're really interested in is how we integrated the
pipeline with our observability platform. In the upcoming slides
we will explain more about observability platform and how we're using the open
telemetry standard to collect data from our Jenkins pipelines.
But right now what we're interested in are the KPIs
that we see on the right hand side that we want to derive from
the data that we're collecting with open telemetry. So I suggest we
analyze one by one. Let's start with the first one, which is speed
of the CD pipeline. I want to
stress the CD part here as we're only talking about the deployments.
Now consider the situation in which an outage occurs and the
DevOps engineer needs to deploy a hot fix as soon has possible.
We are talking about a critical situation because the system
is unstable brand, the end user is expecting a fix,
otherwise there is a risk that they might abandon it and that's something we
want to avoid. So as you can see, by measuring the deployment speed, we also
improve our realtime to recovery. The quicker I manage to deploy the
hotfix, the lower will my MTTR be.
The next KPI that we're deriving is build test success rate.
These one gives us an indication on how many core tests
or unit tests are successfully completed once deployed
to production. Furthermore, we also want to count the total amount of deployments
we have each month per pipelines and possibly per application.
We want to know the lead time for a change or a deployment. This one
is relevant for identifying the time between the moment in
which the DevOps engineer executed a commit to when the deployment
actually took place. Furthermore, we're also measuring the change
success rate, which corresponds to the total number of successful deployments
divided by total. Speaking of which, some of these might
fail, which is why we want to identify the success rate.
And the last one we ee in this slide is the availability of our CI
CD platform, which is obviously a critical KPI.
Now, coming back to the big picture of this architecture.
You see the benefit of collecting this kind of data
and feeding it into the observability platform is that these DevOps
engineer who does the deployments doesn't even need to use Jenkins
to monitor the state of his deployments since all of these data will
be readily processed brand available in the observability platform.
So it's there where our DevOps engineer will go and
check the status. Okay, speaking of observability,
to introduce this concept, observability is the measure of how
well internal states of a system can be inferred from knowledge
of its external outputs. This basically translates to understanding how
the various components in our systems are connected to each other.
And we pose questions such as what are these dependencies?
What are their dependencies? How do they work together? So in other
words, we are introducing transparency and visibility
across our entire end to end system.
Why? Because our ultimate goal is to derive actionable insights
from it. Here are a couple of notions relevant for observability.
We have different sources of data illustrated in the center figures
that we need to observe. As we mentioned,
observability is the ability to measure a system current state based
on the data that it generates, such as flows, metrics, and traces
the so called golden triangle of observability on the
left hand side, while on the right hand side we see the golden
signals, which consist of latency, traffic errors and
saturation. Okay, before we move on
to the AI side of things, it's also really important to understand
how we collect all of this data that we just mentioned. We sre
using opentelemetry, which is an open source standard for generating and
capturing what we ee here traces,
metrics, logs. So basically our golden triangle. On the
right hand side you see a diagram illustrating how exactly
open telemetry works. Slo I suggest we quickly analyze
it. Let's start from the top, where logs, traces and metrics
are generated in our specific case by the prod and the container.
But from the diagram, we see that we're talking about raw data
too. In the middle, we have the open telemetry collector, which enriches
and processes all of this data. Specifically, these enrichment
process occurs in a completely uniform manner for all three
signals. Basically, the collector guarantees that the signals have
exactly the same attribute names and values describing the Kubernetes pod
that they come from. So what happens in our case is that
the exporter takes all of this data from Jenkins.
So we're talking about system flows, app flows, traces,
metallics, et cetera. It correlates all of it and it forwards it
to our backend. So the key benefit from this approach,
from these dynamic approaches is that we don't have to install
anything on each single Jenkins agent. Instead we simply
attach open telemetry to these Jenkins master and this ensures that
we receive all three signals with only one agent.
Thank you very much Mikhail. Now let's have a look on the aidriven approach.
What you can see are we have these different stages. Observe,
engage and act. In observe we are using open
telemetry collector agent to collect all the inside flows,
metrics and traces from the CI CD pipeline,
but also from the underlining platform on the engage.
We are going to run a machine learning prediction. So we are
going to try to understand and correlate if there are disruption.
You see here that we have two different KPIs.
The first one is CI CD reliability.
The second one is CI CD pipeline end to end.
So we are running a smock test every 15 minutes and tests all
stages of a fake pipeline. Then we move on
on the correlation brand prediction. Correlation and prediction will
try to understand if there will be a disruption and to predict
the service alerts of the next Alphan hour. Then this will land
on predictive alerting. So we are sending an alert before something will
happen or in zero touch operation or zero
touch automation. In this case we will have SL filling.
All scripts will run automatically to fix the
CI CD pipeline and platform. In this case we can
also increase the performance.
We can increase the resources that we have in youre CI CD
platform to prevent any disruption that
can happen. Okay, this is a funny slide. What we
did is to try to understand if our clai these are correct. We asked
at chart GBT, we asked at chat GBT if they can give
to us a couple of slais and
you can see here what we selected at the end. So the first
one is about a build success rate. Really interesting because it's also the
one that we are using these we have a build lead time. You can imagine
which kind of measure. We have these and is really important
for our SRE and DevOps text asset rate.
We have deployment success rate, deployment lead
time. Really interesting one because it's the one also that we are having in
our prediction and machine learning will run brand.
We'll try to understand if there are correlation with the other event brand
then what we are using here is also change failures rate.
But we will see this running automatically in the next
part because we will see in a demo. Now let's move on.
Our demo will have a two part, the first part I will explain
what we are doing with observability and using open telemetry in a Jenkins
pipeline which kind of insight we will get and then try to
understand how to use these for the
machine learning part. And then the second part of the demo
Michael will explain us what we can do with AI.
Okay let's start first to have a look on our pipeline has you
can see we have our smoke tests pipeline. It will
run every 50 minutes. Is it running over every 50 minutes
and it's covering all stages. Here is everything green. Let's have
a look on something that is broken in the end. So we have one time
that was broken. We click inside,
we have a look on our observability. As I mentioned
we are using open telemetry. We are getting these complexity dynamic and
let's have a look on the inside. Those are collected everything automatically.
This is a trace. So the first stage of our pipeline
we see these name of the pipelines and we see the time that was running
is around 1.4 minutes. We can also
click on the span and we can have a look all strategies which
kind of progression they had and when stopped on the
weatherfall. It's really interesting because youre see from the start
the first stage will be about the agent. So we are running
Jenkins inside our Kubernetes cluster. You can imagine that every
time that the agent will start it will be allocated a new pod.
And this new pod will can the Jenkins agent.
Jenkins agent of course will do something as
first it's these allocation. So we request resources to Kubernetes
and start a new pod. Of course you can imagine here that
if we have a problem with our Kubernetes cluster.
If the Kubernetes cluster is saturated then this
will take longer. So here we can already have a look, we can jump inside
and we can try to understand if there are problems. In this case these
we see a checkout. So there will be the download of our
Jenkins file. So our Jenkins file is the subscription of all
steps. We see these that is getting downloaded.
It's consuming around 16 seconds. So we can do already some improvement.
These the script will deploy brand compile
our Yaml file. So it's the one that we want to kind of deploy
to our Kubernetes cluster in prod. It's taking 47 seconds.
And here you can see the deep dive in all stages we can click inside
and also understand what is going on. Then there is a build of a new
version. We are building a new version for production they will
condense our source code and
in this case we are going to have a build at the end
these we place it these gate because SRe we want to understand if we already
have a problem in production. We already consumed our error budget
and we don't have anymore here we have a block. So this is the
case. So in this case we cannot deploy to production because
our error budget has already consumed. Another interesting
part is about these performance summary what you can see on the
right side. So we see here that the wall time
that is required to run these pipeline. There is only one imputation
is about application. So imagine that we have a database
or we had other different third party software
that we needed to connect here. You would have seen something like network
compute and database. We can
jump now on the graphical overview.
What you can see here from the Jenkins pipeline using the open telemetry.
We sre also getting some of the interesting metrics out of the box. The first
one is about the request rate. It's really important for SRE because
we wanted to understand when there is a peak on these request
and it will happen maybe has. And if
also we need to kind of run brand increase some resources.
This will be also the case. Second part about the request
latency. So this will tell us which kind of problem we
have on this Jenkins master and what
we can see down is also the error rate. So if we have error
on the API then we will have these immediately an
alertment that we can create. Another interesting part
is the overview that we have with
the Apm. Here you see the wall overview of our application.
We are going to select only Jenkins.
In this case we will get filtering only on Jenkins
as we are interested about these service we are going to select of
one day because we want to catch also this problem that we have.
Another view that we see here is the problem on
the timeline and here in this case there will
be one of this problem of this
run of the pipeline. Has I mentioned before, we can also go in deep
dive is another run. And you see in this case
the run was 1.86.
2nd we can also try to understand from a different run
if something changes we can compare all of them.
And this is also really interesting on observability
our metrics, of course they are not standalone. This will
be used mainly for debugging built.
Michele will tell us what we can do with AI.
Thanks Francesco for the observability preview. Now it's time
to jump to the AI part. All of these metrics
that youre saw in splunk observability that Francesco showed have also
been integrated in Splunk IT service intelligence,
which is what youre seeing here and which we're using as our AI
platform. So what you're seeing on this page is
the service tree overview which gives us insights on the structure of the CI CD
platform service. I suggest we look
at a couple of these services just to get an idea. We have
the vault here which is used for secret management.
We got the Kubernetes cluster service mapped, which is if
you remember, we actually saw this in the CI CD architecture slide.
We've got also the GitLab CI CD. However, this is
out of scope for today's demo and we finally get to
our Jenkins CI CD service node.
If we look under it, we have two other nodes
which are Jenkins end to end and the Jenkins reliability
service. If I click into one of
these, I will see on the right hand side, I will get a
drill down, will open with a list of KPIs related to
this service. And if you look at these KPIs, you'll actually
notice that these are some of the ones that we
already saw when we were analyzing CI CD,
specifically when we were looking at its architecture. So we
listed some of the KPIs. So for this demo we
implemented a few of these.
So keep in mind that this data comes from the splunk of observability platform
via the open telemetry collector. And the idea is that all
of these KPIs will contribute to the scoring that we ee
up here, which is currently 100, therefore healthy.
So this was sort of a short overview of the service decomposition,
but I suggest we move now to the actual AI part.
So if I go here into the predictive analytics
section. So this is,
as I said, the predictive analytics feature and I will use it to train a
code based on my service and its KPIs. I will be
using data from the last 14 days and since what
I want to predict is the service health, that is
whether it will be low, medium or high or critical, I will
use the random forest regressor. So I will
choose here random forest regressor. The split
is 70 30, which is fine, and I click on train.
This is now currently training the model based on
my Jenkins reliability service which I've selected and its
KPIs. And this might take a couple of minutes.
Now we see that our model is ready. It has been tamed
but also tested. Let's quickly look at the test results.
Specifically, I want to see how my model performed on
the test set. So if I scroll here to the button I see
different analysis, but what I'm really interested to see is the predicted
average versus the predicted worst case health score.
And we see that in both cases it's 100%,
which is okay. But if this were to go below alerts,
an action would have to take place. This is something we also
implemented. But for today's showcase, this is
out of scope. I will save this model
brand. Now I can finally use it on actual real
data.
So I again choose the Jenkins reliability
service knows it's
loading the model. For this service I
again select the random forest regressor.
And basically now while I'm waiting for the results, just understand
what I'm trying to calculate. Here is the service health score for the next
30 minutes and this is the output that we get.
And basically from this point on, the predictions runs automatically.
Okay, let's summarize everything we learned so far today.
We identified different challenges that come with CI CD pipelines
and therefore concluded that we're talking about a critical platform that
requires endtoend monitoring.
Talking about endtoend monitoring, we also introduced the concept of observability,
which is necessary in order to introduce full transparency brand visibility
across our entire infrastructures.
In order to bring this data from our CI CD platform
into our observability platform, we made use of open telemetry,
an open source standard used to collect telemetry data and which is relevant
in order to ensure reliability.
Furthermore, we identified some relevant APIs that help us derive
the state of our CI CD platform. And finally we applied AI
to all this data in order to create some predictions and derive actionable
insights. Why? Because our ultimate goal is
failures prediction, which refers to the use of historical data
in order to preempt failure before it actually occurs.
And a final takeaway, I would like to point out from today's session,
start simple and scale fast. So perhaps youre don't know
where to start from. Well, maybe start from a simple experiment,
see how the system react, see how it goes. And as you
proceed you can scale, you can basically build more and
more on top of that. Well, it seems it's time to close
the curtains. Thanks a lot for watching. I really hope you
got something from this session. Until next time.