Transcript
This transcript was autogenerated. To make changes, submit a PR.
Everyone, thanks for joining me in my today's session
on observability in Ci CD pipeline.
I'm Siddhartha Khare. I'm working as a technical account manager with
new Relic. Prior to joining new Relic, I was working with
Citrix as a software developer. Today's agenda
will revolve around a quick recap on
opentelemetry. What is continuous
integration? What is continuous delivery?
Why we need observability in Ci CD pipelines
what are the benefits of implementing observability
in Ci CD pipelines? And all of this is
followed by a short demo on how
these things can be implemented.
Let's discuss about opentelemetry so
Opentelemetry is an incubating project in CNCF
umbrella. It is formed by merging open
tracing and so if you have used
Eger or Zipkin, you have already experienced the taste
of open tracing. It also provides you a certain
set of APIs libraries
integrations so you will be able to collect your
data more efficiently. It also
provides you a standardized approach of collecting
the data from your applications, which means
what type of data you should be collecting from your applications
or from your systems to understand how they
are performing. What is continuous
integration? So as the name suggests,
it is a practice for merging all the developer code
into a main code several times a day and
it will help you to reduce the risk in already
released features. It will decrease the number of the bugs
which can be or in your application.
It will minimize the integration issues which anyone can
face. What is continuous?
So the approach where the functionalities
are delivered frequently,
continuous delivery and with continuous
delivery you will be able to deliver faster
to the market in lower cost and the
quality will also be enhanced. One can say like
CI CD is a combined practice for integration and
delivery. Let's discuss about why
you need observability in CI CD pipeline. With observability
you will be able to monitor your CI CD pipeline
more effectively and you will be able to resolve
the issues before they escalate, which will save your time
and resources. By understanding the ins and outs
of your CI CD processes, your team can
make more informed decisions about resource allocation,
process change or tool consolidation.
You can also detect anomalies if there are
any in your system. You can detect the performance issues.
You can identify if there are any misconfigurations which
have occurred, or the teams which
work in the silos in more efficient manner.
Let's also look at some of the benefits which you
will be getting after monitoring your CI CD
pipeline. So first one is about
isolating the faults where it is
a practice of designing systems such as
when an error occurs, the negative outcomes are limited
in the scope. Limiting the scope of the
problems, reduce the potential for the damage and
make systems easier to maintain. You will get
the faster MTTR.
This measures the maintainability of repairs and
set the average time to repair a broken feature.
You will see the faster release rates.
You will also see that within
your team the transparency accountability will improve.
CI CD is a great way to get the continuous feedback
not only from your customers but also in
the demo of how we can implement the
CI CD pipeline monitoring and what
are the prerequisites to achieve this and
how this will help you in a longer run.
So let's quickly take a look on my Jenkins server
what all jobs I have. So there are a couple of
jobs which I have in my server and I'm running
them. Let's take example of this particular job.
I'm cleaning up the workspace first, then I'm checking
out my code. I'm building the docker image, pushing that
docker image to my Docker hub count and
deleting the images from the system.
Then I'm updating the Kubernetes deployment file
and eventually pushing it to a deployment.
So I'll also show you the actual app
looks like. So if I go to sage you
can see that I have an application,
it's a Kubernetes hosted and I've just shown the
view. If I go to this particular
page, you can see I have a web page
as well, produces an error,
it will tell you. So this is the work
of my application. Now at any point
of time, let's say I want to monitor what
my Jenkins server is doing, where it is failing and
you don't have the access to portal.
So what you will be doing, the process is very simple
here. What we have leveraged is opentelemetry.
So if I go to manage Jennicins search
for opentelemetry plugin. So in my scenario I have already installed
it. So I'll show you here. So this is the
plugin which we will be using. Once you install
the plugin, go back to manage Jenkins inside
system. You will see multiple locations,
multiple configurations which you need to do. So just search for
opentelemetry and you need to provide
the opentelemetry endpoint. So this endpoint can be
any endpoint of your back end service which you are planning
to use. In my scenario I'm using new relic.
So the endpoint is otlpnrdata
net 4317 is the port.
The authentication which I am using is the API key.
So I have leveraged the header authentication, I have named
that header and the value is neuralik
ingest license key. Okay. And I click on
save. So as soon as I save this automatically
within my neuralik portal, if you go to ApM and
services inside opentelemetry,
you will start seeing the here you will get the
information about the response time throughput,
the error rate of the application and the instance of the application.
Now if you want to that
what all other components your Jenkins is communicating,
you can go to service map. If you want to validate
what are the different type of transactions which are happening.
So now the catch is when you see build
in the caps. So all these are different builds
which I have in my jenkins. So if you see here Argo
CD Gitops worker bench Argo CD
Gitops Argo CDCI operation Gitops Argo
CDCI so if I go back here, you can see
all these names if you want to dig deeper into
it, of what is happening. Because let's say this is the main
section which is taking 93%.
So if I click on this particular transaction,
I'll see the complete percentile graph,
the throughput, and you will see the traces which are here.
So let's see if there is any trace which has
error. There are no one, but let's see if I go
to this particular trace. You will see the entity map here,
just for this particular trace. And you will see a
nice indicator as well. Do a
drop down, you will see all the process pens,
you will see when the pipeline is starting, it starts running,
the agent got initialized, and here you will see the
stage wise approach. So the cleanup workspace took
this much time, the checkout took 1.65 seconds.
Building the docker image, it took 3 seconds.
Then here, in pushing the docker image
while it is running some shell command, it took some time. It is
also showing me the anomaly which says this
span is 3.79 seconds slower
than what an average it was.
Then it is deleting. So all these stages are coming up here.
Now at any point of time, let's say here things are
working fine. If I go back to my distributed tracing directly,
I'll show you the different type of jobs which
are in running the post details. So let me
just filter out with errors.
Okay, so this is the particular build
where the errors are coming. So if I go to this particular
and I will click on one of the trace
here you will see the indicator is red and orange.
If I click on drop down you will
see the complete process span. Here you are seeing
one anomalous span. So this anomalous span
is generated based upon our anomaly detection engine.
Here are the errors, which potential errors due to
which the problem has occurred. So if I go to in span,
you will start seeing when the pipeline started.
Here is the first error. So let's quickly check how that
error will look like. So it is tracing to clone from a
wrong repo. If I go to SCM,
you will start seeing this data. If I go to, let's say
build docker image will start giving this. So now
we know that this is the problem here and there was
also an exception. So if I click on this particular exception,
it will show me the complete stack trace of
what is happening, which will help me to dig deeper
how you can fix this problem. These are the
information which you will start getting once
you instrument it right. Even you will see
the logs as well. So if you go to logs, you will see the details
about that. There is some change which got detected in
argocd Githubs workbench build,
right? So it will tell you which particular build has changed.
It will give you the details about the Jenkins URL.
Once you start seeing this data, there are
certain scenarios where every time you just don't want
to go inside this and see what the problem is or how
your Jenkins server is performing. So for that we have
the out of the box dashboards. So if
you go to any of the dashboard, you can see how your
application is, how your CI CD
pipeline is performing. So you can see the dashboard details like
number of builds, 40, 20 failed.
You will see the executed jobs,
you will see the average job duration, you will see the job failures
which step took the longer time,
longer duration. You will see the number of the steps, you will see
all those steps, the count, you will see the max
error, like what type of errors are occurring most,
and if there are any failed steps, you will start seeing those.
So let me just increase the time frame to 3 hours and
let's see. Yeah, so if you go to 3 hours, you will start
seeing the steps which failed. You will start seeing the
longer steps and you will see the number of builds and
the failure builds and you will see if there are
what is the queue time? You will see all these details
coming up out of the box. I'll end
my session in for observability in
Ci CD pipeline and I hope you have enjoyed
this session and please give it a
try. If you face any problems. Feel free to reach out
to me on my LinkedIn. Thank you. Have a nice day.
Happy learning.