Transcript
This transcript was autogenerated. To make changes, submit a PR.
Folks at CoN 42, thanks so much for joining into my talk today.
Today I want to talk with you about observability.
What is the importance of having observability in your system
and some of the key pillars about it. I'm also going
to talk to you about opentelemetry and how it works and
where it fits within this entire observability system. So let's
get started with that a little bit. About me I'm
Siddhant. I'm a developer advocate at Siglens
and I'm also a co organizer at cloud native community
groups nasic. Along with that, I'm also a community
manager at a couple of tech communities.
Now if that didn't make it obvious already, I'm a
huge geek when it comes to tech. I love talking about Linux
as well as Kubernetes and I've also started to geek
out about various books and about health.
If you want to connect with me after this talk, feel free to
find me on my socials. Now before I
talk about anything technical, let's imagine a scenario.
You have a server. Let's say it's running on, on Prem
in your own data center and on top of the server
you have your applications running. You've got your healthy
applications, you've got your healthy databases, and all
of your users or your customers are able to properly
access your applications. And everybody is happy.
The developers are happy, your operation teams are happy and
most of all your customers are happy. Now all
of a sudden something happens and your server goes down
for whatever reason, maybe because of a power failure,
maybe because there's too much load on the system,
or it could be a plethora of different reasons your
server has gone down. That means it's going to lead to a
cascading number of other errors. You're soon going to
start seeing some faults in your application and
your database as well. In the extreme worst
case, your data might be completely lost.
You might start losing old data as well as new incoming
data. And this is going to cause a lot of unhappy
users, which at the end of the day
is going to cost a lot in terms of business value and
revenue as well. And none of us want that right.
And this entire thing has happened at 03:00 a.m.
And your engineers, your operations folks are getting
a ton of calls as to what's wrong with the server. They are working
tirelessly to try and bring the server back up,
but they have absolutely no idea what's wrong with the servers.
This is because you haven't put in any sort of way
to get visibility within your servers as to what's happening
within it in the first place, right? So that's what we're going
to talk about today. That's where observability comes
into play and can really help you out.
So observability in a nutshell allows you to get
deep insights into your system and it allows you to use
all of this data to evaluate your performance
and improve it as well. And it's also useful for
debugging issues and predicting any sort of future issues.
Let me give you an example about predicting issues. Let's say
you're an ecommerce website and you have spikes during certain periods of the
day, right? So in this case, whenever you have a spike
you would require more resources. So you need to allocate more
resources or provision more resources from your cloud provider
in order to maintain a healthy uptime. There are also
three main pillars of observability. But before
getting into that, let me try and give you an analogy to understand observability.
Let's say you drive a car and you do your own
repairing and maintenance for your car. Now if you did not have
the correct tools, how are you going to know what's going on
within your car? If you try to understand what's happening
in your entire engine without ever opening your bonnet,
you have no idea what's going to happen. For example,
let's say your tire has low air in it. You wouldn't know
that unless you have the right tools, correct?
That's exactly what we want to achieve within observability.
But for our software and for our servers.
Now talking about the key pillars of observability, we have got three key
pillars. The first one of them is logs. Logs are
simply timestamped data with some information about
an event that has happened. For example, my application
could have thrown me a warning at 02:15 a.m.
Now obviously nobody is going to sit at 02:15 a.m.
And continuously monitor the logs, right? So that's where we
generate logs using some sort of observability tool
and we store it in some sort of observability back end.
More on this later. The second pillar of observability
is metrics. Now metrics involve things like your
CPU utilization over a long period of time,
your memory utilization over a period of time,
and other similar things. It can
also include things like your HTTP requests as well.
How many requests were dropped, how many were accepted,
and similar types of details. Next,
we have traces. Now traces is
useful for figuring out the performances of your application.
Now, in this diagram, you can see that for going from
A to B, it takes 50 milliseconds. Now, A and
B are simply some function calls. So function
A makes a call to function B, and it takes 50 milliseconds
for that entire process to wrap up. Then function b tries to
go to function C, function c tries to go to function D,
and so on. The amount of time it's taking for this entire
transaction to happen, this entire communication between various
functions to happen, is what we call as a span
and the entire process for it to
complete, that is what we call as a trace.
So now let's talk a little bit about what is open
telemetry. Now, opentelemetry is simply
a framework which you can use for implementing observability
within your systems. Now, to give you a little bit of a backstory,
before the introduction of open telemetry, there were
around 14 or 15 different standards for observability.
If you come from the web development world, you know how much of
a pain this can be, having multiple standards for the exact
same thing. When Opentelemetry was created,
the project had an aim of unifying all these standards,
and it has so far achieved this goal.
A lot of the existing standards have all
pretty much been merged into Opentelemetry, and Opentelemetry
is becoming a de facto for observability standards.
Moving on. I might call Opentelemetry as Otel,
which is just an abbreviation, a short form for saying Opentelemetry.
Now, Opentelemetry works in a couple of different ways.
If you look on the left in the entire microservices column,
that is where you actually try and instrument your code.
So Opentelemetry has a couple of developer kits,
SDKs. Using those SDKs, you will instrument
your code that, hey, this is my function.
I want Opentelemetry to tell me how much time it
takes to go from this one function to the second function to
this third function or whatever. So for that you
have the Otel SDKs, you have Otel APIs as well.
And there's also a really useful feature which is
compatible with just a few languages for now called
as Opentelemetry auto instrumentation. It's just as it
sounds, it tries and automatically instrument all of
your code. For now, I have seen it work with
node JS, but there are a couple of other languages as well,
which it supports. Then you can also use
opentelemetry for your infrastructure. For example, if you are running
opentelemetry on a VM, you can collect the system logs,
the memory usage, the RAM usage, the system calls,
et cetera. Same thing you can do with Kubernetes as well.
There is the opentelemetry operator for kubernetes,
and you can install it with a simple helm chart.
Now, once you have instrumented your code or
your infrastructure, you have something called as an opentelemetry
collector. The collector simply acts as a way to collect all
of your telemetry data. Telemetry data and observability data
are the same thing. So telemetry data includes all of your three things,
logs, traces and metrics.
Once the opentelemetry collector collects all of that telemetry data
and it processes it properly, it sends it to an observability
front end, or rather an observability backend.
Now, this backend can be something like
Grafana or Loki or siglens,
for example, which helps you store all of this data,
filter through all of this data, create graphs, create some
sort of analytics and so on. Now, how does the
opentelemetry collector actually work? So here I'm
taking an example with Argo. If you don't know what is Argo?
CD? Argo is basically a tool which allows you to
implement Githubs within your entire software workflow.
Now, to talk about how the entire Opentelemetry
collector works, Argo, within its entire
code base has some sort of built in mechanisms for collecting
this telemetry data. You simply need to configure within it
that the endpoint where you want this telemetry
data to go is open telemetry and it exists over
here. Once that is done, Argo will send
the information to an open telemetry receiver. The receiver's
job is simply to receive whatever data is being sent by this
external source. It doesn't necessarily have to be Argo.
It could be a number of other applications, it could
even be a custom application. Once the receiver
has the data, it'll send it over to the processor.
Now, the processor is used for adding some additional data onto
the existing telemetry data. For example,
I've gotten a warning log from Argo that hey,
your deployment has failed for whatever reason.
Now, within the processor, I can configure the processor
in such a way that it will attach some details about
CPU usage, memory usage, maybe some batch
process that's happening in the background to
this particular log which I've received from Argo.
Once those entire processes are done, the next step is
the exporter. Exporter's job is simply to send
the data to some sort of observability backend.
Opentelemetry does not store any of the data
which it collects. It'll collect it, it'll process it,
and its job is done. If you don't send it to any sort of external
observability storage area or an observability backend,
this data is going to be lost. So that's where the exporter's
job comes into play. The exporter will send the data
to an observability backend, such as siglens for example.
There are other options as well, but I'll take siglens as
the example here. Now, if you want to get started
with Opentelemetry, there are two ways depending on
who you are. If you are a developer, then you
can use the API of Opentelemetry. It also has
a number of different SDKs which you can use for instrumenting
your code, and you can find all of that on Opentelemetry's website.
And if you're an operations person, you're a system administrator,
there's a completely different roadmap for you. I'm going
to talk about it from an administrator's perspective, since that's the
field that I have some experience with.
So as an administrator, or rather an
operations person, you have two ways to install it.
You can either install it using Docker compose,
which you can find in the documentation again,
and docker. I'm taking the docker way and assuming that
you want to run it on a simple VM. The second
way is using Helm. Helm is a package manager for
Kubernetes, so if you want to install Opentelemetry onto
Kubernetes, you would do that using helm.
Now there are a lot more in depth processes such as the Opentelemetry
operator and all that comes into play for Kubernetes,
but that will go outside the scope
of this talk, so we're going to skip that for now.
Once you have installed opentelemetry using Docker
or using helm, this configuration
that you can see on the screen is what an open telemetry's
configuration might look like. So let's just quickly
take a look through all of these configuration options.
First, you can see that we have the receivers, and over
there you're simply mentioning the type of protocol that
you're receiving the data with, whether it's via GRPC or HTTP,
and you are mentioning an endpoint as well. So in here for
the HTTP protocol in specific Opentelemetry
is expecting some data to come in from localhost on
port 4318. Now after
that the telemetry data which
the receiver has gotten is going to head over to the processor.
The processor will have its own thing. Over here you have a batch
process. You can attach a number of other processes
as well. And all of that you can find either in the documentation
or on Opentelemetry's GitHub
page. Next thing we have is the exporters,
which again I mentioned what it is. The exporter is simply
going to send the data to some external storage.
Now over here you have exporter type of OTLP.
This means that the data which is getting exported
is going to be in the format that opentelemetry supports,
and it goes to this endpoint. Now over here, this is just an example
endpoint, but this can be absolutely anything
as long as the observability backend supports the
opentelemetry endpoint, which at this point pretty
much every single observability backend will be supporting the opentelemetry
data format. Then you also have some
extensions and service checks which you can put into place.
Talking about pipeline, this is something important.
The pipeline is simply how you want all of the data to
be formatted. So for example, for traces,
I first want my opentelemetry traces to be
collected. Then I want my processes to run in this order.
Now since over here we just have one single processor,
that's the only thing which is mentioned. But if we had a couple of
other processes, for example CPU utilization,
memory utilization, we would put that here in which
order we wanted. So if I wanted first CPU utilization,
then I wanted information about batch, that's how my order would
be. If I wanted the details of batch first, my batch would
come first, and then I would mention my CPU
usage categories. And yeah,
that's the end of my talk. Thank you so much for being patient
and listening to this talk. I hope you found it useful
and informative. And yeah, if you want to go ahead and try out opentelemetry,
feel free to check out its website and you can even use siglens as
one of its backends, the observability backend where
you store all of the data and this is the website where you can find
out everything about siglens. Yeah, if you found this informative,
please do let me know and share about this on socials as well.
And yeah, looking forward to connecting with all of you. Lovely audience,
thanks for listening to my talk.