Transcript
This transcript was autogenerated. To make changes, submit a PR.
You all right. Welcome everybody.
Pleasure to be here. And today I'm going to show you some strategies
to implement observability in your company without acquiring
engineers efforts. If you are doing observability using
open source solutions, as I am, you probably think
about do the same thing that most of the observability
vendors are doing out instrument maintain. They are just deploying
an agent in your host and starts collecting your
metrics, logs and traces. So the idea of this talk is to
show you some strategies using things that you probably already have
on your infrastructure to start collecting all those signos
and improve your developers experience and without
requiring engineering effort. So the idea is to let developers
focus on what really matters for them. Okay? So I hope you
enjoy and let's get started. Cool. So before
we start talking about observability strategies, let me introduce
myself. My name is Nicolas Takashi. I'm a brazilian software
engineer living in Portugal for the last seven years.
I'm open source contributor especially on the observability
ecosystem for projects such as Prometheus, operator,
opentelemetry and many other projects on the observability ecosystem.
I'm currently working at quarter logs it analytics
platform for logs, metrics, traces and also security
data. And you can find me on my
social media networks such as Twitter, LinkedIn, GitHub by
my name. I'm usually talking about Kubernetes,
observability, Githubs distributed
system and also of course personal life and
changing experience. So I hope to see you there and let's move
forward. Okay, cool. So now
everybody know who I am, let's move forward. Let's start talking about
observability strategies, folks. So before
we start talking about the strategies, see the things in action.
Let's ensure that everybody's on the same page. Let's ensure
that everybody has the same knowledge and the same understanding about observability
and the use cases. Okay? I know this may be very
basic, but this is very important to have the understanding about
the strategies that we're going to see here. So folks,
in the context of software engineers, observability is crucial
for understanding how system is behaving. So given
external inputs and what, I mean external inputs, I mean users
using your system, like if you are running
ecommerce, users buying, adding things to
their checkout bag, doing payments
and all those things, okay? And you start collecting telemetry data.
So you start collecting traces and you start collecting logs,
you start collecting metrics, profiling and many other things,
okay. And getting all those information
this huge amount of information, because observability is very
easy to start handling a huge
amount of data. You can identify
issues, you can identify bot on acts on your system where
your system can be improved, okay? And you can troubleshooting
problems very quickly. But when
we are talking about observability, it's very common. People start talking
about infrastructure observability, which is true
and which is very important actually,
because if we are running a health
infrastructure, your resilience is
better, your reliability is better, and your customers are
happy. And usually when you are talking
about infrastructure observability, we are talking about monitoring kubernetes
system, for example, if your system is scaling or not,
if you have new pods or for example if
you're running kafka broker,
you may pay attention on disk size,
disk throughput and many other things,
which is important, as I said. But when you're talking
about observability, we need to talk about also application observability,
which is a little bit more complex because most of
the things on the infrastructure side is
made. We have metrics, we have logs by
default because logs is the most common
observability data type.
But when you're talking about application observability is
a little bit more complex because some systems is
not prepared yet to export all the things that we need.
Okay, because sometimes we
need technical metrics
or technical information like logs, traces and metrics.
But from the application perspective,
you want to know what is the p 99 of
latency for a specific application. You may want to
know how many messages this specific
application is producing for your message broker. If you're
using Kafka, RabbitMQ,
anyone is the same concept more or
less. And on the application
side you may
want to understand also some business metrics. So as
given the ecommerce example that I gave you in the beginning,
you may want to know how many orders your customers is doing
by second, by minute, what is the
click path on your system, how users navigate on your platform,
and collecting all those information. You can start
thinking about like places in your system that
you want to add
a lot of focus to improve resilience, to improve
performance, to reduce or hate, and so on and so forth.
Okay? But collecting all those information sometimes
is not easier, especially if this is not build in your framework that
you are using. And it lead us to a work that
most of the engineers don't like to do even less
the product manager, which is instrumenting code to get
in the required information. Okay,
so if the idea of this talk is to show some strategies to
let engineers putting their focus on the things that really matters
for them, like delivering features, measure user
experience, getting business information.
Let's avoid engineers spending time adding
telemetry to collecting technical things like HTTP requests,
Kafka, throughput,
DB latence, and so on and so forth. Okay,
so this is what we're going to show you today, how we
can collecting standard metrics without acquiring
engineers efforts. And this is useful to make
some standard proxy across your company and ensure that doesn't matter
the language, the language that you are writing your system, you are
collecting the same kind of information using the
same structure. Okay, pay cool
folks. And we are talking about instrumenting
code and instrumenting code. It's collect
as much as we can. So as an engineer, you want
to know every single information from
your system because everything is kind of available,
but do this kind of job can
quickly become overwhelming for you and for your team because
you need to use engineering
time to add metrics that maybe
might be provided by platform. If you
have a platform team, for example, a DevOps team that can do
all the automations and the strategies that
you're going to see today on your company and may give you such
information, and you can use your engineers
time to do instrumentation to collect things that really matters
for your system and for your product engineering and so on. And for
example, we have a meme here. And of course, this is just a joke,
but a funny one, because when
you're talking to product engineers that we need to instrument our code,
we need to spending engineering time instrumenting codes instead
of delivering features, it automatically get a
low priority. Okay? And as I said, this is just a joke,
folks, because of course we are talking about two different professionals
looking for two different perspectives for the same problem,
okay? And when you are talking about
when you are engineer, you're trying to push as
much as you can, the better system to production you want to build,
the more scalable, the more performance system. When you're
a product engineers you want to push to production, the better product you
can with the better features, the basics, user experience
and so on and so forth. So this is
a trade off. We need to talk to each other. And of course, if your
platform team is providing you some
basic information on the platform side, you just
need to instrument your system, your code for the
information that really matters for your product, for your teams,
and you can use the information to
improve your product. And your product engineers can use the same information because
it's attack and business
information and observability information so
let's move forward and let's see what are the
strategies that we're going to see today. Those are three.
Okay, cool folks. So those are the strategies
to not overload your engineer team, the proxy strategy, the open telemetry
strategies, and the EBPF strategies.
I have a blog post for every one of those
strategies on my medium account. You can check this information
there as well and feel free to reach me out and provide any
feedback that you may want. And people,
the idea of these strategies is to design and give
to your engineering team a solid foundation
of observability without having any code change.
And what I mean by this is simple. You as an engineers,
you as an engineer, when you want to deploy a service on your
company infrastructure, you don't want to change your code to start collecting
common metrics such as HTTP classes,
GRPC streams,
Kafka consumers and Kafka producers.
You don't want to instrument your code to collecting latency
metrics. You don't want to instrument your code to start collecting
basic tracing information and
so on, because you want to leverage your platform.
You want to consume observability as a service.
I like to say that because we are offering many
services as a service, like CI as a service,
kubernetes as a service, deployment as service, but you
also want to have observability as a service. You want to deploy your system
without any coaching and you want to start collecting
telemetry data. Okay? So in
the end, the idea of this talk is to provide some useful insights
that you may use in separate each one or combine
those strategies together to get the information that
you may want. Okay? After I
show you the live demo of each strategies,
we are going to see table of comparison
so we can compare each strategies,
the benefits, the pros and cons of each one.
And this may help you understand when you
choose one and when you choose another one and so on.
Okay, so folks, I hope you enjoy, like this is going to be
very fun right now because it's live demo. You're going to see things
in action or you're going to see open source solutions.
And yeah, let's go, let's move forward and see the proxy strategy
in action. Okay, cool folks. So let's go for
the first strategies, the proxy strategy.
We are going to leverage existing piece of infrastructure
that you probably already have on your company, which is your
web proxies, okay? If you're running HTTP
applications, you probably have something like Nginx or Ha
proxies, which are very common solutions
when we need this kind of strategy.
But this is not coupled for
any technology. Okay, I'm going to use Nginx
as an example just because I'm familiar with Nginx.
But you can do the same with ATA proxy or any other web
server that you know better. Okay?
And the concept is the same. So as we can see
on this slide, we have a diagram showing
the flow. So we have an ingress proxy. The ingress proxy is
responsible to handle the HTTP request that's coming
from outside your platform to
inside your platform. And then it's hedirecting
to the proper service. So service a or service cb.
And on the left side of the diagram
we have the three telemetry
backends, okay? We have Prometheus for metrics, yeager for traces,
and open source for logs. And those backends
are going to store the opentelemetry data produced by
the ingress proxy. Okay, this is a very simple one.
And the idea of this strategy is if you are using web
applications and this strategy is very web
specific, you can ensure that you are going
to produce standard telemetry data like
traces, metrics and logs independent
of the technology that you are using to build your serve. So let's imagine
that the Serfca is using Java and
the Serfcb is using Golang.
You can ensure that the opentelemetry data that we are collecting
is using the same standard and doesn't care about the
technology that the service is
using. Okay? So let's move,
let's move to the VS code and see the
very simple strategy we have.
So quick spoiler. First thing that you see here is
a make file. I'm just using this to abstract a few combs
and type a little bit less.
But on the app folder we have a very simple
Golang application where we are mimicking ecommerce
checkout process. So when user is doing a checkout,
the checkout service is going to call the payment service to do
the user payment. Okay, very simple.
And here you can see a few lines of go code,
not important for us. And then we have a docker
file, simple as well known especially.
And then we have docker compose where we have a
few containers running here. And I'm going
to tell you about it a little bit.
So first we have two proxy containers.
The first one is the ingress proxy, the ones that I told you,
it's handling the requests coming from outside your platform.
And then we have the egress container
which is acting as ambassador container.
So which is handling the request that's
going out from one service to another service.
Okay. And having those containers,
those two proxies, we can collecting
and connect points between service a and service
b with distributed traces. Okay. It's very similar what
we have when we are using serfs mesh in kubernetes.
Okay. We have like containers handling
in front of every application to do this magic.
Cool. Besides that we have checkout and the
payment serves and then we have the exporter
for both proxies. For the ingress and the
ingress I'm using Prometheus exporter
which is collecting, it's creating metrics using
the NginX HTTP logs which
are very useful and we already have a lot of information there. So we
are getting the logs and creating metrics from the existing
logs to understand latency,
quests, hates and so on and so forth. And then we
have Prometheus and also Yeager.
Okay, so before we move to the next step folks, let me
come back for the proxy configuration.
And I would like to highlight that we are using a
very specific image for this container which is
the NgInX open tracing and this
docker image already have all the required modules to
starting spawns when a request is received and
then export the spawns and the traces
to the tracing backend in our case,
Yeager. Okay, so an important thing
here is I know that we already
have opentelemetry version that was released very
in this week. I didn't update this demo yet,
but if you are starting doing this strategy right now,
I really recommend you using the Opentelemetry version
and not the open tracing. Okay cool.
So for the ingress configuration we
have something very simple. For the proxy configuration,
it's like a forward proxy. We are just getting
the request and forward this to the serves and
we are just leveraging the proxy to collecting
the opentelemetry data without needing
to change anything on the code as
we saw the go application. It's very simple and
we don't have any instrumentation to
collecting HTTP metrics. Okay cool.
So we already know the basics, we already
know all the things and then we can
just start generating a
few load on those services and see the things happening. So I'll be
open the terminal, let's get the make file just to understand what's
happening behind the scenes. And the first thing that I'm
going to do is run make setup. Make setup
is going to start all the containers using
Docker compose up. And also I have all
the containers up and running. As we can see here.
All those logs is here are here we can
go to the web browser. Let me switch to the web browser
and oh actually let me just fix
for the seminar nokijit. And now on the
web browser we can access localhost nine
it's nine it which is Prometheus web
interface. And we can see the Prometheus targets
here, the two Nginx products, the ingress and the grass
and Prometheus itself. Okay,
so backing to the vs code,
let me see which is the port
that Yeager is running because I don't know by heart
and it's 6006. Eight six.
And then we are going to move back to the
browser and we are going to
access localhost and then the Yeager port.
So we don't have anything here. Okay. The only service
we have is Yeager itself because Yeager is collecting its
outer traces and
now we have all those things running as
expected. We can start producing a few loads
to this infrastructure. So let's go back
to vs code and then
we are going to open a new terminal and using a
make comment we have here which is maketest.
What maketest is doing is just creating a few load
on the checkout serves using Vegeta. Vegeta is
a very simple loading test in Cli.
I think this is just amazing for this kind of workload. Okay.
And I'm going to run load test against
the checkout API during 6 seconds.
Okay. So let's go.
And it's producing, it's making a lot of HTTP requests.
So if we can go to the logs, we may see a
few logs happening here and
I think we may already have a few data
available. So let's go back to browser.
And then in the browser the first thing that we're going to see is,
are the metrics that we are collecting. So we have a few
metrics, name it Nginx. Let me hit fresh
because the metrics might not be available. Yeah, we already
have it. And then we have the NgInX
HTTP request count total.
So we have few things
here, we can see all
those things. And then if we run an expression like
hate HTTP request and
sum this by, I don't know,
serfs, URi and stats
code we can see all
those things. And then
we can see this increasing over time,
which is very cool. And since we are doing linear requests
we are not going to see these ups going up and down.
But here we can already know the information
that we need. Okay. We can measure
the amount of HTTP requests for a given service
and that's cool. That's pretty cool.
And using that information we can see the amount of HTTP
requests we can use in these two building slos, for example,
like ahorhates, we can use that information
to measure HTTP responses because we should have
probably, I'm not finding right now,
but we should have a few metrics about latency.
But we may know which is for
example the response
size for each class, which is also
very useful information. And using the
NgInx exporter, you can build any metric
you want using the NgInX logs.
Okay. So you can get in the logs and you can build all the required
metrics that you may need. Those are just a few examples.
And as I said, those are
standard metrics. Doesn't matter which technology you are using
behind the scenes. The next telemetry
data that we can see are the
distributed tracings, okay. And then we can see that
we have two services here. The first one is the
checkout. And if we find traces from checkout
services, we can see that the checkout
go to payments and so on.
So we have two hope, we have two hopes for every service.
This is Nginx internal, okay.
And we can see in
that way we can look in for system architecture and we can
see for example,
the service is not zooming again. Let me see if it's better on the service.
Yeah, we can see the checkouts. It's using the
payments API. And if you have many services, you'll be able to see
this diagram architecture,
which is nice because if you are using
a microservices solution, whether you have many microservices
talking to each other, it's very hard to know only by
knowledge what are the service communication
flow. Okay. One thing that's very
important, those distributed information are
useful, but it doesn't give much detail about
the service internally. So you cannot identify performance
issues inside your services using that information. Okay.
This is useful to understand the network hopes on
your platform, but not useful
to identify internal problems. Okay.
But giving that information, giving that telemetry data,
your teams can start looking and see, okay, we are talking to these herbs and
these herbs and then they can see,
okay, on this payment flow,
I want to improve the details. And then
the team can go and add telemetry for the flow and the path
they really want to know. Okay. And they can reduce
the amount of work they need to do.
Cool. So this is the proxy strategy,
folks. This is very simple. As I said,
the idea is not like any rocket science. Just using
a piece of infrastructure steps that you already have on your company
and start collecting a few telemetry date.
Okay, so before we finish just talking about
the logs, logs are available on
your host. So for example, you can see all the logs produced
here. We are producing a lot because we are generating
a bunch of requests.
We don't have any errors in that case,
but we could mimic some errors, for example, and also we
have this, those access logs. You can use
some log sheeper like Opentelemetry fluent
beat to collecting those logs and ship to
open source solution. And my advice
is since most of the information you
have on the logs are now available on metrics
like stats code path,
you can just choose to drop
some of those logs, especially the 200
stats codes. Okay, this is just another device that might save
you some money in terms of storage and also networking.
So this is the first strategies,
we are going to move to the next one now,
which is the open telemetry strategy. So let's
go. Okay, cool folks, this is the
second strategy, the opentelemetry strategies.
And in my opinion this is the coolest one because
this is using the Opentelemetry, which is an
amazing project maintained by amazing people,
offering a lot of integration and
wow, this is very interesting. Okay,
and this strategies folks also aims to rely in the
infrastructure piece that you will deploy on your company
and it will start collecting metrics and traces out
of the box for you. Okay, so the Opentelemetry project,
it's a huge project composed by many different parts such
as the open telemetry specification, the opentelemetry collector,
also the opentelemetry EBPF
out instrumentation operator and so on and so forth.
We are going to using today the open telemetry, open instrumentation and the open telemetry
collector to
generating, auto generating and collecting the
traces and generating the metrics. Okay, so looking
for the diagram, we have an auto agent, opentelemetry agent
running your host and receiving the
tracings that will be produced by your application.
The opentelemetry agent is going to process the
traces and creating metrics and then expose
metrics and traces to the back end
for metrics to promote traces to Yeager,
it's like kind of language agnostic. So it doesn't matter which language
you're using, if you're using Python, if you're using Java
and so on and so forth. Okay, so let's move to the
vs code. On vs code we have
almost the same thing. Before I
show you the solution, I'm just going to run make setup
to ensure that I have everything running and I will start products
a few loads on the
seRps. It's using the same vegeta comment that I showed you
on the previous demo.
So we will looking for the configs we have here.
So first I have the app folder. On the app folder
we have a very simple python application mimicking
the same behavior, checkouts and payments doing the same
flow. Okay? And folks, as you can see
here, we don't have any open telemetry or Prometheus or
Jaeger code. We are not creating traces, we are not creating
metrics. This is a plain and standard
flask Python server. Okay, this is just to
bear in mind all the things that you're going to see. It's auto generated.
So looking to the pipe to the docker file, this is
where the magic starts to happen, because we have to
install a few python packages on
the host, which has a dependence like the opentelemetry distro,
the exporter of the opentelemetry format.
And also we need to run a few bootstrap comments to
configure everything that we need on the host. And then
we have the opentelemetry instrument, some rounds in the
Python command. It's like running
the opentelemetry instruments. And this is starting a process of
Python when we are doing these folks.
And this is where the magic starts to happening because the
out instrumentation project of opentelemetry is
doing the same thing that the
vendor agent is doing. It's changing your codes
in Python casing during hung time to
adding opentelemetry code. So the
opentelemetry out instrumentation is gamechanging the Python code to
initialize tracing context and ensure that we are propagating
the tracing headers when we are doing HTTP requests, when you're producing
a Kafka message, reading the trace context when we are receiving
a request or consuming a Kafka message, and those are just
example, this is doing a lot of things behind the scenes.
Okay, cool. The traces are being produced
by this application will be sent
by the opentelemetry collector. Before I
show you the opentelemetry collector configuration
to you, let me show you the containers. We have running
side the Docker compose. We have Prometheus and then we have Yeager,
and then we have the open telemetry. And then we
have the services containers where we have the checkouts API and
also the payments. As you can see, I have for
both containers a few environment variables defined.
We have the opentelemetry traces exporter,
which is going to be the OTLP.
What is the format that I'm going to export traces,
which is the serfs name, in this case the checkout. And what is
the opentelemetry collector inch point where
this application should push traces, which is hotel
and on the port 43, 117. Okay,
hotel is the container. Running the open telemetry
collector. Cool. Pretty simple. Moving to the open
telemetry collector we have the open telemetry
collector. And the open telemetry collector is a piece of software
responsible to receive opentelemetry data,
process the opentelemetry data and then export the
opentelemetry data. This is literally
a software pipeline. Okay, so you can receive,
process and export. And we can
see we have a pipeline section on the
opentelemetry code collector config. And we have two pipelines running here.
The first one is the receiver, which is the OTLP.
So the application is producing the traces for this
receiver. And then we have a batch processing. We are just
batching the spans and exporting those
spans in batch to the back end which is export
another OtlEp and then the OTLP it's sending to
the Jaeger end point. And on the
exporters we are also sending to something named
Spun matrix. And what is spun
matrix? Spun matrix is a connector.
And connector is a part of opentelemetry
collector that acts as
a receiver and also exporter. So it's like
literally a connector, it can receive and export
on the same time. Okay, so for every
spun we are exporting
we are sending to the spun matrix. The spun matrix is receiving
the spuns. Processing those spuns is
basically creating metrics from
the spuns. And later we are using
another pipeline named matrix, spun matrix.
And we are getting data from the spun matrix since we can
receive and export data. And then we are getting
from metrics from spun matrix and we are exporting to Prometheus
and vote write which is remote writing to our Prometheus server.
Okay, this is basically that folks.
And the part of code is responsible
to creating metrics from traces is the spam matrix connector.
So the open telemetry out instrumentation
on the docker image is gamechanging the python code.
To initialize the tracing context and ensure
that we are propagating headers and reading headers. The collector
is receiving processing and export to the metrics.
Okay, just one note.
The way that I did on the docker is the simplest
way that you can do to have out instrumentation running.
Okay, there is another approaches.
If you are running kubernetes you can use an open
telemetry operator and you can inject sidecars
on your pods based on
your technology. So if you're using Java,
you just need to notate your pods with Java
notation, open telemetry, Java notation and opentelemetry collector is
going to do all the magic to you. The opentelemetry operator.
Okay, but I'm not show you this today
because it's a little bit more complex. So we
already have this running for a while.
Let me see, probably for a few
times from now.
Let me just ensure one thing. I guess I forget
to run make test.
Yeah, but let running and let's switch to browser.
Now we are on the browser. We can go
first on the Yeager view and we
can see the traces from the services. We have checkout
and payments. If we look to
one tracing in specific, we can see that
we have three spans. We have the first spawn which is
the checkout. So when the checkout service receives
the request and then we have a checkout
action. Get where the checkout service
is doing an HTTP request to the payment service and
then we have the payment service receiving
the request. Since the payment service is doing nothing,
we don't have any kind of continuation
spun here. But if we're consuming a Kafka master job,
another thing, we're going to see this spun as
well. Okay, we can look to those spuns
and see that we have a few useful information like user agent.
In this case vegeta is written in go.
So the user agent is go client. We have, which is the host
port, the prip,
the opentelemetry library name. We also have a few process
tags such as the SDK version.
It's also showing auto version, auto instrumentation
version and the same is true for
payment serves. The difference here is the user agent is
Python because checkout serves is a Python server
and not goling. Okay, cool.
As we can see on the open telemetry strategy
we have more details about the service internals.
The thing that we didn't have on the proxy strategy,
because we are collecting on the proxy strategy, we are collecting opentelemetry
data from the layer above. Okay,
well we also have the system architecture as we can
see here, very useful as on
the other demo. But since we are using
Opentelemetry in this puns metrics and we are creating
metrics from traces, we can leverage another Yeager feature,
which is the monitor feature. We can just
see an APM view like by
service and operation. Under these serves we can see
the request hate the P 99 matters,
we can see the action and we can also
see the impact of this action on the services.
So this is related. So if this
action is most used or less used,
this is very nice and very useful for quick and troubleshoot.
If you want to see this on your Yeager page.
Yeager is reading data from Prometheus to build
this screen, which is kind of nice. Okay,
so moving to Prometheus,
and if we go to Prometheus on the 99th part,
we have two metrics here. The first one is the cost,
which is basically a hate of actions.
Okay, in this case, since we are using an HTTP request, this is
a hate of HTTP request. And then we can
run Hm five minutes by
HTTP metal stats code and then service
name and we can see this going up.
Okay, let me just reduce this a bit.
Well, this is it. So if we are
doing Kafka producers, we're going to see the
same thing, Kafka consumer, the same thing
under the calls. So it's important
to understand this kind of operation. We can
also of course have in the span name,
for example, to help you understanding what is the action
is being executed on this part
of code. But it's basically that, okay,
another useful metric we have is the buck duration bucket
where we can measure p 99,
p 95 as we saw on the
Yeager screen. So let's do sun
by Le and service name,
and we are going to use a histogram quantile p 99
for these. And we can see like
last five minutes, the p 99 for
the checkout service is around eight milliseconds and for
the payment service is around one millisecond and a
half. Okay, well those are very useful metrics.
Again, we are doing all those things without any
code change. We are just deploying an agent on
the host or changing the docker image. So this is very
simple to be executed by someone from
your platform team and which is nice. We are
ensure that those metrics are being produced
using open standard, which is the opentelemetry specification.
Most of the vendors are supporting opentelemetry
data. Open source solutions are supporting open
telemetry data as well. And if you are
using a vendor which is not supporting open telemetry format,
I do recommend you move away from this vendor because
it's not good for you. Keep using proprietary
opentelemetry date, okay, which is nice as
well, because doesn't matter which technology you are using to
build your service, if you're using python, if you're using Golang,
if you're using Java, all those metrics will have
the same standard, the same labels and so on.
So this is very useful, especially if you want to build dynamic dashboards,
dynamic slos and so on.
Okay folks, so I hope you enjoy
this demo. For me, this is one of the coolest one.
And then we are now switching to the 31, to the EBPF
strategies. All right folks, so this is the
30 observability, observability strategies to not
overload engineering teams. And this strategy
is like very interesting
because it's using EBPF. So EBPF is an emerging
technology, especially on the cloud native space.
You may see a lot of products using EBPF
for observability, security, networking and
so on. So EVPF, for those that
don't know what this means, it stands for extended
Barclay packets filter, BPF. It's very common on
the Linux kernel and EBPF,
it's like BPF with some tuning
extra features and really cool extra features
actually. And the idea of EBPF
is extending your Linux kernel to
trace, monitor and analyze system performances
and behavior. So you can collecting things that's happening
on the kernel level and providing sites
and provide those information to the user space.
Okay. And EBPF
is not such a new technology, but you can see a few
products leveraging EBPF. We start
seeing a few more nowadays, but it's not highly
adopted yet. Okay. For many reasons,
people still not like discovering
and so on. So the idea of this
demo folks is basically the same idea. It's like we are
going to having an agent that will be able to collecting
all the signos that we need, like the metrics, trace logs
from the application level, not only from the
infrastructure, but as we did for the other demos,
we're going to use the same
concept of collecting application level observability.
Okay, wow. So let's move
to the VS code. And this
demo is going to be a little bit different because I'm
really focusing on kubernetes strategies right
now because I'm going to be using a solution which is Kubernetes based.
But we already have many other options to the ones that not
using kubernetes. Okay, so what
we have here, it's like pretty simple. We have a
cluster and then we are going to starting a
minikube cluster and then we are going to install on the cluster
cilian. What is cilian? Celine is
a networking interface for
Kubernetes. Okay. So we have many, like Falco,
we have some cloud specific
CNIs container network interface like AWS,
CNI and Cylin is another CNI,
okay. And Celine is fully built
on top of EBPF and it's using EBPF for
networking, for loading, balance and many other
things. And now so using EBPF to
provide observability inside your cluster. So using
Celine and its EBPF agent, we can collect
metrics like TCP,
HTTP networking metrics and so
on. And we can also understanding what
is the networking flow inside our cluster.
Very similar with the information we have
on the Yeager diagram architecture,
but it's build not using traces, it's build
using network flows.
But the concept is the same. It's just another sign or another kind
of information that we can use to get the same
site. Okay, so I have all the comments that
I need to install celeb to install the monitoring,
the monitoring stack on the Kubernetes cluster. And another thing like
we are leveraging some applications from Star wars to
start collecting traffic from there. Okay, I will not
cover each comment as I didn't on the previous
one, but you are free to check this out later.
All the source code will be available on the GitHub and
you have access for that. Okay folks, so meanwhile
let's start creating, make setup as we
did for the other ones. We are going to start in a Kubernetes cluster
and then we will start installing
all the things that we need like Celian, Grafana,
Prometheus and so on. Okay?
So as soon as it's finished, we will be back here.
Okay, cool. Now, we already have all
the components running on the cluster. So if we run kubectl,
get pods a to get pods from all
namespace, we're going to see all the
pods that we need to have on our clusters.
Like we have the celine operator,
which is the operator that's going to ensure that
we have each agent has ceiling
agent running there, that the cylinder is running health,
it's collecting all the metrics and so on and so forth.
So we have few kubernetes pods
like core, DNS, etcD. It's pretty straightforward. Those one, we have
rubble and UI. Okay, I have other Kubernetes
pods as well. So the thing here is like cylinder
by default is not providing any kind of observability.
Okay, the ceiling project, it's working in a very
specific way, which is Kubernetes CNI
networking and loading balancing.
Sure that when you create a new pod,
the pods getting IP, the nodes getting IP,
it's like doing the communication with your cloud providers
to getting ips from your network
and so on. But the
Celian project has another sub project named
Hubble. So Hubble is an observability
solution, if I can say that, that it's
leveraging cilium to getting network flows from
your pod communications and then extracting metrics
and providing network visibility from
your cluster and the applications that you are running. Okay, so this
is what is celebrated and this is what Hubble and we are going to see
this right now. For this we
need to have port forward few components on
our machine. Let me show you the pods again
and explain other thing. So we have each agent with
a ceiling agent et Kubernetes nodes with a ceiling agent running
there, watching every network
communication inside those nodes. Okay,
and then rubble, it's getting the network
flows and using all the observability that we need,
creating metrics and so on.
And we have a few other things here, which is like
we have a Grafana and then we have also prometheus because we need
store the time series that
rebel is creating. And we have Death
Star and now tie fighter
X wing to provide
some loads inside the cluster. And then we can mimic like
services communicating to each other. Okay,
so the tie fighter and X wing is like
doing some HTTP request to death Star. And we're going to see
this in action like right now. Cool.
So let's port forward a few components
like make port forward and then
relay because we are running all
those things from our machine. And then this is like ceiling
requirement because cylinder
UI needs to talk to cylinder relay to getting the information
that we need. So now I'm going to run get
port make port forward Ui.
Okay. Meanwhile we can switch to browser
and then on the browser, let me switch to browser
as well. We can. Localhost is 2000,
I guess it's 12,000.
Okay, this is the rubble
home page. Okay, so you can see all
the namespace we have inside the cluster. And then
if we click on the namespace, we can see all
the applications is running inside this namespace and the traffic
flowing inside each application.
Let me show you another one. Like Kube systems, the same idea.
We have a rubber UI and a rubble running there.
What else on Celia monitoring we have
Grafana. Let me see if it's load.
No flows found for now because we don't
have any trafficking happening inside this
namespace, but we can back to default and then
we can see that x wing and tie fighter is
talking to Devstar. Okay. And we can see
all the action happening right now, like the post for
the v one requested landing. Okay.
And we can see like forward. And if we click on
those things, let me see maybe
down below, if we click here,
I just missed this. We can see few details,
like when this communication is happening,
we see if it's a track, that action, if it's in grass
or aggress, what else we
know what is the source pods. So if you have many pods,
you can see from where this
network action is coming from.
And you have a few labels, we have the ip,
we have the destination pods, we have destination
labels, like a lot of useful information.
Okay? And then we can run
a few filters here,
like we can filter by name. Let me
see, kubernetes, I'm not seeing this,
but maybe clicking here,
namespace default. It's already, you see,
label equals namespace equals default. This is how
we are filtering. And this is the same thing.
If we click here on the
service pod, we can see a few labels from
the pod, from the destination, same thing.
And that's basically that. From Cedar,
we can know a few network information,
but not very special.
But this is more like a UI view because
from that application you cannot create any alerts,
you cannot create any dashboard. Okay? This is more information regarding
ceiling and rubble than like a properly
observability solution. But how?
Look, we are, because ceiling and rubble exports all
those information, especially metrics, is being created
to a time series database such as Prometheus. And then we
can move back to vs code. Okay,
so let's go back to vs code and then let's create a
new tab here and let's port forward Grafana.
Now we have grafana running. Let's go back to the
browser and open localhost 3000.
And we have few dashboards on Grafana. The first dashboard
is like about ceiling operator.
This is not what we need, it's more related to ceiling
operator healthiness.
What else? We have ceiling metrics,
which is useful as well, but not what we need.
Okay, we may see like how much the
BPF memory has been using, if we have any
ABPF, air horse or not,
system calls, maps and so on. But this is
not what we want, right? Let's see again
what else we have. We have two another dashboards,
which are more related to rubble and the metrics. Rubble is
empowering based on the network flows.
So we have Hubble dashboard by itself.
We may see the amount of flows and the flows,
folks, we are talking about here. It's all those things is happening
on this tab down below where I'm hovering
the mouse. So each communication between a
search and a destination is considered a flow. So we
may see the amount of flows we have, we may see the type
of flows we have, like it's trace or if it's
Lsl seven network
flow. So in this case we have a few l seven flows
because we are doing HTTP quests,
okay. And so on. So we may see if
we are losing any package. And that's it
from this dashboard, which is nice because we
start already seeing a few HTTP
metrics that are being created on top of the network
flow. Celian is collecting DNS
and so on. But what is nice here is also the
Hubble L seven HTTP metrics by workload,
where we can see what
is the source workload,
for example x wing or tire fighter,
and then we can see the destination in this case
is only death star. We don't have any other destination,
but we can see metrics
like by stats code, by source
and destination and so on. So let's explore.
We may see also latency, as we can see here, like the p
99 and the p 95. We can build like
slos using these metrics. We can build in alerts for error,
h latence and so on. So just to show you
we can see few metrics to explore.
Labels, let me see, five minutes.
And we can see a few labels here. Like we have the
destination, the namespace and the workload,
the destination IP as well.
We have which protocol we are using because
we may use like Kafka protocol, we may
use HTTP or any other kind
of things. Okay, we may
use HTTP two and so on. We have the source workload
and the source namespace, which is the
method. And I guess we also have the status
code in someplace here. I'm not
100% sure. Yeah, we have the status and also
we have the method. So we
may create inquiries like,
okay, on the last five
minutes, I want by
this, in a source workload code
and method, and we can see something like that.
I guess it's status maybe. Yeah, it's status.
So we may build some slos like all the
requests, all the requests by the band requests,
and then we can status something
like that.
Okay, I just have typo here.
Oh, it's the opposing. Yeah,
and then we may have something like that.
Yeah, I broken the query, but I don't know why
because it's zero. Yeah, this is the reason because we don't have
enough questions with the errors. But yeah, you got the idea.
Okay, so you can use all those metrics on
the same standard that we have. On the other, doesn't matter which technology
you're using, if it's java, if it's golang, if it's ruby,
we don't care. The metrics we are collecting
are the same. The network flow we are collecting are the same.
We don't need to instrument our codes to get all
those metrics, which is nice. Again, this is
on the application level, it's not on the load balancer
level. Okay. It's pretty close to your service.
And then this is the idea of ceiling,
EBPF and so on. Okay,
so back into the
slides. I would like to add a few things
here is like, I've been using Celine as
the solution for this demo, but the truth is,
when I was building this demo, Celine was the most mature
technology for observability using EBPF.
But nowadays, we already have a few
more technologies that I didn't test yet. Like we have
opentelemetry, EBPF agent, which is providing a
few metrics. Not very detailed as Celine
is doing. But I do believe this is a project that's
going to be very mature in a few
weeks and a few months maybe. We are still working
to have helm shards to make this agent installation
easily, and the opentelemetry team is working to improve
this. I know that community is
building a few other services using EBPF, but I didn't
test yet. Okay, so for sure,
Celium is not providing all the signals we need.
We are talking about more related
to maybe we are not including traces
and logs. Okay, we are only talking about metrics. But this
is a starting. EBPF, as I said, is continuous
growing technology. So day after
day, we seek the community involving the
observability and security solutions using EBPF.
So I do believe this is a technology that you
must watch for
your observability systems, not only for metrics
and traces, but also for profiling. We see many profiling
solutions like park doing cpu and memory profiling
using EBPF. Okay, folks,
so this is the third strategy,
the EVPF strategies. Again, the idea is to not change
any line of code to include instrumentation. And we
collecting as much as we can using platform.
Okay, so this was the last one.
Let's now looking for a table comparing both of them.
Okay, now you might be asking what is the best solution?
What is the best strategy? To not overload my team.
And this is the answer that engineers usually
to hate. But it depends, of course, it depends of your
infrastructure. It depends your team knowledge, it depends how many
people you have working on platform abstracting
features provide things as service inside your company.
So of course it depends always. But I
list a few things that I think that might be important,
like technology agnostic, context propagation
environment agnostic, and also MOOC opentelemetry data.
So from the technical agnostic, what I mean by this,
if I have different implementations based on the technology
I've been using, and for proxy and EBPF
solutions, this is completely agnostic. It doesn't matter the
technology you are using, the implementation will be the same like on the
proxy we are collecting on the proxy level, EBPF is doing all
the magic on the kernel level for you, the opentelemetry instrumentation,
it depends on the technology that you are using.
The example that I show you,
it's a python solution. So if
you are using, I don't know, C sharp, this is another
way to implement like the same concept, but another
way if you are using ROS, another way if you're
using nodes as well.
So open telemetry instrumentation depends on your
technology. So you might have different implementation based on
the solution you have. But in the end
it's worth, you saw the power of opentelemetry,
of instrumentation and the things we can do with opentelemetry
collector. I do believe that you should give a try for that.
Okay, so the things that ensures context propagation,
like the proxy, it kind of shows,
okay, it only ensures context propagation
on our demo because we have an ingress proxy and an egress
proxy. So I'm listening to all the traffic that's
incoming from my platform and all the traffic that's going out from
my services. And the
other things is we only see context
between proxies. We don't see things happening sides of the
application. Okay? So we cannot
using this information to Rio,
troubleshooting the problems inside the application.
This is what I mean regarding traces, okay,
regarding the metrics, well, we have the highest
level possible and as
much closer from your customers, your proxies.
More realistic will be the latents and the error rates that
you're going to collect from your system. Okay,
the opentelemetry and the EBPF,
yeah, it ensures context propagation. The opentelemetry saw the
things as well, going through one serfs in another.
And they use inside the serfs, the EBPF as
well. It depends, of course, the solution you are using. Sealum is not
talking about trace, but if you're trying another solutions, that is EBPF
trace solution, it's going to work.
All the three options are environment agnostic. So it doesn't matter
if you're running kubernetes,
if you're running virtual machines, if you are running bare metal. So it's
going to work. The ceiling is
only working for kubernetes because
it's a Kubernetes CNI. But the EBPF solution is not
kubernetes based technology. So you can use Opentelemetry
EBPF collector on your Linux machines.
Okay. All those are providing
different kind of telemetry data. We have logs, from there we have
traces and now we also
have metrics for sure for all those three options.
Okay folks. And as I said, it depends.
You need to understand your user case, you need to understand your requirements
and your capabilities to choose the best fit for you.
One of them is going to provide for you what we need and the
metrics and the telemetry data that you are looking for.
Okay, cool. But there is another
option and if I'm able
to implement every solution, why not do this and collecting
metrics in different levels, okay, like using
the proxy strategies to collecting metrics on the proxy
level, the proxies that are closer to my customers and then I
can measure pretty close customers
latency. Okay, why not using auto instrumentation to
start collecting traces from my applications without any
code change and providing all the information that I need
and then I can use an open telemetry collector to processor and enrich
this telemetry data and why not using EBPF
as well? So then I can collecting network information and also
application metrics from the kernel level because the kernel
is the best place ever that we can use to collecting
observability and security data. And then you
can have different point of views and you can decide which
metric and which it's better for each level
you are looking for. If you're looking for network levels, probably EBPF.
And the Celia solution is going to be better for you. If you're only looking
for application level metrics, the opentelemetry
is going to be better and so on and so forth. And then you can
use all those strategies, provide as much insight as you can for
your engineer teams. They can build alerts, they can do whatever they
want, or you using all those standard metrics,
you can automatically providing them dashboards and
now alerting out of the box, okay? And this is
the great thing from those strategies to ensure that
the teams are going to getting default observability
for their services without any code changes. They can
focus on delivery features and make the product owner happen,
the customer happen as well, increase revenue. And then
when they really need to adding effort about
observability, they're going to adding observability for their context
for their specific use case cool. So folks,
that's it. We can also try to join all those
three. So that's it from my
side. I really hope you enjoyed it. If you have
any questions, feel free to ping me out. Will be my pleasure.
Talk to you and have a nice conversation about cloud
native, about observability and many other subjects.
And thank you for being here and thank
you for listening. So see you folks.