Transcript
This transcript was autogenerated. To make changes, submit a PR.
Today I'll be talking about zero instrumentation observability
based on EBBF. Let's start from the very beginning.
What is observability? Miserability is
being able to answer questions about the system.
How is the system performing right now?
How its current performance compares to
the past period? Why are some
requests failing? Or why are certain requests
taken longer than expected? In other words,
why system performs slower than before?
A while ago, systems web applications were
much more simpler than today.
Usually it was a dedicated nod based
applications, making few replicas
of applications. Dedicated database,
pretty simple. If something goes wrong,
you just will need to analyze logs of the
application, maybe metrics on
the application work, database metrics
and logs. That's it. Pretty simple.
Now systems are getting more and more
complex microservice architectures. A lot of
databases, which in a typical application
this day can contain hundreds or even thousands
of services which are run on
a kubernetes cluster. Nodes are
dynamic, which can appear and disappear,
autoscaler spot mods or something
like that. While troubleshooting such a
system, we need to follow the system topology
from service, from one service to another,
to database to infrastructure level, database level
network. A lot of stuff should be analyzed
to identify the cause. Let's discuss
the steps. How to integrate make a
system observable what does it mean? The first step is
collecting telemetric data such as metrics,
logs, traces and force telemetry signal.
These days, continuous profiling.
If we do that manually, we should instrument
every application with some observability framework such as open
to magnetree. This approach is good,
but it requires time of your team.
It can be time consuming because of
we have a lot of services and the
most like one. One of the additional
disadvantage that you cannot achieve 100% coverage
of such a system because usually we have a
lot of third party services or legacy services,
we cannot modify their code. That means we cannot integrate
open telemetry into such applications.
Then we need to store somewhere this date.
Usually it can be now open source databases
like click House or Grafana storages,
which elementary data, or it can be some commercial
services such as the baby bug, Neuralik and so on.
A lot of from my perspective,
the most challenging part of this is learn
how to extract insights from all this
data. Because we have 100gb or
terabytes of logs, we have thousands or
millions of metrics, and we have a
lot of traces, how to know item and how to turn
them into insights. What's happening with my systems
right now and how to understand the root cause
of an issue using all this data.
Our system is pretty complex and we
need to gather a lot of parameters
or metrics or telemetry of
each subsystem to be able to troubleshoot
our application. And let's talk about what
exactly we want to know about each application
in our system. First of all,
we need to understand how the particular
application performs right now it's called slis
service level indicators. Usually it means we
want to know how many request system is processing
right now, success rate or errors rate
and letters of each request. Or it can
be a histogram or something like that. Any aggregation
of latency will work. The second one, if we have
a distributed system, we should understand how each component
communicates with other services or databases.
It means we need to know each outbound request
to an external or internal service. We know
how many requests is performing right now,
success rate and weapons. We need to understand
everything about resource consumption because
any service can degrade
or perform slower than usual. If there will be
a lack of cpu time, for example, or the application
instance can be restarted because of out of memory,
or if you run database,
it's really sensitive to disk performance.
Next, our components usually communicate using network,
so we should understand how network is
performing right now. So it can be some
sort of network issues like connectivity loss
between some nodes or availability zones or
regions. If you're our system, in many regions that
can be a packet clause or some
network calls can proceed longer than usual because of
delays at the network level.
Then we should be able to explain
why an application is limited cpu time.
It can be the reason behind this can be like
being out of capacity of cpu on the node,
or a node can fail or something like that.
Then the next class of possible failure
scenarios is something related to
application runtime, such as GVI runtime,
dotnet runtime, or database internals.
So we need to collect a lot of metrics related to the
application runtime, such as garbage collector metrics,
the state thread pools or connection pools
or logs, and so on. This telematching
will allow us to explain why,
for example, a Java application stop a handling
crash for a while because of GC activity.
Then if we run our application in
a Kubernetes cluster, we need to gather orchestrator
related matrix. Some of our application instances can
be in an unscheduled state.
An orchestrator cannot place the application
instance to the node because of out of capacity or
something like that. Then to be able to investigate
unknown issues like application specific issues,
we need to gather application logs.
And the last one I would say is collecting profiling
data to be able to explain why
the particular application consume more,
for example cpu time than before. And knowing
only CPU usage is not enough to understand the
reason behind this. And I
see that there is two approaches
together. To match data. You can do it manually,
or you can use some sort of automatic instrumentation
using EBPF. The advantage
of automated instrumentation is that you
don't need to change anything
in your applications. You just need to initially install some
agent and that's it. You will have all
telemetry data with some limitations.
But we discuss it. And also automated
automatic instrumentations allows you to avoid blind spots.
It means since you don't need to change code of
your applications, you can instrument even legacy services
or services that you don't
have access to to their code. It allows you
to cover all your,
but not on the most critical services.
But I've seen a lot of cases where companies
started to add instrumentation to their applications,
starting from the most critical services.
And in fact for a while, after a while
they have only a small part of
their system, instrumented telemetry data.
And the last aspect is that
manual instrumentation of your services. It's not a one
time project. It is a continual, continuous process
because you need, if you want to
add a new service into your system,
you will need to be sure that this
service is also instrumented, resulting to damages, decay,
or something like that. Let's go deeper into what is
EBPF and how it can be used to
gather to invention. EBPS is a
feature of Linux kernel. Linux kernel allows
you to run your own small programs in
the kernel space, and such programs will
be called some kernel function code
or some user lan function call.
PF itself doesn't gather any data,
it's just a way to instrument a
system, but you have to write your own program and
run it in the kernel space. It's pretty
complicated task because kernel has a
lot of limitations that applied to such programs.
In simple terms how we use EBPF the
first way is to attach our program to any
kernel function call, such as
for example open or VertCP connection.
But this approach is not the best
option because of possible compatibilities.
Because different kernel versions can have
different function calls, the function can be renamed,
deleted and so on. So you need to support many
variants of them. The second one,
the kernel team tried to provide
a selfdirect stable API. To solve the problem.
They inserted a set of trace points,
trace points in the statically defined places
in the kernel code and your EBPS
program can attach to at this point, and there
are some guarantees that arguments of such
trace points will not be changed over time. For example
ecaroot we mostly we try to use trace points as
and not use gpro. The third one option is
UCL is a way to call
your EBP program when some user
went function is called, for example you
have some binary file, this Golang program
for example, and you want to attach
your program to calling crypto
library for example. And you can do these
a pros and two things
I want to mention here. The first one is maps.
Your EBPF program can store
some data in the kernel space. Using EBPF maps.
You can keep some state between your
program calls. Usually it's used to
cache some data to capture some data from one call
and then use it from next program.
For example, you open HTTP connection and the first
call you store a file descriptor in
some map by process h
for example. And then when connection goes to the
established state you can get data
from the map using process id
for example, or other id like socketing.
And the second one is turfmaps.
It's a way to share some data between
the kernel space and the user space program.
Looks like a circle boot
buffer and it allows for example
store some data in the kernel space and read this
data from user space program. That's the
main way to exchange data between user
and kernel space. But it's
low level entities of EBPF
and I believe that you don't need
to write BPF program because
there are a lot of ready made and
cartoon. But I think it's
good to know how it works. But you don't necessarily need
to write your own EBPF programs. Now I'm telling
you about how current uses EBPF
together. Two dimension data we have our own
node agent. Node agent is a Golang tool,
open source distributed under Apache
to no two to zero license.
It's an open source agent that should
be installed on every node in your cluster.
Discovers processes containers running on the
node, it discovers their logs.
We can analyze logs for repeated patterns.
It tracks all CPA communications between processes,
processes containers, and it also
captures application level protocols data and
can expose metrics and traces
without any instrumentation.
It supports the most popular protocols
such as HTTP postgres, JPC,
Redis, Cassandra and so on.
And let's talk about how the
agent uses CBPF. As I need
mentioned before, we try to use trace points for example,
to be able to discover new
processes in the system we call
of trace point testing tasks. It allows us to know
that in real time that a new
process has been started and we know
it's process id. And then the user
space part of the engine can resolve all metadata
like cgroup or variant process,
understand the name of container labels and
so on. The second one is recapture events
where some process is marked
as a victim of the computer. We need the
data to understand the reasons why the
particular process has been terminated.
And on this call, on this event,
we just put the flag into kernel space,
map that process with process id 123
marked as a victim and will be terminated. So,
and on the next call when process is terminated,
we can get this flag from the map and
mark the event that goes to userland.
That reason of the termination of this process
is it has been terminated by the open.
And then we also check openings to be
able to discover containers log where
container or process stores its log.
And also we want to understand
the actual storage partition. A process
communicates with then few trace
points that we use to track TCP
connections. The first one is a syscall connect
where a process wants to open a new TCP
connection. We call this system call. And we
see that process id 123 opens
GCP connections to some destination
IP and then using socks set
state. At this point we can track the
proticity handshake and we can see
if this connection was successfully
established or failed. We can track the
errors of establishing TCP connections. And also
we can, we can track all successfully established
TCP connections. And this last one
is we track TCP redrawing
in as associated with all connections in our
system. It's really helpful to understand
that service a communicates with service
p slower than usual because of pagan laws. For example,
we need to retransmit our TCP segments and
it brings additional latency into
communication process. And the last one is
trace point that allows us to capture
application level communications. The agent tracks
that. Some process writes something
to a file descriptor, usually it's
socket, and we can capture a
preload of such communications and
we can parse, we can detect and parse application level
protocols within such connections. We have two phase
application level protocol parsing. The first one is
performed in the kernel space to like high
performance protocol detection. But EBPF
program is super limited in the,
you know, ability to
analyze something because it's like
complexity. EBPF validator check the complexity of
each program and for example cannot use loops
in your EBPF programs. And if you want
to parse protocol like for example to read,
reload or to extract
URL of request or status of the
response. We use a user space protocol
parsing and we are not limited in
our complexity of the program, so we
can implement any logic. And the last
one is about encryption, because these
days usually even communications inside
your cluster should be encrypted by compliance requirement
or something like that. And we would somehow
deal with that. There are two primary approaches
to capture within encrypted TCP
connections. The obvious way is to read the
data before, before the encryption or
after decryption, where you receive
some response. In our agent we support two
ways to encryption to
two types of programs. The first one is programs
that use OpensSl library. It's for example
or Python pro applications
or other interpreted languages.
And the second one we use a special
new probe for capture go and crypto
library calls to capture data before the
encryption and it allows us to have
pretty good coverage. So we can say that
the agent 90% of data
even consequently, and few
words about performance impact because at first glance
it looks like a super crazy idea,
allows users to run their programs in the kernel space
because kernel should be,
should perform like a low latency. And it's
crazy idea if we have custom programs in the kernel
space. But in fact there are few
guarantees from the kernel that allows us to
run our programs without any, without having
signals it can perform a sync. The first
one is a validator and the kernel that validates
every problem before it runs. And it
must have a finite complexity, you cannot use
loops and you cannot operate this
huge amount of memory and so on.
And it's some sort of, it's super tricky to
write your EBPF program that can be successfully
validated by the validator. And the second,
the second thing that branch is that perform
of your EBPF program
or your user space programs will
not affect the system, is that communication between a kernel
space and user space is occurred in
using limited buffer.
And if program user lamp program will
not receive some data because of a performance degradation
or something like that, we just lose some data and
it will not bring some locking into
kernel space. And from my perspective it's
pretty fair. And for observability purposes
it's a nice, because I think it's more
important to be sure that our system do
not be impacted by our observability tool.
But in worst case we just lose some telemetry
data, we lose some traces or some
metrics, but it's not a critical thing.
As a result we can instrument all
our distributed system and understand how
components of such a system communicates with each other,
latency inside, latency between any particular
services, status of tcp,
connections between them and so on. So we
can have that for example front end communicators services
such as card catalog and so on. We can see latency,
we can see number of requests,
you will have all the all this service
map just in few minutes after the installation.
You don't need to change code of your application,
integrate open telemetry, SDK and so on.
And it's pretty useful for if you
want to understand how a system perform right now,
you cannot wait for few weeks when
you're development team at some integration.
And we have telemetry
data, pretty granular of telemetry data.
It means we can track connections not between applications,
but between any application instance with
their peers. So we can
see communications as with
the optimization we need. For example, we can see how
a few postgres instances communicate with each other.
This is the primary instance and the two replicas
are connected to it for a cocation.
And we know about each of the connections,
we know the number of requests, number of errors,
latency of application level requests,
we know how network performs between any
instances, we know network round trip time,
and we also know connection level metrics
such as number of connections, number of field
connections, number of TCP transmissions.
It's pretty useful because in distributed
systems it's
hard to understand what's happening right now, and it's hard
to current topology of services.
You should know the nodes where your application
runs. And to check the connectivity between two services,
you need to perform many applications to understand the topology,
then goes to some pod to perform ping
or something like that. And here we have geometry
data that already understands
that an application can migrate between nodes
and something like that. And it reflects the current
picture of your system, current topology of your system.
And also the edge provides not only metrics,
it also provides APF based traces.
So we can see that we have,
for example, application calls a database,
and we can see that some of
queries perform slower than others. And we can
drill down and see particular requests
and useful while troubleshooting.
But EVPF tracing has some limitations.
Most from my perspective. Limitation is that
EBP and BBF traces are not actually
traces, it's individual spans.
Because when instruments and applications are open
to images decay, it can originate some tracing
id and propagate it to other
services. As a result, you will have a
single trace that contains all the requests
from all services.
In case of EBPF, we don't have a tracing
id so we only
can check the particular queries and we cannot
connect them to show you the whole trace.
And yeah there is some tool open
source tool that tried to solve that problem
and they use approach where they
capture the request, then modify them,
inserting for example the JCG header and
then send it. But from my perspective it's
not a good idea because I believe
that observability tools should observe.
They shouldn't change your data, they shouldn't modify
by law and understanding this, we supported
both methods of instrumentation in Karoot.
We support open telemetry generated traces and we
also support EBBF based traces for systems,
for example for legacy services and so on.
After the installation you will have VTF based
tracing, but then you can enrich your telemetry data
with open generator instrumentation.
It's the best way to have all this data and
it allows you to extend your
visibility of a system with MBPF.
And few words about another
another way to gather telemetry data using EBPF
it's EBPF based profiling. What is profiling?
Profiling allows you to explain any
it allows you to answer the question what this
particular application did at
the time x which code were executed
and which code consumed more
cpu the most extract most cpu consuming
part of your application. MPR based approach
doesn't require integrating some continuous profiling
tools into your application doesn't
require redeployment of your application. You just need
to install the agent and it
will gather providing data and storing it
with some storage. In case of carrot we store
all the telemetry data exception metrics in qcause.
Here we have an example of how to
use profiling data to understand to
explain cpu consumption we can use
cpu usage chart and select some area and see
the flame graph work reflecting what part
of your code consumed more cpu.
In this case it's example of coordinates component
of kubernetes and we can see the particular
function calls here. The wider frame and on flay
graphs means corresponding code consumed
more cpu's and the smaller one. And also
we can easily compare cpu usage. We think some
cpu spike be the previous period and
we can highlight it's not a significant spike but
in this case we can see that the service
running more DM's queries than before.
And here notes on how code works.
You install the agent to your nodes.
The agents gathers gather telemetry
data about all containers running on this particular
node, their communication. They gather their
logs and trace profiles and send them
to carot which stores matching the
parameters and other telemetry signals in
decals. Also, if you instrument your applications
with Opentelemergy, you use curves
endpoints that demands ole
mature protocol, and you can store
telemetry data in the same storages.
It's open source, every component is
open source, and you can use it. Or you can
use core cloud to offload your storing
telemetry data. In the end, let's just
scrape it up of the target. EDP is awesome
and it allows us to gather a lot of telemetry
data without the need to instrument your code,
and we can be sure that performance impact
on your application is negligible. On our side,
we have a page with benchmarks, so we
understand principles, but we
decided to to validate that museum
benchmark. If you want to gain visibility
into your system just in few minutes after installation,
just install code. You can reach out to me on
LinkedIn Twitter. Thank you for your time.