Transcript
This transcript was autogenerated. To make changes, submit a PR.
Welcome to my talk about low overhead Python application profiling
with EBPF. Let's begin. Way about myself my
name is Yonatan Goldschmidt. I have six years of experience as an R D
specialist in the IDF. I like everything about computers and
software, and today I'm a team lead at Granulate's performance
research department. About Granulate we enable companies to
optimize their workloads, improve performance, and leverage that to reduce costs.
And I also like wine, especially in Italy.
So why is profiling amazing? It's not a new concept,
but it's definitely rising lately, so it's getting easier and easier
to apply and use. Even in production environments,
you gain visibility on which parts of your code consume
the most resources, and this helps you to expose
interesting performance improvement opportunities. Let's talk about
profile types and focus on Python profilers.
We start with deterministic profilers or tracing profilers.
They track your program's execution in a deterministic
way, for example by instrumenting all code paths to give
deterministic results. They are very common and many types
exist. Probably the most well known one is
c profilers, which is included in the Python standard library,
and to the right of this slide we can see the example outputs
of it. Now, determination profiles are very useful during
development, as they are very versatile and can give accurate
metrics on function and line of code level.
However, their intrusive design, the need to
insert instrumentations in code or in the interpreter,
makes them introduce, possibly introduce I lowoverhead to
the code execution. They also might require code changes, for example,
to enable or disable the profilers or require deployment changes.
For example, you need to start your application with the profilers
script. These reasons makes them less suitable for production
use because you do not want to introduce any overhead and you
preferably do not want to make any changes just for the sake of profiling.
Now, another profile type is statistical profiles.
These work by taking snapshots or samples of your application
every set interval, for example, every one millisecond or
every microsecond. Instead of continuously
tracking everything that's happening over enough time, the accumulated
samples portray an accurate image of your application.
One common example is Pyspy, which is sampling provides
written in rust. This image also shows one way
to visualize the output of Pyspy. It's called the flowing graph, and it
tells us the relative execution time of different functions and flows
in your application.
Now, since these samples can be taken externally,
therefore, these profiles can be made external to the applications, as in
not intrusive thus they do not
introduce any overhead to the application itself to some extent.
Now, the profile itself is a program running on the system,
so it does introduce some overhead to the system, and we'll talk about
that overhead when we finally reach EBPF.
Now, since it's not intrusive, we do not
need to make any changes to the code or deployment. For example,
Pyspy can start profiling any running Cpython process just by giving
it the process id, which is very convenient. For these
reasons, they are much more suitable and safe to use in production.
Now, deterministic profilers are generally more versatile in their abilities.
So for development environments, when you want
to accurately measure a specific functional module, you might want
to still use them. Now that's all for the pre EBPF
error. Now let's see what EBPF brings to the table of profilers.
A primer on EBPF. It's a technology that has evolved
from the old Berkeley packet filters, which is a mechanism in the kernel that
allows the user to define filter programs for sockets
like the one displayed on the screen. It was used
mostly for sniffing programs such as TCP dump. The filter program is essentially
a small virtual machine with a set of outputs and operations
that it can perform on packet data. For example, the program displayed here
checks if the packet source IP address or destination IP address
is the local host, and if the source portal dev support is
80, and you can certainly monitor the assembly
instructions, the VPF assembly instructions for that program.
Now, years forward, this simple interpreter for user
programs has been enhanced with many more APIs that are not limited
and more to package inspection. Also, the programs can now be
attached to virtually any logical point in the kernel, not just
to the entry of packets. Together, this makes
EVPF the most capable tracing or observability infrastructure
on Linux. Here's a short example. To the right
we have the code of an EVPF program called Opensnoop.
It's written in a language called BPFT trace, which is later compiled to
the same BPF assembly we saw earlier. You can read about EBPF
trace online. This program hooks onto the open
system call and thus intercepts all open calls throughout the system.
To the left you can see sample output from running it.
On my box you can see all sorts
of different pids and programs opening different files.
You can see how relatively easy it is to write
this simple code that attaches onto Cisco and traces all
calls with fraud system. And also, I didn't mention the negligible
performance effect, which is something that we just didn't
have before EBPF. This table describes the
difference between standard user code, kernel code, and EVPF.
The core thing you need to take from the app is that EBPF is safe.
By design, a verified mechanism exists which ensures that
only safe programs execute. It also means that EBPF
programs are not entitled to do anything they please. For example,
they are not able to call arbitrary system calls or perform arbitrary
writes to memory. On the other hand, EVPF programs have
fast access to opensource, such as memory. For example, they can access the memory
of the currently running Python application much faster than Pyspy,
which is an external applications that has to run
some system calls in order to read the memory of the Python application.
Now let's get back to cpython. We needed a lowoverhead sampling
profiler, which can sample at high frequency and
can easily profile all Python applications running on the system.
Plus we wanted it to be able to extract native stacks and
kernel stacks. Pyspy, when not introducing
overhead on the application itself, does have some overhead on
the system. As I said, it needs to access the Python memory
in order to extract factories, and it does
a lot of Cisco trying to do that, which take time
Byspace simply wasn't fast enough when we needed to profile
a large cpython application with hundreds of threads at
high frequency. So we started looking onto
the EBPF approach and quickly found Pyperf, which was
posted to BCC as a PoC of an EBPF
based Python profile. By the way, we also found a project called
Aviperf, which is like Pyperf for OBi,
but that's a different story. So we spent a while
and added many new features to Pyperf, trying to make it
the best Python sampling profile. So first of all,
we made sure it supports all currently available
Python versions. We made it a system wide
profile. That is, it profiles all running Python applications
on the system, unlike Pyspy, which works on a per
process basis. If I want to profile 50 Python applications,
I need to invoke 50 different PY spies, which then
introduce more overhead. With Pyperf, I need to add it just
once and this profilers the entire system.
Additionally, we have added logic to extract the native stacks such
as cpython extensions, for example JSon, Piccolo, numpy interpreter
code, and native libraries. And we also extract
kernel stacks, which can be, for example, the system calls your
application is making. These features were relatively easy
to add over EVPF because
Pyperf is EBPF based, and it would have been much harder if
not impossible, and it's been written non EVPF based.
So here's an example of how it looks. This is a simple,
uniform application, and in yellow
rectangles we can see the Python frames from the
Python applications. The purple frames are denoted.
The purple frames denote a native code, and the orange frames
denote kernel code. Together, the combinations of those three
portray a very accurate image of the application institution.
Now, I will be speaking a lot about native code,
which is something that many profilers overlook intentionally saying
that the developer should care about the Python code because they do not have control
about the native code anyway, so they should just focus on the Python code,
and the native frames and stacks are unwanted noise.
However, from our experience, we know that
taking the native profile into account is very important when you want to
truly understand what's going on and which operations on the cpython
level are taking the most cpu and time.
Therefore, we have invested in making this feature work perfectly
in Pyperf. So now we'll do a small exercise.
I have this function written here.
Can you read its code and guess which operations take the
most time? And I'll give you a minute to think and
then we will check out the results. And actually it's
recorded, so you can just pause and continue when you're ready.
I'll continue now. So here I've
cut out the relevant native profile of this function.
The bottommost frame is the Python function itself,
and all frames above it are the native functions
that our cpython function, funk I've named it,
is calling. I've added some arrows
to explain which is coming from where,
and we can see some things that I originally,
after I wrote this, I did not expect the profile to look like
that. For example, I did think
that the string concatenation, which we can see to the right, taking a relatively
large part of the profile. Actually, it was blanking first,
the string concatenation takes a large part. However,
I did not expect the cow calls to take a
large part of the profile. Also, the model operator takes a relatively large
part of the profile. And I only realized
that once I've looked at the native profile.
What I'm trying to tell you by that is that once we
observe the native profilers, even of a simple python function,
we can quickly devise ideas on how to improve the Python
code of it. For example, after viewing this profile,
I now know that the most important optimization to
use is to switch from string concatenation to use string
I. And after doing that, the next thing
I would do is probably to cache the results of car,
and after that I would try to avoid the model operator.
Now the comparison, which I was thinking,
I thought it would take a lot part of the profile. It actually takes
almost nothing. You can see it in the middle, it's actually rather small.
So you need to profile and you need to look at the native profile in
order to truly understand how even a simple cpython
function divides its execution time.
So that's it on Pyperf. I hope the last part was interesting.
Now, Pyperf is a part of gprofilo, which is
our system wide, contains profilers for production environments,
and it supports numerous times not only Python, but also Java
and go rust Obi.
So check it out, it's open source.
So thank you. Feel free to DM or connect on LinkedIn,
GitHub, whatever, and please try.
It's fun. Try flippofinic at deepofilo IO. Thank you.