Conf42 Site Reliability Engineering (SRE) 2024 - Online

Zero-instrumentation observability based on eBPF

Video size:

Abstract

Join us as we explore leveraging eBPF for gathering telemetry data like metrics, logs, and traces, enabling efficient troubleshooting of container activities.

Summary

  • Miserability is being able to answer questions about the system. How is the system performing right now? Why are some requests failing? Or why are certain requests taken longer than expected? A while ago, systems web applications were much more simpler than today. Now systems are getting more and more complex microservice architectures.
  • EBPS is a feature of Linux kernel. Allows you to run your own small programs in the kernel space. Using EBPF maps you can keep some state between your program calls. You don't necessarily need to write your own BPF programs.
  • node agent is a Golang tool, open source distributed under Apache to no two to zero license. It tracks all CPA communications between processes, processes containers, and it also captures application level protocols data. In our agent we support two ways to encryption to two types of programs.
  • MPR based approach doesn't require integrating some continuous profiling tools into your application. You just need to install the agent and it will gather providing data and storing it with some storage. EDP allows us to gather a lot of telemetry data without the need to instrument your code.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Today I'll be talking about zero instrumentation observability based on EBBF. Let's start from the very beginning. What is observability? Miserability is being able to answer questions about the system. How is the system performing right now? How its current performance compares to the past period? Why are some requests failing? Or why are certain requests taken longer than expected? In other words, why system performs slower than before? A while ago, systems web applications were much more simpler than today. Usually it was a dedicated nod based applications, making few replicas of applications. Dedicated database, pretty simple. If something goes wrong, you just will need to analyze logs of the application, maybe metrics on the application work, database metrics and logs. That's it. Pretty simple. Now systems are getting more and more complex microservice architectures. A lot of databases, which in a typical application this day can contain hundreds or even thousands of services which are run on a kubernetes cluster. Nodes are dynamic, which can appear and disappear, autoscaler spot mods or something like that. While troubleshooting such a system, we need to follow the system topology from service, from one service to another, to database to infrastructure level, database level network. A lot of stuff should be analyzed to identify the cause. Let's discuss the steps. How to integrate make a system observable what does it mean? The first step is collecting telemetric data such as metrics, logs, traces and force telemetry signal. These days, continuous profiling. If we do that manually, we should instrument every application with some observability framework such as open to magnetree. This approach is good, but it requires time of your team. It can be time consuming because of we have a lot of services and the most like one. One of the additional disadvantage that you cannot achieve 100% coverage of such a system because usually we have a lot of third party services or legacy services, we cannot modify their code. That means we cannot integrate open telemetry into such applications. Then we need to store somewhere this date. Usually it can be now open source databases like click House or Grafana storages, which elementary data, or it can be some commercial services such as the baby bug, Neuralik and so on. A lot of from my perspective, the most challenging part of this is learn how to extract insights from all this data. Because we have 100gb or terabytes of logs, we have thousands or millions of metrics, and we have a lot of traces, how to know item and how to turn them into insights. What's happening with my systems right now and how to understand the root cause of an issue using all this data. Our system is pretty complex and we need to gather a lot of parameters or metrics or telemetry of each subsystem to be able to troubleshoot our application. And let's talk about what exactly we want to know about each application in our system. First of all, we need to understand how the particular application performs right now it's called slis service level indicators. Usually it means we want to know how many request system is processing right now, success rate or errors rate and letters of each request. Or it can be a histogram or something like that. Any aggregation of latency will work. The second one, if we have a distributed system, we should understand how each component communicates with other services or databases. It means we need to know each outbound request to an external or internal service. We know how many requests is performing right now, success rate and weapons. We need to understand everything about resource consumption because any service can degrade or perform slower than usual. If there will be a lack of cpu time, for example, or the application instance can be restarted because of out of memory, or if you run database, it's really sensitive to disk performance. Next, our components usually communicate using network, so we should understand how network is performing right now. So it can be some sort of network issues like connectivity loss between some nodes or availability zones or regions. If you're our system, in many regions that can be a packet clause or some network calls can proceed longer than usual because of delays at the network level. Then we should be able to explain why an application is limited cpu time. It can be the reason behind this can be like being out of capacity of cpu on the node, or a node can fail or something like that. Then the next class of possible failure scenarios is something related to application runtime, such as GVI runtime, dotnet runtime, or database internals. So we need to collect a lot of metrics related to the application runtime, such as garbage collector metrics, the state thread pools or connection pools or logs, and so on. This telematching will allow us to explain why, for example, a Java application stop a handling crash for a while because of GC activity. Then if we run our application in a Kubernetes cluster, we need to gather orchestrator related matrix. Some of our application instances can be in an unscheduled state. An orchestrator cannot place the application instance to the node because of out of capacity or something like that. Then to be able to investigate unknown issues like application specific issues, we need to gather application logs. And the last one I would say is collecting profiling data to be able to explain why the particular application consume more, for example cpu time than before. And knowing only CPU usage is not enough to understand the reason behind this. And I see that there is two approaches together. To match data. You can do it manually, or you can use some sort of automatic instrumentation using EBPF. The advantage of automated instrumentation is that you don't need to change anything in your applications. You just need to initially install some agent and that's it. You will have all telemetry data with some limitations. But we discuss it. And also automated automatic instrumentations allows you to avoid blind spots. It means since you don't need to change code of your applications, you can instrument even legacy services or services that you don't have access to to their code. It allows you to cover all your, but not on the most critical services. But I've seen a lot of cases where companies started to add instrumentation to their applications, starting from the most critical services. And in fact for a while, after a while they have only a small part of their system, instrumented telemetry data. And the last aspect is that manual instrumentation of your services. It's not a one time project. It is a continual, continuous process because you need, if you want to add a new service into your system, you will need to be sure that this service is also instrumented, resulting to damages, decay, or something like that. Let's go deeper into what is EBPF and how it can be used to gather to invention. EBPS is a feature of Linux kernel. Linux kernel allows you to run your own small programs in the kernel space, and such programs will be called some kernel function code or some user lan function call. PF itself doesn't gather any data, it's just a way to instrument a system, but you have to write your own program and run it in the kernel space. It's pretty complicated task because kernel has a lot of limitations that applied to such programs. In simple terms how we use EBPF the first way is to attach our program to any kernel function call, such as for example open or VertCP connection. But this approach is not the best option because of possible compatibilities. Because different kernel versions can have different function calls, the function can be renamed, deleted and so on. So you need to support many variants of them. The second one, the kernel team tried to provide a selfdirect stable API. To solve the problem. They inserted a set of trace points, trace points in the statically defined places in the kernel code and your EBPS program can attach to at this point, and there are some guarantees that arguments of such trace points will not be changed over time. For example ecaroot we mostly we try to use trace points as and not use gpro. The third one option is UCL is a way to call your EBP program when some user went function is called, for example you have some binary file, this Golang program for example, and you want to attach your program to calling crypto library for example. And you can do these a pros and two things I want to mention here. The first one is maps. Your EBPF program can store some data in the kernel space. Using EBPF maps. You can keep some state between your program calls. Usually it's used to cache some data to capture some data from one call and then use it from next program. For example, you open HTTP connection and the first call you store a file descriptor in some map by process h for example. And then when connection goes to the established state you can get data from the map using process id for example, or other id like socketing. And the second one is turfmaps. It's a way to share some data between the kernel space and the user space program. Looks like a circle boot buffer and it allows for example store some data in the kernel space and read this data from user space program. That's the main way to exchange data between user and kernel space. But it's low level entities of EBPF and I believe that you don't need to write BPF program because there are a lot of ready made and cartoon. But I think it's good to know how it works. But you don't necessarily need to write your own EBPF programs. Now I'm telling you about how current uses EBPF together. Two dimension data we have our own node agent. Node agent is a Golang tool, open source distributed under Apache to no two to zero license. It's an open source agent that should be installed on every node in your cluster. Discovers processes containers running on the node, it discovers their logs. We can analyze logs for repeated patterns. It tracks all CPA communications between processes, processes containers, and it also captures application level protocols data and can expose metrics and traces without any instrumentation. It supports the most popular protocols such as HTTP postgres, JPC, Redis, Cassandra and so on. And let's talk about how the agent uses CBPF. As I need mentioned before, we try to use trace points for example, to be able to discover new processes in the system we call of trace point testing tasks. It allows us to know that in real time that a new process has been started and we know it's process id. And then the user space part of the engine can resolve all metadata like cgroup or variant process, understand the name of container labels and so on. The second one is recapture events where some process is marked as a victim of the computer. We need the data to understand the reasons why the particular process has been terminated. And on this call, on this event, we just put the flag into kernel space, map that process with process id 123 marked as a victim and will be terminated. So, and on the next call when process is terminated, we can get this flag from the map and mark the event that goes to userland. That reason of the termination of this process is it has been terminated by the open. And then we also check openings to be able to discover containers log where container or process stores its log. And also we want to understand the actual storage partition. A process communicates with then few trace points that we use to track TCP connections. The first one is a syscall connect where a process wants to open a new TCP connection. We call this system call. And we see that process id 123 opens GCP connections to some destination IP and then using socks set state. At this point we can track the proticity handshake and we can see if this connection was successfully established or failed. We can track the errors of establishing TCP connections. And also we can, we can track all successfully established TCP connections. And this last one is we track TCP redrawing in as associated with all connections in our system. It's really helpful to understand that service a communicates with service p slower than usual because of pagan laws. For example, we need to retransmit our TCP segments and it brings additional latency into communication process. And the last one is trace point that allows us to capture application level communications. The agent tracks that. Some process writes something to a file descriptor, usually it's socket, and we can capture a preload of such communications and we can parse, we can detect and parse application level protocols within such connections. We have two phase application level protocol parsing. The first one is performed in the kernel space to like high performance protocol detection. But EBPF program is super limited in the, you know, ability to analyze something because it's like complexity. EBPF validator check the complexity of each program and for example cannot use loops in your EBPF programs. And if you want to parse protocol like for example to read, reload or to extract URL of request or status of the response. We use a user space protocol parsing and we are not limited in our complexity of the program, so we can implement any logic. And the last one is about encryption, because these days usually even communications inside your cluster should be encrypted by compliance requirement or something like that. And we would somehow deal with that. There are two primary approaches to capture within encrypted TCP connections. The obvious way is to read the data before, before the encryption or after decryption, where you receive some response. In our agent we support two ways to encryption to two types of programs. The first one is programs that use OpensSl library. It's for example or Python pro applications or other interpreted languages. And the second one we use a special new probe for capture go and crypto library calls to capture data before the encryption and it allows us to have pretty good coverage. So we can say that the agent 90% of data even consequently, and few words about performance impact because at first glance it looks like a super crazy idea, allows users to run their programs in the kernel space because kernel should be, should perform like a low latency. And it's crazy idea if we have custom programs in the kernel space. But in fact there are few guarantees from the kernel that allows us to run our programs without any, without having signals it can perform a sync. The first one is a validator and the kernel that validates every problem before it runs. And it must have a finite complexity, you cannot use loops and you cannot operate this huge amount of memory and so on. And it's some sort of, it's super tricky to write your EBPF program that can be successfully validated by the validator. And the second, the second thing that branch is that perform of your EBPF program or your user space programs will not affect the system, is that communication between a kernel space and user space is occurred in using limited buffer. And if program user lamp program will not receive some data because of a performance degradation or something like that, we just lose some data and it will not bring some locking into kernel space. And from my perspective it's pretty fair. And for observability purposes it's a nice, because I think it's more important to be sure that our system do not be impacted by our observability tool. But in worst case we just lose some telemetry data, we lose some traces or some metrics, but it's not a critical thing. As a result we can instrument all our distributed system and understand how components of such a system communicates with each other, latency inside, latency between any particular services, status of tcp, connections between them and so on. So we can have that for example front end communicators services such as card catalog and so on. We can see latency, we can see number of requests, you will have all the all this service map just in few minutes after the installation. You don't need to change code of your application, integrate open telemetry, SDK and so on. And it's pretty useful for if you want to understand how a system perform right now, you cannot wait for few weeks when you're development team at some integration. And we have telemetry data, pretty granular of telemetry data. It means we can track connections not between applications, but between any application instance with their peers. So we can see communications as with the optimization we need. For example, we can see how a few postgres instances communicate with each other. This is the primary instance and the two replicas are connected to it for a cocation. And we know about each of the connections, we know the number of requests, number of errors, latency of application level requests, we know how network performs between any instances, we know network round trip time, and we also know connection level metrics such as number of connections, number of field connections, number of TCP transmissions. It's pretty useful because in distributed systems it's hard to understand what's happening right now, and it's hard to current topology of services. You should know the nodes where your application runs. And to check the connectivity between two services, you need to perform many applications to understand the topology, then goes to some pod to perform ping or something like that. And here we have geometry data that already understands that an application can migrate between nodes and something like that. And it reflects the current picture of your system, current topology of your system. And also the edge provides not only metrics, it also provides APF based traces. So we can see that we have, for example, application calls a database, and we can see that some of queries perform slower than others. And we can drill down and see particular requests and useful while troubleshooting. But EVPF tracing has some limitations. Most from my perspective. Limitation is that EBP and BBF traces are not actually traces, it's individual spans. Because when instruments and applications are open to images decay, it can originate some tracing id and propagate it to other services. As a result, you will have a single trace that contains all the requests from all services. In case of EBPF, we don't have a tracing id so we only can check the particular queries and we cannot connect them to show you the whole trace. And yeah there is some tool open source tool that tried to solve that problem and they use approach where they capture the request, then modify them, inserting for example the JCG header and then send it. But from my perspective it's not a good idea because I believe that observability tools should observe. They shouldn't change your data, they shouldn't modify by law and understanding this, we supported both methods of instrumentation in Karoot. We support open telemetry generated traces and we also support EBBF based traces for systems, for example for legacy services and so on. After the installation you will have VTF based tracing, but then you can enrich your telemetry data with open generator instrumentation. It's the best way to have all this data and it allows you to extend your visibility of a system with MBPF. And few words about another another way to gather telemetry data using EBPF it's EBPF based profiling. What is profiling? Profiling allows you to explain any it allows you to answer the question what this particular application did at the time x which code were executed and which code consumed more cpu the most extract most cpu consuming part of your application. MPR based approach doesn't require integrating some continuous profiling tools into your application doesn't require redeployment of your application. You just need to install the agent and it will gather providing data and storing it with some storage. In case of carrot we store all the telemetry data exception metrics in qcause. Here we have an example of how to use profiling data to understand to explain cpu consumption we can use cpu usage chart and select some area and see the flame graph work reflecting what part of your code consumed more cpu. In this case it's example of coordinates component of kubernetes and we can see the particular function calls here. The wider frame and on flay graphs means corresponding code consumed more cpu's and the smaller one. And also we can easily compare cpu usage. We think some cpu spike be the previous period and we can highlight it's not a significant spike but in this case we can see that the service running more DM's queries than before. And here notes on how code works. You install the agent to your nodes. The agents gathers gather telemetry data about all containers running on this particular node, their communication. They gather their logs and trace profiles and send them to carot which stores matching the parameters and other telemetry signals in decals. Also, if you instrument your applications with Opentelemergy, you use curves endpoints that demands ole mature protocol, and you can store telemetry data in the same storages. It's open source, every component is open source, and you can use it. Or you can use core cloud to offload your storing telemetry data. In the end, let's just scrape it up of the target. EDP is awesome and it allows us to gather a lot of telemetry data without the need to instrument your code, and we can be sure that performance impact on your application is negligible. On our side, we have a page with benchmarks, so we understand principles, but we decided to to validate that museum benchmark. If you want to gain visibility into your system just in few minutes after installation, just install code. You can reach out to me on LinkedIn Twitter. Thank you for your time.
...

Nikolay Sivko

Founder & CEO @ Coroot

Nikolay Sivko's LinkedIn account Nikolay Sivko's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways