True Kubernetes Observability with eBPF

Video size:

Abstract

In this talk we’ll cover eBPF, and how this revolutionary technology is on its way to introducing completely new observability solutions for cloud-native environments

Summary

Today we've reached a point where observability spends are about 20% to 30% of infrastructure spend. That's unheard of, and most teams are even not aware of it. Cloud native has just made things worse. The alternative would be integrating observability solutions from the infrastructure.
legacy observability solutions also have another disadvantage when it comes to cloud native environments. It's so hard to implement the same observability solution across such a heterogeneous R and D group in a big company. Ground cover is built to be a modern cloud native observable solution.
EBPF allows basically to run business logic in a sandbox environment inside the Linux kernel. What it actually means is that you can actually get value from the kernel really fast. Getting all these teams to do all this at the same time is the real pain in being part of the development cycle.
Ground cover is built differently than digesting data, storing it on the vendor side, and then accessing it. It allows teams to enjoy and reuse that value, and also allow them to keep that private. This is the current reality of ground cover and all future cloud native solutions.

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hey everyone, welcome to my talk. I'll cover cloud native observability today and how we can actually get true Kubernetes observability using EBPF, which is a super interesting technology that we'll dive into in a sec. Before starting anything, just wanted to kind of align us on the core value of observability. Observability is a core competency for every single team out there. I mean, there is no single team almost in the entire globe that isn't using some sort of observability, whatever you would call it. It could be logs, infrastructure metrics, custom metrics that customers create for themselves, or even deep application performance monitoring like we know from advanced tools in the market. But today we've reached a point where observability spends are about 20% to 30% of infrastructure spend, which is, say, what you would pay AWS or GCP for hosting your cloud infrastructure. Now you have to agree with me that we should just stop here for a second since these numbers are huge. Just wrap your hand around these numbers, like 20 or 30% from infrastructure costs. That's unheard of, and I think most teams are even not aware of it. But besides being super costly, they're also unpredictable. I mean, on the left we can see data dogs pricing, which is clearly one of the market leaders in observability. And the expectancy of what you would pay eventually, at the end of the month, is super unclear. Vendors are expecting us as engineers, as engineers managers, to know too much details about the volumes of our production from so many different aspects. I mean, just imagine you have to pay for logs. So on the left, you would have to assume and know how many logs you have per month on a volume basis, like the amount of log lines and the amount of gigabytes that you would actually ingest inside datadog over a period of a month. I mean, you have to know that, right? No, no one knows that, and it's too complicated to know. And even if you do know that, one engineer can just cause 30% rise in your build next month due to some cardinality or volume error that he introduced into production or dev environments. So that's the really scary part. And before we just ask ourselves, how did we even got here, I mean, cloud native, which is clearly a super high buzzword in the last few years and impacting a lot of our decision making and architectures over the few years, cloud native has just made things worse. We see, and this is a report by O'Reilly, we see that observability data is rising much faster than actual business related data or business volume of cloud based companies. And the reason is very clear. In a sense, we're just segregating our environments into a lot of different components, microservices that talk to each other and communicate over API driven interactions. And basically to monitor one single flow of one request from your customer over your production. Suddenly you have to monitor a distributed system with dozens of components talking to each other, creating their own logs, exchanging API and traces between one another. And it means that you have to store so much to get the same insight of what exactly happened, or where is my bottleneck? Or why am I responding in a false response to the actual user. So that actually made things worse. And it's clear to understand why when a recent survey by Kong last year just showed that most companies, or the average company, has over 180 microservices in production. I mean, that's a crazy amount if you just think of what's standing behind that enormous number. There's a lot of teams working in different ways and different technology stacks and so on, to just create a mesh of business logics that communicates with itself and eventually serves value to your customers. Now part of the reason of how we got here is actually the advantage of how observability was actually built when we just started off. I mean, observability in a sense was built to be part of the dev cycle. It was very clear that developers will eventually integrate the solution, either by changing the runtime or integrating an SDK into their code that would help them observe their application. They will then decide what to measure, where to instrument code, what metrics to actually expose, and so on. And then they would happily enjoy that value and build dashboards and alerts, and set whatever goals and KPIs they would like to measure as part of the new data that they just collected. This had a clear advantage. I mean, you could have one team in a big company creating impact on their observability stack pretty easily. They could just integrate their code, change what they wanted to change, observe what they wanted to observe, and get value really fast. Because the alternative would be integrating the observability solutions from the infrastructure, from the wide low denominator of all the company teams that are working on different kind of stacks at the same time. So if you integrate a solution through the infrastructure, you would get maybe a uniform value. But suddenly it's much harder to integrate because developers don't have a decision on which tools to install on their Kubernetes cluster, for example. They don't have access to it. It's a harder reach for a developer to impact the structure of the infrastructure of the company. So observability was built in a sense, in a way that made sense. Now when you imagine the different teams on top of it. Today we're creating a situation, which I think is good, that teams have the autonomy to select their own technology stacks. I mean, you have data science teams working in Python, a web team working in OJs, a backend team working in go. That's all super reasonable in today's microservices architecture. But who's asking the question of, is all this data really needed? I mean, if I let the data science team and the web team all collect their own data instrument, their own code, decide what to measure for observability purposes, who is actually in charge of asking the question, is all this data needed? Are we paying the correct price for the insights that we need as a company, as an R and D group? That's harder when you work in a distributed manner like this. And I think also another important question is, are we given who's responsible, I mean, debt worried DevOps or SREs that you see here, they're eventually responsible for waking up at night and saving production when things go bad. So are we giving them the tools to succeed if they can't eventually impact each and every team? Or they have to work super hard to align all the teams to do what they want to do, are they able to get 100% coverage? I mean, eventually, if there's one team kind of maintaining a legacy code, can the SRE or DevOps really impact this team to instrument the code? Not exactly. So they're already having blind spots. Can they ensure cost, visibility, trade off? I mean, can they control what logs are being stored in a specific team inside a huge company to make sure that cost doesn't drive a crazy amount next month? Not exactly. I mean, they have partial tools to do that, but not exactly. So we're putting them in a position where we expect so much work on production, make sure it's healthy, get all the observability you need to do that, but it's your job. But eventually the developers are responsible for what value they will actually get or what tools they will actually get to succeed in their job. Now, legacy observability solutions also have kind of another disadvantage when it comes to cloud native environments. We mentioned data rising so fast and the amounts of money you have to pay to get insights with these tools. But eventually it results from their pricing models being completely unscalable. Once you base a pricing model on volume, cardinality, or any of these unpredictable things, eventually engineers can't really know what will happen next month. But they can also know that it won't scale well, because if currently I have five microservices, I'm not exactly sure how much I would pay with these pricing models when I go to 50 microservices. And the reason is that eventually communications don't rise linearly in a microservices kind of mesh architecture. And these pricing models have been proven to be unscalable. Once you pay 20 or 30% from your cloud cost, it already means that you've reached a point where you're paying too much for observability and it doesn't scale well with your production. Another reason, another deficiency, as we mentioned, it is harder organizational alignment. It's so hard to implement the same observability solution across such a heterogeneous R and D group in a big company. Getting everybody to align on the same value, on the same observability standards, on the same measures that they measure in production dev and staging, that's really hard in a big organization once you go the legacy way of letting developers instrument their code and work for their observability vendor. And the third part is data privacy. I think that when these solutions were built over a decade ago, it made sense in most cases to send all your data to the vendor to be stored. And eventually these vendors are data companies storing and eventually charging you for data volumes being stored on their side. So the data isn't private. You're working in a cloud native environment, usually already segregated inside a very cloud native primitive, such as kubernetes, clusters and namespaces and all that. But eventually you just send all this data log traces which contain sensitive information, sometimes PiI, to your observability vendor just because you have no other choice. And it doesn't make sense in cloud native modern environments. Now the bottom line is very clear, I think, that observability is under adopted. On the left you can see a recent survey that shows that deep observability solutions are being heavily under adopted, with about 70% of teams using logs as kind of the basic measures for observability or the basic layer for observability. Much, much less of these teams are implementing apms or application performance monitoring tools that everybody agrees on the value of the value of these tools. I mean, eventually it means that there's a gap. It can be any one of the reasons that we stated before, it could be the price, the hardship of getting this solution onboarded into a big organization or data privacy. It could be any one of these things. But the bottom line is very clear. Teams are not adopting these solutions as you would expect from such high value solutions, and that can help them troubleshoot better in production, basically. Now that's exactly the reason that we've built ground cover. Ground cover is built to be a modern cloud native observability solution. We'll talk about how it's built in a second, but as a concept, ground cover is trying to create a modern way that would fit teams working in cloud native environments and would solve the problems that we stated before. And we can state three different measures that ground cover took in order to kind of build our vision in what we think is the correct way to build an observability platform. One is that data is not being collected through the DEV cycle. Basically, to collect data, we don't need to be part of the development cycle and worry about each team inside a big company implementing the solution and working to integrate the observability solution. It's not that developers aren't needed in the process. They're super important to determine what to measure together with SRE and DevOps and all these teams. But we empower one DevOps or one SRE person to integrate an observability solution all across a huge company without working its way through too many stakeholders inside the company. And how are we doing that? We're doing that with EVPF. EBPF is a technology that we'll talk about later in a second, but it allows us to collect information or observability data out of band from the application. We don't have to be part of the development cycle, we don't have to be part of the application code to actually collect data about what the application is doing and to observe it in a deep way, and that's a really major leap forward. The other is that data is being digested distributedly where it lies. Basically, ground cover digests data as it flows through our agent, which is distributed across your cloud native environment. And decisions about the data is already being made in a very early stage of the process. It allows us to collect data really smartly, reduce the volumes of data that we collect for the same insights that you would expect and break. Basically the volume based pricing models where you would pay for volume, you can pay differently once you digest data in a distributed manner, as your cloud native environment is actually built. And the other is privacy. We use in cloud architecture, which is really sophisticated, where all the data plane basically resides in your cloud environment. Basically ground cover has no access to the data, all the data is kept private in your cloud environment, while we still provide a SaaS experience that can be shared inside the organization and so on. So these are three major differentiators that kind of push an observability system into the cloud native domain and more fit for modern teams. Now let's cover these three segments one by one. One is observability is basically hard to integrate. We see a lot of R and D efforts and a lot of coordinations inside team. So we know that just to prove that point, we know that for one team to eventually be able to instrument open telemetry, they would have to work hard and change part of their code, change actual integrate an SDK and change lines of their code across the application to eventually implement the value of opentelemetry or any other vendor based SDK. And this is the best that we can get in part of the languages. This auto instrumentation doesn't really work. You have to work really manually to get that. And now we're still being stuck inside that Dev cycle. I mean, developers instrument their code, they wait for the version release next cycle to actually reach production, and then they get the value in production. That's a pretty long cycle. I mean, for one team, but say reasonable for one team. But what happens in a real company, like when you have 180 different teams or 180 different microservices that you have to work your way through to actually get to production? That is where things get really hard. Because eventually getting all these teams to do all this at the same time with the same principles at the same depth, without making mistakes, that's the real pain in being part of the development cycle. And eventually you still get to the point where you have the same worried or SRe person responsible for aligning all these teams into, can they even do it? Can they get a one uniform approach aside, a huge company, can they set high professional standards that will allow them to say, set and track slos? It's hard. Now EBPF is the next Linux superpower. That's one of the attributes it was given. EBPF allows basically to run business logic in a sandbox environment inside the Linux kernel. One interesting comparison is one that was made by Brendan Gregg, which is an ex Netflix engineer, that says that EBPF is kind of like what javascript did to HTML. Eventually it allows you the flexibility to enjoy the high performance of the Linux kernel, which was before unreachable without writing a kernel model and eventually using them to get the value that you need for your company in a programmable way. What it actually means is that you can actually get value from the kernel really fast. I mean, if before, as you see on the left, I would have to say get something really special happening for my application needs inside the Linux kernel, I would have to wait five years for the push my id, wait for the kernel community to adopt it, and wait for eventually the new distribution that I just made to reach developers in say five years from now. EBPF was basically an all technology in a sense that it's already been part of our Linux kernel for a while. As a technology called EBPF, it was part of tcp dump and a lot of different packet routing mechanisms inside the kernel already. But in 2014, which is exactly the year where Kubernetes was born, EVPF kind of rose and allowed for a lot of new features to be relevant for actual application developers to say, okay, part of my logic can actually run inside the Linux kernel and it can be safe and it can be fast. And I guess around 2017, with the kernel version 414, that's where things really picked. And that's where a lot of the modern features of UPF really reached a maturity point where we can actually do things like deep observability using UVPF. Now, the UPF architecture basically is very interesting. We're not going to dive too deep into that. But I can load from the user space an EBPF program which eventually is transformed into EBPF bytecode and loaded into the kernel. It's been verified by a verifier. So it means that the runtime of this specific program is limited, it's super safe, it can access part of the kernel where it shouldn't. It behaves in a passive way where I can't actually crash my applications or hurt the kernel in any way. And it's then been compiled at runtime to an actual machine code which allows us to be super efficient. So basically now I have the ability to run business logic inside the kernel, while I still have the ability using different primitives like maps to communicate with the user space. So imagine for an observability solution. I can collect data like traces or API calls between microservices from the kernel space and process them back in the user space. And to do that, I don't have to be part of the application in the user space, because once I'm in the kernel space, I have the privileges of the ability to observe everything that is happening in the user space. Now why run things in the kernel anyway? The reason is very clear. One is efficiency. You get to enjoy the super efficient performance of the kernel resources. Did you ever thought about how much overhead, which is really hard to measure, your observability SDK currently takes? I mean, say you implement open telemetry, do you know how much overhead, even in CPU memory response time, does your SDK incur in your application? That's really hard to measure and really scary sometimes. Second is safety. You can't crash the application in any way, which a badly implemented SDK or a kernel module can definitely crash your application or even your entire server and 100% coverage. I mean you can see everything that happens in the user space all at once out of band. So imagine in the old world of one process from say one Java program that wasn't such a big advantage. I mean you could clearly instrument that one Java process and get whatever you want to get from that process. But imagine our current servers. Current our current servers are actually kubernetes nodes, for example in the cloud native domain. And a kubernetes node can suddenly host 150 different containers on top of it, observing all of them at once at the same depth, without changing even one runtime or one container code running inside this node. That's really amazing. And that's part of what EBPF can do. And that also empowers our DevOps or Sre to basically do things on their own. They don't have to convince R D anymore to do what they want. They can set one uniform approach and they can set a super high professional standard and actually hold onto it and perform really well because they're responsible for integrating the solution out of band from the application, and they're responsible for using the solutions to actually measure performance, debug and alert things that they want to be alerted on. The second is, as we said, observability doesn't scale. I mean you store huge amounts of irrelevant data and it doesn't make sense anymore to get such little insight for so much data. And it makes cost unbearable. And the reason is the centralized architectures that we talked about before, you have a lot of instrumented applications sending deep observability data. You also have agents monitoring your infrastructure, and everything goes back into huge data storages inside your APM vendor where you get a friendly UI that allows you to query them and explore your data. Using certarat architectures definitely introduced a big depth in how we actually treat data volumes as part of our observability stack one is that we have to use random sampling. I mean, people ask all the time, how can I control these huge data volumes? I have super high throughput APIs like redis and Kafka flowing through my production environment. How am I expected to pay for all this storage? Who's going to store that for me? So the vendor allows you to do things like random sampling. Just sample half a percent of your production, store it, and it will be enough in most cases to get what you want. That's definitely a scary situation where you have to catch that interesting bug that happens one in 10,000 requests, and that's usually the bugs or the issues that you care about. That's a scary primitive to work with. Second is that raw data is being stored so that the centralized architecture can be efficient. Raw data such as spans and traces are being sent to the observability vendor where they're processed there for insights. I mean, insights could be something as simple as a p 50 latency, like the median latency of a specific API resource over time. That's something that is super common for setting, say, slos and enforcing slos. So if all I want is that p 50 metric, why the hell should I store all these spans? To allow the vendor to just deduce the metrics that I want at his back end. That doesn't make sense anymore. And that's one debt that we're facing with another is that it's really built around a rigid data collection. You're built in a way where the data collection is simple. You just send all the things back to the vendor's back end and then the magic happens. Now that's great, but what happens when you want to collect things at different depth? Say I have 10,000 requests, which are HTTP requests, which return 200. Okay, they're all perfect. And I have one request failing with returning 500. Internal server error. Do I really want to know the same about these two requests? I mean, not exactly. I want to know much more details about the failed request, such as give me the full body request and response of that payload. I want to see what happened. I might not be interested at knowing that for all the other perfectly fine requests that flew through that microservice eventually. Basically what happens is that you store irrelevant data all the time. I mean, as we said before, to get something as simple as a matrix, you have to store so many raw data all the time to eventually get that value. And that creates an equation that you can eventually hold onto when it comes to pricing, it forces you to be limited in cardinality, because where you do care about a specific error, you don't get all the information that you would have wanted, because otherwise you would have to store so much data across all the information that you're collecting with your observability vendor. Now, that's where things are done differently in a cloud native approach, and that's where ground cover is actually different. We're, for example, using in house span based metrics, which means that we can actually create metrics from the raw data as they're flying through the EVPF agent without storing all these spans, just to eventually pay for their storage at the vendor side, and then enjoying the value of the metrics that we actually want to observe. And we also support variant depth capturing, which means that we can collect, for example, the full payload and all the logs around a specific failed request, and we can decide to collect other things, say for a high latency event, for example, collect the cpu usage of the node at the same time, or whatever we want. But we can collect different very shallow things for a normal flow, and for example, not collect all the okay requests of a normal flow and just sample there differently. This is a major advantage when it comes to real systems at high throughput, where there's a lot of things that are perfectly fine, you just want to know some information about them, how they behave. Give me a few examples of a normal flow, but you really want to dive deep into high latency, bad failure requests and so on. And the third thing is that, as we said, information isn't kept private. It doesn't make sense to store traces and logs in modern ages in the vendor side. So ground cover is built differently than the architectures that we see here. It's built differently than digesting data, storing it on the vendor side, and then accessing it. It's built in an architecture where the data plane sits inside your in cloud deployment, so eventually you get the persistent storage of the data you actually care about, like log traces and metrics stored at your side, while all the control plane allows you to access it. From a SAS experience that has a few really interesting advantages. One is that you're eventually allowing teams to enjoy and reuse that value, and also allowing them to keep that private. So once I store that value inside my environment, I'm also able to reuse it for different purposes, and I'm also able to protect it and allow maybe broader data collection around more sensitive areas that will have been debugged than I would otherwise do. While I store it on the vendor side, which is a much scarier aspect of the security of my company. Now, this is our current reality. This is not something that we're dreaming of or a vision that is far away. This is the current reality of ground cover and all future cloud native solutions that will emerge in the next few years. We're using EVPF instrumentation that allows us immediate time to value and out of band deployment, so one person can just get instant value within two minutes on a huge company cloud native environment. We're using edge based observability or distributed data collection and ingestion, which eventually allows us to be built for scale and also break all these trade offs of pricing, which is based on volume, which is unpredictable and costly. And we're also using in cloud architecture, where data is kept private and basically in your full control. That's it. That was kind of a sneak peek into what cloud native environments will look in the future, and how the disadvantages of legacy solutions inside cloud native environments can be transformed into solutions that actually fit the way you develop, think and operate in a cloud native environment. Ground cover is one of these solutions, and we encourage you to explore ground cover and other solutions that might fit your stack better than what you're currently using. And feel free to start free with our solution anytime soon, and we're happy to answer any questions about it. Thank you, guys.

Slides

Download slides (PDF)

See all 54 talks at this event!

Conf42 Cloud Native 2023 - Online

March 30 2023

True Kubernetes Observability with eBPF

Video size:

Abstract

Summary

Transcript

Slides

Shahar Azulay

Co-Founder & CEO @ groundcover

Join the community!

Featured event

2025

2024

Info

Conf42 Cloud Native 2023 - Online

March 30 2023

True Kubernetes Observability with eBPF

Video size:

Abstract

Summary

Transcript

Slides

Shahar Azulay

Co-Founder & CEO @ groundcover

Join the community!