Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hey everyone, welcome to my talk. I'll cover cloud
native observability today and how we can actually get true Kubernetes
observability using EBPF, which is a super interesting technology
that we'll dive into in a sec.
Before starting anything, just wanted to kind of align us
on the core value of observability. Observability is a core competency
for every single team out there. I mean, there is no single
team almost in the entire globe that isn't using some sort
of observability, whatever you would call it. It could be logs,
infrastructure metrics, custom metrics that customers
create for themselves, or even deep application performance monitoring
like we know from advanced tools in the market. But today
we've reached a point where observability spends are
about 20% to 30% of infrastructure spend, which is,
say, what you would pay AWS or GCP for hosting
your cloud infrastructure. Now you have to agree with me that
we should just stop here for a second since these numbers are huge.
Just wrap your hand around these numbers, like 20 or 30%
from infrastructure costs. That's unheard of, and I think most teams are even
not aware of it. But besides being
super costly, they're also unpredictable.
I mean, on the left we can see data dogs pricing, which is clearly
one of the market leaders in observability. And the
expectancy of what you would pay eventually,
at the end of the month, is super unclear. Vendors are expecting
us as engineers, as engineers managers, to know
too much details about the volumes of our production from so
many different aspects. I mean, just imagine you have to pay for logs.
So on the left, you would have to assume and know how
many logs you have per month on a volume basis,
like the amount of log lines and the amount of gigabytes
that you would actually ingest inside datadog over a
period of a month. I mean, you have to know that, right? No, no one
knows that, and it's too complicated to know. And even if you do know that,
one engineer can just cause 30% rise
in your build next month due to some cardinality or volume
error that he introduced into production or dev environments.
So that's the really scary part.
And before we just ask ourselves, how did we even got
here, I mean, cloud native, which is clearly a super high
buzzword in the last few years and impacting a lot of
our decision making and architectures over the few years, cloud native
has just made things worse.
We see, and this is a report by O'Reilly, we see that observability
data is rising much faster than actual business related data or
business volume of cloud
based companies. And the reason is
very clear. In a sense, we're just segregating our
environments into a lot of different components, microservices that talk to
each other and communicate over API driven interactions.
And basically to monitor one single flow of
one request from your customer over your production. Suddenly you have
to monitor a distributed system with dozens of components
talking to each other, creating their own logs,
exchanging API and traces between one another. And it means that you have
to store so much to get the same insight of what exactly happened,
or where is my bottleneck? Or why am I responding in a
false response to the actual user. So that actually made
things worse. And it's clear to understand why when
a recent survey by Kong last year just showed that most companies,
or the average company, has over 180 microservices in
production. I mean, that's a crazy amount if you just think
of what's standing behind that enormous number.
There's a lot of teams working in
different ways and different technology stacks
and so on, to just create a mesh of business
logics that communicates with itself and eventually
serves value to your customers.
Now part of the reason of how we got here
is actually the advantage of how observability was actually built
when we just started off. I mean, observability in a sense was built
to be part of the dev cycle. It was very clear
that developers will eventually integrate the solution, either by
changing the runtime or integrating an SDK into their
code that would help them observe their application. They will
then decide what to measure, where to instrument code,
what metrics to actually expose, and so on. And then
they would happily enjoy that value and build dashboards
and alerts, and set whatever goals and
KPIs they would like to measure as part of the new data that
they just collected. This had a clear advantage. I mean,
you could have one team in a big company creating
impact on their observability stack pretty easily. They could
just integrate their code, change what they wanted to
change, observe what they wanted to observe, and get value really fast.
Because the alternative would be integrating the observability
solutions from the infrastructure, from the wide
low denominator of all
the company teams that are working on different
kind of stacks at the same time. So if you integrate a
solution through the infrastructure, you would get maybe a uniform
value. But suddenly it's much harder to integrate because
developers don't have a decision on which tools to
install on their Kubernetes cluster, for example. They don't have access to
it. It's a harder reach for a developer to impact
the structure of the infrastructure of the company. So observability was
built in a sense, in a way that made sense.
Now when you imagine the different teams
on top of it. Today we're creating a situation, which I think is
good, that teams have the autonomy to select their own technology
stacks. I mean, you have data science teams working in Python, a web
team working in OJs, a backend team working in go.
That's all super reasonable in today's microservices architecture.
But who's asking the question of, is all this
data really needed? I mean, if I let the data science team and the
web team all collect their own data instrument,
their own code, decide what to measure for observability purposes,
who is actually in charge of asking the question,
is all this data needed? Are we paying the correct
price for the insights that we need as a company, as an R and D
group? That's harder when you work in a distributed manner
like this. And I think also another important question
is, are we given who's responsible, I mean, debt worried
DevOps or SREs that you see here, they're eventually
responsible for waking up at night and saving production
when things go bad. So are we giving them the tools to
succeed if they can't eventually impact each and every team?
Or they have to work super hard to align all the
teams to do what they want to do, are they
able to get 100% coverage? I mean, eventually, if there's one
team kind of maintaining a legacy code, can the SRE
or DevOps really impact this team to instrument the code?
Not exactly. So they're already having blind spots.
Can they ensure cost, visibility, trade off? I mean, can they control
what logs are being stored in a specific team inside a huge company
to make sure that cost doesn't drive a crazy amount
next month? Not exactly. I mean, they have partial tools to do that,
but not exactly. So we're putting them in a position where
we expect so much work on
production, make sure it's healthy, get all the observability you
need to do that, but it's your job. But eventually the developers are
responsible for what value they will actually get
or what tools they will actually get to succeed in their job.
Now, legacy observability solutions
also have kind of another disadvantage
when it comes to cloud native environments. We mentioned data
rising so fast and the amounts of money you have to pay to get
insights with these tools. But eventually it results from their
pricing models being completely unscalable.
Once you base a pricing model on volume,
cardinality, or any of these unpredictable things,
eventually engineers can't really know what will
happen next month. But they can also know that it won't scale well,
because if currently I have five microservices, I'm not exactly
sure how much I would pay with these pricing models when
I go to 50 microservices. And the reason is that eventually
communications don't rise linearly in a microservices
kind of mesh architecture. And these pricing models have
been proven to be unscalable. Once you pay 20 or 30%
from your cloud cost, it already means that you've reached a point where you're
paying too much for observability and it doesn't scale well with your production.
Another reason, another deficiency, as we mentioned,
it is harder organizational alignment. It's so hard to implement
the same observability solution across such a heterogeneous
R and D group in a big company. Getting everybody to align
on the same value, on the same observability
standards, on the same measures that they measure in production dev
and staging, that's really hard in a big organization once you
go the legacy way of letting developers instrument their code and
work for their observability vendor. And the third part is data privacy.
I think that when these solutions were built over
a decade ago, it made sense in most cases to send all
your data to the vendor to be stored. And eventually
these vendors are data companies storing
and eventually charging you for data volumes being stored on
their side. So the data isn't private. You're working in a cloud native environment,
usually already segregated inside a very cloud native
primitive, such as kubernetes, clusters and namespaces and all that.
But eventually you just send all this data log traces
which contain sensitive information, sometimes PiI,
to your observability vendor just because you have no other
choice. And it doesn't make sense in cloud native modern environments.
Now the bottom line is very clear, I think,
that observability is under adopted. On the left you can see a recent survey
that shows that deep observability solutions are being
heavily under adopted, with about 70% of teams
using logs as kind of the basic measures for observability
or the basic layer for observability. Much, much less of
these teams are implementing apms or application performance monitoring
tools that everybody agrees on the value of the value
of these tools. I mean, eventually it means that there's a gap.
It can be any one of the reasons
that we stated before, it could be the price,
the hardship of getting this solution onboarded into a big organization
or data privacy. It could be any one of these things. But the bottom line
is very clear. Teams are not adopting these solutions
as you would expect from such high value
solutions, and that can help them troubleshoot better in production,
basically. Now that's exactly the reason that we've
built ground cover. Ground cover is built to be a
modern cloud native observability solution. We'll talk about
how it's built in a second, but as a concept, ground cover
is trying to create a modern way that would fit teams
working in cloud native environments and would solve the problems that we stated before.
And we can state three different measures that ground cover took
in order to kind of build our vision in
what we think is the correct way to build an observability platform.
One is that data is not being collected through
the DEV cycle. Basically, to collect data,
we don't need to be part of the development cycle
and worry about each team inside a big company implementing
the solution and working to integrate the observability solution. It's not that
developers aren't needed in the process. They're super important to determine
what to measure together with SRE and DevOps and all these teams. But we
empower one DevOps or one SRE person
to integrate an observability solution all across a huge company
without working its way through too many stakeholders
inside the company. And how are we doing that? We're doing that with EVPF.
EBPF is a technology that we'll talk about later in a second, but it
allows us to collect information or observability
data out of band from the application. We don't have to be part of
the development cycle, we don't have to be part of the application code to actually
collect data about what the application is doing and to
observe it in a deep way, and that's a really major leap
forward. The other is that data is being digested
distributedly where it lies. Basically,
ground cover digests data as it flows through our agent,
which is distributed across your cloud native environment.
And decisions about the data is already being made in
a very early stage of the process. It allows us to collect data
really smartly, reduce the volumes of data that
we collect for the same insights that you would expect and
break. Basically the volume based pricing models
where you would pay for volume, you can pay differently once
you digest data in a distributed manner,
as your cloud native environment is actually built. And the
other is privacy. We use in cloud architecture,
which is really sophisticated, where all the data plane basically resides
in your cloud environment. Basically ground cover has no access
to the data, all the data is kept private in your cloud environment, while we
still provide a SaaS experience that can be shared inside the organization
and so on. So these are three major differentiators that
kind of push an observability system into the cloud native domain and more
fit for modern teams. Now let's cover these
three segments one by one. One is
observability is basically hard to integrate. We see a lot of R and D efforts
and a lot of coordinations inside team. So we
know that just to prove that point, we know that for one
team to eventually be able to instrument open telemetry,
they would have to work hard and change part of their code,
change actual integrate an SDK and change lines of their
code across the application to eventually implement the
value of opentelemetry or any other vendor based
SDK. And this is the
best that we can get in part
of the languages. This auto instrumentation doesn't really work. You have
to work really manually to get that. And now
we're still being stuck inside that Dev cycle. I mean,
developers instrument their code, they wait for the
version release next cycle to actually reach production, and then they
get the value in production. That's a pretty long cycle. I mean,
for one team, but say reasonable for one team.
But what happens in a real company, like when you have 180
different teams or 180 different microservices that you have to work
your way through to actually get to production? That is where
things get really hard. Because eventually
getting all these teams to do all this at the same time with the same
principles at the same depth, without making mistakes,
that's the real pain in being part
of the development cycle.
And eventually you still get to the point where you have the
same worried or SRe person responsible for
aligning all these teams into, can they even do it? Can they
get a one uniform approach aside, a huge company,
can they set high professional standards that will allow them to say,
set and track slos? It's hard.
Now EBPF is the next Linux superpower.
That's one of the attributes
it was given. EBPF allows basically to
run business logic in a sandbox environment inside the Linux
kernel. One interesting comparison is
one that was made by Brendan Gregg,
which is an ex Netflix engineer, that says that EBPF is
kind of like what javascript did to HTML. Eventually it
allows you the flexibility to enjoy the high performance
of the Linux kernel, which was before unreachable without
writing a kernel model and eventually using
them to get the value that you need for your company in a programmable way.
What it actually means is that you
can actually get value from the kernel really fast.
I mean, if before, as you see on the left, I would
have to say get something
really special happening for my application needs inside
the Linux kernel, I would have to wait five years for the push
my id, wait for the kernel community to adopt it, and wait for eventually the
new distribution that I just made to reach developers
in say five years from now.
EBPF was basically an
all technology in a sense that it's already been part
of our Linux kernel for a while.
As a technology called EBPF, it was part of tcp dump and a lot
of different packet routing mechanisms inside the kernel already.
But in 2014, which is exactly the year where Kubernetes
was born, EVPF kind of rose and allowed
for a lot of new features to be relevant for actual
application developers to say, okay, part of my logic can
actually run inside the Linux kernel and it can be safe
and it can be fast. And I guess
around 2017, with the kernel version
414, that's where things really picked. And that's
where a lot of the modern features of UPF really reached
a maturity point where we can actually do things like
deep observability using UVPF. Now,
the UPF architecture basically is very
interesting. We're not going to dive too deep into that. But I can load
from the user space an EBPF program which eventually is
transformed into EBPF bytecode
and loaded into the kernel. It's been verified by a verifier. So it
means that the runtime of this specific
program is limited, it's super safe, it can access part of the
kernel where it shouldn't. It behaves in a passive way
where I can't actually crash my applications or
hurt the kernel in any way. And it's then been
compiled at runtime to an actual
machine code which allows us to be super efficient. So basically now I have
the ability to run business logic inside the kernel, while I still
have the ability using different primitives like maps to communicate with
the user space. So imagine for an observability solution.
I can collect data like traces or API calls between microservices
from the kernel space and process them back in the user
space. And to do that, I don't have to be part of the application
in the user space, because once I'm in the kernel space, I have the
privileges of the ability to observe everything that is happening in the user
space. Now why run things in the kernel
anyway? The reason is very clear. One is efficiency.
You get to enjoy the super efficient performance of
the kernel resources. Did you ever thought about how much
overhead, which is really hard to measure, your observability
SDK currently takes? I mean, say you implement open telemetry, do you know how
much overhead, even in CPU memory
response time, does your SDK incur in your
application? That's really hard to measure and really scary sometimes.
Second is safety. You can't crash the application in any way,
which a badly implemented SDK or a kernel
module can definitely crash your application or even your entire server and
100% coverage. I mean you can see everything that happens in the user space
all at once out of band. So imagine in the old
world of one process from say one Java
program that wasn't such a big advantage. I mean you could clearly instrument
that one Java process and get whatever you want to get from that
process. But imagine our current
servers. Current our current servers are actually kubernetes nodes,
for example in the cloud native domain.
And a kubernetes node can suddenly host 150
different containers on top of it, observing all of them
at once at the same depth, without changing even one
runtime or one container code running inside this node.
That's really amazing. And that's part of what EBPF can do.
And that also empowers our DevOps or Sre
to basically do things on their own. They don't have
to convince R D anymore to do what they want. They can set
one uniform approach and they can set a super high professional standard
and actually hold onto it and perform really well because they're
responsible for integrating the solution out of band from the application,
and they're responsible for using the solutions to actually measure
performance, debug and alert things that they want to
be alerted on. The second is, as we said, observability doesn't
scale. I mean you store huge amounts of irrelevant data and it doesn't
make sense anymore to get such little insight
for so much data. And it makes cost unbearable.
And the reason is the centralized architectures that we talked about before,
you have a lot of instrumented applications sending deep observability data.
You also have agents monitoring your infrastructure,
and everything goes back into huge data storages inside your
APM vendor where you get a friendly UI that
allows you to query them and explore your data.
Using certarat architectures definitely introduced
a big depth in how we actually treat
data volumes as part of our observability stack one
is that we have to use random sampling. I mean,
people ask all the time, how can I control these huge data volumes? I have
super high throughput APIs like redis and Kafka flowing through
my production environment. How am I expected to
pay for all this storage? Who's going to store that for me? So the
vendor allows you to do things like random sampling.
Just sample half a percent of your production, store it, and it
will be enough in most cases to get what you want. That's definitely a scary
situation where you have to catch that interesting bug
that happens one in 10,000 requests, and that's usually the bugs or
the issues that you care about. That's a scary primitive to
work with. Second is that raw
data is being stored so that the centralized
architecture can be efficient. Raw data such as spans and traces
are being sent to the observability vendor where they're processed
there for insights. I mean, insights could be something as simple as
a p 50 latency, like the median latency of a specific API
resource over time. That's something that is super common
for setting, say, slos and enforcing slos.
So if all I want is that p 50 metric,
why the hell should I store all these spans? To allow the vendor
to just deduce the metrics that I want at his back
end. That doesn't make sense anymore. And that's one debt
that we're facing with another is that it's
really built around a rigid data collection.
You're built in a way where the data collection is simple.
You just send all the things back to the vendor's back end and
then the magic happens. Now that's great,
but what happens when you want to collect things at different depth?
Say I have 10,000 requests, which are
HTTP requests, which return 200. Okay, they're all perfect.
And I have one request failing with returning 500.
Internal server error. Do I really want to know the same about these
two requests? I mean, not exactly. I want to know much more details
about the failed request, such as give me the full body request
and response of that payload. I want to see what happened. I might not
be interested at knowing that for all the other
perfectly fine requests that flew through that microservice
eventually. Basically what happens is that you store irrelevant data
all the time. I mean, as we said before, to get
something as simple as a matrix, you have to store so many raw
data all the time to eventually get that value.
And that creates an equation that you can eventually
hold onto when it comes to pricing,
it forces you to be limited in cardinality, because where you
do care about a specific error, you don't get all the information that
you would have wanted, because otherwise you would have to store so
much data across all the information that you're collecting
with your observability vendor.
Now, that's where things are
done differently in a cloud native approach, and that's where ground cover is actually
different. We're, for example, using in
house span based metrics, which means that we can actually
create metrics from the raw data as they're flying through the EVPF
agent without storing all these spans,
just to eventually pay for their storage at the vendor side,
and then enjoying the value of the metrics that we actually want
to observe. And we also support variant
depth capturing, which means that we can collect, for example,
the full payload and all the logs around a specific failed request,
and we can decide to collect other things, say for a
high latency event, for example, collect the
cpu usage of the node at the same time, or whatever we want.
But we can collect different very shallow things
for a normal flow, and for example, not collect
all the okay requests of a normal flow and just sample there
differently. This is a major advantage when it comes to real systems at
high throughput, where there's a lot of things that are perfectly
fine, you just want to know some information about them, how they
behave. Give me a few examples of a normal flow, but you really
want to dive deep into high latency, bad failure requests and
so on. And the third thing is that,
as we said, information isn't kept private. It doesn't make
sense to store traces and logs in modern ages in the
vendor side. So ground cover is
built differently than the architectures that we see here. It's built differently than
digesting data, storing it on the vendor side, and then accessing it.
It's built in an architecture where the data plane sits
inside your in cloud deployment, so eventually you
get the persistent storage of
the data you actually care about, like log traces and metrics stored
at your side, while all the control plane allows you to access it.
From a SAS experience that has a
few really interesting advantages. One is that
you're eventually allowing teams to enjoy and
reuse that value, and also allowing them
to keep that private. So once I store that value inside my environment,
I'm also able to reuse it for different purposes,
and I'm also able to protect it and allow maybe
broader data collection around more sensitive areas that will have been debugged than
I would otherwise do. While I store it on the
vendor side, which is a much scarier aspect
of the security of my company. Now, this is our current reality.
This is not something that we're dreaming of or a vision that is far away.
This is the current reality of ground cover and all future cloud
native solutions that will emerge in the next few years. We're using EVPF instrumentation
that allows us immediate time to value and out of band deployment,
so one person can just get instant value
within two minutes on a huge company cloud native environment.
We're using edge based observability or distributed data collection and ingestion,
which eventually allows us to be built for scale and also break
all these trade offs of pricing, which is
based on volume, which is unpredictable and costly.
And we're also using in cloud architecture, where data is kept
private and basically in your full control.
That's it. That was kind of a sneak peek into
what cloud native environments will look in the future,
and how the disadvantages of
legacy solutions inside cloud native environments can
be transformed into solutions that actually fit the way you develop,
think and operate in a cloud native environment. Ground cover
is one of these solutions, and we encourage you to explore ground
cover and other solutions that might fit your stack better than what
you're currently using. And feel free to start free with our solution anytime
soon, and we're happy to answer any questions about it. Thank you,
guys.