Transcript
This transcript was autogenerated. To make changes, submit a PR.
All right, I'm really excited today to share a little bit about Fairwind's
Kubernetes benchmark report for 2024. We're really
kind of proud of some of the work that our team has put together over
the past few weeks, getting this report ready for the new year,
and excited to kind of share some of the findings that we observed
by studying over 330,000 workloads.
So just as a quick introduction, my name is Joe Pelletier. I'm a
product manager here at Fairwinds. Been in the Kubernetes
and container space for a little over five years
and been working on the Fairwinds Insights product.
Like I mentioned, we actually have done this report now
for, this is the third year, and we've analyzed
over 330,000 workloads to inform the data
behind this report. It's the largest number of workloads we've ever
analyzed for the report, and it includes more than
a dozen different types of policies covering kind of reliability,
security, as well as cost efficiency.
And when it comes to Kubernetes, we find that organizations really have
to consider all three types of checks
here in order to make sure whats they are running workloads that align with best
practices. This gives you a little bit of an example of the
types of policies we've evaluated. I won't go through every single one,
but we'll be covering some of these in today's presentation.
The final report actually covers a number
of different categories of policies and provides a lot more depth
than what we'll be able to cover in today's session. So I do recommend that
you download the report. I think what you'll see in today's presentation
is sort of a summary and a bridge version, and hopefully kind of
see a little bit of where the industry is going in terms of both
kind of Kubernetes best practices and sort of how well organizations are aligning to
those best practices as well. Okay, so a common graph that
you're going to see in this report will look like this. And what
I'll do is spend a little bit of time just explaining how to read the
report and how to read the different charts and graphs that you're seeing.
So on the y axis here, you'll see sort of the percentage of
workloads impacted within an organization. And on the
x axis, you'll see the percentage of organizations that
were evaluated in terms of how many
of their workloads were actually impacted as
a percentage. And so a great way to kind of read
this report or this example here is,
if you take this example, the number of organizations with less than 10% of workloads
improved has fallen from 46% in
2022% to 21% in 2023.
And so when you see something like that,
where the number of organizations that have such a small percentage
of workloads impacted decrease, it actually demonstrates that the problem
might be getting harder to control. And so that's some of the Ways that we've
been able to kind of highlight information in this report.
And you will see that in just kind of various different aspects of today's
presentation when you read the final piece.
So let's kick it off with sort of the highlight from
this year's analysis, which is really this really
kind of interesting fact that over one third
of organizations today, and specifically 37% of organizations,
need to actually right size their containers to improve efficiency.
And when we dig into some of those details, we'll notice that
actually this 37% of organizations have 50%
or more of their containers that are over provisioned.
And it's an interesting finding, because what we notice
is that a lot of times developers have to guess their resource requests
and limits when they go to deploy it because
they don't really have the tooling or the feedback loops to tell them what those
resource requests and limits should be. And a lot of times developers
will guess high. They will over provision by giving
their application too much memory or too much compute. And when left
unchecked or unmonitored, organizations end up
incurring lots of additional compute spend as a
result. And so this is the first year
that we actually started looking at this data. And again, you'll see
that the 37% of organizations that need to right size
50% or more of their containers really represents the
bottom part of this chart.
But it's also interesting to see that there's actually a large cohort of organizations,
in this case 57% of organizations, that have
less than or equal to 10% of their workloads impacted. So some organizations
do seem to get this right. And what we're really excited
to do is monitor this progress over
the years. So this being our first year measuring this, we want to see
sort of how well has this trend
improved or not improved going into next year as well. So again,
first year we're kind of baselining this, but next year should help us understand where
that trend is going. Another aspect of kind
of Kubernetes efficiency that's important to solve
is really making sure that both memory and cpu
requests are set on deployments before they're actually set
to running Kubernetes. Now, Kubernetes technically makes these settings
optional, but when you don't set memory or cpu
requests, it can actually make it difficult for Kubernetes
to properly schedule that workload. And so what we're
seeing is that this is becoming more and more of a systemic
issue. This year. We're noticing that 78% of organizations
have at least 10% of their workloads missing cpu requests,
and this is up from about 50% last year.
So again, I think the numbers kind of are all over the map here.
You'll see that pretty much every organization
has some amount of this problem. But interestingly enough,
this is actually a fairly easy thing to solve with
guardrails and policy enforcement mechanisms for kubernetes,
where you can kind of give developers feedback at the time of their pull
request or at the time of their deployment when they are missing these settings,
and use that as an opportunity to educate developers as well.
So we think that while a lot of organizations may be having missing
cpu requests, we also think this could be a potentially easily solved problem
with policy enforcement and guardrail tools as well.
Let's shift a little bit to reliability and kind of look
at a few different trends that we've observed in this year's report.
So just at a very high level, about a quarter of organizations today are relying
on a cache version for 90% of their images.
And what does that mean? Well, it really means that the pull policy is not
set to always, which is a general best practice. You kind
of want to make sure that your containers are pulling
the latest image so that you don't have inconsistency.
And it also can help you from a security perspective as well to make sure
whats, when you do push an image with updates,
that it's actually pulling in that latest image as well, not just the cache version.
So this is just a general best practice that we're
seeing. Whats right now, about 24% of organizations
are relying on this pull policy not being set
to always for 90% of their images.
Another pattern that we're seeing is that container health checks seem to be missing or
ignored in some deployments as well. And so right now
about 66% of liveness and 69% of
readiness probes are missing in
Kubernetes deployments. And it's important to set these
because it helps Kubernetes automatically restart containers
and ensure that the applications are available to receive traffic
and then ultimately serve users. So this is actually considered
one of the more basic ways to ensure application reliability
in Kube and we're still seeing that organization struggle to various degrees
here. I think part of it is because the
configuration does require a little bit of application specific input.
So development teams need to consider what are
the changes they need to make to their application in order to make sure that
the health checks works for Kubernetes. So we're hoping that
this trend kind of improves over the years as well. Another trend that
we identified is that deployments are missing replicas. This is another
general best practice is to make sure that there's a few different, there's a couple
of replicas available for pods and that right
now we're noticing that 30% of organizations actually have less than 10% of their
deployments missing replicas. So this is an
improvement over 2023, but still kind
of highlights whats if you look at the
graphic here, that some organizations have much more than just 10%
impacted. There might be lots
of applications missing replicas. And I think sometimes this is because
of what we call the copy and paste problem. Sometimes a deployment from one
team that's missing this best practice gets copied from another
team who's looking to get their application deployed. And so they may
be propagating these misconfigurations where
replicas aren't set on the previous team, and now the new team is using
that same configuration without replicas. And so you can see that this becomes
sort of a wider problem. Again, this is usually a very quick
fix, like a one line change to your infrastructure as
code. And we're hoping to see, even though
we're on an improved path here, that even more and more organizations
have fewer and fewer deployments missing these replicas.
Shifting gears a little bit, we'll also take a look at security. Now,
security in Kubernetes kind of can mean a lot of different things.
We look at security from two lenses in this report.
One is from image vulnerabilities as well as from
the kind of the configuration itself. So the YaMl or
the helm chart that's being deployed at a very
high level. We're noticing that about 28% of organizations are running about
90% of their workloads with insecure capabilities. So that means
that they're adding some sort of insecure capability like net admin.
And a lot of times it actually might be necessary
for some applications or workloads to have these additional capabilities,
but sometimes it may not be and it could be accidentally
added to apps going back to that original copy
and paste problem where one team copies the configuration from another team
as a starting point and inadvertently
propagates some of these misconfigurations going forward.
So we always look to make sure that applications start with not having
these dangerous or insecure capabilities added,
and that helps ensure kind of a good baseline from a
security perspective. One positive, though trend
coming out of this year's report is that we're actually seeing fewer containers
set to run as root. So 30% of organizations
today are running 70% or more of their
containers as root, which is actually a drop from 44%
in last year's report. And part of me thinks that
this is another example of sort of a low hanging opportunity
to make a quick win,
a quick fix to containers by essentially turning
off the ability to run as root, which again, is a one line
change. And I think we also see that this example,
this type of misconfiguration example is sort of very popular
when talking about the issues of misconfigurations
of Kubernetes. A lot of organizations talk about, as an example, running as
root being a common example of that.
So it's great to see that this trend is going
in the right direction in that fewer and fewer organizations have
a vast majority of their containers running as root, and that seems to be going
in decline, which is awesome. And I think it's important to note that
running a container as root, just overall, it increases the risk of a malicious user
taking advantage of that root privilege as part of a larger
attack. So you want to kind of, from a defense in depth perspective,
by default, have your container not run as root unless it absolutely needs
to because of some special need or use case for
that app. So again, this is going in the right direction, and we hope it
stays that way going forward as well. Switching gears
a little bit away from kind of misconfigurations, we'll talk about image
vulnerabilities. And so this is the image vulnerabilities
that may exist in running containers or as part of scanning container
images, as part of your CI CD process or your shift left process.
And I think this is an ongoing challenge for many organizations.
It's an ongoing problem. But we do see some signs of progress in this year's
report. So if we actually dig into
the first section where we show the percentage of workloads impacted,
26% of organizations have less than 10% of their
workloads affected, which is an improvement from 12%
in 2023. So we're seeing essentially a greater percentage of organizations
with fewer workloads impacted due to image vulnerabilities.
And I think that is a signal of both kind of organizations upgrading
their third party containers to newer, less vulnerable versions, but also
integrating and scanning more of their containers so that they have
a process in place for this. In the report,
you're also going to see a section where we talked about scanned images.
So Fairwinds is able to kind of help companies
identify if there's images running in their cluster that they have not scanned.
And this has greatly improved over the year.
We're actually seeing almost 84% of organizations getting
almost complete scan coverage of containers in their runtime. That's up
from 64% last year. So I think that's a
great sign that organizations are kind of doing the first step, which is
scanned as many of their images as possible so that they understand
their risk and then taking remediation after.
You know, we hope that next year we even see a higher percentage of organizations
with fewer workloads affected. One of the enhancements that
we made to Fairwinds insights last year was we added some specific
checks related to the NSA hardening
guide. So the NSA actually released Kubernetes hardening
guidance, I think, back in 2021, and there
was a number of great recommendations there, and we actually expanded the number of checks
that Fairwinds insights offers to match what the recommendations
were in the NSA hardening guide. So a lot of new security
checks kind of made its way into the Fairwinds insights platform this year.
One of those checks is actually verifying if there's a network policy configured
for workloads. And network policies are
increasingly important because it helps you kind of segment workload traffic and
ensure that you've got controls around which pods can speak to which
pods. And so we want to get a sense of how is the industry
doing on this particular policy.
And so I think we see kind of two types
of organizations, 37%, or about a third of
organizations today, have less than 10% of their workloads without a network
policy. And that's actually a great sign that there's a lot
of network policy adoption happening in some organizations where
they're making sure that their workloads have a network policy
set. But on the other hand, there's still a majority of
organizations that have way more than 50% of their workloads without
a network policy. So it means that they're deploying the Kubernetes.
The workload is running fine, but that workload can
speak to any other workload in the cluster. And so I think it shows that
the industry still has a little bit of ways to go to make sure that
network policy adoption is even more widespread and more adopted.
And so just to kind of give a little bit of an example of why
we think this is important, network policies help you
limit that egress and ingress traffic. And so when
you have that ability to control the traffic, it allows you to kind of,
again, from a defensive depth perspective, prevent any undesirable access
to those pods. So those are some of the summaries and
the highlights from the report. Again, I think it's probably only, we're only
covering about a quarter of the information that the report has this year,
but I wanted to kind of also help organizations understand what is a path forward.
Like, if you're running lots of kubernetes today, how do you ensure that
your teams are following reliable security and cost efficient
best practices? And I think that's really where Fairwinds insights can
provide a lot of value. It can provide guardrails to help you solve your business
problems. Whether it's ensuring that your images
are free of vulnerabilities or that your
workloads are aligned to standards like the NSA hardening guide, or aligned
to standards like SoC two or ISO 27,001,
there's a big security reason to provide developers with guardrails
and feedback around their configuration hygiene.
I think increasingly in 2023 we did notice whats a lot of organizations
were very cost conscious. So they wanted to make sure that they had
a way to measure their container usage, but also right size
containers to properly make sure that it's using the
correct memory and cpu and they're not overspending
in ways whats incurs additional cost or just wastes compute resources.
So Fairwinds does provide sort of both Kubernetes cost allocation
as well as container right sizing recommendations. And that's
helped organizations in some cases save over 25% on their
container costs. And then finally, this notion of guardrails is
sort of core to everything that we do. So in order to make
sure that engineers have the tools to take action on this feedback,
you want to be able to provide guardrails at different steps in the process,
whether it's at time of pull request, when they're making their infrastructure as code changes,
or at the time of deployment, also known as the time of admission,
when applications are being deployed into the Kubernetes environment.
You want to give that feedback to developers and have both like a way
for them to remediate things easily, but also ensure consistency so
that you're not introducing risk or over provisioned applications along
the way. And these are kind of the core capabilities that Fairwinds insights provides
and how our customers are getting value. So I do encourage you
to kind of take a look at the Kubernetes configuration benchmark
report for this year. Like I said, we only really covered about a quarter of
what's in that report, and there's a lot more broken out by security,
cost and reliability, so you can kind of see the different patterns.
So if you're interested, I'd recommend going to reach
out to me on LinkedIn. I'm happy to point you in the right direction,
and I think that's really kind of what we're hoping to cover
today. So thanks again for the time and looking forward to hearing
your thoughts out there in the community.