Transcript
This transcript was autogenerated. To make changes, submit a PR.
Our name
you hello there and thank you so much for joining me
here at Conf 42 integrating cloud native security
into the SRE culture. It's really great to be here.
I hope you're enjoying the conference so far. Now,
my name is Anis Oles. I'm the open source developer advocate at acro
security and in this talk I want to
speak about the overlap between site reliability engineering and
cloud native security. How can both benefit from each
other? How can we integrate cloud native security tools best practices
into our site reliability engineering? Now,
last year I was actually working as SRE in between positions
as developer advocate. I'm also a
CNCF ambassador since 2021 and
I have a YouTube channel that you can see here where I talk about cloud
native tools and trend tutorials on how to set them up, how to
use them most effectively. And I have a weekly DevOps newsletter where I
share amazing content from across the space with the
community. So if you are curious, do check those resources
out. You can find all of the links at my twitter.
I also have a puppy. She's just five months old
and she might make a little bit of noise in the background, so I apologize
for that. She's making up for it by being super adorable.
However. Now, last year when I was working
with SRE, the SRE team at
the startup was just about setting up all of the practices
and the SRE culture within the company.
And this is a slide taken from our Kubecon talk
that a manager of mine and I gave in
2021 at Kubecon North America.
And we basically talked about the different Kubernetes operators
that we were using across our infrastructure.
And the infrastructure was very dynamic and very
complex, let's say, because we had basically
several superclusters across different regions across the world,
Frankfurt, New York, London, you name it.
And basically what we had there were compute
racks, and those compute racks were our supercluster.
And in those compute racks we have compute nodes,
and you can see them depicted here as these drawings that these
are all compute nodes and the details will
matter here. But basically tenant clusters, customer clusters
would then be scheduled on those compute nodes within. So we
would have clusters within a large supercluster.
And as you can imagine, you need very advanced
observability tools to get the necessary
insights to understand what is going on. Where, for example,
if a tenant cluster is stuck in some state and
you need to repair it manually, you kind of need to kick it, you need
to know how to identify those tenant clusters right in the
easiest and fastest way possible the thing
is, at the time we were talking a lot about our operator design and
about our observability tools, but we weren't really talking about cloud
native security. So before we go into cloud native
security, I want to talk a little bit more about the SRE culture.
I mentioned that we were really focusing on establishing an SRE culture
and that really focused
on these different areas. The first one is continuous improvement.
You don't want to keep the state of your services the same, even though
things might be working. You want to continuously improve your setup
and the tooling that you have in place to gain insights on your services.
And that's also related to embracing risk.
Of course you want to keep the risk profile low and not deploy
something that might bring out all your infrastructure accidentally.
But ultimately it's a balance between both, because without
embracing risk and taking new step in advancing
your tooling, you can't really improve the tooling itself
and it will slowly deteriorate. Then the other thing is analyze
learnings, analyze your failures and
learn from them. We had lots of incidences
in the early times, so we had
lots of incidences and varying in all times
of severity and degree. We were using tools that at
a scale that haven't been used to that scale before.
So we encountered some really, some edge cases.
So a lot of times we had to sit down with the other companies,
with the other projects and really analyze what has happened. So both
the projects, well, both our company, but also the projects can learn
from that. And the last thing is autonomy. It was my
first SRE role where I had experience in the cloud native
space. I didn't have experience working with production environments,
but I received a lot of autonomy. And I think that was really beneficial to
have that trust and focus within the team. So that's ultimately
what the SRE culture is about. To me. I know
it's different across different teams. You will have different implementations and similar,
but this is kind of what you can think of when you think about the
SRE culture. What is devsecops? Usually I
ask people what devsecops is, but then people are really
shy and really, this is a conference about devsecops.
So I think that everybody has kind of an idea of what
devsecops is. So just think to yourself, okay, this is what I think about
when I hear devsecops. Some people might think about
buzzwords, some people might have specific terminology in mind.
Now I think about integrating security into all of our business functions by
empowering people and creating accountability.
And every word here is kind of, I carefully picked every
word here. So we want to incorporate security into every
business functions, whether that's administration or engineering. Because ultimately,
if everybody's empowered to take ownership of their part
of what they are working with, right, then you can cover all
areas within the business. So it's really about empowering
people to take that ownership, to know what they are supposed to do,
how they can do it, how they can ask for help and similar.
And then when things go wrong,
if they happen or don't happen both ways, if things go good, but also if
things go bad, you then bet, then you can create accountability and
have productive, more productive conversations, right? It's not about finger pointing,
it's about having more productive outcomes in the end. So that's
what devsecops is all about to me, to really
make things happen across the business by
shared ownership. So next
thing, if you're working with anything away from this talk,
it should be that SRE practice and security practices
have a really tight overlap. Ultimately, when we define what healthy services
look like, we should also define what secure services look like,
because only secure services are
healthy services. So that's basically what this
talk is boiling down to.
When I moved from my work as
SRE back into developer advocacy for open source
tools at Aqua Security, I realized that there's such a strong overlap
between both and it just doesn't make sense to completely decouple
them. And I know in many businesses you will have a separate security team.
That's a great thing, right? But at the same time, we should
also see, okay, how can different areas
benefit from each other, and how can we make, for example, something like integrating
security as easy as possible? So the idea is,
start with, if you have an SRE team, if you have people focused on
observability, start with those.
So here SRE, some additional goals that you might have within
your SRE team that are also security goals,
or they're tightly coupled, let's say. So when we
focus about how we can scale our services, we also have to talk
about how can we keep those services secure over time as they become
more complex, as we scale up and down our services based
on demand, our replicas, if they scale,
then we also have to talk about, okay, how can we keep those secure?
The next thing is visibility. Within your observability
tools, you obviously want to gain visibility, insights into what's deployed,
where, how is it deployed, who deployed it, when was it deployed,
how is it interacting with other services? Is it maybe causing failure
in other services? And similar. Those are all questions and topics related
to visibility. Now, when you are getting started with cloud
native security, you want to focus on security scanning,
you want to focus on getting more insights into the
security posture of your services. And all those is also contributing to
how do we gain more visibility into those different areas.
The next thing is reduce noise. And torial, there's something I'm going to talk
about in a little bit more, which is called vulnerability fatigue,
and which basically means that you're bombarded with
security issues and you can't keep up with fixing them or taking
care of all of them. So within your
cloud of security, you want to focus on the most
productive and the most efficient information
that you can take actionable steps from.
Similarly, within your SRE team, you might have thousands,
thousands of logs that you can't filter through, obviously manually.
So similar to that, you want to have processes,
workflows, but also tools in place that help you to reduce that, all that
noise. The next thing is automation. Automation is great for
different aspects. It's making our lives obviously easier.
But I'm going to talk a bit about the downsides of automation and what we
have to be careful about when we do automation for
SOE work, for observability tools, but also for cloud native security.
The last thing is what I already mentioned, ownership,
communication is key for both areas.
So here SRE, some of the more practical items that SRE just
what a lot of SOE teams do, what we can also
adapt for our cloud native security. Getting started with
cloud native security, the first one is investing in runbooks and documentation.
So when we define how to respond to different types of incidents,
when to escalate an incident, what steps to take during an incident,
the same thing we can do for any security
issues that we might have within our tooling. So we
could, for example, define okay, if there's a critical vulnerability, what steps
have to happen, who has to take those steps in similar then
the other items. SRE really also something that can be
adapted for both teams. If you have different teams, if you
have security teams or people focus on security versus people
focused on site reliability engineering, or you can integrate
one into the other. So here SRE, some of
the tools that we used in that startup that I mentioned
where I was working, SRe. So the observability tools are
really like your standard stack, I would say. With Grafana and Prometheus Jaeger,
we tried to install temple. We used Grafana Loki for
logs. For management, we mainly used helm and terraform.
It was very much helm terraform focus and
then we used GitLab CI CD pipelines.
But we talked a lot about these different tools and the different integrations
and installation of those different tools.
However, we didn't talk about security tools.
That's like something we didn't really talk about. We had at some point,
I mean, we were following security best practices, right? Like, don't think we
were not. But at some point we had an intern
who was a university student who was helping us
implement tools such as Kubebench
from Aqua Security as well. Now,
just quickly mentioning every tool that I showcase
here from Aqua, these are all Aqua's open source
tools. I am not promoting any enterprise tools in
this talk. So you
don't have to sign up. It's all used for free on GitHub.
You're not sending us any data. Similar.
So since there is so little conversation about how we
can actually get started with cloud native security, for example
in your SRE team and similar,
I've thought about okay, here are different steps that you can take.
It's one approach, right? There are different approaches.
This might be one approach. So we're
going to focus as security scanner. As our main
security tool, we're going to focus on Trivi. Trivi is an all in one security
scanner. All in one because it can scan all of
those different scan targets. It also has s Bom
functionality features and cloud
provider account scanning, starting with AWS. It also can do
in cluster scanning of running workflows. So it's a very, very versatile
tool that's focused on different users and different workflows.
So step one in our ten step journey
is understanding your need. That's really important because if you have no
idea what you're actually aiming for, then you don't know what to look out for,
right? So our need will be influenced before we can
define our need. We have to be aware of the influencing
factors on that, on our goals, on what
we actually need to accomplish. So the first
one is the size of our team, right? If you are working as an individual
contributor, the needs for the different tools
and the way that you need to integrate security tooling
and practices will be different. If you're working within a large scale team,
the next thing is the industry you're already working with.
Is it a highly regulated industry that requires you to choose
specific tools, work with a specific
stack? Or are you working for a startup where it just makes things work
in the best way possible with the tools available, then the
type of technologies you're working with, it's also related to the
integrating that are available. Do you need to have a custom setup with your
custom on premise infrastructure that
your need will be quite different to somebody who's managing
can open source project for example, or managing,
I don't know, a small retail website.
Right then the company goals and leadership.
A lot of times security, whether to acquire the skills or
the tools, is related to having budgets and expertise,
right? It's usually something that people keep
as last thing to do to take care of, which is obviously
an issue. But yeah, it's one of the factors that you want to
take into account. It doesn't mean when you want to get started with cloud native
security and integrating cloud native security, it doesn't mean you need to have a
budget and expertise already available within your team.
It just means that that is one of the factors that can influence
which tools you're using in the end. Now tools will
differ in different ways. That's also something you want to keep in mind. The first
one is the installation. Different tools are installed differently. A lot of the
cloud native security scanners are used as CLI tools,
so you use them either in your local terminal or in your CI CD
pipeline. Other tools come as Kubernetes operators
and other Kubernetes resources and can be installed within your cluster.
Now you want to be worried about the tools that do something within
your cluster because security scanners will need lots
and lots of privileges within your cluster to perform proper security
scanning. So whenever you are signing up to a
tool and you give it access to your cluster,
you want to be mindful of what is it actually doing within your cluster,
who's getting that data from those scans versus
if you install, for example, an open source Kubernetes operator within your
cluster and it performs just the scans within your cluster and the reports
and resources of the scans are only available within the cluster. Then you know
it's really contained there within your existing environment.
Next thing is scan coverage. We get lots of questions
in trivia, in the project issues and so on,
where people asking why does this scan from Trivi differ from
that can from another tool? And basically Trivi
has a trivia database which is a separate project under the Aqua open source
umbrella and it's pulling from different data sources,
for example, list of vulnerabilities. Then the next thing
is on how tools differ in quite a significant way is the
number of integrations and the type of integrations available,
especially if you're going with an open source security scanner.
You want to be mindful of the integrations that are available,
so more mature scanners will have more integrations
available. Usually the last thing is the focus.
Different tools are focused on different people, different type of audiences. Some might
be focused on security professionals, others are focused on engineers.
So here is can example of need
driven development from device engineering
blog. They basically detailed how they changed their security
scanning to gain better insight into the security posture of their
services. And here are the four goals that they want to accomplish with
that change. The first one is assign ownership of vulnerabilities.
They wanted to have people, different people within the team,
take ownership of different vulnerabilities. So actually somebody,
it's going to be somebody's job to take care and fix that vulnerability.
The next thing is they want to have a global view of the security state
of services. And that's very important because only if you have
a global view, that's not helpful to analyze
specific services, right? And to fix specific service, but only
if you have a global view, you can then see how
other changes, wider changes, for example in your workflows.
Adopting other tools, external tools, has an impact
on your overall security posture.
Then they want to develop dashboards for different users and requirements,
and that's more related to breaking down the security issues
related to specific services. And they want to overcome difficult to
use in different uis. A lot of times in the cloud native ecosystem, whenever you're
using a new tool, you're adopting a new workflows
and you're adopting a new UI and interface and frameworks,
and that takes time to first of all get used to them, to learn
your way around it, and you will always then have to do something separate
to what you have already been doing. So they wanted to integrate
their tools, their tooling, their security tools into their existing workflows. To have
just this one thing
to go to. Then step two, once we
know what we actually want to do,
what we want to achieve, and how different tools differ and
so on, and what factors we have to keep in mind, we want to choose
a cloud native security scanner. Now here
is a list of different cloud native open source security scanners
in the space. And they SRE focused on different types of scanning. For example,
some SRE just focused on vulnerability scanning, others are focused on infrastructure as
code misconfiguration scannings. Others are compliance scans.
Now compliance scans, for example, would likely more
be used by security professionals versus in cluster
scans might also then be used by cluster admins.
As you can see, trivia is really across those different areas
since it's an all in one security scanner. It does lots
of different things, but if you just need vulnerability scanning, you might want to
consider, for example, another tool that focuses on vulnerability scanning.
And here's the list. Now once we have
looked at the different scanners, in our case we're going with trivia
because I'm familiar with trivia. We want to set it up
and make sure everything is running properly. And sometimes you
might go with one scanner and then you set it
up and you play around with it and you realize it's not the right tool
either because the workflow is not intuitive for you or
something is just not working and it's completely fine to go back to step two
and be like, okay, we actually want to use a different scanner now.
In our case we're using trivia now we want to make sure it's working
properly. So the first thing is identify the best installation options.
Also trivia comes in different installation options. Now I usually go with helm
installation inside of my cluster in addition to having automated
CI CD pipeline scanning, then you want to decide
upon a different configuration. For example, if you're
using trivia in combination with observability tools such as Prometheus,
you have to configure some parts slightly different.
You then want to test those custom configurations and
ensure that it's working properly with all tools that it's supposed to
work with. So for example, if you have some niche
cases where trivia is supposed to perform,
I don't know, a thousand vulnerability
scans of different containers, right?
And then on a regular basis, something like that, like some really edge
case, you want to test it out in a small
environment first before and that's with every tool, right? You want to
test out your specific edge case in a small environment before
you implement it in a large scale environment.
Now here is an overview, very simplified overview of a Kubernetes
cluster, how that might look like once you installed trivia,
the first thing is you have like maybe an application namespace
with all your application related resources. Then you have a monitoring namespace with
your Prometheus Grafana, other observability tools and
then you have your trivia system namespace with the trivia operator. Now the trivia
operator is that part of trivia that does continuous in cluster
scanning of your running workloads.
In addition to that, you could then also use trivia, the CLI
tool in your CSCD pipeline or also on your developer
machines. The beautiful thing is if everything is a Kubernetes resource,
you can then use the same processes across your stack. So for
example, here you can use the same processes if everything is a Helmchart
processes Grafana as a Helmchart to view operators and Helmchart you can
deploy and manage those applications through the same processes,
which is really nice, really handy. So here's what you
will then see inside of your trivia system namespace.
Now alongside the trivia
operator you will then have also several kubernetes,
custom resource definitions, deployed crds
and they basically extend the Kubernetes API to
allow for custom security scans.
So here we have the metrics of our different security scans.
Trivia does vulnerability scans of any container image it
finds inside of your cluster. It does exposed secret scans.
Are there any exposed secrets within your cluster then?
Is there any RBAC misconfigured,
any role based access control that should be changed?
Maybe. And then it also does config audit scans.
Now the thing is, things might change dynamically and
it shouldn't. And inside of your cluster, right, like people might change things
around manually, they might try out things, they might deploy set containers
to debug things. I don't know what your company or team does,
right? But trivia will then identify any misconfigurations that
are present within your cluster of those newly set up resources and
can alert you on those. Now these
SRE, just the metrics from
the security scans, from the security reports, the security reports
itself, they are just other Kubernetes resources. They are yammer manifests, the security
reports and you can read them like Yaml manifests. And then because they
are yama manifests, they are kubernetes resources to security
reports. You can export them. For example, you can get the metrics out
and then you can integrate them to your observability stack.
That's the next step, setting up a dashboard.
So we have Grafana Prometheus installed, we have our security
tools installed. It's time to set up a nice dashboard.
This is the dashboard created by the community where we have
a summary of our different vulnerabilities and
they are broken down in severity. So in total we
have 175 vulnerabilities in our cluster.
Now you can also see all of the other metrics
directly through a dashboard as well in Grafana.
And basically by breaking out those different
vulnerabilities into different categories, it then makes it easier to
identify the different types of vulnerabilities that
you have. Now the next thing is, what you might think
already be thinking about is how do you avoid vulnerability? Hell, because if
you have 175 different vulnerabilities, how do
you go about addressing them, how do you go about managing them? Those are,
I'm not swearing, those are a lot of vulnerabilities
here, right? That we can't manage all
at once. Here's a screenshot from Alex Jones
on Twitter saying I just give up and I
just give up and die. No, then difficult sentence.
Anyway, so he scanned
a research, I don't know what type of research he actually scanned, but he scanned
a research with sneak and found
over 550 different vulnerabilities. And they are broken
down in critical, high, medium and low as well. But still there's
lots of vulnerabilities you can't look at 550 vulnerabilities
or similar, right? Doesn't work. So here are some practical
steps that you can take. First one is ignore all but critical vulnerabilities.
He only has three critical vulnerabilities that's easy to address.
Just take care of the critical vulnerabilities first, and then
look in a more productive way at the rest. Don't scan
everything at once. I don't know if they scan just one resource or if you
scan multiple resources, but there's really no need to scan everything at
once. Just scan the most critical workloads first.
Filter by vulnerabilities with known fixture trivia allows
you easily, with an additional flag to just specify that you only want to see
vulnerabilities that already have a fix available. So you could
go ahead and do that. Just look
at those vulnerabilities first, then filter vulnerabilities
by team and by application. Really make them team and application specific. Give them
context. Give them meaning that they are not just like a line of text of
something that's wrong with any resources, right? That's ultimately what
you don't want to have. And that's also related to device engineering blog
post needs, right? So next
thing, step six, what are metrics without alerts?
The thing is, and this is related to what I said earlier about automation,
that I want to talk a little bit more about automation after I take a
sip of coffee.
Sorry, my throat is still a bit messed up from a cold.
So basically, when we define
our deployment resources, you need to define your deployment resources to deploy
your application, right? That's a necessity. Otherwise your application is not deployed,
it's not working, customers can't access it, customers are unhappy,
right? You don't want that. So the thing is,
you then need to obviously define those deployment resources. But the same doesn't
hold true for security, right? You don't need to define
how you scan your resources, you don't need to define like scan
coverage. I don't know, all those things related to security you don't
need to do to deploy your application, to have it working to make customers
happy. Customers are only unhappy when things go wrong in the security
world, right? Like when their data is ultimately exposed.
So it's not a necessity for engineers, for anyone
operating an application, operating a business to actually take care of the
security of that. I mean, most of the services that you use online, you probably
don't know what kind of critical vulnerabilities are within, and you
shouldn't have to care about that. That's something for the business to care
about. But that's exactly why you want to set
up alerts and make your vulnerabilities scream at you,
right? Give them a voice, make them,
set them up in such a way that you cannot ignore them.
So once you do that, you can correlate
metrics. So, for example, if you have a new critical vulnerability here,
new vulnerabilities introduced,
we can then correlate that dashboard from our vulnerabilities,
from our misconfiguration issues that went up with our
deployment dashboards and see, okay, how do those, what happened
in our cluster, there's a new deployment,
there's a new replica set, okay, that caused
the vulnerabilities to go up to have more inside of the cluster.
Step eight is some additional tips that you can do,
and some are iterating on the previous ones that I already mentioned. First one is
assign ownership, really make it somebody's responsibility.
And ideally, the person who's already managing that resource should look at its
vulnerabilities. Don't introduce tools, many new tools
at once. That's something lots of people want to do when they get started with.
For example, cloud native security is implemented everywhere and everything
at once, and that's complete overload,
and people are likely not going to be able to adapt to
those new processes. The next thing is utilize existing workflows,
platforms and processes. Utilize it as much as
possible because it makes it easier for people to actually look at the security
reports. In that case, step nine is optimize based on what
works for your team. A lot of times we can follow
the initial setup, follow whatever company said,
but ultimately every application will
be differently deployed depending on your environment. A lot of times when I get questions
about trivia operator specifically and its
deployment,
I cannot answer those questions before I get more information on
your setup, on your environment, on your needs, on all
those different pipes that play a role right because
ultimately my answer will defer based on how
your setup looks like and what applications you're
already using in simulam. So there's really no one thing
works for everybody. And step ten
downstop at security scanning there sre lots of different types of
security tools in the cloud native space.
So for example, Tracy is a runtime security and forensic tool that
analyzes events on the node level.
So it can basically, while Chevy can scan any
misconfigurations once they have happened inside of your cluster,
Tracy can detect if somebody uses a misconfiguration to do something they
shouldn't do. Those are the main differences. So here you can see just
a dashboard of its different logs. Now you would want to obviously
filter them more in different ways to actually then have
actionable steps to those logs. Because over 2000 logs,
that's nothing you can really follow up on.
And here sre some of the resources used the
blog post from wise Engineering on their application security journey.
Then on the AG for open source YouTube channel we have lots of different tutorials
to get started with. Here's the trivia GitHub repository and the
trivia operator repository. If you Google trivia trivia operator, you should
find it as well. And here's a demo project that I've been using
on GitHub as well, and you can find us on
slack if you have any questions about this presentation,
about anything I said, or about trivia and other
projects within the aqua ecosystem.
Now, I hope we have some time for questions. Otherwise, thank you so much
for attending my talk. I hope you have can amazing rest of your day and
to see you soon.