Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi everyone, thank you for joining my session.
From ghost assets to infrastructure drift don't
get spooked.
We'll start with a short intro and continue with a reminder
of the concept of infrastructure drift.
We'll cover what Ghost assets are and the top five
ghost assets according to more than 100 AWS
production accounts. We'll continue with what
you can do about ghost assets to prevent some nasty
stuff from happening to your cloud.
My name is Naor Paz. I currently work as director
of product management at Firefly, the cloud asset management
platform for DevOps, site reliability and
cloud engineers driven by infrastructure as code
a bit about myself before joining Firefly,
I worked at f five as a senior product managed in
application security, where I led the web application firewall
product. Before that I worked for
the Office of the Prime Minister of Israel as a security research and
analytics team lead, and before that I served at
unit 8200 doing mainly network security
research and data analytics. So in these past decade,
I've got a mixed background in DevOps cloud security
data and networks.
So what do you need to know to gain some valuable
knowledge from this session? Well, not too much,
but you need to understand the following concepts.
Cloud throughout these session I will mainly
show examples from the most significant public cloud to
date, Amazon Web services or AWS,
but any cloud background is sufficient.
The session is as relevant to Azure, Google Cloud
and other cloud service providers.
Infrastructure as code or IAC.
This session focused on examples from Hashicorp
terraform. Although relevant to Polome,
cloudformation and other IAC technologies,
I recommend understanding state files HCL,
which is Hashicorp configuration language, and some
terminology such as providers.
When I say cloud asset, I usually mean a terraform resource when
it is a managed one.
Infrastructure drift a short reminder, when we manage
cloud assets with terraform or any other infrastructure
as code, we do that for multiple reasons,
speed, consistency, versioning,
documentation, and decreased risk.
The opposite of managing cloud asset with code
is doing it manually through the cloud console or user
interface, what we like to call click Ops,
which stands for click operations.
Infrastructure drift starts when the infrastructure
drifts away from the code. When we want to
make just a small patch here and there,
we tell ourselves that we'll add it to the code later,
but we usually never do.
So we started with a fully managed codified infrastructure.
We used click ops to make just a small patch here and there.
And now our code no longer describes the
actual state of our cloud. When we run terraform plan,
for example, we get a list of all the changes
we've made manually and we're stuck.
We have to decide whether we want to revert everything to the
older codified state or adapt it somehow.
Emerging changing the code to match our manual changed.
These drifts are all about changing existing assets,
which we created from code. This session
focuses on the very existence of those assets,
as you'll see right away.
So what are ghost assets? We have our codified
managed assets created with code using
the terraform apply command. Someone or
something completely removed one or more of
those assets with APIs or clickups,
meaning manually. They now cease to
exist in the cloud.
But because of the fact that they were initially created
with code, they are still in the code itself
and in the state file, which refreshes
once you apply your code again.
These assets are what we call ghost
assets. These can hurt cloud reliability
by causing downtime or production issues,
creating redundant ghosts, and generating security
issues.
Anyway, there are two options. You either
wanted to delete that asset which you no longer need,
and what can happen is that it can be accidentally revived.
It is still in the code, so applying your code will just recreate
that asset again. The other option
is that you did not want to delete it.
You still need it in your cloud deployment, but it was deleted
by mistake and now it is not available when
you need it. It can be a storage backup,
a compute asset such as lambda that can cause
downtime, or a database replica, you name it.
What is most certainly true here is that you have a problem.
As a cloud asset management platform.
Firefly's customers integrate their AWS accounts,
so we have plenty of data to look at.
By looking at only part of them, more than 100,
we get a fascinating sorted list of the top
five ghost asset types. EC two instance a
compute asset, essentially a virtual machine
IAC policy which is all about identity and
access permissions. SqSQ, which is
a managed messaging queuing service.
Cloudwatch metrics alarm an alert configuration
on a cloud metric such as cpu utilization
for an EC two instance or volume write bytes
for an EBS volume and security group,
which is basically a virtual firewall.
So EC two instance is one of the most critical
assets. It is used by almost any AWS
customer for compute again, essentially a
virtual server you can do anything with. It was
also the second service AWS introduced back in
August 2006, 2nd only to the
s three storage service. So what can happen
if an instance is accidentally deleted?
Well, in that situation we mainly
have reliability problems. Imagine we have a multiservice application
deployment. An EC two instance can run its web server,
a backend service, a web data scraper,
and even a database. If this production
instance is suddenly gone, you will definitely
suffer from downtime. Services will not be accessible to customers
or any other type of consumer. It can also
remain undiscovered for days or weeks if it's not a code
service used for any operation. What can happen
if a deleted EC two instance is accidentally revived?
When was the last time you spent up a temporary instance
just to try a new tool, open source or any
other experiment? If you deleted that
managed codified instance manually and
another engineer or team hit the terraform apply,
it will be revived and you will pay for. It can be
anywhere from three to four digits dollars a month.
What about security? In that case, you have an
instance that might be publicly available to the Internet
running old unmaintained, unneeded code.
You know where this goes. A breach IAC
policy is at these core of identity and permissions.
Any user or role can be attached to it and
get its needed permissions, such as writing or
reading from specific services. It is used
for manual and programmatic access.
Managing policies as code have tremendous advantages,
especially when implementing least privileged permissions.
Upon deletion, the role is detached
from any entity attached to it. This means that any permission
granted by the deleted policy is now revoked.
Production processes can closet their ability to write and read
data, and customers can lose access to services.
But what about security? Policies can include
explicit deny statements that are always stronger than
allow statements. They can protect against unauthorized actions
and act as a guardrail that, once removed,
can pose a real threat of excessive permissions.
SQS is Amazon's queuing service,
a central pipe in any meaningful project,
usually connecting between producers who send
messages to it and consumers who read from it
and do something with the data. When accidentally deleted,
it is no longer available for the producers
to send messages and for the consumers to read from,
therefore causing operational damage to any
dependent service, haunting the reliability of data
and halting jobs already in the queue.
The Cloudwatch metric alarm is an elementary
configuration for monitoring cloud assets.
Whether it's the cpu or memory utilization of an EC two
instance that can lead to changed its type
or space in your EBS volume,
the notifications it generates are actionable and essential
for production environments. It's pretty evident that
when such an alarm is deleted, it can cause SLA
service slowdown, SLA problems, of course,
service slowdown and even downtime security
groups are virtual firewalls that control the traffic
that is allowed to reach and leave assets that these are associated
with. When accidentally deleted,
it can remove needed access to a specific asset when
deleted. Because multiple security groups can be assigned
to one asset, it can be
detected late in the game haunting downtime,
although for this type it is a bit harder to accidentally
delete as it is not allowed to delete an
attached security group, whether to a network interface
or an instance.
This table summarizes the asset
types we discussed. Those are just the top five,
and any other deleted type can cause cost,
security and reliability issues.
So what can we do about this?
Well, you should all become your own cloud Ghostbusters
there are several ways to tackle ghost assets
and we will cover just some of them.
There are four categories that we look at when we talk
about taking care of ghost assets.
One is detecting a ghost asset. It means
catching when an asset is deleted somehow
but is still in the state file.
Two, alert is the ability to send
a notification to your favorite shoutout solution
such as slack. Three,
fix means the ability to make a change to your
actual code or run a CLI command.
And for prevention,
prevent means making sure that this never happens in
the first place. We'll start with our
SaaS solution Firefly, for which we have a completely
free tier, no credit card needed.
You can start using it today and it supports the detection,
alerting and fixing of ghost assets.
We'll talk about prevention that can be achieved by spooking
your cloud for any clickups activity.
Although largely theoretical, some teams might
be able to get it done. We'll talk about click
looks an open resources that lets you detect and
alert on ghosts. Finally, we'll mention
drift CTL which helps you detect ghosts
as well.
So this is what these Firefly inventory looks
like. As you can see we have a table with all of
our cloud assets up there. We filtered
to show only ghost assets. We have quite a few
of those. Let's focus on one of them nec two
instance. So these instance
was created with terraform code. It is no
longer in the cloud meaning it was somehow
removed. It is still in the code and in the state file.
As you can see here we have an alerts channel
on slack and we send alerts on ghost
assets so we can immediately know when a codified
managed asset is removed from the cloud.
We can also looks at the asset data and in
the IAC tab see exactly which stack or
state file managed it and which terraform
code created it. When we click on
the code link we will get right
to the code in git.
On the left you can see the resource code
in terraform that created these asset.
And if it was accidentally deleted and we want to revive
it, we can just run the following command terraform
apply minus target AWS instance
which is the type bar three
which is the resource name from the code.
If we want this asset removed from the code,
we click on remove asset code and we instantly
create a pull request to remove these relevant resource
block from our code. Once approved,
this asset will never come back unexpectedly.
Here you can see the pull request on GitHub and
that concludes the first solution.
Well, the ultimate prevention option is to looks your cloud,
meaning disallowing any non read click
offs through the AWS console and
allowing only your infrastructure as code solution
to make changes. This way you make
sure you never have any ghosts or drifted
assets has well. Still, the cost
is relatively high, especially when we
want to make a quick fix manually after a
pagerduty alert woke us up instead of writing
terraform code at 02:00 a.m. But it
is still possible and it's definitely worth a
try.
Click looks is an open resources solution that helps
you detect and alert any clickops actions someone
made in your cloud accounts. It can be provisioned
using AWS control tower for multi account support and
it is basically using cloud trail logs to detect
those actions. You can also alert slack
using a webhook configuration.
Clickops also provide a terraform module
that is used to provide all the needed resources,
including a dedicated lambda. It seems
like a great solution if you are willing to invest
time configuring, monitoring and troubleshooting
it. Drift CTL
is another open source tool that works with state files,
so you don't need any unique cloud deployment,
but you still need to invest the time and automate
using cronjobs or any other automation
to catch ghosts as they happen. If you
want any alerts, you will have to build them yourself.
In addition, I recommend being careful when
handling state files as they might contain sensitive
information when not cleaned and can cause damage
if accidentally changed.
Let's wrap up. Ghosts assets are dangerous.
They have to be detected and fixed. This way
or another, we can prevent redundant costs,
reliability issues and security exposure.
I will end with our favorite blessing at Firefly.
May your infrastructure never drift.
Thank you.