Conf42 Cloud Native 2022 - Online

From Ghost Assets to Infrastructure Drift - Don't Get Spooked

Video size:

Abstract

There’s a lot of confusion in the infrastructure operations world regarding emerging concepts such as state drift, ghost assets, and other cloud operations nightmares. In this talk we’re going to demystify the differences between all of these concepts that are the byproduct of large-scale, rapid adoption and growth of cloud operations. Cloud operation were widely impacted by the introduction of Infrastructure as Code practices which changed the way we manage our cloud. That said, cloud deployments that predate IaC were often not codified and managed via UI or cloud APIs - which has led to a mishmash in the way cloud resources are managed and maintained today. We will go on to focus on how and when ghost assets happen, why you might have these haunting assets and what to do about them - hopefully without spooking you out too much.

We’ll wrap up with some real code examples of what ghost assets looks like, when you should be concerned - from the cost to the security implications, as well as how to fix them and prevent any ghosts in your closet in the future. Boooo!

Summary

  • From ghost assets to infrastructure drift don't get spooked. We'll cover what Ghost assets are and the top five ghost assets according to more than 100 AWS production accounts. What you can do to prevent some nasty stuff from happening to your cloud.
  • Noor Paz is director of product management at Firefly, the cloud asset management platform for DevOps. He explains how infrastructure drift starts when the infrastructure drifts away from the code. These assets can hurt cloud reliability by causing downtime or production issues.
  • The ultimate prevention option is to looks your cloud, meaning disallowing any non read click offs through the AWS console. Click looks is an open resources solution that helps you detect and alert any clickops actions someone made in your cloud accounts. Still, the cost is relatively high and it's definitely worth a try.
  • Ghost assets are dangerous. They have to be detected and fixed. This way or another, we can prevent redundant costs, reliability issues and security exposure. May your infrastructure never drift.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hi everyone, thank you for joining my session. From ghost assets to infrastructure drift don't get spooked. We'll start with a short intro and continue with a reminder of the concept of infrastructure drift. We'll cover what Ghost assets are and the top five ghost assets according to more than 100 AWS production accounts. We'll continue with what you can do about ghost assets to prevent some nasty stuff from happening to your cloud. My name is Naor Paz. I currently work as director of product management at Firefly, the cloud asset management platform for DevOps, site reliability and cloud engineers driven by infrastructure as code a bit about myself before joining Firefly, I worked at f five as a senior product managed in application security, where I led the web application firewall product. Before that I worked for the Office of the Prime Minister of Israel as a security research and analytics team lead, and before that I served at unit 8200 doing mainly network security research and data analytics. So in these past decade, I've got a mixed background in DevOps cloud security data and networks. So what do you need to know to gain some valuable knowledge from this session? Well, not too much, but you need to understand the following concepts. Cloud throughout these session I will mainly show examples from the most significant public cloud to date, Amazon Web services or AWS, but any cloud background is sufficient. The session is as relevant to Azure, Google Cloud and other cloud service providers. Infrastructure as code or IAC. This session focused on examples from Hashicorp terraform. Although relevant to Polome, cloudformation and other IAC technologies, I recommend understanding state files HCL, which is Hashicorp configuration language, and some terminology such as providers. When I say cloud asset, I usually mean a terraform resource when it is a managed one. Infrastructure drift a short reminder, when we manage cloud assets with terraform or any other infrastructure as code, we do that for multiple reasons, speed, consistency, versioning, documentation, and decreased risk. The opposite of managing cloud asset with code is doing it manually through the cloud console or user interface, what we like to call click Ops, which stands for click operations. Infrastructure drift starts when the infrastructure drifts away from the code. When we want to make just a small patch here and there, we tell ourselves that we'll add it to the code later, but we usually never do. So we started with a fully managed codified infrastructure. We used click ops to make just a small patch here and there. And now our code no longer describes the actual state of our cloud. When we run terraform plan, for example, we get a list of all the changes we've made manually and we're stuck. We have to decide whether we want to revert everything to the older codified state or adapt it somehow. Emerging changing the code to match our manual changed. These drifts are all about changing existing assets, which we created from code. This session focuses on the very existence of those assets, as you'll see right away. So what are ghost assets? We have our codified managed assets created with code using the terraform apply command. Someone or something completely removed one or more of those assets with APIs or clickups, meaning manually. They now cease to exist in the cloud. But because of the fact that they were initially created with code, they are still in the code itself and in the state file, which refreshes once you apply your code again. These assets are what we call ghost assets. These can hurt cloud reliability by causing downtime or production issues, creating redundant ghosts, and generating security issues. Anyway, there are two options. You either wanted to delete that asset which you no longer need, and what can happen is that it can be accidentally revived. It is still in the code, so applying your code will just recreate that asset again. The other option is that you did not want to delete it. You still need it in your cloud deployment, but it was deleted by mistake and now it is not available when you need it. It can be a storage backup, a compute asset such as lambda that can cause downtime, or a database replica, you name it. What is most certainly true here is that you have a problem. As a cloud asset management platform. Firefly's customers integrate their AWS accounts, so we have plenty of data to look at. By looking at only part of them, more than 100, we get a fascinating sorted list of the top five ghost asset types. EC two instance a compute asset, essentially a virtual machine IAC policy which is all about identity and access permissions. SqSQ, which is a managed messaging queuing service. Cloudwatch metrics alarm an alert configuration on a cloud metric such as cpu utilization for an EC two instance or volume write bytes for an EBS volume and security group, which is basically a virtual firewall. So EC two instance is one of the most critical assets. It is used by almost any AWS customer for compute again, essentially a virtual server you can do anything with. It was also the second service AWS introduced back in August 2006, 2nd only to the s three storage service. So what can happen if an instance is accidentally deleted? Well, in that situation we mainly have reliability problems. Imagine we have a multiservice application deployment. An EC two instance can run its web server, a backend service, a web data scraper, and even a database. If this production instance is suddenly gone, you will definitely suffer from downtime. Services will not be accessible to customers or any other type of consumer. It can also remain undiscovered for days or weeks if it's not a code service used for any operation. What can happen if a deleted EC two instance is accidentally revived? When was the last time you spent up a temporary instance just to try a new tool, open source or any other experiment? If you deleted that managed codified instance manually and another engineer or team hit the terraform apply, it will be revived and you will pay for. It can be anywhere from three to four digits dollars a month. What about security? In that case, you have an instance that might be publicly available to the Internet running old unmaintained, unneeded code. You know where this goes. A breach IAC policy is at these core of identity and permissions. Any user or role can be attached to it and get its needed permissions, such as writing or reading from specific services. It is used for manual and programmatic access. Managing policies as code have tremendous advantages, especially when implementing least privileged permissions. Upon deletion, the role is detached from any entity attached to it. This means that any permission granted by the deleted policy is now revoked. Production processes can closet their ability to write and read data, and customers can lose access to services. But what about security? Policies can include explicit deny statements that are always stronger than allow statements. They can protect against unauthorized actions and act as a guardrail that, once removed, can pose a real threat of excessive permissions. SQS is Amazon's queuing service, a central pipe in any meaningful project, usually connecting between producers who send messages to it and consumers who read from it and do something with the data. When accidentally deleted, it is no longer available for the producers to send messages and for the consumers to read from, therefore causing operational damage to any dependent service, haunting the reliability of data and halting jobs already in the queue. The Cloudwatch metric alarm is an elementary configuration for monitoring cloud assets. Whether it's the cpu or memory utilization of an EC two instance that can lead to changed its type or space in your EBS volume, the notifications it generates are actionable and essential for production environments. It's pretty evident that when such an alarm is deleted, it can cause SLA service slowdown, SLA problems, of course, service slowdown and even downtime security groups are virtual firewalls that control the traffic that is allowed to reach and leave assets that these are associated with. When accidentally deleted, it can remove needed access to a specific asset when deleted. Because multiple security groups can be assigned to one asset, it can be detected late in the game haunting downtime, although for this type it is a bit harder to accidentally delete as it is not allowed to delete an attached security group, whether to a network interface or an instance. This table summarizes the asset types we discussed. Those are just the top five, and any other deleted type can cause cost, security and reliability issues. So what can we do about this? Well, you should all become your own cloud Ghostbusters there are several ways to tackle ghost assets and we will cover just some of them. There are four categories that we look at when we talk about taking care of ghost assets. One is detecting a ghost asset. It means catching when an asset is deleted somehow but is still in the state file. Two, alert is the ability to send a notification to your favorite shoutout solution such as slack. Three, fix means the ability to make a change to your actual code or run a CLI command. And for prevention, prevent means making sure that this never happens in the first place. We'll start with our SaaS solution Firefly, for which we have a completely free tier, no credit card needed. You can start using it today and it supports the detection, alerting and fixing of ghost assets. We'll talk about prevention that can be achieved by spooking your cloud for any clickups activity. Although largely theoretical, some teams might be able to get it done. We'll talk about click looks an open resources that lets you detect and alert on ghosts. Finally, we'll mention drift CTL which helps you detect ghosts as well. So this is what these Firefly inventory looks like. As you can see we have a table with all of our cloud assets up there. We filtered to show only ghost assets. We have quite a few of those. Let's focus on one of them nec two instance. So these instance was created with terraform code. It is no longer in the cloud meaning it was somehow removed. It is still in the code and in the state file. As you can see here we have an alerts channel on slack and we send alerts on ghost assets so we can immediately know when a codified managed asset is removed from the cloud. We can also looks at the asset data and in the IAC tab see exactly which stack or state file managed it and which terraform code created it. When we click on the code link we will get right to the code in git. On the left you can see the resource code in terraform that created these asset. And if it was accidentally deleted and we want to revive it, we can just run the following command terraform apply minus target AWS instance which is the type bar three which is the resource name from the code. If we want this asset removed from the code, we click on remove asset code and we instantly create a pull request to remove these relevant resource block from our code. Once approved, this asset will never come back unexpectedly. Here you can see the pull request on GitHub and that concludes the first solution. Well, the ultimate prevention option is to looks your cloud, meaning disallowing any non read click offs through the AWS console and allowing only your infrastructure as code solution to make changes. This way you make sure you never have any ghosts or drifted assets has well. Still, the cost is relatively high, especially when we want to make a quick fix manually after a pagerduty alert woke us up instead of writing terraform code at 02:00 a.m. But it is still possible and it's definitely worth a try. Click looks is an open resources solution that helps you detect and alert any clickops actions someone made in your cloud accounts. It can be provisioned using AWS control tower for multi account support and it is basically using cloud trail logs to detect those actions. You can also alert slack using a webhook configuration. Clickops also provide a terraform module that is used to provide all the needed resources, including a dedicated lambda. It seems like a great solution if you are willing to invest time configuring, monitoring and troubleshooting it. Drift CTL is another open source tool that works with state files, so you don't need any unique cloud deployment, but you still need to invest the time and automate using cronjobs or any other automation to catch ghosts as they happen. If you want any alerts, you will have to build them yourself. In addition, I recommend being careful when handling state files as they might contain sensitive information when not cleaned and can cause damage if accidentally changed. Let's wrap up. Ghosts assets are dangerous. They have to be detected and fixed. This way or another, we can prevent redundant costs, reliability issues and security exposure. I will end with our favorite blessing at Firefly. May your infrastructure never drift. Thank you.
...

Naor Paz

Director of Product @ Firefly

Naor Paz's LinkedIn account Naor Paz's twitter account



Join the community!

Learn for free, join the best tech learning community for a price of a pumpkin latte.

Annual
Monthly
Newsletter
$ 0 /mo

Event notifications, weekly newsletter

Delayed access to all content

Immediate access to Keynotes & Panels

Community
$ 8.34 /mo

Immediate access to all content

Courses, quizes & certificates

Community chats

Join the community (7 day free trial)