Conf42 Chaos Engineering 2024 - Online

Security Chaos Engineering: How to Fix Things (you didn't know were broken) and Make Friends While Doing It

Video size:

Abstract

How do we fix a broken industry? We talk about ‘traditional approaches aren’t enough’ while delivering established (traditional) best practices. We fix it by going offensive. Security Chaos Engineering proactively validates our assumptions of how something truly works and provides insights to improve.

Summary

  • Security, chaos, engineering, how to fix things you didn't know were broken. The things I'll be talking about while sounding like a job for offsec people. The goal is to understand how things work before you a team to break things.
  • Verizon Dvir regularly shows misconfigurations as the primary cause of breaches, not vulnerabilities. Basic fundamentals are still difficult even without the smug jerks clumsily saying just turn it on. We need to understand what we're trying to protect.
  • Secure today does not mean secure tomorrow. Trying to prevent failure doesn't work, and it's likely never going to work. Everything's going to fail at some point, so it's our jobs to understand how they'll fail and plan accordingly. Key performance indicators help set benchmarks, measure progress, set business goals.
  • Most adversaries look extremely similar to each other. Don't use actual malware because that can ruin your day. You should be able to see layers of defense in action. Getting started is probably one of the hardest parts.
  • We should not settle for 80 or 90% visibility. Everything else needs to be tested, identified and measured. In experimentation, you're learning about how things actually work. Knowing that your product works enables better decisions when it comes to renewations.
  • Work is really all about making friends in the right places to help you get the things you want done. Security chaos engineering helps you ask better questions. Success means more work because that's really what life's all about.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Security, chaos, engineering, how to fix things you didn't know were broken and make friends while doing it. The things I'll be talking about while sounding like a job for offsec people is not really an offset role. I'm not particularly interested in knowing something doesn't work. We all know that there is more than enough people out there to tell you that the goal is to not find all the broken things. The goal is to understand how things work before you a team to break things part one nothing works how we think it works. Maybe it's me. If you're looking for the smartest person in the room, just shut this off. Just go watch a different one because you're looking at the wrong person. But if you want to feel better about yourself, let's hang out. I don't really know how everything works all the time and you might and that's really awesome. But individuals don't scale. The outside world's a scary, scary place. And the people pulling the big decision levers are constantly being fed information about the horrors of the outside world and the threats looming on the horizon. They're hearing about the magic dust that solves the world's problems that they didn't know that they had. They protect the business from the next big nasty thing. They're going to ask their teams, are we safe from that threat? And then they're going to hear, we've got signatures in place and configurations made to stop those threats. But that doesn't really answer that question. The answer should be, yeah, we've modeled it. Here are the results. We arent tracking the resolution of those gaps or like, I don't know. But here's how we're going to find out. Everyone working in this industry, they're constantly bombarded with ads meant to terrify them into buying whatever jalapeno nacho cheese flavored security tool that says it's the only real way to prevent all the cutting edge quantum AI threats that arent continuously trying to eat their lunch. Vendors make us worry about the hypothetical next gen problems while the core of the business still struggles with fundamentals that have been placed since 19 eight know where your assets are, where the data lives, who can access the data? Taking backups, validating the backups work, protecting information and keeping the nasties out should really most of us be worrying about quantum ransomware while we're struggling with managing role based access control. My mind is still blown from hearing about vendors talking about bloodhound and how they can detect bloodhound. It's like great I'm glad it's super noisy. It's like, I hope you can pick it up. Maybe next we'll talk about detecting Mimi Captain's default configuration I was having trouble sleeping a few nights ago and I decided to look up vulnerability trending information. So I took one look at the trending chart and said, wow, I'm glad I'm not involved management saying that it's probably going to be gifted with a project around in vulnerability soon for cyber karma. So going back on topic, it makes me wonder if we're actually improving as an industry. And based on those vulnerability numbers, we're clearly good at identifying these issues, especially in 2017. Minor change, their assignment process made cves easier to get. It's like a nice birthday gift to vulnerable management teams. Verizon Dvir regularly shows misconfigurations as the primary cause of breaches, not vulnerabilities. Is the time that we're spending on patching improving our ability to defend? Probably. But it's also true that misconfigurations contribute greatly to the security issues that plague us today. When you've got all these issues just dumped in front of you, it's really easy to buy a tool and say, okay, I've got the industry standard tools, so I don't have to worry about this. This is covered. I don't have to think about this anymore. Then you only find out later that it might not be working the way that you thought it worked. As we buzzily declare that we're shifting left to embed security in all the platforms at the beginning, trouble is kind of brewing. Like how do we know that the developers are making secure code? How do we know that the libraries that we rely on aren't actually held together with tinfoil and bubblegum, we get more tools to tell us it's extremely complicated and none of it's easy. As we do our work, these systems constantly change and constant changes the chaos experienced in day to day work. Chaos engineering security chaos engineering help design, build, operate, and protect these complex systems that are more resilient to attacks and failures. A lot of these tools exist because it's kind of hard to afford the folks that have time to dig into the root cause and make real change. Arent you afraid of malware? Buy Av a lot of these successful attacks are broken down, can be broken down to the basics like don't let untrusted code execute. Don't let folks run untrusted macros from excel docs. But if you do, make sure the one time it happens isn't going to bring down your whole environment. Arent vendors where we should be focusing our efforts to secure things, or should it be in people? While I'm somewhat excited about AI, I'm not really excited as being trained on the same thing humans are being trained on. Probably stack exchange or outdated Reddit posts or materials that help you get what you want, but it's not done securely, which will probably result in creating broken or insecure applications at a faster scale than a human can. And even the fundamentals are simple, but they're not easy. Even paid for tooling is complicated, like how does it work with deployment pipelines? Can it work at scale? Can you get your data out of it? Or are you essentially stuck with them for life? Token management, like things that are considered table stakes and basic fundamentals are still difficult even without the smug jerks clumsily saying just turn it on. So what makes things thing challenging? Like in some cases we don't know why we're doing it outside of third parties saying it's best practice, follow best practice. These basics arent really basic, they are kind of hard. If you don't believe it, you can ask yourself like why are there so many companies struggling with implementing native endpoint controls like a firewall, and instead rely on vendors to cover that gap for them? It's easier with the resources that are available to them to just manage a portal and not the entire platform. Have you ever really tried managing something like an app locker or Windows defender app control? It's got poor usability, poor user experience, and that makes it far more challenging than it should be. If you need to be certified to use a product or go through 40 plus hours of training, it's probably not well designed. And things that aren't well designed are prone to failure. Good security should not just be for those who can pay to play, it should be the built into the foundation. But regardless of who runs it, we need to understand how the tools work, if they work, and ensure they continue to work. And importantly, do they work as expected and to accomplish things, to accomplish things, we need foundational work. We need to understand what we're trying to protect. We need to know what is the goal that we want to achieve? Is it top ten adversaries? Is it opportunistics? How the tools work doesn't necessarily have to be deep weighted knowledge, but in general were are the weak points in your detection? What does it not cover? Like, what can't it see? Are they being used for their intended purpose? If they work, sometimes things are secretly a beta with pretend machine learning and AI that only work under certain conditions. We need to make sure they continue to work. So whatever value is being taken from these tools, that's the baseline of what's expected. It needs to continue doing that thing always. I would argue that so few know how things work that it's more scalable and acceptable to say we manage that a part of our portfolio with Vendorx. And the more I learn about something and how things work, the more I'm amazed that anything works fix things world at all. So we need to understand how these tools work. If they work, ensure they work. And importantly, do they work as expected? Because using security incidents as a way to measure this detection, while it is a method, it's not a great method to measure the ability to detect. People are stressed. There's a lot going on. Researching gaps during this time isn't really recommended. Obviously you can do it in retrospective, but these are still stressful times, and a bad retrospective can end up being a game of blame the human, don't blame the human, blame the system. Security awareness already puts enough responsibility on the person. Let's build resilient systems that can withstand an attacker. It's like shaming people for clicking on phishing. Like if clicking on a link just ruins everything for you, it's not the victim, it's the system. Stop shaming user for clicking on things. When things break in real life, we generally analyze. When we're wrong, we redesign it, we model it and we build it better. But when we think people are the cause, we just typically blame them. Like cybersecurity awareness, we're well were of cybersecurity. So let's build platforms and environments that are resilient enough to withstand a user clicking on a link. Part two secure today does not mean secure tomorrow. In real life, you only know how the systems work. The moment you deploy it, it gets presumably been tested, documented, everyone signed off for deployment, and chaos been set free to run through the fields of cybersecurity. But then tuning happens, patching happens, priorities change, humans get reassigned work, people leave. This is where the chaos comes in. Chaos engineering. We arent not making the chaos everyday. Work is the chaos, and we're working to make sense of how things change and why planned and approved changes can fundamentally change how a system operates. An example, software development constantly invents new ways to change. How systems work usually features performance, usability, so on. Oftentimes, vendor platforms aren't equipped to move things quickly. So in these cases, what do you do, do you hold back their releases? Do you stick with older versions of things? Sticking back with older versions of things means you could have more vulnerable software. So when we're looking for reliability, we know how it operates, we know what failure looks like, and we can validate multiple layers of defense to cover the gaps generated by the failure. Trying to prevent failure doesn't work, and it's likely never going to work. You cannot prevent all failure. Everything's going to fail at some point. So it's our jobs to understand how they'll fail and plan accordingly. And these experiments help us understand how something's going to fail so we can learn from it, and it will enable us to build systems resilient to cybersecurity failures. You get something dumping lsas and you have got a whole slew of controls that have failed and questions that come along with it. How did it get there? Why did it run? Was privilege escalation needed? If so, why did that work? How did it get through our network defenses? Does that mean we need to revisit configurations to do this? I enjoy building attack trees that help model these scenarios and guide me in creating new experiments with visual aids to help with telling the story to other humans and help me not look insane while I'm doing it. Part three experiments in madness KPIs super hot. Key performance indicators help set benchmarks, measure progress, set business goals. There are nearly unlimited numbers of terrible KPIs floating around, or floating around around there based on things that you can't control, like KPIs on intrusion attempts, numbers of incidents. Time to resolve. These can be gamed or just inaccurate. You can't control how many external entities are trying to ruin your parade. Something like a meantime between failure, those I can get behind. Every successful event on the endpoint is a failure. App control stops something. There's a failure at every previous layer, and that's just where the work begins. It's where we need skilled humans. Anybody can push the go button, but you need folks in place who understand what the button does and what it all means together to understand it. We need to tie it back to what we care about. What are the customer concerns? Are they worried about threat coverage? Defensive resilience? Needlessly duplicative technology? You're going to want to take the time to figure out their goals, especially if they aren't sure what they need. So when you're looking at it, what concerns you? What threats concern you? What does your coverage look like? Can you safely withstand the attacks from these groups? How much failure can you safely withstand before you have a really bad day. Testing versus experimentation. Testing implies a right or wrong answer, and we're not looking for right or wrong. We're looking for new information to help us make better decisions. I want to know what works, what doesn't work, why it works, how it does, and what cost effective improvements we can make that will make other teams work harder. I'd like to know that the money we're spending is providing value beyond compliance. Checkboxes. Starting with design, what do we think will happen? Assuming there's a documented expectation, we can safely wager that the platform will do what's promised. Tons of people will say truly and verify, but I don't really trust claims. I want to test the claims and gather evidence. It's admittedly a lesser exciting part, but we don't want to eat the dry ramen and then just drink the water afterwards. We just want to tie it back to business goals. So in building an experiment, we want to start with a nice, reputable source to use. I generally go with either intel sources or map it out in front of me. But if I'm not emulating a specific threat, I'm looking at each stage and finding out what's available to me at every single part. Odds are soon you're going to find that most adversaries look extremely similar to each other. Maybe they'll deploy something differently, but in the end, it's just untrusted code executing. It's like 22 Jump street. You infiltrate the dealer, you find the supplier. Just do the same thing. Mostly. Obviously there are edge cases. So DFIR report usually has great breakdowns of adversarial actions that you can take to plan your experiment. But don't use actual malware because that can ruin your day. You can use icar or mimicats, things that aren't inherently destructive and won't get out of your control. Like, you can even make AV flip out by appending mimicats commands after calc exe, because next gen AV likes to pretend it's learning everything, but it's going to trigger on those words. I wonder why. This is just a visual example of a couple different tests where we expect to see some observables, sort of a scorecard, if you will, about what you think you should see and where. And doing things is noisy. In this example, we're running a couple of atomic tests. In this scenario, were seeing a file download, execution, and local discovery. And when these typically run, you're looking for endpoint events, which makes sense because that's the easiest place to look for indicators of attack. In reality, downloading a file should provide no less than two to four pieces of telemetry through your stack, or even nothing, depending on how your organization is running. Point being, when you do things, you should be able to see layers of defense in action. So questions that I have from above would be why are these files allowed to be downloaded? Is that normal for that user's Persona? Do people typically download things out of band? Do they use Powershell to do it? Why are they able to launch setup from the desktop? Is that normal? Like pretend it's malware? Shouldn't the proxy handle malware at that layer? Do people normally start services in the command line? The answer to all that could be yes. But if you don't know that, it's difficult to say whether your tools and platforms are working as expected. We're going to be filling this out later. It's a quick, high level overview of what the experiment ultimately looks like. At its conclusion, we'll have an easy to understand view of the overall status of the cybersecurity tooling. So this one is relatively straightforward. Take your outcome, your goals, validate your position. If you deploy a thing, does it do what it's supposed to do? You don't want to just assume the outcome. You need to exercise, it needs to be exercised. And getting started is probably one of the hardest parts. So in order to sort of guide this like mitre defend framework, it's going to help if you're unsure how your defenses measure up. Attack helps if you've got absolutely no idea how to get started or what an attack looks like. If everything is super awesome, you've just created a baseline to measure future changes against patching, tuning whole platform replacements. It can all be measured against that baseline. One key note, this baseline should be changed over time to reflect positive changes. Otherwise, you'll be measuring the wrong thing, and you won't really know if you're improving from the previous iteration. If everything's not super awesome, well, you now know where you've got to start. Part four measuring stuff the phrase we don't know what we don't know is an traditional annoyance of mine. It's like you're throwing up your hands and giving up. Presumably there are things that have been done outlining what you know about your environment. Even a bone scan is a start. Whatever that is, that's what you know. Everything else needs to be tested, identified and measured. Whatever you know, that's your current baseline. So to put visibility in, expect to put visibility into perspective. What if they could only guarantee that 80 or 90% of the time your cheerios weren't poisoned or contained bits of rock and glass? There'd probably be a big investigation. So we should not settle for 80 or 90% visibility. We don't want to be eating glass. Cyberglass. I don't think cyberglass is a thing yet. And the view of experiments and practice. This is currently my favorite view of the world. It tells me exactly what defenses are working for the given technique. Now, this example is purely fictional, but the idea is that you take a view like this and guide it to where you're going in the delivery stage. You can see the payload was blocked at the endpoint. Arent mission accomplished. But why did it get to the endpoint? That means there were failures at the other layers coming earlier in the chain. Why didn't it get stopped earlier? Was it a misconfiguration? If nothing's wrong and everything is working and expected, is that normal? We don't know. That's what we're going to find out. And that's the beauty. In experimentation. You're learning about how things actually work rather than guessing based on a configuration. And once you can do it that first time, you automate it. And now you have ways to identify drift. Over time, you're able to measure the impact of changes and avoid mistakes later on that are more expensive to fix, like PG and E. Fixing the infrastructure 50 years ago would have been way less expensive than making it for deferred maintenance. Buyers homes lost and eventually rate raises that we have today. And it helps you make friends. Friends like these. So this was AI's attempt to make five half human, half ravioli people. There are six here. I had some free tokens that were expiring, and I needed to make some art for this presentation. Security chaos. Engineering should not just be limited to engineering. Teams don't think they're the only ones that can have all of the fun. Hunt teams. Hunt teams can validate analytics to let them know that in lieu of an active compromise, that their analytic is still functioning. Parsing results can let groups zero in on where detection is missing and guide hunt activities to where the gaps live. Architecture teams or groups can be looking for duplicative technologies, maybe reduce tech tech debt. If you don't know where things overlap, then when you make changes, you might be losing critical tooling. You could also find that you've got too much emphasis in one place and not enough someplace else. And this is how you can find out without waiting for an adversary. Operations teams want to note that the tools they rely on to work through all the nonsense of maintenance of updates. They want to make sure their alerts still work. They want to know where the holes are that they need to worry about. And our product teams everywhere. They want to know how their portfolios perform, whether they provide actual value to their customers. Knowing that your product works enables better decisions when it comes to the renew replace cycles, and will help give you a baseline to measure against. It also helps validate that a new product will do what it claims, and you don't have to take their word for it. Or if they say, just look at the minor attack evaluation results. You could see we saw everything. Yeah, sure you did. I don't want to know what the best tool is. I want to know the best way that we can reduce the impact of a malicious entity getting into an environment. And that is what we can do with the experimentation. Part five making friends along the way and I will wholeheartedly, wholeheartedly attest to making friends. Director director director director director of security chaos engineering become popular like popular with audit second line functions, which you might not want unless you really like writing responses to their requests. Because talking to them and giving information means you'll be writing responses all the time, because you're bringing real life views into what's typically a box checking exercise and making other programs better. It helps you make friends, and that's really what working is about, isn't it? Work is really all about making friends in the right places to help you get the things you want done, actually done. So in the example, I'm going with compliance now, everybody loves compliance. And many times when asked about the security program of their business, they'll answer, we're HIPAA certified were PCI compliant, which, if you're listening to this, you know exactly what that means. But it doesn't have to mean that compliance can be made mostly painless. And director of security chaos engineering support measuring those tools it helps measure the effectiveness of your program. Effectiveness of your program opportunities to improve your program communicate recent achievements demonstrate stewardship of your resources show how your team supported objectives of your organization, possible actions that you want others to take to improve. And by doing it all the time, it's not an exercise in finding the right evidence at the right time to show compliance. It enables you to be compliant because you're secure, not secure because you're compliant. Security chaos engineering helps you ask better questions. Questions like, does our endpoint security stack proactively reduce the impact of unwanted activity by malicious actors. How do we minimize the impact of ransomware? Do our network security tools effectively provide defenders with data necessary to investigate and or mitigate unwanted activity? So we're reframing the question to ask what the most effective use of resources would be, whether it's tuning implementation build versus buy instead of just asking what's the best av we can buy? Making things palatable for others will make your life easier, and it takes patience to do this kind of work because it's not as common as it should be, or hopefully in the future will be. You may get empathy from any sres out there when you explain hey, here are my challenges, how do you solve this? And they'll smile, they'll nod and maybe smuggly chuckle first time. So get started. Start small with business goals and use examples they'll care about. Start by experimenting against controlled endpoints to validate they function the way the owners think. Give them data to build KPIs to show how awesome they are. Be nice. If you're not nice, nobody's going to want to help you trace back why something doesn't work if you're mean, it's really easy to get sidetracked by finding oddities. Don't go chasing waterfalls and beware. It's work that people don't even know they need until it's done. And success means more work because that's really what life's all about. Life's about succeeding and generating more work for yourself. Our ultimate goal is making things better with proven, well designed functional systems that provide depth. It increases the cost of attack and it's resistant or resilient to failures. And I believe in you, so thank you for listening and or watching, and I hope you learned something from this. And if not, I made a terrible mistake.
...

David Lavezzo

Director of Security Chaos Engineering @ Capital One

David Lavezzo's LinkedIn account



Join the community!

Learn for free, join the best tech learning community for a price of a pumpkin latte.

Annual
Monthly
Newsletter
$ 0 /mo

Event notifications, weekly newsletter

Delayed access to all content

Immediate access to Keynotes & Panels

Community
$ 8.34 /mo

Immediate access to all content

Courses, quizes & certificates

Community chats

Join the community (7 day free trial)