A New Architecture to Free Incident Responders from False Positives

Video size:

Abstract

Security tools base detection on “protect the conduit to protect the data inside”. But can we trust correlation between APIs and data? High false positive rates that are killing SOC teams. A new approach is presented that avoids correlation to maximize real incident to false positive ratio.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hi, everyone. I'm Rob Kuros. I'm CEO and co founder of Caber Systems, and I'm here to talk to you today about how we can fix the high false positive rate you all are dealing with every day. I hope you'll find this talk valuable, and I invite you to connect with me via email. Or LinkedIn if you do I spent 30 years on the vendor side of IT and security at the likes of Cisco Riverbed Akamai. And after helping build one of the first startups in the SAE or Secure Access Service Edge space, I realized our approach to cybersecurity today is completely broken. In 2023, the false positive rate, according to the Sands Institute hit 95%. That was up 20% since 2019. And also me, we missed 20 percent of all incidents despite that. And it can sometimes take years before we could detect a compromise. So what's the problem here? one of the issues is collectively, companies spend 60 billion on tools to find and alert us on indicators that we might be compromised. And you already know false positives are when we see these alerts that are indicators that turn out not to be actual compromise. So how do we tell the difference between an indicator and a true false or a true positive, an actual compromise? This goes back to security 101. It's when the indicator leads us to an actual compromise to the confidentiality, integrity, or availability of data. As information security professionals, our job is to protect data. So why aren't we detecting the compromises to data directly instead of, Looking for indicators, the newest version of this cyber security framework calls out the same protections for all states of data. But the only 1 of those states where we can directly protect data is data at rest when it's on a storage system. In fact, when we talk about data security products and tools. We're only talking about securing data addressed, and that's because the on storage systems, that's the only place where data and its ownership and permissions can be directly linked for access control. So what do we do for access control when it comes to our cloud native applications for data in transit and data in use inside those applications? We don't even know what the permissions are. We don't know who owns the data inside the APIs. Because we don't copy that information when those APIs copy the data from storage systems. Instead, we rely on controlling access to the APIs, thinking that somehow that's going to be able to stand in as a proxy for controlling access to the data. It's like we're arbitrating access to bags of sand or bags that are labeled sand and thinking that because it says sand on the outside that it contains sand on the inside. But we have no idea what the bags contain. There's no correlation between API parameters and what the data is inside, who owns that data, and who can access it. In fact, we might as well be calling zero trust network access API. It's infinite trust zero access, because that's what we're doing. We're completely trusting the, these header parameters and the labels on those bags to tell us what that data is. And our so called data leakage prevention tools aren't any better. They tell us what data looks like, not who owns it or the permissions on who can access that data. They can't tell you if it's your data or if it's my data, and that's where we're getting into trouble. We have no way of effectively doing authorization and access control on data in APIs and the top security vulnerabilities. No surprise. are authorization failures. So in order to keep you from getting access to my data, I need to know what is your data, what is my data. It can't be just a determination of, hey, this data contains PII, or it looks like it's health care information, because my health care information can't be distinguished from your health care information. My PII could be yours, could be somebody else's. What we have to have is deterministic controls. Access control has to be deterministic. We need to know the identity of the data, its ownership and permissions, like we know the identity of the services and users that access it. That's what we need to control data in use. And data, in transit, and that's going to help us solve this false positive problem. A lot of people have called this kind of direct control over data in use the holy grail of security. Because with it, we can solve so many problems today. We can address broken object level authorization, privileged articulation, SQL injections, and other, vulnerabilities in our applications just by checking authorization rather than having to look for chains of indicators that these things may be happening. And moreover, instead of it taking us to Months, or in fact years, like it was in the case of First American, to find a compromise in our systems, we can detect them forthright. We can detect them as they're happening because they become authorization failures. So why hasn't this happened before? Why haven't we been able to do this before? there's three Particularly hard problems that have to be solved. The first I've already mentioned before, that our APIs don't send permissions with data. We read the data, we copy it into an API or a service, we move it to another service, and we leave the ownership and permissions behind. We're expecting that our knowledge of how the software works It's going to allow us to tell whether or not these APIs are authorized to move that data or not, but that gets us to the second problem. service accounts. These services themselves hide the end user identity. If you have an end user that often that authenticates to a front end. That front end doesn't use the user's identity to make a request to the back end to get data from storage. So the storage system never knows who ultimately is going to get access to the data. It only knows the back end with broad access rights read some data and passed it to the front end. The front end doesn't know what the permissions are and passes it to the user. There's no place within our applications where the user permissions and the data permissions and the user identity come together for authorization. And the third problem is a little bit more subtle. It comes down to data deduplication. Common chunks of data, byte sequences that exist in multiple different documents, in multiple different records and databases. It that form the data that we move in these APIs. We can't actually tell, looking at the data itself, whether it came from, say, document A with policy A, or document B with policy B. So we need a way to be able to figure out not just what the object was that it came from, but the total lineage of this data so we can rationalize and determine what those permissions should be, whether they should be policy A or policy B. Policy B, and this is the critical problem that we're actually facing with generative AI. We take documents and we ingest them without permissions and put them in a vector database and then we expect that we can do some sort of access control as those chunks of data are pulled out of, Out of that vector database in RAG systems, but we can't and even copying the permissions, as I mentioned before, because of data deduplication means that we're not going to get the permissions on each chunk as it should be, we get the permissions on the last document that chunk was in. So we may be redacting information from all of our users that they shouldn't, that they should have access to. Moreover, we can mix these chunks together in a single API call. And then how do we get to the permissions? Of the data in aggregate should actually be. So we need a new architecture to fix these problems. We need to be able to identify data as it's moving in API payloads, and we've tried this in the past. We've gone down the road of enterprise digital rights management. Google's come up with their own solution for this. My company, cber is proposing a new solution. But let's look at the, let's look at the previous ones first. EDRM. So EDRM is a crypto, cryptography solution. Basically, we're encrypting the data that we're moving in these API payloads, and as they were constituted on the other side. The users that have the keys that we've passed out to decrypt that data can access them. The ones that don't can't access them. And it's a pretty effective way of doing access control on a per object basis. But now let's shift our perspective and look at what happens when an API pulls a chunk of data out of a document and then moves that chunk To another API that ultimately gets to a user. It starts to get pretty messy from there. You'd have to, the service that, that extracts the chunk wants to send it securely to an end user. it's got to send it back to some central crypto authority to have it re encrypted with a private key. We get that chunk back, then we have to send the public key to the right people that have access to those chunks. Which may not necessarily be the same people with access to the entire document. And as the number of chunks grows and the number of services, this becomes simply untenable. So imagine how many encryptions, decryptions, and re encryptions we'd have to do. So Google came up with a different solution to this. As they're putting together Google Cloud, they were thinking about data security, in their BeyondProd project, and coupled with Zanzibar, which was their central authorization engine, they came up with A new way of doing data security between their microservices. So what they've done is effectively what you would do if you decided to build data security. Correct from the first point is you pass that you actually do pass the permissions of all the data that you are carrying in your APIs all the way down to the. final point where those APIs are accessed by an end user, and you pass the user credentials all the way up so that every point where the data and the user identity meet, you can authorize access. Be it the API access or the user access to that data. Now, that's a great solution. The problem is, the problem is that they had to re engineer their service to be able to do this. And if you think about modern applications and enterprises that use hundreds, if not thousands of services, many of which are open source or third party sourced services, How are you going to reengineer them all to pass the credentials of the users or pass the permissions of the data for most companies other than Google? that's a complete barrier to entry. They won't be able to make use of this solution. So that's why at Caber we came up with a third option based upon data lineage tracing. Effectively, what we're doing is we're looking at the byte sequences in every API payload around an application, tracing them back to all the source objects that could possibly come from, and then building a map that allows us to determine which source of that data should those sequences inherit. We're leveraging some, Basic technologies like content defined chunking to make sure that the sequences that we see in one API payload is the same as the chunks that we see in other API payloads. And that allows us to connect the dots between all of these APIs to build those relationships I mentioned before. And that map is extensive. This is an example of How we look at data within an application, the objects that those chunks of data themselves, which are the small blue dots come from the objects that hold them and the APIs that move them. And from this, we can eliminate the junk data that. It doesn't carry permissions, it doesn't show us what, or doesn't contribute to our ability to authorize access. And we can narrow down to just the data that we know what the permissions are and we can authorize access at any given point within the application. So based upon doing that sort of time series analysis or graphical analysis and coupling that with time series analysis, looking at event ordering and eliminating impossible paths, we can effectively do this. And the result, if we're looking at, the management console of Kaber's product, we can see that we're detecting access to sequences of data that come from objects. Even if. Those sequences are only a small fraction of the content that came from the object in the first place. in this case, we have Bob as a user, and Bob is accessing some data from an object owned by Amy. And it's through an API where they got 7 byte sequences out of about 10, 000 that came from her document. this is very powerful. fine grained access control for data. We don't need to see hundreds of kilobytes of data to be able to provide access control to it. We can do it on a very fine grained basis. Even down to 30, 40, 50 bytes is in the case of this demonstration that I'm showing here where We're providing access control directly on the chunks that are coming out of a RAG system. And in this slide, what I'm showing is the user on the top has access to, we're using Apple 10 Q statements, financial filings that Apple has made with the SEC. And the user on top has access to Every year's financial statements, but let's just consider 2024 is material nonpublic information. That's only accessible by the CFO on the top and the regular users on the bottom. They can be asked questions about 2024. But as you can see. we can tell the difference between the data that exists in the 2024 statements and what exists in the 2023 statements. Even though those two statements, if you put them side by side, you see they have a lot of data in common. The boilerplate management discussions, it's really only the distinguishing characteristics of these documents that matter. And that's the sequences that we're redacting from this user and holding back. from the LLM that's about to give them knowledge based upon that. And this access control is deterministic. We know for a fact we have a one to one exact match of those byte sequences to the data that they contain And the data and the source where they came from. So this determinism, we need two or more observation points. We need to be able to see the data in two different places. And this is unlike what you would have with DLP, where you're just looking at the data itself and trying to decide on the spot, what does it look like? We need to have multiple sources where we can connect those sources together to determine the identity of that data. But that's the secret sauce for this. Now, why, again, we're doing this, is to be able to detect incidents without false positives. Now, I'm going to be honest right here. These are hashes, and we're still subject to hash collisions within a given environment. But the probability of a hash collision is 0. 00001 percent or so. Compare that to the 95 percent false positive that we have today, and you see this completely flips the model. So what can we do with this? Beyond just being able to detect access problems, Bob accessing Amy's data, what we need to be able to do, or what we want to be able to do, Is to use this data effectively to help us not just detect problems, but solve those problems. And that's what we can do by tracing the path and the movement of these byte sequences through every API call that carries them. Here I'm showing a call graph of what happened in that alert that we generated with Bob accessing Amy's data. And we can see exactly what happened. From starting with Amy uploading a document that gets passed through this NuCo Next Cloud service stored in an S3 bucket, we can then see that there's this service NuCo Disrupt at the bottom that accessed that data, extracted some content out of it, and wrote it back to a different place. In fact, a shared bucket where Bob can get access to it and see a preview of the text of the document she uploaded. But by being able to trace this path, we have the full and complete picture of how the incident happened. But not only that, we know how to stop it. We can prevent this just knowing that the mechanism that he was able to get access to that data was this. NUCO disrupt service. Now think about that disrupt service as a piece of malware in your system, or as a bad actor that was able to escalate their privileges, gain access to an instance, and then read the data from one bucket and write it to another. These sorts of errors go undetected because they're all authorized access when you look at them individually, but they're unauthorized when you look at them from an end to end data perspective and know that the data that Bob got came from an object owned by Amy. But this ability to trace data through a series of API calls gives us an even better more interesting window into the application behavior. Beyond just detecting how an incident happened, think about black box testing, where you look at the inputs and outputs of a given software package, To be able to determine what that software package is doing, what its function is, and by being able to trace how data flows and being able to identify what that data is as it moves through these different services gives us the ability to do that as well. And so what does this mean? It means that as we're building our applications and building our policies, we know exactly what's happening across all of our services, or we can know what's happening across all of our services in terms of actual data flow, and then use that knowledge to build the policies that we need to be able to control that data flow. In fact, we're able to take this data along with some high level policies Prevent Bob from gaining access to Amy's data. Feed that to a generative AI. And get back actual policy implementations and where to put those policies to enforce that compliance requirement. We think the direction that we can take with this level of data identification, the granularity that we can provide, And the rationalization of the authorization on access to that data everywhere within an application is going to open up a wide new arena for security. We think this is the key to eliminating false positives in applications going forward. Thank you very much for your time. I really appreciate it. And I look forward to hearing from you. If this is of interest. If you think that this will solve a problem within your environments, let me know. I'd be really interested in hearing from you. Thanks again.

Slides

Download slides (PDF)

See all 36 talks at this event!

Conf42 Incident Management 2024 - Online

October 17 2024 - premiere 5PM GMT

A New Architecture to Free Incident Responders from False Positives

Video size:

Abstract

Summary

Transcript

Slides

Rob Quiros

Founder @ Caber Systems

Join the community!

Featured event

2026

2025

Info

Conf42 Incident Management 2024 - Online

October 17 2024 - premiere 5PM GMT

A New Architecture to Free Incident Responders from False Positives

Video size:

Abstract

Summary

Transcript

Slides

Rob Quiros

Founder @ Caber Systems

Join the community!