Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hey, all. My name is Christopher Weber, and I am the director
of product and IT operations at Open Raven.
We're going to talk a little bit today about staring
into the data abyss and how we can achieve a higher level of
cloud security. Mostly so you can sleep better at night,
because I think that's really the critical piece of understanding
how we might better sleep at night just to not have
to think about all these crazy things. Because the
reality whats I think that a lot of us are dealing with at this point
is there is just so much
data, and I mean so much data.
It's really easy to think about when we look in terms of things
like s three. I think using that as kind
of our starting point, it's really interesting to reason about just
how much is actually out there. So, first off, think about the number of AWS
accounts you have. Even in my small organization,
we've got over 30 accounts, and each one of those ends
up with a bucket per region, per config.
So AWS's config service, you add to that things
like cloud trail,
cloud formation, all that sort of stuff, and we haven't even started talking
about your actual data and the applications that
write and how all of that plays together. So it's
really incredible just how much data ends up in the
cloud, if you will. And I think it's worthwhile taking
a step back. Right.
In the old days, for those of us that have been around a little
while,
the data kind of protected itself in some way, right?
You had to get into the system before you could get access to
the NetApp filer, to the big
EMC boxes, because you had to actually have access
to those systems, whether they were sitting via NFS or
via some fiber channel loop, you had to have access. And not
just that, but the data could only grow so big
because one of the things we realize is that in
those environments, it really was a lot more
about, you could only afford to
buy so many shelves, you could only afford to add so
many controllers because it was so expensive.
And not to mention the upper limits of those systems, right?
You could only get so much space on a given filer.
And I think this is where it becomes really interesting
and important to think about how much the world
has actually changed, because it wouldn't be so bad with
all of this data and the unlimited ability to write,
except all those darn breaches.
And I think when I look at
s three in particular, and we can
talk a little bit about rds as well. We can talk about your
elasticsearch servers and pick a different cloud.
Right? If we're talking about Google and Google Cloud storage
or bigquery or any of those sorts of things,
you have similar sets of problems. But at these, end of the day,
it really boils down to we have to think
about how we protect these environments a bit better.
And I don't want to belabor the point, but I think it's
important to really think through all of the
breaches in these environments. Right? So Corey Quinn
from last week in AWS or the duck bill group,
as part of their last week in AWS newsletter,
they regularly call out this bucket negligence award.
And it's really interesting to me to
think through just how
much data gets exposed in some of these larger breaches.
And the crazy part here is the three that we're showing
here. That first, second and third is 1st, second and third that
I found in my inbox. There's nothing particularly
interesting about any one of these three breaches, except that
it's personal data, it's customer data. And even more
so than that, when you look at things like the Breast cancer Organization or
breastcancer.org say that ten times fast,
it was personal images, it was things that really
make a huge difference to care about because we need to protect
folks. So I don't
want to go into any and shame any particular organization,
but we all have this as a potential, right?
We all have this data from customers that we need and have a responsibility
to protect. So let's do that, right?
The reality is that we're going to take that seriously.
So the first thing we're going to do is we're going to add ourselves some
security tooling.
I think the starting point here is a CSPM tool.
And if this was a live studio
audience, I'd ask you all to raise your hands as to who knows what a
CSPM tool is. But since I can do that, I'm going
to go ahead and define that so that we all make sure that we're using
the same meaning for these same acronyms.
CSPM is cloud security posture management.
So in a nutshell, you apply
policies and you get alerts when things
have incorrect configuration or configuration
that's not secure by some
definition and the
way that plays out. So we
install these tool and you know what? We're going to
lean on people. Whats should know things better than us. We're going to
apply the AWS CIS benchmark policy. For those that
aren't aware, CIS is the center
for Information Security and they
do a fantastic job putting together a set of benchmarks.
We all feel good, right? We're going to know all about our environment and
it plays out really well because we're going to come back into our
CSPM tomorrow once all the policies have run,
and then we find ourselves in these abyss. So let's
talk about this a little bit, because anybody that's done this before
knows where I'm headed. But let's talk
so there are five controls
as part of two. One which deals with security
of S three buckets, as in the AWS Chris benchmark policy.
And I'm only going to deal with the automated ones because these are the ones
that any CSPM tool is going to
actually evaluate against. So let's look at these. First off,
ensure all s three buckets employ encryption at rest.
This makes sense, right? Until you realize that
there are lots of places where you wouldn't necessarily want to use encryption
at rest. For example, things that are intentionally
made public, or my favorite things
that have heavy readloads.
Let's just say I got to know the CFO really
well after some mistakes made with Athena
and KMs and the cost of reading
from Athena. There's a great story there at some point, so catch
up with me afterwards to dig into that. But I digress.
Two one, two. Ensure the s three bucket policy is set to
deny HTTP requests. This is really a way of prohibiting
effectively what could be anonymous calls, right?
So if you're coming in via HTTP, that means that you're likely
not authenticated via s these, and that's what this is
wanting to do. There are lots of reasons that you might want things turned on,
right? We may want to serve up images, we might want to
serve up things that come in directly over
the various protocols.
So like cloudformation, that sort of thing. So there's lots of legitimate
reasons why that may be a thing. Ensure MFA
delete is enabled on so
pro tip if you use MFA delete, you are going to need to go
use the root account to go delete
anything that has MFA delete turned on. So this seems
really good in practice, or rather
really good in theory, but in practice is absolutely terrible.
I don't have to explain to this group, I don't think,
why you shouldn't be logging in as the root user,
and anything from a security policy that really binds to
needing to access the account as a root user likely
has some concerns and then finally
block public access. Well, first off,
AWS, by default when it creates buckets for you, doesn't tick
this box and it gets really interesting when that plays out.
So think
a little bit about that. Here's the
reality. Based on what we just talked about,
95% to 100% of your
buckets, they're going to flag,
they are absolutely going to show up
as being problems, as being in violation of
that security policy. And when you get to a point where 95%
to 100% of a given asset fails
by default, those checks
are kind of useless. It really is hard to
think of a world in which it makes sense that
everything is in violation of that policy. And I think,
for me, what's really critical here is I have no ability
now to priority what's bad, because it's all bad,
right? The sky is falling.
Well, which part of the sky am I even caring about at this point?
So I think the
real piece becomes, now,
what can we focus on to really
drill into and think a little bit differently, me,
about what data we need to know and what
information we need to be aware of for our success
in this arena. So we'll start with where
did the data come from? There's a bit of a history piece
around this first point, and I want to call it out because a
lot of folks aren't aware of this. So back in these day,
because as I was talking about EMC and
NetApp, you should probably get a good feel that I'm a little on the older
side and been around the block a couple of times. Back in the day,
AWS had this thing where you could only have so
many s, three buckets for a given account.
And one of the workarounds was to store things
that were loosely affiliated, but not necessarily the same data in
a single bucket. So what you might do is you might have your images
in one prefix, you might have, or static assets,
if you will, and then maybe some customer data in another prefix,
and then maybe some separate application data in
another prefix because you only got so many
buckets and it was in the 100 bucket range was the limit.
That limit has been lifted
because it was at one time a hard limit. Like, you couldn't actually get them
to raise it unless you were like super special. That's not the case anymore,
which is fantastic. But those buckets still exist,
those applications still write to those places, and it's still a thing.
What region is it in? So I think it's really important to reason
about the regionality of the data, because a lot
of times it doesn't necessarily matter whether
it's protected. You can have stuff that's completely protected
properly and you still be in violation of compliance concerns,
because you've got data that shouldn't be there in that region.
Not to mention, from my perspective, it's really interesting. We've got
a map at open Raven where you can look at your infrastructure, and one of
the first things that catches a lot of customers'eyes and is always,
why I'm a super big fan of it, is you look at it and go,
wait, why do I have stuff in AP Southeast
one? I shouldn't have anything these. And then sometimes it's,
oh, we turned on AWS config and it put a bucket there.
Fantastic. Or you go hover over and look at the buckets and go,
yeah, that shouldn't be there at all. We need to go take care of that.
So I think that's a really valuable tool.
What apps actually write into this bucket? And I'll talk about the
write piece a little bit later, but it's understanding what
apps send data to that bucket and
keeping that in mind. The other thing
is, is things all coming from automated processes or
is it being manually uploaded to? So one of the things that becomes really
an interesting question, and when you look at some of the breaches, a lot of
times it's not uncommon that a
backup got uploaded to the wrong spot or to a place that
someone thought was safe but wasn't because
they were manually uploading it and there weren't all the other controls in place from
the application side. And I think it's really critical to kind of look
at that and reason through. Okay, so is
it a normal thing for this to be manually uploaded for someone could accidentally upload
the wrong thing? I things from there.
We really want to talk about what kind of data is in the bucket.
This seems really straightforward, and you can take a bunch of
different approaches to go figure out what's there. Right. If there's protected
health information, if there's personally identifiable information,
you should know. Hopefully you're going to want to know if it's
these. And on one hand, we can absolutely
go talk to each individual person. And if you are
in a large organization,
that probably won't work super well. So you can use tools like
open raven or AWS Macy to
go and classify the data that's inside the buckets.
The same is true on the open raven side. You can do this with your
RDS instances as well. And we're looking to expand beyond just
s three. We've got a bunch of stuff coming down the pipe,
and it's going to be exciting, but you
need to know what kind of data is there?
Things one always makes me laugh a little bit because the first place we always
jump from is who owns it. And this
would be amazing to know. Like, I would love to know who owns
the data. The problem is, and I want to call it out here as it's
a great thing to know, but reality is that you're probably
not going to know. It's going to be hard to track down who owns
it. And just because someone owns it doesn't necessarily mean they have control
or have any semblance of understanding of what's actually going into
the buckets. I think it becomes a lot more critical to
understand who can write to the bucket. When you understand how
data can get in the bucket, you can start from there.
So even if one team owns the
data in that bucket,
can applications that are owned by
other teams right into that bucket and it get accidentally used? Are there
other opportunities for people to once again manually upload into
it? So you can use tools like open Raven?
We've got a feature coming out, it's API only now, but will
be available in our UI soon, where you can actually go in
and look and say, okay, what security principles have the ability to write
into this bucket? You can use tools like hermetic as well, which does
a bunch of things around IAM and
better understand who can read and write to a bucket. But I think it's
so common for us to focus on who can read from it.
I think the starting point should be who can write to
it because these you can actually start to identify where your
actual risk is.
So I've talked about a lot of what, right, we want to
know all of those things and I think it's really critical
to think in a different way,
think about where we can start and how we
really enable teams to start taking next steps.
So the first thing is, don't protect data
that doesn't need protecting, right. If it isn't there, you don't have to
do anything with it. So I really kind of
call out a couple of things. First off, use intelligent tiering.
This is going to sound silly, but it gives you the ability
to get an alert about the state of the world that isn't directly
tied to all the security tooling. If you're using intelligent tiering
and all of a sudden you start accessing a bunch
of stuff and it's changing lies so that your
costs go up, you're going to see that. And the reality is that we're all
watching cost a heck of a lot more than a lot of these security
tooling, because the security team is looking at
the security tools, cost is being looked at by everyone.
And so as a result, we can use things like intelligent tiering to
save money because things shouldn't be being accessed all the time.
And it gives us the ability to see those anomalies in
the system. The next thing is applying lifecycle
rules, and this ties really closely with using data retention
rules. So lifecycle rules are the technical implementation,
right? I go into the s three bucket and I say, hey,
after some period of time, delete this thing.
Data retention rules are the business side of that, right?
It's the hey, we're dealing with healthcare data,
so it must be kept for 24
months, five years, whatever it happens to be. But on five years
in one day, we can get rid of it and we should get rid of
it. And so the real key becomes, can you use
something like lifecycle rules on those s
three buckets to remove that data so that you don't end
up having to protect it going forward?
There are some also great conversations about having
data that you don't need and how
it plays into legal things like discovery and whatnot.
That's a little bit broader than this talk goes into, but I think more
than anything, there's no reason to protect things whats
don't need to exist. So get rid of it so
that you're not protecting things unnecessarily.
Manage your riskiest buckets first. I think it goes without saying whats
public buckets are going to, by definition, be the riskiest.
The problem is that we normally stop there
in our conversations. It really becomes a good
point to go, hey,
go start there. But also look for a couple of things. Look for broad write
permissions. So if you can find and track down places
where you've got everybody and their brother is able to write into
that s three bucket, you've probably got a problem, because it's much
more easy for something to be exposed than
it would be if only two or three applications are able
to write, or no human users are able to write into that s
three bucket. So that becomes a really important thing. And then one of
the things that we found in our environment is backups
aside, one of the real indicators that you've got actual legit
data somewhere is lots and lots of small files, whether it's lots
of images that are being uploaded from customers, whether it's
Json, that sort of thing. The large number of files tends
to be an indicator of there's some automate process,
or there's some process that's putting data in these and that's a really
good place to start because it's actual data coming from
customers and not just a dump of
some source code archive out of NPM
or something like that. We see all sorts of fun things,
but I think really the biggest thing is focus on those large numbers
of small files as a good place to start and hone in on your
managing for risk.
Ultimately, I think for me the biggest thing
is, and yes, I get, I work for open Raven.
There's a reason why I do. I believe that understanding
your data classification by being able to
understand what is actually out in these world
matters,
it's so critical to be able to go out and say,
this is what's in that buckets. And you can start
really simply whether it's using open Raven,
whether it's using Macy, go do
some scans, understand what you've got out there and
from there, run those scans regularly, making sure
that you are actually checking for things. And one of the cool things
we do at open Raven is we cache the results,
right? If the file hasn't changed because the e tag hasn't changed, we're not going
to rescan that object in s three because we know it hasn't
changed, but we're going to do those sorts of things.
And then more than anything,
you need to have rules in place to alert on the things
that are actually critical, right. You want
to know if you find european data in
a us region. You want to know
if you find Pii in a bucket that's
open and that's the real critical differentiator,
right? It's not that you found Pii, it's not whats
you have an open
bucket, it's that you have Pii in
an open bucket and it's those sorts of things that really
provide the value. So,
to summarize, I think the real key is these three things.
You turn on intelligent hearing, it will
get more eyes on the problem, because if costs
bump heavily, you'll know that, hey, data, whats shouldn't be
accessed is being accessed. Classify your data, go figure out
what you've got. And these use those retention policies
and those lifecycle policies to delete the stuff you don't need. Ultimately,
that's really going to be what is the game
changer for you going forward.
So with all of these things said,
I want to thank you for joining me for my talk. I can be
found on the interwebs, you can find me on Twitter, hit me up via
email, and I'm trying the new Mastodon thing. We'll see how that plays out.
But I hope you've enjoyed this talk, and I'm looking forward to catching
up with you in discord.