Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi, and welcome to my talk. Who's going to secure the code?
Army of robots is going to be writing.
So, my name is Arshan Dabirsiaghi. I've been at
the intersect section of code and security my
whole career, around 20 years. And all of
these bullets can be summarized as I've always been
fascinated with the idea of taking over your computer against your will
and all the different aspects that come out of
that, including protecting
your data. And so let's talk about
generative AI and how I see from my
vantage point how that's going to affect the security
of our software. So if
you look at the studies that have been put out by the major
players, you would be
led to believe that the tools we have today
for helping us code largely
and take the form of these fill in the middle models, like GitHub's
copilot, they help us
produce 25 CTO, 60% more coding
throughput. Now, these numbers, really, they depend on how you measure
and what is the activity. I think a lot of people
say that the acceptance rate is between 20% and 30%
for GitHub's copilot. Suggestions? I haven't
had a chance to play with the other tools, but these fill in the models,
really, you can think of them as autocomplete, right?
And just with autocomplete, we have to
acknowledge that throughput is higher.
And now we can argue about the numbers and what they mean, but with
a higher throughput, it'll come downstream consequences,
some good, some bad. But right now, what most people
are using, what's been adopted as of now, has been this
autocomplete kind of feature.
And more recently, we've seen coding assistants
jump into our IDE, things like
magic and GitHub's copilot
in the IDE. And so these things are
pretty new. And so, adoption is not through the
roof yet. And what we
saw on the autocomplete side was this, I don't want to
say modest improvement to throughput, but one
standard deviation away from what we do today. Right now, if we
have an assistant that is drafting whole
sections of code, whole files of code, you can see in this
example, we're asking copilot to create a new button component,
and it's able to do that. And so
if a fill on the middle model can deliver us 25
cto, 60% more throughput, we have to ask ourselves,
what is the throughput that a drafting assistant
like this could do? And the studies haven't been done on
this yet. The studies really are just coming out now for the fill in the
middle models,
we're left to guess here. I'm guessing sort
of out of thin air here, 100%, just for an argument's
sake. And if you watch the GitHub keynote,
just a few weeks ago,
you would have heard that GitHub has said that they're
refounding the company based on Copilot,
and they demoed something at the end with their just one
more thing that was really
impressive. And so they showed essentially
somebody opening up an issue and saying, hey, I want to add this feature.
And the GitHub copilot feature that they were
demoing created a plan, as you can see here,
went through the code and changed files,
and we'll try to auto fix like compilation errors.
Or from the demo, it seemed like it was going to try to solve small
kinds of errors on its own. And so now
this is a tectonic shift from fill
in the middle, right. This is doing requirements,
sort of guessing at the requirements,
guessing at those requirements translate into
code requirements and performing multiple file changes,
and then itering on those changes to get them to a working state.
And so if autocomplete can
give us these modest numbers, what can a feature
like this do?
Were again, left to speculate. Seems like a lot.
So there have also been studies on the
models to help establish that they're
not that good at security. In fact, they're kind of poor.
The models produce, they produce insecure
code, right? The statistics on the right help us understand
that. They've done studies where they
had a control group that didn't use a coding assistant,
and then a group that did use an assistant,
and they found that consistently, the group that used the assistant produced
more insecure code and perhaps more
dangerously, believed that their code was more secure
than the control group.
This also reflects my experience testing the top
commercial offerings. If you
ask the models about SQL injection,
it's fairly competent. SQL injection is super
common issue. There's a ton of literature out on the
Internet about it. But if you ask it about a vulnerability
class that's just slightly less popular,
I have not found it to be competent at all in
delivering fixes, analyzing itself for whether
the code is secure.
It can't reason about these issues with any level
of competency. So what our
experience is that basically the LLMs don't produce
secures code. Now we have
this issue where, and maybe this shouldn't be surprising because the
LLMs are trained on human code, which contains bugs.
And so in fact, they're overtrained, right? I mean,
the data set that they are trained on is
all on insecure code, because that's the code that's in GitHub, that's publicly available.
And so it's not a surprise that the LLM then produces
insecure code. So if you wanted the why or how are they producing
insecure code? That's really, how is that the models are really a mirror
held up to the existing human vulnerable code.
So then the question was posed
to me, can't the models just generate secure code?
Can we teach them to do that? Maybe.
And this is harder than people think.
And to understand why, you need to understand the nature of a vulnerability.
And so if we look at a vulnerability that takes
place, all these blue bubbles are little pieces of code.
User input actually comes into the system here and then it bounces
all around the system. And so the system is not just what's
in your GitHub repo, right? The system, the application is
a combination of your custom code libraries, frameworks,
runtime, third party services, et cetera. And so
as the data, untrusted data kind of flows around your
system, you'll see that eventually,
in this example, it reaches a place in the runtime where it shouldn't.
And so lots of vulnerabilities can be modeled this way.
SQL injection, cross site scripting. Many of the
medium high critical vulnerabilities look this way.
So it's tempting to just say, well, we'll just
take the whole code base and shove it into the context window and
try to reason about the safety, the security of
the code. And you might notice that a lot of this
vulnerability data flow isn't even in your code.
So that's one problem. And then also code bases are
just way too big. The most popular,
biggest context window today,
I'm not even sure if it's available for public offering yet, but there is 100K
context window from anthropic, if I'm remembering that correctly.
But 100K tokens is
not going to be enough. It might be able to fit microservice
with one endpoint in it, but when
you look at the manifest files, the data files, all the code,
we're going to need millions of context window in
the millions in order CTO fit most apps into it.
So the next, maybe most obvious step
would be to take all the code and cram it into embeddings,
which are another way that we augment LLM usage
in order to give it knowledge about other
things that are big. I know that's a very simple way of thinking about
it. But, for instance, you might feed an LLM all
of your docs in order to make a useful chat bot.
And so you would give it the ability to sort of search over that space,
and that works relatively well. But searching
is different from reasoning. And so if
we just were to cram all of the code into an
embeddings and then ask it to connect these
dots for us, that hasn't worked
in my experience, and we haven't seen that anything suggesting
that that is possible yet, either. From the wider marketplace.
And generally,
models are confused by very
long series of events where you have
to reason across many different steps.
And most of these issues involve many steps
and sometimes many variables. So adding those two
dimensions, llms tend to
deliver less, in my experience, on those types of
problems, where it's got to keep track of the history of two
variables across all these different events.
It's quite difficult. I think we need to figure out how to make,
if we wanted to solve this problem, we have to figure out how to make
these problems smaller for the LM. And just as a
point of reference,
we've tried to build static analysis tools, code analysis tools,
that are purpose built for exactly this problem,
and they can't do this fast and accurately. And so
the hope that the very general purpose,
working at the speed of inference,
it feels like we can't even get this right with
really purpose built tools. It feels very far fetched to me.
So I want to show you, this is a
diagram put out by Pedro Tutti to talk about
what is the secure development lifecycle, where does security
fit in? And so what
you see here is this is
a company that takes security very seriously.
Whoever actually works CTO, this diagram, they have a
range of activities that are sort of like things you do
at the front end in terms of things you do once to establish
the context of that app for this lifetime. And there are things that you do
continuously, either during development,
testing, and then in production as well. And so they've
labeled here some of the things that are manually performed with this little m.
And I'm going to tell you that this
graph, this diagram, is way underlabeled, doesn't have nearly enough
M's. And I'll give you an example. So, in the beginning,
yes, you would do threat modeling once. Threat modeling is an exercise.
Were you sort of look at the inputs and outputs to the system,
you look at the third party systems it connects to, and you try
to predict ahead of time what are the different
threats you might want to face? What are the controls you want to make sure
you have. And so this is an expert led human
process, usually looking at a
combination of cloud infrastructure and
literally papers to try to do this process.
So this is obviously very human, very manual. You're asking
a ton of questions when you lead a threat model to try to
enumerate the real world picture of this app.
And so, of course, this is going to happen one
time. Very unlikely it'll happen a lot after that.
But every time you
commit some code, your pipeline runs. You're going to run a static analysis
tool. You're going to need CTO look at the results
of that when it's found. Right. And so now imagine
we have all the generative AI,
all the robots are working, they're cranking out code. What are the robots
going to do when they find a static analysis finding?
Well, let's talk about what the humans do. But first, we're going to put m's
here, because now we've said, look, when we do
static analysis, scanning of the code, and there's a finding, we need to do
something about it, we need to triage it, we need to possibly fix it,
we need to do something. And so all of these activities,
static analysis, software composition analysis, which is
looking at your libraries that you have, especially new
libraries coming in. We have dynamic analysis, which is kind of
like fuzzing your web application from the outside, or your rest API.
We have Iast, which is watching the internals of the system running.
You're scanning any containers that you would have built for vulnerabilities and
sort of the infrastructure of your app.
If you're lucky enough to work in an organization that has penetration
testing, you're also going to have pen testers looking at your app
sort of occasionally. And so through the, hopefully you can
see the through line here is that we have a lot
of security processes and we have a lot of security
technology, and the process for acting on these
results is very manual. And so we
list some of the human interventions here.
And it's interesting to note that also these activities
are not just strictly a developer. There's a lot
of developer stuff that's happening here, but also there's some product management
stuff, talking about trade offs, there's compliance stuff,
there's security engineering.
There's a lot of activities here across a couple of different disciplines
in order CTO make this secure software factory
work. Now, again,
what are we going to do when the robots come?
Because the question is going to be, are we going to slow down the
software factory in order to accommodate inserting
humans in all these places. And typically when somebody
says go fast or go secure,
businesses choose to go fast because they
have to compete. And so they feel like
they can't tie one hand behind their backs
because for a lot of different reasons.
And so just today
I want to talk about how our programs are limited. I think developers often
think that security is somebody else's job,
and to a certain degree that is true.
But I just want to give you a glimpse behind the
wall here. Developers at least, and some
studies say that this developers outnumber security 101.
My experience is if you go to a
giant bank or a giant financial institution,
these ratios are much worse.
And I wouldn't want to speculate on what the numbers are,
but at least maybe an order
of magnitude off of this,
the humans we have aren't cross skilled. So if you think
of just very simply, and we just say there's developers on one side and there's
security on the other, security understands risk
pretty well in such a way that the developers don't, to be honest,
they think about security differently, but they
don't have the skills often to pitch in directly
or CTo, review findings very deeply.
They just don't have that skill set. A lot of times they don't come from
an engineering background. And then on the other side,
developers don't have great security skills.
So they don't understand vulnerability classes that well,
and we shouldn't expect them to because security is a
tough, complex, fast moving field with every
vulnerability class is its own interesting
rabbit hole. And so they don't have that
muscle memory to do a really great job
at working through vulnerabilities. And that's what on their own. And that's
why often we'll see developers struggle
to fix vulnerabilities after one
iteration. So if you read bug
bounty reports, you'll often see like the developer fixes something,
but the attacker is able to get around their
proposed fix right away. And so anyway,
we have people who are good at parts of this,
but we don't have that many people that are really great
at both. And then just from a math
perspective on the number of humans we have, regardless of
how skilled or cross skilled they are,
we just don't have enough people to do the jobs.
And so what ends up happening here? If this is our application portfolio
here, where you're thinking you're
the business owner at a big bank or a big technology company,
and you say, these are all the apps I have, what typically actually
happens is that you label a very small number
of those apps, let's say 10%,
as the most critical. So these are things that might be Internet
facing. They might directly touch some sensitive assets.
And so you might choose to say,
look, we're only going to run
the full barrage of activities and tools on
this 10% of our most critical applications.
And so this is why we have situations like,
there was a major retailer that was broken into a few years
ago that suffered a tremendous
breach and a really painful breach.
And the way they got in was through a
contractor HVAC portal.
And I'm sure that at some point, somebody looked at this
asset of the companies and said, it's just for contractors,
it's just HVAC.
There's just not enough assets here at risk. It's a small
number of people who have access to it, HVAC contractors and
so know to get
in their foot of the door. It all looks the same to them. And so
the attacker in this case breached this system.
I'm not sure if they had insider information or they knew a contractor,
but they found some way to get to this and
they pivoted from were to the soft
underbelly of the systems and
did a lot of damage. And so if were only
doing, let's say, what we want to do
on 5% CTO, 10% of the applications that we have, and we're
about to raise the throughput of our developers by
a lot. Somebody's going to have to explain
to me how
we're going to secure all this code. And so this caused a lot
of soul searching for me, and I'm sure a lot of
other people to say, what are the things that can scale
with the robots? And I'm just choosing three things here
today to talk about. I think there's some more opportunities,
but I think I want to focus on the highest yield
things. And so one
solution, one strategy we can have is to make it hard to be
secures. And so Netflix
has a term for this. They call it paved roads.
So this is the idea where we have a
use case, and we give the developer a
very simple path to follow. We give them a framework,
we give them an abstract type to
work with that automatically enforces authentication for them.
Right? So they don't have to think about authentication anymore.
They don't have to think about identity authentication.
They just add the type, add their feature,
and security comes baked in. In that regard,
you might have another use case where for a developer to
make something for their
code to compile, it forces you to provide
roles for access control, enforcement, and so
this is really forcing the developer to acknowledge
when they make a new feature,
what are the security aspects they should be asking themselves about?
And we don't do this enough. Usually we tell the developer, hey, we need
you to go make an app that does XYZ.
And the product manager, the product owner doesn't really care many
times about the security of it. They just sort of assume that security is
baked in. And so the developer might stand
up a new app with
just the base, let's say exprs
framework. And that's not going to come with any paved roads, right.
The developer is now going to have to reinvent all
the security controls and
they're probably going to get a lot of things wrong along the way. And so
these are some good ideas just to try to force them down a road
that's going to either provide the security or force them to
provide answers themselves. And you
might say, to prevent cross site scripting,
all the apps here have to be rest plus Json,
right? And that makes cross site scripting
patterns a lot harder to create accidentally.
And this doesn't just apply to code or
frameworks or something sort of at compile time,
but maybe if we say, look,
there's only one pipeline to use, or if you want
to run a GitHub action, we're automatically going to add
this in there where were going to force static analysis
on every build, or we're going to add a GitHub app that watches
your dependencies. And so these paved roads,
they really help. And if
a robot is going to take your code
and add something to it,
copilot in my experience has been pretty good about following
the patterns that are there. And so if
all it sees around it are paved roads,
it'll probably use the paved roads. And so
it'll force it to reason about some of these things
that we talked about and provide some good
first draft settings for some of these things.
So what does it take to get this done? Sort of organizationally,
the first thing is we need strong devex and platform teams.
So strong teams who are centralized,
who understand developers, who understand
the requirements from security and can help developers
go down these paved roads. Now the second bullet here is also really
important. If you want to have
paved roads, unfortunately,
you can't say go build whatever you want in whatever language you want with whatever
framework you want. That ends up being really difficult,
because if you want to build
a paved road that like an access control mechanism
for every different language, for every different framework,
it just doesn't scale. And so the fewer technology
stacks you have, the better. And of course,
if express is really big in your organization and you have a
long tail of other technologies, of course it's still worth it to do paved roads
for those technologies that are predominant.
But it's hard to do sort of globally, hard to solve globally unless you're a
little bit more authoritarian about what technology stacks are allowed.
And then we love the developer
security champion model, and this has become a really popular model
in lots of different companies where we have
security cross skilled developers
who can chime in, who think about risk maybe a little bit differently
than your average developers who can help create these paved
roads, help inject security into these paved roads.
And so I added a few vendors and tools here at
the bottom for you to look into more. If this is interesting for you.
One other strategy is to make it so. The first strategy
was make it hard to be secures, right? Give developers
paved roads so that it's difficult to get off those roads and create
accidental vulnerabilities. Another strategy is to make it
hard to exploit your insecure code.
So traditionally, runtime protections
were dominated by these tools called web application firewalls
that watched HTTP traffic and sort of tried to
signature to detect attacks.
And they weren't super accurate, but they did provide
visibility into traffic. You could detect
obvious attacks. It was
hard to rely on them in sort of a blocking code because they had
lots of false positives. It's very difficult to watch
traffic and pick out the bad stuff and
not pick out good stuff accidentally. I've been in a job
like that, and it's quite difficult. And so in
the last few years, we've seen a class of
tools I worked on one called RASp
runtime application security protection. And so whereas traditional
protection tools sat at the network level and
try to protect, build a moat around the app,
these rasp tools are really injecting sensors and actuators
into the application itself, into the
app, the frameworks, the libraries. So these are sort of
language level agents that are putting these sensors
actuators in and acting on behaviors
rather than traffic signatures.
And so this is an example of a rasp tool.
And so the user input that the attackers
sent in, they're trying to exploit a SQL injection. They send in
this tick, or one equals one,
which is a comment in this SQL
dialect. And so they're trying to attack this
line of c sharp code where the
user input is included here. Now a
WAf only sees the input right.
It gets the traffic first. It looks at the input and says, is this an
attack or not? It has to make a decision that's far
too early, far too away from what we call boom to
make that decision. But a rasp
can look at the application behavior, it can look
at the SQL query that's actually being sent, and it
can scan it and it can tokenize it,
and it can semantically analyze it and say, look this
query, some user input came in,
I saw it go into the SQL statement.
And two things about it irritated me. One is it
caused a data context CTO become code.
So the token boundary was crossed here.
This input looks like it became code at this, or one
equals one part. And then there's a
clause that always evaluates to true right. This is something we can
evaluate at the sensor where SQL
is evaluated. And that kind of bugs us
too. And so you have so many more degrees of freedom
with tools like these that sit in the runtime and
can look for malicious behaviors. People who use these
tools, they were protected from
log for j, the exploits,
when the log for J exploit came out because they was
watching for malicious behaviors of the runtime,
not traffic. And so
there's some vendors here that there's not really an open source,
good open source option, but it's
still a relatively new space. And so if you want to make
your code much harder to exploit,
this is a very good option, because now you
have some confidence that even if the
generative AI is producing code that may be secures,
that it still can be protected from a lot of different vulnerability classes.
Now, these vendors and these strategies
work pretty well when it's a vulnerability class that looks the
same from your app to the next app to the next person's
app. So for instance, SQL injection is the same no
matter whose app it is.
But we also have a class of vulnerabilities called business
logic vulnerabilities that do look different. They look very custom
to your app. So you might have business rules that say
Mary is allowed to access this data unless it's Tuesday after
04:00 p.m. And so those types
of weaknesses, the gaps in our security models
there, the misses we create there, those are
different from app to app. And so they're harder for both
static analysis tools or any kind of analysis tools and
for tools like protection tools to understand
that there was an exploit that occurred, because sometimes
that's allowed. And so how can these tools understand the
business requirements being violated here?
So although that these tools can help with
many things, there's still some things that they can't help with.
So if we remember our diagram,
we had all those M's on the board and all those
M's, all those manual activities, most of them were
responding humans that had to respond to
an interruption from a security tool.
And some of those tools were software composition analysis tools.
Some of them were docker tools. But what we found is that most
of the results come from code scanning tools.
So we need to solve this problem for all of them. But it's
interesting that this problem of evaluating the results from security tools
also requires the hardest
collection of skills across
development and secures. So you have to understand
the vulnerability class. You need to understand sort of security concepts
in general, and you need to understand the code in order
to determine is this a real issue?
Is it a false positive? Is it something I need to fix right now?
And so we
can imagine a tool here where we
see sonar here finding something,
sonar finds a SQL injection vulnerability.
We have a security copilot here that's
reviewed the code and said, hey, look,
I noticed you had some vulnerabilities in that code.
I'm going to issue you a pr to try to fix those vulnerabilities.
And then after that PR gets merged,
the scanner doesn't find anything.
We need to be able to do two things to accomplish this reality. We need
to be able to triage results to determine if they should be pixee,
and then we need to be able to fix them confidently.
And so what we see here is a
tool doing that. And so there's still a human
in the loop here to approve this pr,
but the whole job of triaging the vulnerability and creating
a fix and verifying that all the tests pass and all that stuff,
this can be done by what we're calling a security tools copilot.
And so I have an offering in this space, but there's also others that
I've listed here. And so I wanted
CTo especially highlight. At the bottom here we
have this library called code modder, on which we
opensource this technology to help other people perform
this same kind of activity. And so I'm going to spend a little bit of
time on code modder. So code mods are this cool
idea. They came out of the first Python
community from an engineer, Justin at Facebook. Oh my gosh, I can't remember
his last name. And then we saw them
sort of jump to the JavaScript world as
really the primary users of code mods today.
And so the idea is a code mod is a little bit of code that
changes a lot of code. And so the Javascript community
uses code mods today to do things like upgrade your react
four, to react five, update all your code. And so
this is a cool use case, but they never really escaped
that pattern of usage. When I was looking
into how can we automatically secure code on people's behalf?
I wanted CTO do more and I couldn't get them to
do more. And so that's because they were
missing the ability, they weren't very expressive.
I couldn't get them CTo highlight or find
complicated patterns of code. If you just want to change library
a to library b, and you're just replacing APIs, it's not that
difficult. But if you want to automatically refactor
some code to be secures, well,
you need to do a good job of finding the places where it isn't secure.
And so this is why I developed code modder with
some of my friends. So code modder is an open source
code mod library for Java and Python today.
And what makes it different and what's so exciting about it
is it is really an orchestration library. At first
I started to build a library that was very ocean boiling.
It tries to offer query language and
it was too much to do. A lot of languages, CTO support a lot of
languages. And we realized that there's already been so many
hundreds of manures invested in tools like
contrast, Semgrap, code, sonar, fortify,
checkmarks, et cetera. All these tools,
they've invested a ton into identifying vulnerable
or interesting shapes of code. To go change code
mod libraries shouldn't try to replicate that. We should
just take the results from those tools and then pair them with
tools that are great at changing code. And so tools
like Java Parser, Libcst, JS, Codeshift,
Go has refactoring features sort of right out of the
first class feature of the language. And so we
need to create a library that orchestrates these things together. And so this
is an example code mod in the code moderate framework,
which orchestrates a Semgrep rule.
Semgrep is a fun static analysis tool to use.
It's really good at building very expressive, simple rules.
And so in this example we want to find anytime you're using
the random functions and replace the random
dot function with a more secure version of that.
If you don't know most of the time in a language,
when you say give me the next, give me a random number, or give me
a random string of characters, it's actually quite predictable.
You need to use often the secure version of that
library in order to get actually unpredictable,
unguesable entropy, which is very important for
generating passwords and tokens, et cetera. And so if you wanted
CTO write a code mod in
Python, this is what it would look like.
We create a little semgrap rule to help find the shapes
of code we want to change. And then through
our magic, the developers doesn't have to do anything in terms
of understanding how the tool gets invoked
or anything. It'll just jump on the results of that,
and for every result, it changed
to the secure version of that
API. And so what we're trying to do with this open source
project is create,
I'm not sure, have 50 or 100 something like that,
rules or code mods in order to upgrade
code automatically. And so this is obviously a fundamental tool if
we want to keep up with the robots.
If code comes in from a robot and
we can have code that changes that code to be secure,
that's a big deal, because now a lot
of these findings from these security tools won't happen. And the
ones that do get found, we can act on them and fix them automatically.
And so we can help stay on track with
all the code that's coming in. And so here's
some links for you to follow if you want to learn more about the open
source offering. So that's
what I came here to say. I think that there
are a lot of things we can do
to try to keep pace with the robots,
but we have to be realistic and we have to move
right now in
order to keep up. Most of the enterprises I
talk to now are doing PoCs with copilot.
And when copilot code in, when code
whisper comes in, whenever, whatever LLM
is, your preference comes in, we're going to see a lot more
throughput and security.
I've suffered the same staffing challenges
as the general tech industry has. And so
how are we going with fewer people than ever?
We absolutely today need to create
some strategies and start working on people, process knowledge,
CTO keep up, because the LLMs are
not going to produce secure code. We have plenty of evidence of that,
but we're going to own the risk of it as
application makers. So happy to be here,
thanks for having me. And I've got some contact info
if you want to talk further. Take care.