Transcript
This transcript was autogenerated. To make changes, submit a PR.
Everyone. So first of all, thanks for joining me.
Thanks for having me at this conference. I'm excited
to talk to you all about Python code mods.
Before we get into it, let me just introduce myself briefly. My name
is Dan Davella and I work at an early stage
startup called Pixee, where we are building an automated
product security engineer. So the idea is that we want to help
developers and security engineers to automate
improvements and fixes to their code.
I work on security tools. I have a history of working on security tools.
I've previously worked on runtime application security
instrumentation. I have a bit of experience working on static analysis
tools. And so it was kind of a natural transition
for me to come to pixie and to work on tools that actually fixes
security problems. If you've
ever worked at an early stage startup, you realize that you wear a
lot of different hats. But I would say the biggest hat that I
wear is actually as the lead maintainer of this
Python code moderator framework, which I'm going to be talking
about today. This is an open source framework.
We're going to get a lot into the details
of this framework, some of the design decisions we've made. We'll talk about the
motivation for having a new code mod framework at all.
And I'm really excited to talk to you about that. But before
I get into that, I want to take a moment to sort of set up
the problem here. And so the problem
really boils down to the fact that there's a lot of insecure
code out there. And the even bigger problem
is that this insecure code is not actually getting
fixed. We're not fixing these security problems.
And while I was preparing this talk, I was
reading this state of software security report that was put out by Veracode
pretty recently, and there were a couple of statistics that really stood out
at me. The first one was this
first quote here that says that roughly 63%
of applications have flaws in first party code. And the
reason that this stood out to me is that over the past few
years, there's been so much emphasis on vulnerabilities
in third party packages, like in our open source software that we're using
and in our dependencies. And there's a lot of tools that have been helping us
fix these kinds of things. There's been log for J and
log for Shell, and a lot of big problems in third party code.
We've tended to forget that there's also a lot of problems in our own
code, in our own application code, and the code that we write
and this is really important. And then this other statistic
that stood out at me was this idea that 42% of
all applications have flaws that persist unremediated
for longer than one year. And this is
how veracode is defining this concept of security debt.
It's problems that aren't being resolved and
these security issues just continue to live in our code and
continue to get shipped. So there's a really big problem
here that we're not fixing the security problems in our
own code. And when you look at the security
tool environment, there's really no lack of tools that are capable
of finding security problems. If you
look at all these different products, some of them are open source, some of them
are enterprise grade commercial products. But if
I asked everybody in the room or in the virtual room to kind of
raise their hand if they were using at least one of these products
on at least one of these projects, most people
would probably raise their hand. And so we're all using security
tools, but that doesn't seem to change the fact that these
problems aren't actually getting fixes. And that's pretty
concerning. Now,
I will say that sometimes when I talk to developers and development teams
and I ask them, what security tools are
you using today, I actually do get a response
that looks a little bit like this. I think that there's some smaller teams
out there that have not really coalesced
on some formalized security practices yet.
So there are some teams out there that have not really adopted some of these
security tools. And so obviously this means that they're
not fixing some security problems as well because
they don't even know that they're there in some cases.
I think there's a couple of different reasons for this.
One of them is that some developers feel like they don't have the expertise
to really dive into security. And so it really requires
a team getting to the level where they have some formal security
engineering to really put these practices into
place. Sometimes when I talk to more senior developers,
they tell me that they feel like they are mostly writing
secure code. And I understand where that
thought process comes from because it's more likely for a
senior developer that they're going to correctly parameterize their
SQL queries and they're going to validate
user input before it gets rendered into HTML.
They're not going to put hard coded credentials in their code. But the fact
remains that there's still an awful lot of code out there that
has security issues and these issues aren't getting fixed.
So what are we going to do about it? The obvious solution
is that we need to fix and harden our code,
but I think the less obvious solution, but that the industry is starting
to arrive at, is that we need to do this automatically.
And this is going to do a couple of different things. First of all,
we're going to enable developers to
merge secure code. So they're going to be able to merge fixes to
their outstanding security problems. And then we're going
to be able to guarantee that any new code that gets added is
also secure because it's being validated and it's being fixed
if there's any problems before it gets merged and deployed in the first place.
We're also going to enable teams to work down their security
backlogs. So if you are using one of these tools
already, then we're going to be able to take the results of those tools
and automatically fix a large proportion of them. So that
that takes away distractions from the developers and gives
people more bandwidth. And the result of all of that is that
developers get to spend more time actually writing features
and focusing on the things that matter to them. They get to ship features without
feeling distracted by security problems.
That is the goal here. And that's what leads us
to this code code framework that I'm going to be talking to you about today.
So just in a one sentence summary of Codemodder,
it is an opensource code mod framework that is
designed for fixing security issues.
So I've already covered the open source part of this. I mentioned this is
an opensource project that's being maintained by Pixee.
But the rest of my talk is going to focus on what it means to
be a code mod framework, and then how we go about using that
to fix security issues.
So I've used this term code mod a couple of times now,
and I just want to make sure we define this for everybody. In case you're
not familiar with this concept. But the word code mod has a simple
etymology. It's just from a shortened form of code plus
modification. And what we really mean is that a code mod is
code that is capable of changing or updating other
code. So codemodder is not
the first code mod framework out there. There's some other prior art that I
just want to mention. So the first thing to mention is
this framework from Facebook that's actually
called code mod, and this was intended
to enable large scale refactoring with some level
of human intervention. So if you imagine that you're doing big structural
changes to your code, this is a framework that's
going to help you with this. It was implemented in Python, and I believe
that it's not actually actively maintained anymore.
If you're in the JavaScript or typescript ecosystem, you might already be familiar
with this framework called JS code shift, which is designed
to quickly apply updates and framework migrations
and version updates and things like that to a large number
of fixes automatically. So this is quite popular,
quite actively used, but this only applies to JavaScript
and typescript code. And also there's
really not any emphasis on security in this particular framework.
And then there's also this project, also from meta, from Instagram
I believe, called Libcst, which is
a framework for parsing and transforming Python code,
but it also provides an API for developing code mods,
and they also include some
pre built code mods as part of this framework, which include things like
removing unused imports or ordering your imports or
things of that nature. So we're going to talk a lot more
about Libcst going forward, just to place a
bookmark on that one. But the question
is, if there's already all of this prior art for
code mods, why did we need to design and develop
a new code mod framework? And so that leads
us to the code moderate philosophy. So the
fundamental idea of Codemodder is that we want to
fix problems that are found by other tools and
specifically by other security tools.
So the whole idea of Codemodder is that we want to be able to
take the results of those security tools that I showed you a few slides
back, and use that to drive fixes
for problems that are identified. So we
want to use those tools to identify problems and
then fix them.
Another big part of the code moderate philosophy is that we want our code
mods to tell a story and to educate users.
So if we're fixing security problems, we want
users and developers to understand, first of all,
what is the problem that is being fixed, and why is the new
code a safer solution? And this is going to enable developers
to write better code. It's going to teach them about security,
and it's going to help them write more secure code going forward.
And it's also very important for Codemodder to make changes that
are simple to understand and approve. So good storytelling
is part of this, but we
want to make changes that a developer can look at and understand that
yes, this is a good change. I want to make this change to my
code, and I'm going to go ahead and accept it and now have
more secure code. So it doesn't really matter if we propose changes
that nobody wants. We need to propose changes that are
understandable and that developers are willing to
accept. And so in
order to do this, we've decided that we can leverage existing
Opensource tools in order to build a solution
here. So we've got tools out there like Opensource tools out there
like Semgrep, which are very good at identifying
security problems and other code quality issues.
And then on the other hand, we have this framework I mentioned before,
Libcst, which is very good at transforming code
and making changes to code. And so we
feel like these two things belong together. If we can
put these open source technologies together and orchestrate
them, then we can build a tool that's very useful for developers
and that can help automating fix security
problems in their code. So one aspect
of this is we want to be able to process results
that are identified by other tools. So what
this means is if you're using tools like sonar or CodeqL
or Semgrep, we want to be able to process the output of those
tools, which is often in the standardized file format
called Serif. But we want to process the results of those tools
and then feed it to the code moderate in such
a way that we can use LibcSt to make transformations.
So these tools, we expect, are in some cases already being used by
developers, and we're going to identify the locations
that are insecure, that are pointed
out by these security tools, and then take that and make fixes
to those locations in the code. But the other thing that
we want to do is sometimes we want to be able to invoke
the opensource tools ourselves. Sometimes we
want to be able to find problems ourselves and
use that to fix
code. So in this case, we've written code
mods that leverage Semgrap using custom rules that we've written,
and we feed those rules to CST
and use the results that we've actually generated with Semgrap
to fix problems in code. And now this is very useful
for the development teams that haven't really adopted formalized security
practices yet, is that code moderator can give
these teams a tool that will both find and
fix problems. So we call this kind
of code mod, find and fix problem, find and fix code
mods. Whereas the previous kind of code mod that I showed
you where we're consuming the results of external tools,
those code mods are going to be called fix only code mods because
we're taking results that have already been generated.
So I mentioned that it's very important for us to educate users.
And what this means is that we want our code mods to tell a story.
So we believe that every fix the code modder provides is
an opportunity to educate developers, both about security
problems, but also about writing more secure code.
We also believe that the fixes we provide should be comprehensible to
developers and compelling. So if we
tell the story right, it should be very easy for a
developer to understand why the change is being made, what the original
problem was, and that should make it compelling in
terms of a fix from the perspective of a developer.
And the result of that is that it makes fixes easy to
merge. So when a developer sees a fix from Codemodder,
it should be very easy to accept that into their
upstream code base and say, yes, that is a change that I want to
make. We want these fixes to be easy to merge.
So at this point in the talk, after we've learned a bit about
the code code framework, you're probably asking, how can I use it?
So Python Codemodder is available
as a package on Pypy. It's listed under the name Codemodder.
And so you can just run pip install codemodder.
And when you do that by default, you get this
new executable called Codemodder on your path. You can run
it with the h option.
And what I'm showing you here is the output of
the help message to the terminal. We're not going to go through all these options
today, but I just want to give you the sense that there's
a lot of different knobs to turn here, and Codemodder is very
configurable. So that's the first step
to getting it installed and seeing what it can do.
And so then the next question you're asking is what does it actually do?
And so if we invoke this codemodder executable with
a path to your project. So a project that contains Python
codemods is going to do a couple of different things.
The first thing it's going to do is it's going to use
the find and fix code mods that I mentioned earlier
that are using semgrap rules in many cases, and it's going
to identify problems in your code, and then it's going to apply
fixes for the problems that it identifies.
And those fixes are going to be applied directly to your
files on disk by default. So it's going to make changes to
your code. The other thing that code monitor does is
it generates output files in this
format that we've called code TF, which is designed as
an interchange format for representing the
results of code monitor runs. So I'm not going to get into a lot
of details about what code TF looks like today.
It's not really important to this talk, but I will mention
code TF at least one more time towards the end of this talk.
The general idea is that code TF can be consumed by upstream
tools and it can be used to do interesting
things.
Okay, so at this point I'd like to show you a couple of
examples of the kinds of security problems that Codemodder
is capable of fixing. We have a pretty large catalog of
code mods that we currently support. I think it's
on the order of 40, 45, maybe close
to 50 code mods that are currently supported. We're always developing more,
but I'm just going to show you a couple of examples so you get a
sense of what this framework can do.
All right, so this first example is to replace
unsafe Pyaml loader. So if you're familiar
at all with the Pyaml module
Pyaml library in Python,
you might be aware that the default loader in Pyaml
is actually insecure. It potentially enables
arbitrary code execution if you load a yaml
file from a opensource that you don't trust. And so this change is
relatively simple. What we do is identify locations
where that unsafe loader is being used, and we replace it with
a safe loader which is not susceptible to arbitrary
code execution in the same way. So you can
see this is a pretty simple change that's being made. Should be pretty simple for
a developer to understand the reasons for this,
and we think it's a good code. It makes your code more secure.
The next one I'm going to show you is a personal favorite of mine.
This is one that uses diffused XML for parsing
XML. So if you're familiar at all with the standard
library XML parsers in Python, so the ones that are
provided out of the box with Python,
these are actually insecure for
different kinds of XML parsing vulnerabilities. And if you
go to the documentation for these modules on the Python docs,
you will see a big warning right at the top that says that these XML
libraries should not be used for parsing
untrusted XML data. And what the documentation
actually does recommend is the use of this third party
module called diffused XML. And because that has
secured against many of these different types of XML attacks.
And so what this code mod does is it identifies places in your
code where you're using the standard library XML parsers
and it replaces them with parsers from diffused XML.
And so you can see in this diff here that we're adding some
imports and we're changing the parsers so that
they use diffused XML instead of the ones from the standard library.
Now the interesting thing about this code mod which we'll come back
to, is that in order for this to work properly,
it actually needs to add the diffused XML dependency to your
project if it's not already present. So again in a couple of slides we'll
talk a little more about that. Here's another code mod
that automating closes resources. If you open
a file handle and you forget to close that in certain cases
that can lead to resource overconsumption.
It can make you susceptible to denial of service attacks and in
certain cases can be quite catastrophic depending on the application.
And so what this code mod does is it identifies any cases where the
file handle wasn't closed and it rewrites those
usages in terms of a context manager, which is the recommended
way for handling these kinds of I O resources.
The interesting thing to me about this code mod is that when you look at
the change, it's actually a very simple diff
here. But the code that
implements this code mod is actually very sophisticated and
it's really quite impressive. So I think that this is a cool
code mod and a very useful one as well.
In a similar vein, this is another code mod where the change
looks pretty simple, but the logic behind it is very
sophisticated. This is one that parameterizes
SQL queries to make them safe against SQL injection. So if
you look on line 147 here of the old code
in the diff, you can see that string formatting
using f string is being used to generate this
SQL query which is then executed and that's potentially
insecure against SQL injection depending on where that token string came
from. So this code mod rewrites that query in terms of
a parameterized query, which secures it against potential
SQL injection. Again, I think this is a very cool,
very valuable code mod and a very impressive one too.
And this last one I'm going to show you is called use generator
expressions. I like this one because it's not actually a
security fix per se, and it also looks
very simple just based on the diff. But it's a really interesting
one because it identifies places where list
comprehensions or other kinds of comprehensions have been used
and rewrites them in terms of generator expressions where possible.
And the reason for this is that you can have, in some cases,
a very large performance benefit from doing this, especially if you're working with
very large data sets and you need a lot of memory.
This changes these data sets so that they're now lazily evaluated
instead of having to load all of them in memory. I guess in some cases
this could lead to denial of service. So there is a bit of a security
impact here. But this is another case where it would be very
hard to make this kind of change without the
kind of syntactic and semantic analysis
that we perform with these security tools and this transformation
library. So I like this code mod because it's simple to understand, but still
very interesting. So now that we've talked
about some of the code mods, I want to dive a little bit deeper
into the architecture of this framework.
So when we designed this
architecture, we realized that a code mod really consists
of three different components. The first of these is called the
detector. The second is called the transformer,
and the third is called metadata. So the
detector is responsible for finding problems.
These are the security tools that go out and find problems with your
code. In the case of Codemodder, this can be one
of two different things. It can be problems that were identified
by external tools, in which case the detector
is really a parser for the results of those tools and transforms
them into something that codemodder can use to fix.
But in other cases, it's us running Semgrep directly. So codemodder
is directly invoking Semgrep with custom
rules and using that to drive the fixes. The transformer
is what's actually responsible for changing the code and making the fix.
And then the metadata is the part of the code mod that actually tells a
story and helps the developer to understand the code mod and what
kind of change is being made. So if I
show you this diagram, this schematic of our base code mod class,
you can see on the left hand side, we've got the detector,
and that detector is feeding into what we've called a
transformer pipeline, which can potentially be multiple
transformers that are chained together. And then we
also have this box up top, which is metadata,
which includes some fields that I'll talk about on an
upcoming slide. But in practice, what this ends
up looking like is a little bit like this, where our detector
is something like sonar or codeql or Semgrap.
Again, sometimes the detector is parsing the results of these tools.
Sometimes the detector is running Semgrap itself. In our find and
fix code mods and then it's being fed into transformers that
are implemented in terms of lib CST.
And that transformer is what's responsible for actually changing
the code. So I mentioned metadata on the last slide
and I want to take a minute to talk about what that looks like.
So metadata consists of a name which
is really a unique identifier for a code mod.
You can see here in this example that this name has three different parts.
The first part is Pixie, which is telling you the origin of
this code mod. It means that we wrote this at Pixie.
The second component of this is the language that it applies
to. So we're talking about Python codemods, but we
do support another code mod framework for Java and
we intend to build some others going forward.
And then the third component after the slash is the actual name of
the code mod itself, which is use diffused XML.
The next part of code mod metadata that's interesting is
a summary which is a short human readable description
of the change being made. In this case, it's used diffused XML for parsing
XML. This tools the developer what the code mod is doing and
then we've got a description. And remember,
we want to be able to tell a good story about a code mod.
So we want to be able to support a reasonably large long form
description. In this case, we've decided to
represent this as a separate markdown file which
is automatically associated with the code mod code. And that allows us
to use markdown rendering and write a nice long form description about
this code mod. And it also enables us
to have this without cluttering up the code itself so that
the description doesn't live right next to the code, it lives in a separate place.
Okay, so I mentioned with the diffused XML code
mod. Sometimes a code mod needs to add a dependency.
Sometimes the right thing to do to fix a security problem is
to use a different library that either has
a more secure implementation or sometimes even to
introduce a security package that is
capable of hardening certain operations that tend to be insecure.
So to do that we need to be able to add dependencies to our
project, to the project being modified. And if
you're familiar at all with the Python packaging ecosystem, you know that doing this
in Python is not that easy of a problem to solve.
In the simplest case, if you're using a requirements TxT file,
we can generally just add that dependency to the requirements text file
if it's not already present there. But in
the Python packaging ecosystem there's a bunch of different places where
packages can be, where dependencies can be expressed.
This includes Pyproject tumult, which is currently recommended
for setup tools. It's also used by poetry,
which we don't currently support, but we may going forward.
But some older projects might be using setup config if they're
using setup tools. And then there's also setup
pY, which can express dependencies,
and that's sort of the older, less recommended way of doing things
now. But we need to be able to figure out which of
these is being used in a project and where the right place to add
the dependency is. This is a pretty tricky problem.
It's actually a bit harder to solve than the problem that
dependabot has because they can just identify
existing dependencies and update them. But we need to find the right
place to add a new dependency. So I think that this is really
useful and it's cool. And it's also
something that definitely differentiates ourselves from other code mod
frameworks. I don't think many other frameworks are necessarily
thinking about this kind of thing.
All right, so we've covered a lot of ground about the codemodder philosophy.
We've talked about some examples and then
the underlying codemodder architecture. So it's time for
us to dive right in and write a code mod.
So first of all, I want to mention that the code moderate
framework supports a plugin infrastructure for
loading custom code mods. So if you write a custom
code mod, our framework is capable of automatically
loading that custom code mod and making it available to
the framework for use. I'm not going to get into a lot of detail about
how that plugin infrastructure works. That's probably better
covered in our documentation. But I will say that if you are interested
in following along and writing your own custom plugin,
your own custom code mod plugin, then you should start
with this code mod plugin template, which is a cookie cutter template
that you can use to generate your own custom
code mod project. And what that's going to do is enable you to have a
project that if you pip install it, it's automatically going to be
picked up by Codemodder, and that custom code mod is going to be available.
So if you're interested in doing this yourself, go ahead and get started with
this cookie cutter template. And for the sake of
the examples I'm going to show, we're going to assume that all of this is
within the context of this particular cookie
cutter template. Okay,
so here's an example. Code mod that we're going to
write, we're going to write a code mod called secure random,
which is going to find places where the
standard import random module is used in Python
and we're going to replace it with a more secure and safer system,
random module from secrets. And the reason for
this is because if you're generating cryptographic primitives
or using this to generate passwords or other kinds of
keys, the standard random module is
not secure enough for those purposes. So we think that this is
a good hardening step to make really has no downsides and
we recommend it. So for this code
mod, first notice that we're importing this core code mod,
this core code mod class which we're then using to define
a secure random code mod. And remember when
I talked about code mod architecture and it having three different components?
We had metadata, we had a detector and
we had a transformer. So you can see that the secure random code mod
is defining each of those things. But the interesting thing
that we're going to get into is what the definition of each of these different
components looks like. So first of all, we're going to talk
about metadata. You can see here
we're defining this new object using the
metadata class and it has a name which we're calling secure random.
Now notice that this doesn't have the origin or the language component
that I pointed out previously. That's because our framework
is automatically going to add those based on the plugin.
We know that this is a python plugin
so it doesn't have to be provided here. Python code mod.
And we also encode the actual origin name
at the plugin itself. So ours would be pixie,
but yours would have a different name for your project. So we're calling this secure
random and the summary that we're providing
is secure source of randomness. Now there's also
this other field here called review guidance, which just sort of gives developers
an idea of how much attention they need to pay to this
particular change before they merge it. And then recall that
I mentioned that the long form description is actually stored in a
separate markdown file which I'm not showing here, but that's automatically
going to get associated with the code mod that we're writing.
All right, so the next part that we need to implement is the detector for
this particular code mod. We're implementing it as a find and fix code
mod, which means we're going to find the problem and we're going to
do that by writing our own custom Semgrep rule.
So we're using the Semgrep rule detector class here
to define the detector, and what we provide to
that is a Semgrep pattern.
So I'm not giving a tutorial on Semgrep here. I'm not
going to get too far into the weeds about what this pattern
means. But suffice it to say that we're identifying all the
cases where the random module is being used,
but we're also making sure to exclude the system random, which is already
secure. So this pattern is going to find all the locations in the code
that look insecure and then the results of that are going
to be fed to the transformer, which is what we show here.
So first of all, note that we are creating a transformer
class using the Libcst result transformer
as the base class. So we're explicitly
saying here that we're using lib CST for the transformation.
We've added this layer of abstraction because we expect we might want to
have other kinds of transformers in the future. So right now you
have to define explicitly that you're using a webcst transformer.
Okay, so I'm going to jump to this method called on
results found. So this is where all the magic is happening.
This is a callback that the transformer class defines that's
going to get automatically called by the framework
in response to any of the results, in response to any
of the locations that are identified by the detector.
So remember we're using a Semgrep detector and it's going to find locations
in the code that look secure. And the framework is going to automatically
call this method on the transformer anytime it sees one
of those locations. So what
we're really doing here is we're updating the call
target of this operation. So originally the call
target was random. Maybe we had a call to random Randint.
So the call target there was the random module, but we want to replace
that with this secrets system random. So the new
call is going to look like secrets system random random.
So that means that we're able to take advantage of this API
call called update call target and use
that to implement our transformation. Now this is a pretty common
use case. If you recall back to our diffused XML code
mod. This would actually be doing the same thing. It would be replacing
the original call target, which is the standard library XML
module, and it would be replacing it with diffused XML.
So this method shows up in a bunch of different places.
The other interesting thing to call out here is that if we're
using the secrets module now, we need to make sure that it's
imported. And so we call this method called add needed import
which for each file is going to check is that secrets module already
imported? And if it's not, go ahead and add it.
And then on line ten above that you can see we're also calling remove
unused import which just cleans up after ourselves and
makes sure that if there's any unused imports after this they
get cleaned up. So it makes the linters happy and keeps the code clean.
Okay, so that defines the transformer class, but we also need to
define the transformer pipeline, which for this
particular case only consists of a single transformer.
Okay? And if we go back to this example,
we've defined the metadata class,
we've defined the detector and we've defined the transformer.
And so we've actually written a code mod that's capable of making a change
and making your code more secure. And if we look at
the diff that is generated by this code mod, if we apply this code mod
to pygote, which is a deliberately vulnerable
Python web application, you can see that the uses of
random have been replaced with secret stat system
random. And you can see on line nine up there that we've removed the
random import which is no longer used. You can see on line 40 that
we added the secrets import. And so this is
a change that we hope a developer would be willing to accept.
A big part of our philosophy with the code mod API is that
we want to make the easy things as easy as possible. And I think that
you saw some of that with the on result
found method and the methods that are being called there that are
intended to handle the most common use cases in a pretty
straightforward way. But I also think in that example
that I showed you that there was a lot of boilerplate, we had to define
a couple of different classes and we had
to put it all together. So to make the easy cases as
easy as possible, we've defined the simple
code mod API. And a simple code mod is one
that has a single detector and it has a single
transformer and specifically a single lib CST transformer.
And if those two things are true, then we can use this simple code mod
base class to implement our code mod. So this is the same code mod that
I showed you before, except it's rewritten in terms of the simpler API.
And you can see each of the components here. We define the metadata,
we define our Semgrep detector pattern here.
And then we define this on results found method. And so this
is all in 20, maybe 25 lines of code. It's easy to
read, and this, we believe makes for a very nice interface for
defining some of the simpler code mods.
However, we also want to make sure that the hard things are still possible.
We don't want to lose any expressive power by having a
simplified code mod API. And so I'm going to show you just a slightly more
complicated example. This one is called subprocess shell
false. It identifies any subprocess calls where
shell is set to true, and it flips it to false, which is
a safer default. And I'm not going to get too
into the details of this code mod, you don't need to understand it all.
But what I do want to point out is that instead of that on
result found method that we saw in the previous example,
we have this leave call method, and this actually
directly exposes the underlying lib CST transformer
interface. So we have the full expressive power of
Libcst here and can leverage that to do some fairly
sophisticated transformations for certain code mods.
Now there's also come other things that we had to do here that we didn't
have to do in the previous example because we're using a lower level
API. One of these in the first box is we
need to filter our file name by the path and
the line number. So this is something that can be given
on the command line to include or exclude certain
files or lines from analysis. So we need to explicitly
call that here, whereas an on results found that's already
being handled by the callback. And then down at the bottom in this
other box, we're calling this report change method,
which is what's helping us generate that code tf file that I mentioned before.
And again, that's already automating handled by the unresolved found method
as well. So we didn't have to do that in that previous case, but we
do have to do it here. So that's just giving you a sense of what
a more complicated code mod might look like.
All right, so here, getting to the end of my talk, I want to take
just a minute to talk about some future directions and where we're looking
ahead for this framework. So I showed you this
diagram before where we have our detector can be a
variety of different security tools as input, and then our
transformers are implemented in terms of lib CST. Now,
I think the elephant in the room in 2024 is
where do large language models or llms fit into
this? And should we be using those to implement some of our transformations.
So looking forward,
some of our transformations might look more like this, where we're using
an LLM provider. I've used OpenAI here is
probably the most well known, most popular one. But this could be a variety
of different models. It could be llama, it could be something else.
But should we be using llms to
perform our transformations? There's a
couple considerations here. First of all,
do developers trust llms to make security changes to
their code? I think that's an open question. Another thing
is that right now we have the advantage of having an open source framework.
People can see exactly the kinds of changes that we're
making. They can understand them in terms of code and they can make
a pull request or open an issue. And when you use
llms you lose some of that transparency. On the other hand,
there's definitely some code mods I've seen that would require some more context
than just the kind of syntactic and semantic analysis we're
doing can provide. And that's where an LLM could really help
us make some even more sophisticated and clever kinds of
changes. So it's something we're considering going forward.
The other thing we're thinking about is of course I'm talking to you about
the Python code moderator framework. All of this is currently implemented
in terms of Python and it's also being applied to Python code.
But the question we have is that could you have a framework that's implemented
in Python? So the detection, all of this orchestration
is implemented in terms of Python, but could we apply
transformations to other kinds of code? And now
of course in this case we wouldn't be using Libcst because that's only for
Python. But maybe llms could help us here,
or maybe there's some other frameworks that could help us out.
So this is just something we've been thinking about. All right,
so I'm just going to shamelessly say we'd love to have your feedback. This is
an open source project. We'd love for you to open GitHub issues
with suggestions or bug reports.
We'd love for you to clone or fork the repo and try it out yourself.
We'd love to earn your stars. And more than anything
we would love to hear ideas from you about code mods
you want to see. And even if you would like to contribute directly
upstream, contribute your own code mod to our project,
that would be awesome. We would love to see it.
One thing I do want to mention right before the end of my talk
here is when I talked about code TF in this interchange
format. The way that we're using this at Pixie is we've built
a GitHub application that you can install for
free and that consumes the results of Codemodder.
And you can see here in this box. This is
where the summary field of that code mod is being used,
and then down here is where the description is being used.
So Pixiebot automatically applies Codemodder to your
code base, and it orchestrates all this together and opens pull requests
with suggested changes for your code. It's really cool. Again, it's free
to install. We'd love for you to try it out. The other thing
Python codemods helps with is our pixie
command line interface, or the CLI. This is sort of a
higher level user interface around both our Python and Java code modders
provides a bit of a nicer user experience,
but Codemodder is the
results of Codemodder are being used by this tool,
and this is also free to use. It's installable from
homebrew and we'd love for you to try it out and
give us feedback. So that's my talk.
Thank you so much for spending a bit of time with me and learning about
Python codemods. You can find me here on GitHub
can. Here's my email address. I'd love to get feedback from you.
Check us out at Pixie AI and look
me up on LinkedIn. I'd love to hear your feedback, love to see your GitHub
issues or get an email from you. And thanks again.