Writing Python Codemods for Fun and Profit

Video size:

Abstract

Do you want to automatically harden and improve your Python code? Are you struggling with processing the results of security and code quality tools? Come learn about the Python Codemodder framework: an open-source library for automating code quality and security fixes.

Summary

Dan Davella is the lead maintainer of the Python code moderator framework. The idea is to help developers and security engineers to automate improvements and fixes to their code. This is an open source framework. We'll talk about the motivation for having a new code mod framework at all.
There's a lot of insecure code out there. And the even bigger problem is that this insecure code is not actually getting fixed. We're not fixing the security problems in our own code. One solution is to automatically merge fixes to outstanding security problems.
A code mod is code that is capable of changing or updating other code. codemodder is not the first code mod framework out there. Other prior art includes JS code shift and Libcst. Why did we need to design and develop a new codemod framework?
Codemodder takes the results of security tools and uses that to drive fixes for problems that are identified. Another big part of the code moderate philosophy is that we want our code mods to tell a story and to educate users. The fixes we provide should be comprehensible to developers and compelling.
Codemodder can replace unsafe Pyaml loader. The default loader in Pyaml is actually insecure. It potentially enables arbitrary code execution. Codemodder has a pretty large catalog of code mods that we currently support.
The next one I'm going to show you is a personal favorite of mine. This is one that uses diffused XML for parsing XML. Because that has secured against many of these different types of XML attacks. In order for this to work properly, it actually needs to add the diffusing XML dependency to your project if it's not already present.
Here's another code mod that automates closing resources. In a similar vein, this is one that parameterizes SQL queries to make them safe against SQL injection. I think this is a very cool, very valuable code mod and a very impressive one too.
Use generator expressions identifies places where list comprehensions have been used and rewrites them in terms of generator expressions where possible. This changes these data sets so that they're now lazily evaluated instead of having to load all of them in memory. In some cases this could lead to denial of service.
A code mod really consists of three different components. The detector is responsible for finding problems. The transformer is what's actually responsible for changing the code and making the fix. And then the metadata is the part of the code mod that actually tells a story.
metadata consists of a name which is really a unique identifier for a code mod. The next part of code mod metadata is a summary which is a short human readable description of the change being made. In this case, we've decided to represent this as a separate markdown file.
Sometimes a code mod needs to add a dependency. In the Python packaging ecosystem there's a bunch of different places where packages can be, where dependencies can be expressed. This is a tricky problem to solve. It's something that definitely differentiates ourselves from other code mod frameworks.
Code moderate framework supports a plugin infrastructure for loading custom code mods. If you write a custom code mod, our framework is capable of automatically loading it. Here's an example of a code mod called secure random. It replaces the standard import random module with a more secure and safer system.
A code mod can be used to make your code more secure. It uses a Semgrep detector to find locations in the code that look insecure. The results of that are then fed to a transformer. The new call is going to look like secrets system random random.
Pixie is an open source framework for code modders. Could we apply transformations to other kinds of code? Should we use large language models like llms? We'd love to hear your feedback from you.

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Everyone. So first of all, thanks for joining me. Thanks for having me at this conference. I'm excited to talk to you all about Python code mods. Before we get into it, let me just introduce myself briefly. My name is Dan Davella and I work at an early stage startup called Pixee, where we are building an automated product security engineer. So the idea is that we want to help developers and security engineers to automate improvements and fixes to their code. I work on security tools. I have a history of working on security tools. I've previously worked on runtime application security instrumentation. I have a bit of experience working on static analysis tools. And so it was kind of a natural transition for me to come to pixie and to work on tools that actually fixes security problems. If you've ever worked at an early stage startup, you realize that you wear a lot of different hats. But I would say the biggest hat that I wear is actually as the lead maintainer of this Python code moderator framework, which I'm going to be talking about today. This is an open source framework. We're going to get a lot into the details of this framework, some of the design decisions we've made. We'll talk about the motivation for having a new code mod framework at all. And I'm really excited to talk to you about that. But before I get into that, I want to take a moment to sort of set up the problem here. And so the problem really boils down to the fact that there's a lot of insecure code out there. And the even bigger problem is that this insecure code is not actually getting fixed. We're not fixing these security problems. And while I was preparing this talk, I was reading this state of software security report that was put out by Veracode pretty recently, and there were a couple of statistics that really stood out at me. The first one was this first quote here that says that roughly 63% of applications have flaws in first party code. And the reason that this stood out to me is that over the past few years, there's been so much emphasis on vulnerabilities in third party packages, like in our open source software that we're using and in our dependencies. And there's a lot of tools that have been helping us fix these kinds of things. There's been log for J and log for Shell, and a lot of big problems in third party code. We've tended to forget that there's also a lot of problems in our own code, in our own application code, and the code that we write and this is really important. And then this other statistic that stood out at me was this idea that 42% of all applications have flaws that persist unremediated for longer than one year. And this is how veracode is defining this concept of security debt. It's problems that aren't being resolved and these security issues just continue to live in our code and continue to get shipped. So there's a really big problem here that we're not fixing the security problems in our own code. And when you look at the security tool environment, there's really no lack of tools that are capable of finding security problems. If you look at all these different products, some of them are open source, some of them are enterprise grade commercial products. But if I asked everybody in the room or in the virtual room to kind of raise their hand if they were using at least one of these products on at least one of these projects, most people would probably raise their hand. And so we're all using security tools, but that doesn't seem to change the fact that these problems aren't actually getting fixes. And that's pretty concerning. Now, I will say that sometimes when I talk to developers and development teams and I ask them, what security tools are you using today, I actually do get a response that looks a little bit like this. I think that there's some smaller teams out there that have not really coalesced on some formalized security practices yet. So there are some teams out there that have not really adopted some of these security tools. And so obviously this means that they're not fixing some security problems as well because they don't even know that they're there in some cases. I think there's a couple of different reasons for this. One of them is that some developers feel like they don't have the expertise to really dive into security. And so it really requires a team getting to the level where they have some formal security engineering to really put these practices into place. Sometimes when I talk to more senior developers, they tell me that they feel like they are mostly writing secure code. And I understand where that thought process comes from because it's more likely for a senior developer that they're going to correctly parameterize their SQL queries and they're going to validate user input before it gets rendered into HTML. They're not going to put hard coded credentials in their code. But the fact remains that there's still an awful lot of code out there that has security issues and these issues aren't getting fixed. So what are we going to do about it? The obvious solution is that we need to fix and harden our code, but I think the less obvious solution, but that the industry is starting to arrive at, is that we need to do this automatically. And this is going to do a couple of different things. First of all, we're going to enable developers to merge secure code. So they're going to be able to merge fixes to their outstanding security problems. And then we're going to be able to guarantee that any new code that gets added is also secure because it's being validated and it's being fixed if there's any problems before it gets merged and deployed in the first place. We're also going to enable teams to work down their security backlogs. So if you are using one of these tools already, then we're going to be able to take the results of those tools and automatically fix a large proportion of them. So that that takes away distractions from the developers and gives people more bandwidth. And the result of all of that is that developers get to spend more time actually writing features and focusing on the things that matter to them. They get to ship features without feeling distracted by security problems. That is the goal here. And that's what leads us to this code code framework that I'm going to be talking to you about today. So just in a one sentence summary of Codemodder, it is an opensource code mod framework that is designed for fixing security issues. So I've already covered the open source part of this. I mentioned this is an opensource project that's being maintained by Pixee. But the rest of my talk is going to focus on what it means to be a code mod framework, and then how we go about using that to fix security issues. So I've used this term code mod a couple of times now, and I just want to make sure we define this for everybody. In case you're not familiar with this concept. But the word code mod has a simple etymology. It's just from a shortened form of code plus modification. And what we really mean is that a code mod is code that is capable of changing or updating other code. So codemodder is not the first code mod framework out there. There's some other prior art that I just want to mention. So the first thing to mention is this framework from Facebook that's actually called code mod, and this was intended to enable large scale refactoring with some level of human intervention. So if you imagine that you're doing big structural changes to your code, this is a framework that's going to help you with this. It was implemented in Python, and I believe that it's not actually actively maintained anymore. If you're in the JavaScript or typescript ecosystem, you might already be familiar with this framework called JS code shift, which is designed to quickly apply updates and framework migrations and version updates and things like that to a large number of fixes automatically. So this is quite popular, quite actively used, but this only applies to JavaScript and typescript code. And also there's really not any emphasis on security in this particular framework. And then there's also this project, also from meta, from Instagram I believe, called Libcst, which is a framework for parsing and transforming Python code, but it also provides an API for developing code mods, and they also include some pre built code mods as part of this framework, which include things like removing unused imports or ordering your imports or things of that nature. So we're going to talk a lot more about Libcst going forward, just to place a bookmark on that one. But the question is, if there's already all of this prior art for code mods, why did we need to design and develop a new code mod framework? And so that leads us to the code moderate philosophy. So the fundamental idea of Codemodder is that we want to fix problems that are found by other tools and specifically by other security tools. So the whole idea of Codemodder is that we want to be able to take the results of those security tools that I showed you a few slides back, and use that to drive fixes for problems that are identified. So we want to use those tools to identify problems and then fix them. Another big part of the code moderate philosophy is that we want our code mods to tell a story and to educate users. So if we're fixing security problems, we want users and developers to understand, first of all, what is the problem that is being fixed, and why is the new code a safer solution? And this is going to enable developers to write better code. It's going to teach them about security, and it's going to help them write more secure code going forward. And it's also very important for Codemodder to make changes that are simple to understand and approve. So good storytelling is part of this, but we want to make changes that a developer can look at and understand that yes, this is a good change. I want to make this change to my code, and I'm going to go ahead and accept it and now have more secure code. So it doesn't really matter if we propose changes that nobody wants. We need to propose changes that are understandable and that developers are willing to accept. And so in order to do this, we've decided that we can leverage existing Opensource tools in order to build a solution here. So we've got tools out there like Opensource tools out there like Semgrep, which are very good at identifying security problems and other code quality issues. And then on the other hand, we have this framework I mentioned before, Libcst, which is very good at transforming code and making changes to code. And so we feel like these two things belong together. If we can put these open source technologies together and orchestrate them, then we can build a tool that's very useful for developers and that can help automating fix security problems in their code. So one aspect of this is we want to be able to process results that are identified by other tools. So what this means is if you're using tools like sonar or CodeqL or Semgrep, we want to be able to process the output of those tools, which is often in the standardized file format called Serif. But we want to process the results of those tools and then feed it to the code moderate in such a way that we can use LibcSt to make transformations. So these tools, we expect, are in some cases already being used by developers, and we're going to identify the locations that are insecure, that are pointed out by these security tools, and then take that and make fixes to those locations in the code. But the other thing that we want to do is sometimes we want to be able to invoke the opensource tools ourselves. Sometimes we want to be able to find problems ourselves and use that to fix code. So in this case, we've written code mods that leverage Semgrap using custom rules that we've written, and we feed those rules to CST and use the results that we've actually generated with Semgrap to fix problems in code. And now this is very useful for the development teams that haven't really adopted formalized security practices yet, is that code moderator can give these teams a tool that will both find and fix problems. So we call this kind of code mod, find and fix problem, find and fix code mods. Whereas the previous kind of code mod that I showed you where we're consuming the results of external tools, those code mods are going to be called fix only code mods because we're taking results that have already been generated. So I mentioned that it's very important for us to educate users. And what this means is that we want our code mods to tell a story. So we believe that every fix the code modder provides is an opportunity to educate developers, both about security problems, but also about writing more secure code. We also believe that the fixes we provide should be comprehensible to developers and compelling. So if we tell the story right, it should be very easy for a developer to understand why the change is being made, what the original problem was, and that should make it compelling in terms of a fix from the perspective of a developer. And the result of that is that it makes fixes easy to merge. So when a developer sees a fix from Codemodder, it should be very easy to accept that into their upstream code base and say, yes, that is a change that I want to make. We want these fixes to be easy to merge. So at this point in the talk, after we've learned a bit about the code code framework, you're probably asking, how can I use it? So Python Codemodder is available as a package on Pypy. It's listed under the name Codemodder. And so you can just run pip install codemodder. And when you do that by default, you get this new executable called Codemodder on your path. You can run it with the h option. And what I'm showing you here is the output of the help message to the terminal. We're not going to go through all these options today, but I just want to give you the sense that there's a lot of different knobs to turn here, and Codemodder is very configurable. So that's the first step to getting it installed and seeing what it can do. And so then the next question you're asking is what does it actually do? And so if we invoke this codemodder executable with a path to your project. So a project that contains Python codemods is going to do a couple of different things. The first thing it's going to do is it's going to use the find and fix code mods that I mentioned earlier that are using semgrap rules in many cases, and it's going to identify problems in your code, and then it's going to apply fixes for the problems that it identifies. And those fixes are going to be applied directly to your files on disk by default. So it's going to make changes to your code. The other thing that code monitor does is it generates output files in this format that we've called code TF, which is designed as an interchange format for representing the results of code monitor runs. So I'm not going to get into a lot of details about what code TF looks like today. It's not really important to this talk, but I will mention code TF at least one more time towards the end of this talk. The general idea is that code TF can be consumed by upstream tools and it can be used to do interesting things. Okay, so at this point I'd like to show you a couple of examples of the kinds of security problems that Codemodder is capable of fixing. We have a pretty large catalog of code mods that we currently support. I think it's on the order of 40, 45, maybe close to 50 code mods that are currently supported. We're always developing more, but I'm just going to show you a couple of examples so you get a sense of what this framework can do. All right, so this first example is to replace unsafe Pyaml loader. So if you're familiar at all with the Pyaml module Pyaml library in Python, you might be aware that the default loader in Pyaml is actually insecure. It potentially enables arbitrary code execution if you load a yaml file from a opensource that you don't trust. And so this change is relatively simple. What we do is identify locations where that unsafe loader is being used, and we replace it with a safe loader which is not susceptible to arbitrary code execution in the same way. So you can see this is a pretty simple change that's being made. Should be pretty simple for a developer to understand the reasons for this, and we think it's a good code. It makes your code more secure. The next one I'm going to show you is a personal favorite of mine. This is one that uses diffused XML for parsing XML. So if you're familiar at all with the standard library XML parsers in Python, so the ones that are provided out of the box with Python, these are actually insecure for different kinds of XML parsing vulnerabilities. And if you go to the documentation for these modules on the Python docs, you will see a big warning right at the top that says that these XML libraries should not be used for parsing untrusted XML data. And what the documentation actually does recommend is the use of this third party module called diffused XML. And because that has secured against many of these different types of XML attacks. And so what this code mod does is it identifies places in your code where you're using the standard library XML parsers and it replaces them with parsers from diffused XML. And so you can see in this diff here that we're adding some imports and we're changing the parsers so that they use diffused XML instead of the ones from the standard library. Now the interesting thing about this code mod which we'll come back to, is that in order for this to work properly, it actually needs to add the diffused XML dependency to your project if it's not already present. So again in a couple of slides we'll talk a little more about that. Here's another code mod that automating closes resources. If you open a file handle and you forget to close that in certain cases that can lead to resource overconsumption. It can make you susceptible to denial of service attacks and in certain cases can be quite catastrophic depending on the application. And so what this code mod does is it identifies any cases where the file handle wasn't closed and it rewrites those usages in terms of a context manager, which is the recommended way for handling these kinds of I O resources. The interesting thing to me about this code mod is that when you look at the change, it's actually a very simple diff here. But the code that implements this code mod is actually very sophisticated and it's really quite impressive. So I think that this is a cool code mod and a very useful one as well. In a similar vein, this is another code mod where the change looks pretty simple, but the logic behind it is very sophisticated. This is one that parameterizes SQL queries to make them safe against SQL injection. So if you look on line 147 here of the old code in the diff, you can see that string formatting using f string is being used to generate this SQL query which is then executed and that's potentially insecure against SQL injection depending on where that token string came from. So this code mod rewrites that query in terms of a parameterized query, which secures it against potential SQL injection. Again, I think this is a very cool, very valuable code mod and a very impressive one too. And this last one I'm going to show you is called use generator expressions. I like this one because it's not actually a security fix per se, and it also looks very simple just based on the diff. But it's a really interesting one because it identifies places where list comprehensions or other kinds of comprehensions have been used and rewrites them in terms of generator expressions where possible. And the reason for this is that you can have, in some cases, a very large performance benefit from doing this, especially if you're working with very large data sets and you need a lot of memory. This changes these data sets so that they're now lazily evaluated instead of having to load all of them in memory. I guess in some cases this could lead to denial of service. So there is a bit of a security impact here. But this is another case where it would be very hard to make this kind of change without the kind of syntactic and semantic analysis that we perform with these security tools and this transformation library. So I like this code mod because it's simple to understand, but still very interesting. So now that we've talked about some of the code mods, I want to dive a little bit deeper into the architecture of this framework. So when we designed this architecture, we realized that a code mod really consists of three different components. The first of these is called the detector. The second is called the transformer, and the third is called metadata. So the detector is responsible for finding problems. These are the security tools that go out and find problems with your code. In the case of Codemodder, this can be one of two different things. It can be problems that were identified by external tools, in which case the detector is really a parser for the results of those tools and transforms them into something that codemodder can use to fix. But in other cases, it's us running Semgrep directly. So codemodder is directly invoking Semgrep with custom rules and using that to drive the fixes. The transformer is what's actually responsible for changing the code and making the fix. And then the metadata is the part of the code mod that actually tells a story and helps the developer to understand the code mod and what kind of change is being made. So if I show you this diagram, this schematic of our base code mod class, you can see on the left hand side, we've got the detector, and that detector is feeding into what we've called a transformer pipeline, which can potentially be multiple transformers that are chained together. And then we also have this box up top, which is metadata, which includes some fields that I'll talk about on an upcoming slide. But in practice, what this ends up looking like is a little bit like this, where our detector is something like sonar or codeql or Semgrap. Again, sometimes the detector is parsing the results of these tools. Sometimes the detector is running Semgrap itself. In our find and fix code mods and then it's being fed into transformers that are implemented in terms of lib CST. And that transformer is what's responsible for actually changing the code. So I mentioned metadata on the last slide and I want to take a minute to talk about what that looks like. So metadata consists of a name which is really a unique identifier for a code mod. You can see here in this example that this name has three different parts. The first part is Pixie, which is telling you the origin of this code mod. It means that we wrote this at Pixie. The second component of this is the language that it applies to. So we're talking about Python codemods, but we do support another code mod framework for Java and we intend to build some others going forward. And then the third component after the slash is the actual name of the code mod itself, which is use diffused XML. The next part of code mod metadata that's interesting is a summary which is a short human readable description of the change being made. In this case, it's used diffused XML for parsing XML. This tools the developer what the code mod is doing and then we've got a description. And remember, we want to be able to tell a good story about a code mod. So we want to be able to support a reasonably large long form description. In this case, we've decided to represent this as a separate markdown file which is automatically associated with the code mod code. And that allows us to use markdown rendering and write a nice long form description about this code mod. And it also enables us to have this without cluttering up the code itself so that the description doesn't live right next to the code, it lives in a separate place. Okay, so I mentioned with the diffused XML code mod. Sometimes a code mod needs to add a dependency. Sometimes the right thing to do to fix a security problem is to use a different library that either has a more secure implementation or sometimes even to introduce a security package that is capable of hardening certain operations that tend to be insecure. So to do that we need to be able to add dependencies to our project, to the project being modified. And if you're familiar at all with the Python packaging ecosystem, you know that doing this in Python is not that easy of a problem to solve. In the simplest case, if you're using a requirements TxT file, we can generally just add that dependency to the requirements text file if it's not already present there. But in the Python packaging ecosystem there's a bunch of different places where packages can be, where dependencies can be expressed. This includes Pyproject tumult, which is currently recommended for setup tools. It's also used by poetry, which we don't currently support, but we may going forward. But some older projects might be using setup config if they're using setup tools. And then there's also setup pY, which can express dependencies, and that's sort of the older, less recommended way of doing things now. But we need to be able to figure out which of these is being used in a project and where the right place to add the dependency is. This is a pretty tricky problem. It's actually a bit harder to solve than the problem that dependabot has because they can just identify existing dependencies and update them. But we need to find the right place to add a new dependency. So I think that this is really useful and it's cool. And it's also something that definitely differentiates ourselves from other code mod frameworks. I don't think many other frameworks are necessarily thinking about this kind of thing. All right, so we've covered a lot of ground about the codemodder philosophy. We've talked about some examples and then the underlying codemodder architecture. So it's time for us to dive right in and write a code mod. So first of all, I want to mention that the code moderate framework supports a plugin infrastructure for loading custom code mods. So if you write a custom code mod, our framework is capable of automatically loading that custom code mod and making it available to the framework for use. I'm not going to get into a lot of detail about how that plugin infrastructure works. That's probably better covered in our documentation. But I will say that if you are interested in following along and writing your own custom plugin, your own custom code mod plugin, then you should start with this code mod plugin template, which is a cookie cutter template that you can use to generate your own custom code mod project. And what that's going to do is enable you to have a project that if you pip install it, it's automatically going to be picked up by Codemodder, and that custom code mod is going to be available. So if you're interested in doing this yourself, go ahead and get started with this cookie cutter template. And for the sake of the examples I'm going to show, we're going to assume that all of this is within the context of this particular cookie cutter template. Okay, so here's an example. Code mod that we're going to write, we're going to write a code mod called secure random, which is going to find places where the standard import random module is used in Python and we're going to replace it with a more secure and safer system, random module from secrets. And the reason for this is because if you're generating cryptographic primitives or using this to generate passwords or other kinds of keys, the standard random module is not secure enough for those purposes. So we think that this is a good hardening step to make really has no downsides and we recommend it. So for this code mod, first notice that we're importing this core code mod, this core code mod class which we're then using to define a secure random code mod. And remember when I talked about code mod architecture and it having three different components? We had metadata, we had a detector and we had a transformer. So you can see that the secure random code mod is defining each of those things. But the interesting thing that we're going to get into is what the definition of each of these different components looks like. So first of all, we're going to talk about metadata. You can see here we're defining this new object using the metadata class and it has a name which we're calling secure random. Now notice that this doesn't have the origin or the language component that I pointed out previously. That's because our framework is automatically going to add those based on the plugin. We know that this is a python plugin so it doesn't have to be provided here. Python code mod. And we also encode the actual origin name at the plugin itself. So ours would be pixie, but yours would have a different name for your project. So we're calling this secure random and the summary that we're providing is secure source of randomness. Now there's also this other field here called review guidance, which just sort of gives developers an idea of how much attention they need to pay to this particular change before they merge it. And then recall that I mentioned that the long form description is actually stored in a separate markdown file which I'm not showing here, but that's automatically going to get associated with the code mod that we're writing. All right, so the next part that we need to implement is the detector for this particular code mod. We're implementing it as a find and fix code mod, which means we're going to find the problem and we're going to do that by writing our own custom Semgrep rule. So we're using the Semgrep rule detector class here to define the detector, and what we provide to that is a Semgrep pattern. So I'm not giving a tutorial on Semgrep here. I'm not going to get too far into the weeds about what this pattern means. But suffice it to say that we're identifying all the cases where the random module is being used, but we're also making sure to exclude the system random, which is already secure. So this pattern is going to find all the locations in the code that look insecure and then the results of that are going to be fed to the transformer, which is what we show here. So first of all, note that we are creating a transformer class using the Libcst result transformer as the base class. So we're explicitly saying here that we're using lib CST for the transformation. We've added this layer of abstraction because we expect we might want to have other kinds of transformers in the future. So right now you have to define explicitly that you're using a webcst transformer. Okay, so I'm going to jump to this method called on results found. So this is where all the magic is happening. This is a callback that the transformer class defines that's going to get automatically called by the framework in response to any of the results, in response to any of the locations that are identified by the detector. So remember we're using a Semgrep detector and it's going to find locations in the code that look secure. And the framework is going to automatically call this method on the transformer anytime it sees one of those locations. So what we're really doing here is we're updating the call target of this operation. So originally the call target was random. Maybe we had a call to random Randint. So the call target there was the random module, but we want to replace that with this secrets system random. So the new call is going to look like secrets system random random. So that means that we're able to take advantage of this API call called update call target and use that to implement our transformation. Now this is a pretty common use case. If you recall back to our diffused XML code mod. This would actually be doing the same thing. It would be replacing the original call target, which is the standard library XML module, and it would be replacing it with diffused XML. So this method shows up in a bunch of different places. The other interesting thing to call out here is that if we're using the secrets module now, we need to make sure that it's imported. And so we call this method called add needed import which for each file is going to check is that secrets module already imported? And if it's not, go ahead and add it. And then on line ten above that you can see we're also calling remove unused import which just cleans up after ourselves and makes sure that if there's any unused imports after this they get cleaned up. So it makes the linters happy and keeps the code clean. Okay, so that defines the transformer class, but we also need to define the transformer pipeline, which for this particular case only consists of a single transformer. Okay? And if we go back to this example, we've defined the metadata class, we've defined the detector and we've defined the transformer. And so we've actually written a code mod that's capable of making a change and making your code more secure. And if we look at the diff that is generated by this code mod, if we apply this code mod to pygote, which is a deliberately vulnerable Python web application, you can see that the uses of random have been replaced with secret stat system random. And you can see on line nine up there that we've removed the random import which is no longer used. You can see on line 40 that we added the secrets import. And so this is a change that we hope a developer would be willing to accept. A big part of our philosophy with the code mod API is that we want to make the easy things as easy as possible. And I think that you saw some of that with the on result found method and the methods that are being called there that are intended to handle the most common use cases in a pretty straightforward way. But I also think in that example that I showed you that there was a lot of boilerplate, we had to define a couple of different classes and we had to put it all together. So to make the easy cases as easy as possible, we've defined the simple code mod API. And a simple code mod is one that has a single detector and it has a single transformer and specifically a single lib CST transformer. And if those two things are true, then we can use this simple code mod base class to implement our code mod. So this is the same code mod that I showed you before, except it's rewritten in terms of the simpler API. And you can see each of the components here. We define the metadata, we define our Semgrep detector pattern here. And then we define this on results found method. And so this is all in 20, maybe 25 lines of code. It's easy to read, and this, we believe makes for a very nice interface for defining some of the simpler code mods. However, we also want to make sure that the hard things are still possible. We don't want to lose any expressive power by having a simplified code mod API. And so I'm going to show you just a slightly more complicated example. This one is called subprocess shell false. It identifies any subprocess calls where shell is set to true, and it flips it to false, which is a safer default. And I'm not going to get too into the details of this code mod, you don't need to understand it all. But what I do want to point out is that instead of that on result found method that we saw in the previous example, we have this leave call method, and this actually directly exposes the underlying lib CST transformer interface. So we have the full expressive power of Libcst here and can leverage that to do some fairly sophisticated transformations for certain code mods. Now there's also come other things that we had to do here that we didn't have to do in the previous example because we're using a lower level API. One of these in the first box is we need to filter our file name by the path and the line number. So this is something that can be given on the command line to include or exclude certain files or lines from analysis. So we need to explicitly call that here, whereas an on results found that's already being handled by the callback. And then down at the bottom in this other box, we're calling this report change method, which is what's helping us generate that code tf file that I mentioned before. And again, that's already automating handled by the unresolved found method as well. So we didn't have to do that in that previous case, but we do have to do it here. So that's just giving you a sense of what a more complicated code mod might look like. All right, so here, getting to the end of my talk, I want to take just a minute to talk about some future directions and where we're looking ahead for this framework. So I showed you this diagram before where we have our detector can be a variety of different security tools as input, and then our transformers are implemented in terms of lib CST. Now, I think the elephant in the room in 2024 is where do large language models or llms fit into this? And should we be using those to implement some of our transformations. So looking forward, some of our transformations might look more like this, where we're using an LLM provider. I've used OpenAI here is probably the most well known, most popular one. But this could be a variety of different models. It could be llama, it could be something else. But should we be using llms to perform our transformations? There's a couple considerations here. First of all, do developers trust llms to make security changes to their code? I think that's an open question. Another thing is that right now we have the advantage of having an open source framework. People can see exactly the kinds of changes that we're making. They can understand them in terms of code and they can make a pull request or open an issue. And when you use llms you lose some of that transparency. On the other hand, there's definitely some code mods I've seen that would require some more context than just the kind of syntactic and semantic analysis we're doing can provide. And that's where an LLM could really help us make some even more sophisticated and clever kinds of changes. So it's something we're considering going forward. The other thing we're thinking about is of course I'm talking to you about the Python code moderator framework. All of this is currently implemented in terms of Python and it's also being applied to Python code. But the question we have is that could you have a framework that's implemented in Python? So the detection, all of this orchestration is implemented in terms of Python, but could we apply transformations to other kinds of code? And now of course in this case we wouldn't be using Libcst because that's only for Python. But maybe llms could help us here, or maybe there's some other frameworks that could help us out. So this is just something we've been thinking about. All right, so I'm just going to shamelessly say we'd love to have your feedback. This is an open source project. We'd love for you to open GitHub issues with suggestions or bug reports. We'd love for you to clone or fork the repo and try it out yourself. We'd love to earn your stars. And more than anything we would love to hear ideas from you about code mods you want to see. And even if you would like to contribute directly upstream, contribute your own code mod to our project, that would be awesome. We would love to see it. One thing I do want to mention right before the end of my talk here is when I talked about code TF in this interchange format. The way that we're using this at Pixie is we've built a GitHub application that you can install for free and that consumes the results of Codemodder. And you can see here in this box. This is where the summary field of that code mod is being used, and then down here is where the description is being used. So Pixiebot automatically applies Codemodder to your code base, and it orchestrates all this together and opens pull requests with suggested changes for your code. It's really cool. Again, it's free to install. We'd love for you to try it out. The other thing Python codemods helps with is our pixie command line interface, or the CLI. This is sort of a higher level user interface around both our Python and Java code modders provides a bit of a nicer user experience, but Codemodder is the results of Codemodder are being used by this tool, and this is also free to use. It's installable from homebrew and we'd love for you to try it out and give us feedback. So that's my talk. Thank you so much for spending a bit of time with me and learning about Python codemods. You can find me here on GitHub can. Here's my email address. I'd love to get feedback from you. Check us out at Pixie AI and look me up on LinkedIn. I'd love to hear your feedback, love to see your GitHub issues or get an email from you. And thanks again.

Slides

Download slides (PDF)

See all 32 talks at this event!

Conf42 Python 2024 - Online

February 29 2024

Writing Python Codemods for Fun and Profit

Video size:

Abstract

Summary

Transcript

Slides

Dan D'Avella

Principal Engineer @ Pixee

Join the community!

Featured event

2025

2024

Info

Conf42 Python 2024 - Online

February 29 2024

Writing Python Codemods for Fun and Profit

Video size:

Abstract

Summary

Transcript

Slides

Dan D'Avella

Principal Engineer @ Pixee

Join the community!