Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hey everyone, welcome to mapping the minefield of open source software risks,
where we'll be talking about the dumpster fire that is software supply chain
security. To give you a quick background on myself, my name is
Kyle Kelly. I am the founder at Cram Hacks, which is a weekly software supply
chain security newsletter. I'm also a security researcher at
Semgrep, working on these supply chain security team. And lastly,
I'm an executive consultant at Banksec, where we offer cybersecurity consulting
services to financial institutions across the United States.
As far as education goes, I do have a master's degree in computer science
and a number of cybersecurity related certifications.
For jumping into the presentation, quickly go over the
agenda, which are broken up into three sections where we have industry prioritization
and easy ones. Ultimately, I'll be covering what are software
dependencies, some generalizations around their usage and risks,
how to prioritize these vulnerabilities, and some easy ones
that can be applied to make life a heck of a lot easier.
So what is a software dependency? Well,
software dependency is a specific module, library or software
package that another piece of software relies on to function correctly.
In essence, with the help of dependency management, tooling for downloading,
version management, et cetera, dependencies are what enable reuse
of code at scale. Of course, these are both internal and open
source dependencies, but we'll be focusing on opensource specifically.
So does your organization use open source code?
Almost definitely. If you don't, I honestly don't know how you get anything
done. Might be a testament to my own software developer skills,
but I know for a fact that without open source
code, I would not be getting anything done. In 2022 alone,
GitHub recorded over 52 million new open source projects.
That is absolutely nuts. But focusing on known software dependencies,
we can look at a public package repository such as NPM,
which hosts somewhere around 3 million unique packages.
Now NPM does by far host the most number of unique packages,
but that makes sense given how modular JavaScript tends to be.
That being said, others, like Pypy, are still hosting about 500,000
unique packages, and after you account for different versions, that's over
5 million releases. It's really no surprise that over 90%
of organizations use open source software in some shape or form.
Maybe a more interesting data point is that any given application,
70% to 90%, is likely to comprise of open source
code. That being said, there's one major caveat to
that statement, which is that although many sources will reference the 70%
to 90% figure. In actuality, the coverage is likely much less.
And that's just because when you consider active code or lines of code that are
actually executed by the application. So open
source dependencies often have many, many lines of code, but you
are rarely ever using all of them, if ever. And we've actually
concluded that for any given dependency import into a project,
you're likely only using roughly 10% of its functionality.
Now that we know your organizations is using open source software dependencies,
let's talk about vulnerabilities now for vulnerabilities
in 2023, there were just over 28,000 total cves
assigned, and this number is trending upwards for sure. But we don't necessarily want
to use cves as a single source of truth for supply chain vulnerabilities.
Firstly, many cves are for commercial applications or
standalone projects. Secondly, the CVE process is a bit
painful, especially in comparison to the simplicity of GitHub
security advisories, at least in the context of opensource software.
That said, we still have a lot of work to do on the vulnerability,
discovery and disclosure side. So I project that this 28,000 total
CVE count in 2023 is going to be washed away significantly
in the next couple of years. Taking a look at some of the ecosystems
covered by the GitHub Security advisory database, we can see that ecosystems
don't all get the same amount of attention. And this also kind of
leads to determining how much value could you be getting from
a software supply chain security tool or software composition analysis
tool? If you're using Swift elixir, dart or flutter,
there's really not a whole lot of value to be had simply because there's not
a lot of known vulnerabilities. Supply chain security tools aren't
going out there and discovering application security
vulnerabilities in open source code. We're simply saying,
hey, you're using this dependency. What are known vulnerabilities
related to it? How can we help you prioritize which ones to
remediate first, either through upgrading or replacing that
dependency with a more secure alternative? So, as of the last
time I checked, there were about 16,000 total advisories on the
GitHub Security advisory database, and these have all been
reviewed and verified by the advisory maintainers. Roughly half
are either a high or critical severity, and in my experience
with reviewing thousands of advisories, anything less than a high
severity can usually be ignored, especially if you have highs
and criticals impacting your application. Definitely always prioritize
the highs and criticals first. You maybe have heard
of solutions like the exploit prediction scoring system, EPSs,
or CiSA's known exploited vulnerabilities catalog, the KeV,
which can also be used to prioritize vulners based on their likelihood of being
exploited. However, if you take a closer look at these, especially the
known exploited vulnerabilities catalog, you'll find that they really focus
in on commercial products and standalone applications.
For instance, EPSS signs a very heavy weight if
Microsoft is mentioned in the advisory and Kev
really only has a select few advisories related
to any opensource project at all.
Moving along, let's take a look at how to manage these vulnerabilities.
Here's an example of just seven projects which I've run Semgrep supply chain on,
but without our reachability analysis filter enabled. This is intended
to be a one to one matching of what you would see using something like
dependent bot, but I'll be walking through what we do differently to make this number
much more palatable. Now, 2500 might seem
like a little or a lot, and that's just going to depend on the size
of your projects, how many dependencies you use, et cetera. But to put
it into perspective, I've seen orgs with a total vulnerability count
in the high six figures. Hopefully you're already using some sort
of supply chain security tooling and have a rough idea of your magic number.
For those of you with a big number, it's no surprise that sonotype
has reported that supply chain attacks are increasing at a rate of
742% per year. Now that being said,
let's talk about prioritization. Well, these likelihood
is that you're using some tool other than Semgrip supply chain, if any.
So your security staff or developers are likely to approach the problem
with questions such as these, especially the obvious ones.
Are we even using this function and is it even exploitable?
Then they'll likely spend valuable time and resources only to learn the
code does not use the vulnerable function, making it unexploitable.
We don't care about this, or my personal favorite these vulnerability is
a critical severity, but it's a regular expression denial of service vulnerability,
and it's on an internal application. There's no risk we can ignore
this. Don't spend valuable security engineering time
fixing these ridiculous vulnerabilities. That being said,
opensource dependencies enable impressive efficiencies, there's no doubt about that.
But time spent by your staff to investigate these questions takes
away from that value. So let's touch on how reachability offers
near effortless prioritization in the traditional
sense. Semgrep supply chain still falls under the dependency scanner category,
no different than a dependabot or OSS dependency check
or NPM audit, just to name a few.
So using a traditional solution like these mentioned,
they more or less take a vulnerability database and see thousands of vulnerabilities,
and then they report the hundreds which impact your dependencies based on their
version and the name of the dependency. But by introducing
code scanning reachability, you can narrow down from hundreds to just
the tens that result in an actual vulnerable usage,
otherwise, meaning the vulnerability actually rests in your
code. Now I have a slide to get deeper into this, but at a
high level, reachability to us at Semgrip means that the piece of
a dependency that introduces risk, aka the vulnerable component,
appears in your code or is reachable in your code.
The magic sauce on how we do this is in our usage of the open
source Semgrep engine, which also powers our other commercial products.
This enables our researchers to write Semgrip rules that detect
the vulnerable usage of dependencies. If you've ever used Semgrep,
whether it be the paid version or open source, you probably know how easy
it is to write Semgrip rules. And so we can do this very efficiently.
Let's move on and discuss software composition analysis,
often called SCA, which is ultimately the prioritization enabler.
Afterward, we can look at some more research backing our claim here,
which suggests reachability reduces thousands of alerts into
tens of high quality findings. So here I've broken
up SCA into four different categories, manifest,
lock file, static analysis, and dynamic analysis,
and each of these offer insights into a project's dependencies, just in
different ways. Manifest files will tell you what direct dependencies
are used, but really not much else. This isn't all that effective
these days. Next we have lock files, which take things
a step further by also presenting specific versions tied to those dependencies,
and also include any transitive dependencies. So any dependencies
required by the dependencies being imported into your project.
Now, manifest and lock file analysis are what I often refer to
as traditional sea, whereas the industry has since incorporated
other techniques like static and dynamic analysis. Now,
I'm not going to argue static versus dynamic analysis,
we just don't have enough time for that. But you can probably guess that I
prefer static analysis, especially in the context of software supply chain.
The two main reasons are that a build reproducibility is often a
nightmare, and with static analysis we don't necessarily care if your code
can run and b I enjoy research at scale. Simple as that.
I can confidently run Semgrip against thousands of projects without manually
tuning each and every one of them to get dynamic analysis to work.
And dynamic analysis is significantly slower, at least based on my experience.
So that said, there are some potential benefits to dynamic
analysis, the obvious one being that in static
analysis we can't determine whether a path is likely to
or ever will be taken, which means dynamic reachability
could be slightly more effective in that regard. But based on the research I've
reviewed, the complexity tends to cause static analysis to be favorable,
with only marginal benefits to using dynamic analysis.
And that's why at sungrip we use a combination of manifest,
lock, file and static analysis. This gives us the best bang
for the buck, enabling effective reachability, but without compromising performance
and usability. Now I briefly described reachability earlier,
but we should narrow in on this just because similar to sea, the industry
seems to be using it to mean all different types of things. For example,
Cdxgen, which is the open source Cyclone DX SBOm generation
tool, calls a dependency reachable if it is used at
all in your code. Other tools may call a vulnerability reachable
if it impacts a direct dependency, or maybe it's affecting a public facing application.
Now, Semgrep reachability, which we commonly refer to as code scanning
reachability, does, as the name implies, it scans your code to identify
these and often how you are using a known vulnerable function.
If it finds a path where the function will get called, then it's reachable.
And the way we determine these is based on
what the vulnerability is from the security advisory. If it says
you're only introducing risk if you call this particular function in
a particular way, well, we can write a Semgrep rule to identify
that usage in your code. And this has been proven
time and time again to be effective. But to reference some external research,
there's a paper by NC state titled a comparative study of
vulnerability reporting by software composition analysis tools,
which found only 2.1% of roughly 2500
vulnerability alerts were found to be reachable. And that's through static
code analysis. Early on, we at Semgrip
conducted an internal study specifically using Semgrip's reachability analysis
and evaluated 1100 open source projects.
Of these, Semgrip identified 932 total vulnerabilities,
but only twelve were determined to be reachable, and that's actually
less than 2%. But keep in mind there are several factors that
may play a role in your results, most commonly the language of the project.
For example, JavaScript and Python dependencies tend to benefit a
ton from reachability, and that's just due to its modularity. Whereas other
languages like C sharp are more likely to be used by standalone projects where
static code analysis is less capable of determining function calls.
Discovering vulnerabilities is great and all, but what is objectively
more important is remediating them. Sonotype's 2023
report discovered that 96% of known vulnerable downloads had
a fixed version already available, so that's a pretty good sign that open source project
maintainers are doing a good job of releasing fixed versions. Log four
j is a great one to highlight here, because Sonotype actually has a dashboard
for known vulnerable downloads, which last I checked, had over
300 million vulnerable downloads since December 2021,
and still, I mean roughly 25% of
downloads in the last seven days were vulnerable versions.
So let's focus on what orgs can be doing better to alleviate
some common struggles when it comes to remediating vulnerabilities,
besides the obvious one, which is to use some form of supply chain security tooling.
In the remaining slides, we'll touch on semantic versioning, manifest files,
and transitive risks. All right, so semantic
versioning is beautiful and I love it. I really can't express that enough.
It can just be super beneficial for identifying easy wins during the remediation
process. For example, maybe you have 200 vulnerabilities, but 100
of them have a fixed version that is just a patch upgrade. Now,
if I was the responsible developer and saw this, I'd probably just hit the
upgrade button. It's that simple. Although I maybe wouldn't suggest
this on anything super important in production. Just to be safe,
maybe an additional 50 are minor upgrades. Again, I'm just going
to hit the upgrade button, but if you're a more responsible person,
I'd say pay closer attention to minor upgrades versus patch
upgrades. Now, major upgrades, on the other hand,
are the ones where I definitely want to take a look at these change log
or other relevant documentation. And that's just because I've been bitten
more than once by major upgrades, breaking functionality,
or even worse, changing functionality just enough where it only slightly impacts
your project. And maybe you missed a test case.
Lastly, just a reminder that most of what I've shared in this
presentation is ecosystem dependent, so breaking changes may be
more common in some languages than others.
Also, semantic versioning isn't actually enforced. For example, the study
breaking bad semantic versioning and impact of breaking changes in Maven
Central concluded that roughly 20% of non major
releases in the Maven central ecosystem were breaking changes,
but then again, that also implies that roughly 80% were not
breaking changes. So if you're upgrading 100 dependencies 80%
of the time, you're safe to just go ahead and hit the update button.
Now let's talk about these in the context of a JavaScript
developer, where you've likely seen a package JSOn
file, otherwise known as your project's manifest. Now, if you're developing
in a different ecosystem, you'll likely have a different file name, but the concept
is more or less the same. A manifest file contains all
direct dependencies used by your project and what is used to
generate a project's lock file. We touched on this a bit earlier
when discussing SCA, but as an example, whenever you run NPM install or
directly generate a lock file with the package lock only flag,
NPM is looking at your manifest file, your version ranges,
and then assigning specific versions to these dependencies, while also
determining all transitive dependencies as well as their respective
versions. Now what most people don't realize is the potential
ROI when you keep a manifest file up to date, especially when paired
with meaningful version ranges, for example, a dependency that is
widely used throughout your project. You may want to actually review all changes before
upgrading, but a package where you're only using only one
component of it, or maybe you know it to be well maintained
and that it adheres to semverse specifications,
you can safely assign a broader version range.
So back to our example with NPM, you can get fairly creative on how
you specify versioning for each direct dependency.
To quickly go over some common ones, you can specify an exact
version, use a tilde to allow any patch versions,
use a carrot to allow both minor and patch versions.
So basically anything except a major version. Or you can
always go for the latest version via an asterisk. Just to show
you a quick example of what this might look like in Semgrip app, you can
clearly see your version and the fixed version, so if you see
it's just a patch or a minor upgrade, well, it should be an easy one.
Just go ahead and upgrade. A research paper titled
a large scale analysis of semantic versioning in NPM
reported that for the NPM ecosystem, specifically,
the minor flexible version specification was by far the most
commonly used. And the study also determined that over 87%
of all dependencies in their test sample were configured
to receive updates automatically. But that
doesn't mean they actually get updated automatically.
A common occurrence is that new versions are available, but you haven't
updated your lock file in a while. Of course, this can be resolved
by regularly running NPMI or similar to
regenerate your lock file and use the latest versions.
All right, I'm going to be ending things on a bit of
a controversial topic, which are transitive vulnerabilities.
Now, transitive vones are basically the dependency vones impacting
your project's dependencies. I often refer to them as 3rd,
fourth, or fifth party vulnerabilities, but the chain can go much,
much further than just five, and in fact, it very commonly does.
A little fun fact for you. The 2020 GitHub Octaverse report
disclosed that the average amount of indirect or transitive dependencies
for a JavaScript project with only ten direct dependencies
is 683 total. I always
found that to be crazy, and I still do. It's just imagining NPM
install ten times and all of a sudden you're downloading 683
different software packages just for your project to work. Now the
biggest reason why this is somewhat controversial is because in reality,
what are you going to do about it? If you use a dependency maybe affected
by another dependency's vulnerability, there's really not much you can do.
Unless you plan to dedicate engineering time to fixing the issue however far
down the chain, and mapping this fix at each and every level,
there's not really going to be any impact. That said, I do really enjoy
the research behind transitive vulnerabilities. However,
that said, reachability via static code analysis hasn't really impressed me.
This goes back to what I mentioned earlier about active versus inactive code.
For example, if I only import one function of an NPM package,
transitive reachability via static code analysis won't know to only look
at paths associated with that function. And this has a chain
effect as you continue down the dependency tree. But at
the end of the day, transitive risks are definitely real, and I'm hopeful future
research and development will continue to make this more manageable.
Anyways, at these end of the day, odds are that more than 90% of your
vulnerabilities are caused by transitive dependencies.
And I say just smack the easy button, ignore them.
There's not much you can do about them anyway. If you see that there's any
subset of direct dependencies introducing a mass amount of transitive
vulnerabilities, maybe it's an unmaintained project,
maybe it's an older project, or maybe the maintainers just don't
care a whole lot. Or maybe those vulnerabilities
just aren't reachable in their code, and they don't care to fix these because it
doesn't introduce any actual risk. So to wrap
things up, I'll leave you with three main takeaways. Firstly, the effectiveness of reachability.
We routinely see up to a 98% false positive
reduction thanks to code scanning reachability.
Secondly, build reproducibility and semantic versioning solve so
many headaches, it's only inevitable with the recent push for software
supply chain security that we see these become more of a priority.
And lastly, the spicy take transitive vulnerabilities can usually
be ignored. I say usually because there are some very, very special cases,
but even then, if you're using respectable packages, they themselves will
often create a security advisory due to their usage of a transitive one,
which ultimately makes it a vulnerability impacting a direct dependency.
So transitive vulnerabilities don't matter at all if
you're making security advisories for the direct dependencies.
Now, this could very easily start a debate as to why the current
standards for reporting supply chain vulnerabilities is ineffective,
but we'll save that for another time, and that about wraps things up.
Thank you for joining this talk, and if you are at all interested in learning
more about reachability or the Semgrip supply chain product, I've included
links in the slide. Lastly, if you want to keep up with the latest in
software supply chain security news, feel free to check out my newsletter, cram hack
and subscribe. Or if you'd like to contact me directly,
I've included my LinkedIn as well.