Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi everyone. My name is Joseph Hydrop. I'm a
researcher and software developer at Endolabs.
In this session, I'm going to talk about going beyond metadata
are why we need to think of adapting static analysis
in dependency tools. This talk in general
is largely based on my PhD work, where I've been
adopting static analysis to better understand dependencies and
how we are using third party components in general.
Before we dive into why
we even want to consider static analysis in general, I think
it's a good idea to understand why we are doing
software reuse in the first place. And I think it's always a nice thing
to see what are like the key principles and key
ideas on software reuse. And I happen to
find a very interesting article from the US
Department of Commerce that is a management
guideline to how we should do software reuse.
And in this guideline there are a few core principles
that I found very interesting on how software reuse should
be as an experience.
And there are two principles
that I found very interesting. One was on productivity. And here
we can find that it says reusing well designed,
well developed, and well documented software improves productivity and
reduces software development time, cost and risk. Right?
And then there's the other aspect of software reuse,
which is improvements in the quality of software developed
from well designed, well tested, and well documented reusable software
components. And here we can see that
in general, when we want to use third party components,
right, we would like it. Of course, we want to
reduce software development time, but at the same time, we also want to reduce
risk. And when we are also using third party components,
we also expect them to be well tested, well documented, so that
there is as little friction as possible.
Right? And the
way we have, let's say, implemented these principles,
and that's, I think what many of you are familiar with are
mainly package managers. And here is an example of
using NPM. With NPM directly
from our command line, we can access
thousands of libraries and frameworks.
And whenever we want to use a library
to solve a particular problem, we can directly hit NPM,
install the package name, and it will make
it available to the workspace without any problems.
And then the third part, which also is really nice, is that it's very
easy to publish a package. So if you're developing something
that could be useful for the rest of the world to use, it's very simple
to use these packet managers as a distribution
channel.
And there are, of course,
some problems with using package managers.
And I'm going to highlight some of the problems.
So the first problem when we look at in general,
is that whenever we install a third party
component or library, we often end up importing
not only one, we can be importing ten or even hundreds,
or in some cases even thousands of dependencies.
So for example, in this one here, we can see that it says like
building 194 out of 307 dependencies.
So that is quite a lot. And these
dependency trees are not a simple tree structure, but they're also
quite complex, because sometimes you can even have dependencies
that are from the same library but contain different versions.
So if you see in the figure to the left here, we can
see that it has accepts 1.38, but at the
same time there is also accepts 2.80,
right? And what is like the
other aspect of dependencies in some of the packet
managers, we also have that there are version
ranges. So we have that, for example, if I
install accepts today, I get 1.3.7,
whereas if I do, three days later it's 1.3 point
twelve. So it's constantly changing.
And these are like some of the properties with dependency management,
and then at a more global
ecosystems level, I think whenever
we read the news, we always comes to these headlines
where some hackers manage to hide
some malicious code, for examples like in the event stream tools,
or with the very popular
left pad, where some developers remove
the package, making sure that the
build systems wouldn't work for many other hundreds of thousands of
clients. And these
are lesser like the main type of problems that we find with
package management in general.
So how, or let's say we've been
able to detect or identify these type
of problems. So when it comes to the temporal properties,
as I was saying, that if you're using, for example, the version ranges,
you can easily use something called version pinning or lock files
to ensure that whenever you build a project, you will
make sure that exactly the same set of dependencies will be resolved
every single time, and hopefully also within the same build environment
as well. And for everything else that is
related to, for example like malicious code, I was saying in the previous slide,
or security bugs, or even when it comes to what
are like major changes from one version to another, we have
to rely of tooling. And commonly this could be dependency analyzer
bots or even plugins in dependency management.
And if we typically look like what is like the sort of
workflow when we use a dependency analyzer bot or plugin,
so be it. For example for vulnerabilities, updates,
audits, quality deprecation, whatever the problem is,
we usually try to analyzers the dependency tree,
which we can see like in the middle, and then based
on the package that has a problem, for example, a security vulnerability.
So we can see, for example on the bottom left
corner, or maybe it's on your right corner, we can see that
there's a path from the vulnerable package up all the way to
the client.
But the problem is that we are able
to quickly identify which
packages might be vulnerable. But as we can see
at the end here, that there are a lot of false
positives. And also just by knowing that there
is a problem in a package may not be particularly actionable.
Because some
packages libraries can be relatively large, there might be many APIs,
and if you're just using a small fraction of that package,
you may not in the end be vulnerable. And some warnings are simply not
relevant depending on how you are using a third party library.
So if you see how well we have done based on these
principles from the 1980s,
we could say that with, let's say like package
managers, we're able to quickly reduce software development time,
cost, but perhaps, maybe not with risks.
And so the
question here is like, is this,
let's say like problems that we are having with packet manager? Is it like a
typical classic alert fatigue?
I think not. So the reason why is that metadata
is not source code, and most of the analyzers
base on analyzing metadata. And the
problem here is that it does not really equate with usage.
So how one client or user use a third
party library is very different from how another person or package
might be using a third party library. So it is very different.
And the other rule is that we need to start making
code first class citizens.
And the reason why I'm saying that we need to make the first class citizens
is that if you are just going to report, for example, like in the
dependency tree, that package green version 1.2
is vulnerable. It doesn't really tell much,
but if you, for example, use some type of code
structure, it could be, for example like ast structures,
core graphs, et cetera. I could, for example,
say that, hey, the function like bus
is the one that is vulnerable in this green package one version
two. And if there is a reachable path from
that to, let's say like the main function of the client,
we can clearly see that this client
is impacted by it. But if there aren't, for example,
any reachable paths, it could also be a way for us to conclude that this
package, I mean, this user, is not affected by it. And by
already starting to discuss with code, we are also in a
way making developers, let's say, more involved
on how these alerts or warnings,
et cetera, is actually related to how we are using code.
And discussing around codes also makes discussions, I think,
much more actionable and also much more easier to
understand what is like the efforts needed to
solve a problem or how much of a code is actually
impacted. And one
sort of main concern is that, okay, great. But it's
very expensive to run program analysis tools,
and usually it's not very scalable if you have many dependencies.
And with the example that I had earlier, for example, I was showing that one
package might have like 300 or 500 dependencies.
So the concerns are actually valid, because usually when
we do program analysis, the scope is usually around
the project, right? But now we're expanding that scope to
the entire dependency, which makes it more difficult.
And because I have a bit of an
academic background, I've been doing, let's say like analysis of the
whole rust ecosystem. I was able to build all
the packages that were at least compilable packages in
ten days without much problems. And I think
the major trade off or thing to consider here is that
the ponders is not to really build program analysis
tools that are relatively advanced or resource consuming,
but to aim for something that is lightweight. Because the
main argument I have is that using something lightweight is
probably more, better and more actionable than just looking at metadata declarations
in general.
And there are of course like many questions. Like some
I feel, for example, like, hey, for my dependency tool or
me as a tool maker using program analysis, overkill.
Or like in my product we have a lot of
python Javascript developers. What about them?
Then there's the aspect of false negatives.
My security customer won't be happy about it. So how
do we deal with all these type of questions?
So to answer this, I of
course put my research hat on and started
doing some research. And I think to better sort of understand
these trade offs, I first looked at a
very interesting, simple thing. So what is the difference in the
number of reported dependencies between traditional metadata
based approaches versus program analyzers? Approaches?
And I did this for the
entire rust ecosystem. So if you're very interested to know about the
work down below, there's reference to
this paper that I worked on, and this is based on the rust
ecosystem. So in
the figure we have box plots of
three data sources, and I'm not going to
go into the detail of it, but all
of these data sources report what is
the number of direct dependencies per package. And this
comes from all packages in the rust ecosystem.
And what we can find in general here is that for
networks that are metadata based, that is basically
the crates IO and docsrs data
sources in the figure are about
the same as the prezi, and Pretzi is the one that is the call
based, let's say like representation. So here we can
see that the medians is similar, which means that
the metadata based, let's say like number of direct dependencies,
are closely approximating what
the number of dependencies, like a static analysis tool would do.
So what it is saying is that in general,
if you are just counting, let's say like number of direct dependencies that you're
using in your project, it is highly likely that you are
using also those dependencies in real life as well.
And then when we come to looking at transitive dependencies
for the same data set, we are now seeing that there
is some significant differences between them.
So when we look at the median number of transitive dependencies,
we can find that on average, if you just use
metadata based representation,
you will find that it reports, let's say like 17 dependencies.
Whereas if you look into the usage,
it's about six dependencies with
usage, I mean, like looking at which dependencies
are actually being used in source code.
So indirectly, this also means that we
are roughly not calling or using 60%
of the result transfer dependencies. And we can see that
there's a huge gap between them, right?
So then the question is like, why is there such a huge gap between those
transitive dependencies? So it could either be that
there are some problems with the static analysis tooling,
or actually the static analysis is correct in
making the assessment that there are actually no edges to certain transitive dependencies.
And to understand why this is the case, I manually analyzers
34 dependency relationships to see whether,
where basically they had certain differences,
whether to see if the static analysis is
correct or the metadata based approximation
is correct. And the first,
let's say, like, difference that I found was that in three
of those 34 cases, there were no import statements,
meaning that the dependencies were declared
in the project, but they were actually not imported.
And in the other case, I also found that there
were like, data structures important, but they were actually never used with
used, I mean, like, there were no function calls to it.
They were not even used as argument types or even return times
in the function. So this also shows that if
you use static analysis, right, you can see this information
directly whether something is used or not. Whereas in the other case
where you just look at declaration and manifest data,
you cannot see this.
There were also some other interesting facts,
like for example, we find one case of conditional compilation.
There were also cases of macro libraries and also like
a test dependency that was declared in the runtime
section. And it's important to note that, and this
is probably like a very rust specific finding that not all dependencies are runtime
libraries. Because in the case of rust, for examples,
you would like to maybe generate serialization deserialization
data structures, and you can basically add those annotations
to data structures. And whenever the code is compiled,
all those data structures are automatically generated.
But with those macro libraries, right, they are actually not
really runtime libraries. But if you use look into it
with a dependency tooling, you will not be able to make
that distinction. So it will just show that there is a dependency
from your project to this dependency, right?
And the other thing is with the conditional
compilation. So for example, if you have certain flags
within your code, you can, let's say like if you enable this feature, for example,
if you enable openssl, suddenly like a new code section is
open, and if it's never compiled with that, right,
then you might not be using code related in this code section.
And I think in one case in this code section, certain dependencies were
used, but in reality it was never compiled that way
as a dependency.
And then I think the largest difference that we found was that there
are basically 16 times non reachable transfer dependencies.
And what do we mean with this?
So if you look here, how many dependencies is app version 1.0
using? If you look from a package dependency
network, if you just analyze the dependency relationships,
we can see that app one depends on lib one,
lib one depends on lib two, and lib two depends on lib three,
right? This is what tools would normally report.
If you use static analysis, we can see there is a function called
from foo to bar. So that means that app is using lib
one. And then from bar there is a call to used.
And we can see that lib one is using lib two,
right? But then here we can see that the
whole reachable path from main foo bar goes
to used and terminates it intern.
And here we find that in lib two, which is a transitive
dependency, unused is called used in lib three,
but there is no path that leads from the
app down all the way to lib tree.
And those were the cases that we often found.
And this really shows that context really matter. And if
you think about if you are using, let's say like metadata
based approach, we are directly assuming that all
APIs of all direct dependencies are used, and then at the same time
that all APIs of transitive dependencies are used, right?
But if you see in the figure, right, that lib
two has basically, let's say we can
say like, it has like three APIs or three functions, right?
But only one of those three functions are used, right.
It could be another case where all the functions are used. Then of course lead
three would be used. But it really shows her that context
really matters. And that's something that we are not taking into account when we're using
regular dependency analyzers.
So now to sort of wrap up the
talk, let's look around with the practical
implications. Coming back to the question,
what should we do? Should we use more like program analysis, or shouldn't
we use program analysis when we do dependency analysis?
So when it comes to direct dependencies,
we found that declared dependencies closely estimates a
utilized dependency, meaning that if,
for example, you have use cases where it comes to counting
the number of dependencies or you kind of want to know, are direct
dependencies generally used or not, we found that you
most likely would not need to implement any program analysis.
And other of course benefit is that if you have a very security or
soundness sensitive application where recall
is important, then this also optimizes
for that. But the downside is, which I was
showing in the manual analyzers earlier, is that
it will not be able to understand things like for example,
if there are not missed import statements or
no APIs being used, et cetera, right?
And it can also not eliminate, for example,
dependencies that are unused or had different purposes, like for
example, code generation or being used as a test dependency.
When it comes to transitive dependencies, we are much stronger here on
that. You should probably prefer static analysis over metadata
because, for example,
if you depend on a parser library, for examples,
and some part of the parser library might also depend on
an additional reg x library. If you're not using any
of the reg x functionality for this parcel, right, then you're
not really using the
reg x library. That is the transitive dependency here.
And by looking at how
we're using source code, we can directly understand
what is the general context of how we're using, first the direct
dependencies, but also like the transitive dependencies.
And with applications
having such large dependency trees, it makes a lot of sense
to do more static
analysis to help developers quickly know
which dependency is problematic or not. Because if you have to
manually go through your transitive dependencies, et cetera,
and go through code, like for example,
start from your own code, then you have to go to direct dependencies,
for example, look it up on GitHub. Then further on going to those other
dependency. It becomes a very tedious job.
The only problem that can happen is that there
are false negatives. The reason why
is that static analysis has limitations, which I will talk
in the next slide. But another challenging part
is that with package repositories,
these are not like a set of homogeneous
libraries. These are very diverse libraries where for example,
one library might be, for example, using a lot of
static dispatch, whereas another library might solely
be using different types of dynamic class loading
or dynamic execution, which static analysis
are not able to analyzers well.
And in such cases, if for example, you're only going to analyze
libraries that do a lot of class loading or dynamic
that runs dynamic codes that are not done
at compile time, then it might
make more sense to use like a metadata based analysis because
with static analysis you might not be able to actually capture relationships
between packages and
going to program analysis. As I was saying,
there is this problem of false negatives. So you
have to think about, for example, when it comes to recall.
So if you're going to implement,
for example, a call graph generator for a programming language,
it's important to see what are the language features it
covers. Because for example, in the case of
Java, there are three popular core growth generators.
One of them is like Vala, Opal and suit.
And for example, when it comes to coverage of language features,
like Opal has more coverage because it can handle, for example, Java eleven
features, whereas Vala is not able to do that perfectly
fine. The other thing is
that there are language features that, for example,
does dynamic class loading and dynamic dispatch.
Here we will probably lose, let's say some
precision, but I will still argue that it's better than metadata.
And if you're going to aim for higher precision,
like for example, if we handle dynamic dispatch, we might be linking
a call site to all the implementations that are possible,
that this can be, for example, tens or hundreds, because we
cannot know exactly before at compile
time, which let's say like implementation
will be in box. We basically make the assumption that we link to all implementations,
but there are algorithms that might be able to better
reduce all the implementations. But the problem is
that they might not scale when you start analyzing the entire
dependency tree versus analyzing a project,
and as I was saying before, the scope of analysis of
project and its dependency tree. So you have to be careful
about what type of analysis you would like to do on it.
And the other thing that I was mentioning earlier, that package reports are not
a homogeneous collection of libraries.
The other consideration, for example, when it comes to languages
like Python or JavaScript that are dynamic,
it's very difficult to build a static call graph.
But I would argue that there are some techniques that does like
hybrid analyzers where you do part static analysis,
but you also do part dynamic analysis to kind of create a hybrid
representation of like a
projector and
yeah, that's it for me. I hope you enjoyed the talk,
and if you have any questions or want to reach out to me, feel free
to email to my address.