Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello, Avypress here. Thanks for joining me. We're going to talk about the lessons that
we've learned from doing the data analysis on 1 billion
downloads of open source software on our platform at Scarf.
All right, so let's dive in quick.
Some stuff about me here. So I'm the founder and CEO of Scarf,
based in Oakland, California. And my background is as a
software engineer and open source maintainer.
Scarf, on the other hand, we're about a four year old startup, it's about 16
people. And what we do is we provide usage metrics,
marketing and sales intelligence for commercial open source businesses.
And so a lot of projects and companies will distribute
their software through our gateway from any given
package and container registry. This data
set comes from both commercial open source projects as well as non commercial
open source projects. A bit skewed more towards the commercial
side. They tend to have more of a reason to want to
get the kind of metrics that we provide. But by and large
we're here to try to promote open source sustainability by
offering kind of, you know, by offering responsible
anonymized analytics for businesses that have a, you know,
resting stake in that usage of their open source.
So today, let's see. So we're
going to talk about just at a high level, like what the data is and
how we collected it. We'll talk about the trends that
we see in the data and then what we can learn from it and why
you should care. So the
data comes from the distribution of about 2500
software, open source packages. These packages
come from a variety of different ecosystems. A lot of them are Docker containers
or helm charts. There's a lot of binaries
and tarballs and just other kinds of files for download,
things like artifacts and GitHub releases,
downloads from a download page on
a website, files out of s three these kinds of
things. There's also NPM packages, python packages,
terraform modules and just kind of a long tail of other kind of artifact
that people will distribute through our gateway.
The maintainers of this software that is being distributed.
Like I said, it's commercial and non commercial open source.
Some of it is single vendor, some of it is multi vendor. And a significant
amount of these artifacts are actually open
source projects that are hosted by various open
source foundations. So whether that's the Apache Software foundation, the Linux
foundation, cloud native Computing foundation and
a variety of others. So this comes
from a pretty wide range of types of open
source that we are looking at. Our title is
a little bit misleading. We actually looked at more than a billion downloads. But we
just figured we would look at what we had at the time of this analysis,
which was a bit earlier in the year, but it's about
19 million anonymized origin ids,
which we'll talk about what that means. The metadata around these
IP addresses comes from a variety of different partners
that we work with, but this is coming from IP
metadata providers like Clearbit and six Sense and a couple of others,
but also more even things that are basic as
whois records, if you just do a whois on the domain.
And so we collect this data from a few different
ways, but kind of the most novel thing that we do is we host a
package and container registry gateway. And so what that means is if you are
pushing containers to Docker hub or you're pushing Python packages to pypi,
Scarf has made it easy to distribute all those kinds of artifacts from
one central place. Scarf is just redirecting the traffic to
wherever the artifact is actually hosted. But the thing that we
can do by sitting in between the user and the registry,
by sitting in the middle, we can passively look at that traffic as it flows
through and do this kind of data analysis for
the maintainer, for our customer, for any given distributor
of open source.
That's the main way the registry level insights. With Scarf Gateway,
we also collect data a few other ways. We do collect
post install telemetry in certain cases. So we have a
NPM library called Scruff JS, which will send post install hooks.
But once the data about a download is collected, we'll process the
metadata associated with the IP address that we see and then we
anonymize it. So we will delete the IP address itself,
any other associated PII, and just keep the metadata and
use that for this analysis.
Yeah, the total volume of this data was
kind of increasing quarter over quarter. And this does speak more to kind
of the growth of scarf than it does to most of the software that is
being distributed on scarf. But this is the data that we
are looking at. Back in q one of this year,
we hit about 670 million downloads
of open source packages that was increasing
quarter over quarter. Um, but these are the volumes
we were looking at. And so earlier in the talk,
I said that we were looking at about 19 million uniques,
um, unique users. And, you know, when we talk about,
like, what a user actually is in this context, it's actually not a very
straightforward question. Um, and so, you know, if, if,
you know, an open source package or an open source project says,
oh, look, we've been downloaded a million times,
you know, what does that mean? Was that a million downloads from one person?
Or was that a million downloads from a million different people? You generally
don't really know. And because the user's
not signing in, you will never know exactly what those
real numbers are. But in Scarf's
system, we have different ways of getting at what
this number actually is. And so we talk about two different kinds of identifiers.
We talk about an endpoint id, which is essentially just an IP address.
We will hash that IP address and just store hashes.
And then we also have the notion of an origin id.
And that's where we start to include other
bits of information in the hash that can further try
to give uniqueness
to any given source of traffic. And so typically that
will be the IP address, it will also be a user agent and any
other kinds of headers that we can find that will help
continue to identify. And so these
two different metrics will overcount and undercount in different
ways. So if you have one corporate
VPN, you might have thousands
and thousands of different people all sending traffic out of
one single egress point. And so you might have thousands of
people on one IP address. Similarly,
one person will have multiple ips throughout their
life. They might look at a webpage from home,
they might go to a coffee shop and work from there, maybe download a package
from there. Then they go into the office, and at
each step of the way their ip is changing, and they're going to have
both different endpoint ids and different origin ids. And so,
user agents, all these combination of things, basically,
they might get to various distinct programs that are doing the downloading
or the viewing or whatever kind of event that we are looking at.
But endpoint ids are largely going to undercount
pretty consistent in general. And origin ids will often
overcount. And so whenever we talk
about like a user in this context, we'll typically talk about kind
of both of these metrics. And if you want to
know kind of the user count, it's somewhere in the middle.
So at the scarf
gateway level, we're talking about registry level analytics.
And so whether you're distributing on NPM or
Docker hub or GitHub packages,
pYPi whatever, what do those registers actually
see? Well, they see things from the
web, requests that come in to download stuff.
And so the registry is going to see what got downloaded. They're going
to see, when it got downloaded, the user agent
of the downloads. So was this coming from curl or a browser
or a package manager or what have you,
and then any other headers that might be included. So there might be things like
auth tokens, there might be things like
other kinds of settings that may be fingerprinting,
and ultimately they'll see the ip address of the request. And that's largely
what the registry is working with.
And so just from that kind of analysis
at the registry level, one immediate lesson that we see is that
open source is being used virtually everywhere.
Now, Scarf is an american company, and so our
user and customer base does skew american.
But in general, the usage of open source is pretty
much in every country. There were very
notable exceptions,
western Sahara, french and southern antarctic lands, but otherwise every
single country is represented in those
couple billion downloads. And interestingly,
even the most remote, remote areas we're seeing as well.
Actually here, let me move myself out
of here so we can kind of see that. So we
can see the points here. So very, very pretty
close to the North Pole. Pretty close to the south pole.
Open source is being used in the most extreme and remote parts of
the world, which is kind of interesting. Cool. Let's keep going
an interesting, I'm going to move myself back here really quick.
Governments around the world also use open source.
And so one thing that we see, and maybe I
have a note on this on the next slide. Okay,
so for any given type of,
for any given IP address that comes in, there is typically what is
called a connection type that will be associated that IP address. So that connection,
that IP address can be owned by a business, it can be owned by an
educational institution, a government, a hosting provider.
So that would be like an EC two instance, a GCP machine,
these kinds of things, or as an ISP Internet service
provider from someone on their home network. And any given IP address that
we see is going to fall into these categories. And so to jump back for
a sec, what we see is IP
addresses where the connection type is government. And we see
government traffic coming in around
the world. Definitely not every country, but most countries
have some kind of open source consumption footprint
from the public sector. We definitely see by far the
most number of organizations, public sector organizations
coming from the United States. Brazil is another notable
frontrunner in terms of just overall traffic,
Australia being kind of another high one as well.
But otherwise we see pretty consistently across Europe,
Asia, South America, Australia and parts of Africa where
there is fairly widespread open source usage from
the government. And so I think this is one of those
things where on GitHub we don't see a lot of issues
being created by people coming from the government, but they're very much using the software.
And that's pretty cool to see in the data the
overall volume, the overall
volume of downloads that we are seeing. A lot
of that stuff is coming from just plain old ISP
networks. We do see a pretty high volume of hosting
provider based downloads as
well. This is just a lot of automated systems that are
in AWS or GCP or what have you.
Then the next
highest category is businesses. I think it's
definitely the case that a lot of the ISP and hosting
traffic actually do kind of, you know, indirectly,
you know, come for business purposes. But this is kind of the
breakdown of IP address ownership
that we have been seeing.
So overall in this data, we see about 1.2 million
different corporate associated endpoint ids.
So a few things about this. So what we mean by this is that,
you know, for any given IP address, if it, it belongs to a business,
we're going to see a domain associated with it. So like a
business identity. So we'll know the organization that was behind the IP address.
And we've seen about 1.2 million different businesses.
So I think one thing that
is really, really gleamingly obvious from this is that there's a lot
of data here that can support commercial, open source businesses.
If you are a business commercializing open source,
like you sell products and services on top of this software,
the businesses using your open source
are often potential customers. And so this is one of the
main reasons that scarf has a business here to begin with, is that our customers
really, really value this data.
But more broadly is that in general, most open
source maintainers do not have access to this data
because the registries don't provide it by default. In general,
I don't know of any that do, unless you're paying a pretty substantial amount for
it. But yeah, 1.2 million businesses
in the data from those 1.2
million businesses, that also represents 95% of the Fortune
500. And so literally just from these 2500
packages that are being distributed on Scarf's platform,
the vast majority of the Fortune 500 was seen
showing up, downloading these artifacts. And so it's
very, I think there's a lot of surveys out there, there's a lot of
corroborating evidence of this, where people report like,
hey, we use open source at our massive enterprise, but this
is a really nice independent
verification of that where there's no self reporting going
on here. This is literally just watching the traffic of live production,
download traffic, and seeing that indeed most
of the largest companies in the world are leveraging open source
in terms of the public cloud. So this is connection type equals hosting
that we see on the platform. AWS dominates
the traffic by a huge margin.
Again, because Scarf is a public. Because scarf is
not public, because Scarf is an american company.
That is the, you know, we will skew more
towards the american providers for
public clouds. And so, interestingly, like Hetzner,
which comes in at number three, a more european
based public cloud. But if scarf was really
catering more, if we were based in Europe, we'd probably see Hetzner be
quite a bit more prominent. But notably, Aws really
dominated even more than Google.
So I think that's kind of an interesting thing in the traffic where just like,
if you're getting a lot of downloads of your artifacts and you, you know,
do a lot of open source, you may want to, you know,
if you're wondering who you should partner with, wondering who you should try to optimize
for in terms of, you know, fixing bugs or whatever, whatever that might look
like, this is the breakdown as we have witnessed it,
I think a really interesting one. So we, like I said earlier,
a lot of the artifacts in this set of 2500
packages, they definitely skew,
skew towards docker containers being kind of not
the majority, but the plurality of packages that we are
talking about. And Scarf can redirect container
downloads to any registry.
What we see here is the,
what we're able to see is just the market share of different container
registries. What we've been seeing is that Docker hub,
I think, unsurprisingly to many,
is the dominating container registry.
And interestingly, okay, the interesting thing here as
well is that this market share is actually also trending towards
Docker hub actually eating more of it. So I
don't have slides for this, but earlier
on in Scarf's trajectory, we saw GHCR
actually very, very quickly eating up that market share
and becoming one of the primary registries
we're now starting to see. Docker hub is starting to push
back and continue to take more of that download
share. Key IO Quay IO,
depending on how you like to pronounce it, is the third most
popular container registry that we see.
And every other registry out there kind of makes a
really, really kind of negligible portion
of this pie otherwise. So if you are wondering
where you should publish your containers or which registries
you might want to support if you're integrating to these other things, this is the
general traffic patterns that we see.
One question that we actually get a fair amount with this kind of
IP address metadata is. Well, what about VPN's?
VPN's will kind of mess all this up, won't they? And one of the really
nice things is that a lot of the metadata providers actually are able to detect
if a connection is on a VPN. And what we've seen in practice is that
about 2% of the downloads that we see are through
a VPN.
The VPN providers do vary in terms of how easy they are to
detect. So I think this will hit a lot of the more
retail VPN's or if you're a big company rolling your own VPN
that may or may not get detected by some of these metadata providers. But this
is generally what we are seeing, the percentage.
So I would say that the percent is at least 2.2%.
It's probably not a ton more than that, but these are
kind of the ballpark figures that we're looking at.
I think one of the biggest surprises for most scarf
users is that your total
downloads versus your unique users often
look very different than the data would look
like if you're just looking at downloads alone. So here we
see the red line is representing the total
number of unique users that are being seen. And that is
in millions. On the right hand side and then on the left, we just see
the total raw number of downloads. And so the
ratio here as you're seeing is about 100
to one is like the average. And so if you're looking
across any given in open source package and
they say, hey, we have 1000 downloads or a million downloads,
and I think a lot of people will immediately jump into
wondering, sure, but how many users actually, what,
like, how many active users do you have? How many distinct
sources of traffic did that come from your back of envelope
math? You can just divide by 100 and that'll give you some
kind of reasonable approximation.
Um, however, you know, we, that it depends
on the package, right? Like this is not, these are the averages across.
We will see other packages, uh, where, you know,
other projects where the ratio is actually closer to 15 to
one. And so this kind of thing really just depends on the, um,
the type of software that you're dealing with, kind of what usage patterns
tend to look like. You know, obviously, I think where,
where something is in CI pipelines that will tend
to explode, the numbers and the metrics that you might
see in terms of those total downloads versus uniques. But it really just depends on
the type of thing that you are looking at.
Often for something like an application, a full blown standalone
application that you spin up as an internal tool is going to have very
different usage patterns than say a library where every
automated system in your infrastructure is going to be downloading it every time you
spin it up, up. And so
what that means is that many surges in the
downloads that you might see, if you're looking at like your NPM metrics,
your cargo metrics, whatever you might be looking at,
if you see surges in your growth, they might be totally
not real. And one of
our users published a really great blog post about this.
So Linux server IO, for those who are not familiar,
it is a completely non commercial organization.
Basically what they do is they repackage a lot of popular
applications as Docker containers. They dockerize a lot of different
things to make it really easy for anyone to spin them up on just arbitrary
machines. It's a really cool project. Highly recommend checking them out.
They've serviced. I think they're probably well over this
by now. But it was coming up on 20 billion downloads of docker containers
that they are maintaining.
And they used scarf to track the metrics around these downloads.
And what they found was that the correlation
between unique users and total downloads did
indeed vary across different applications. And some of
them, they vary wildly. And so
what you can see on the top here,
what you can see on the top here for Wireguard is that there were
multiple kind of spikes of, of total
downloads, very little movement on unique users.
And what they said was that about half of the polls could be attributed
to 20 users. And those 20 users probably had misconfigured or overly
aggressive update services. So in the Docker world,
there's tools like Watchtower and Diune and kind
of some of these others where all they're doing is they're just pulling and pulling
and pulling, and they're just trying to make sure that they have the latest version
of the software. And so what
happens is the numbers get super, super inflated. And that's just the case on a
lot of different registries that you might look at.
And the really interesting thing to think about is that in
the venture capital world, millions and millions of dollars
are deployed to open source projects,
turning into companies that just have really exciting download figures.
And what we're showing with some of these metrics is that
these download graphs are not reliable in any way, shape or
form. When we think about the companies that
have been started, the tens of millions of dollars that have been deployed,
all basically garbage download metrics.
It's an unfortunate reality to understand,
but that is what the data shows.
Like I said, a lot of the downloads that you may have as an open
source maintainer, they come from totally automated systems.
Some automated systems are very real signals and others are much less
so. Like I said, container agents that are continually
monitoring for updates. You have CI CD pipelines,
you have artifact mirrors that are just trying to mirror
any given registry and make sure that they're always up to date.
If you have a link to your artifact on a
website, there's web crawlers as well,
web crawlers from all different companies that are just crawling your
pages and downloading all your downloads. And those
are all inflating your stats. In the
top clients that we've seen, there's a lot of repetitive
downloaders. Renovate bot, Scopio, Diyun uptime robot,
renovate bot. These are all, these are all very low
signal, but very, very high volume
user agents. And so for most maintainers,
this is noise, this is not relevant, but they will really impact
your stats.
And so like we were saying that the,
you know that these user agents, there's quite a lot of rich
information in those user agents, and it's actually even,
it's a little bit more complex than that, where some user agents
send tons and tons of information about the program
that is doing the downloading and others less so. But one
really standout piece of software is Pip,
the Python package manager. It does an exceptional job
where the user agent actually contains a human readable JSON
blob with all the various information about it. Literally a point
of telling you cpu architecture
and right at the top, a CI flag telling the,
you know, telling the server if this Pip install
is running from within CI or not, incredibly helpful.
There's build information, there's system information, there's dynamic
dependency information. Humans can read it,
machines can read it. It's incredible. Hats off to pip.
We'd love to see more programs
include this kind of very rich information in
their user agents because it's really, really beneficial for open source maintainers
that rely on this kind of data.
And we see this in a lot of other places, like homebrew does a pretty
good job of this. Docker does a great job of this. Even showing you
different upstream from Docker, downstream from Docker. And that's really
awesome. But then we see a lot of user agents that are not
super helpful. We are looking at you go HTTP
client, which for the go developer, this is
not really, they did exactly the right thing.
But for the folks that are using go HP client under the hood,
which is most of this traffic, it's not usually coming from raw go.
It might be coming from helm or kubernetes or
what have you. A lot of other platforms
that include very little information about the actual client that
is making the request. I would love to see more
pips and dockers and fewer kubernetes
and just raw go HTTP client that give very little
info on who's making the requests.
Let's see. Another big piece of info that has come
from these trends is that scarf users
and customers are often very surprised that people do not upgrade versions
in the way that you might expect.
Excuse me. So in the container
world, the default tag, the version of the artifact
that you're publishing is just latest. That's kind of the default
automatic tag that gets applied. And so if
you try to download a container, you don't specify a
version. It's going to use latest, it's just going to download the latest. And unsurprisingly,
that is by far the most downloaded version across all
the different packages on our platform, latest is by far the most popular.
So more than three quarters of the packages latest
is their top version.
But interestingly, most users will
never download a second version once they get one.
And so if you see someone that downloads a stable release,
getting them to upgrade will be very tricky. So it's kind of weird. We have
a lot of people who are just going to use the bleeding edge of whatever
is the default, and everyone and most other
people are never going to grab something again, no matter how much
you nag them to do it. And this is why a
lot of software you use is begging you to upgrade to the latest version if
there are security updates or these kinds of things, because in practice, people don't really
do it. That's the world that we live in.
If you are a maintainer and you have pushed a security patch and you think,
oh great, everything is fine, we push a patch. Well,
maybe not, because just because you push the patch doesn't mean that
anyone is using it. And that is the reality.
And so, you know, we've, we've talked about
a lot of random, you know, learnings and
trends here, but who cares? You know, why does this matter?
There's a few big reasons why this stuff is important for
anyone who has done open source work on their own, anyone who's
been an open source maintainer or, you know, an auth, creator of a project,
you probably know that you're often working in the dark about
how your work is actually being used in general.
These kinds, any work that you're doing, whether it's
making a decision about what to work on, what not to work
on, what to prioritize, what not to prioritize. You can do that
more effectively if you understand how the software is actually being used.
Interestingly, when we first got started working
on scarf, the attitudes, the open source community
were quite a bit different around collecting
any kind of usage metrics. So that's changed quite a bit since
we've been around.
But the reason that this is really important for all of us is that we
do look at some data. It's not like we don't use any data.
We have these kinds of metrics. We have some download metrics in most
of our registries, but these metrics are
very misleading. And to make it worse is that the registries actually
already have the data. All the stuff that we've shown here today
from scarf, whether it's uniques or version adoption or what have
you, the registries have that information already.
We're just the first to show it to you. That's problematic,
because that means that there's things that would be
helpful to maintainers that are just locked away, that they don't have access
to, even though they're kind of the rightful owners
of the data in the first place.
And so if you want to see fewer
companies ditch open source licenses,
the trend that we see is that you have all these companies that they kind
of hit a wall with how much they can grow.
Because whether it's that they're competing with their open source,
whether it's at their community and their business, or not properly aligned
with one another, one way
or another, these businesses and their communities are kind of skewing apart
in their interests. And that causes the businesses to, you know, they do
what they need to do, and they sometimes that means they need to change their
license. If we want to see less of this. I know
I do. I know a lot of you listening. Probably also, do we have to
support open source businesses more systemically, more holistically,
make it easier to build a sustainable company around a successful piece
of open source software? And so analytics and
better metrics, better data observability, is,
in our opinion, a very crucial part of this equation. If companies
better understand how their software is used, they can more effectively commercialize.
And if they can more effectively commercialize, they might reach for
license changes less often.
And I think we all want to see a world where more companies are successful
doing open source, and that open source becomes the more dominant
way to build software, period.
Yeah. And so a few takeaways here
from the data that we've learned today. So one is that open source,
many of you already know this open source really is everywhere. And so
maintaining open source is sometimes a pretty thankless job.
A lot of times people only come to you when they have a problem,
you know, when they run into issues, when they're upset, when they
have something to complain about.
But your work as a maintainer actually has a huge impact
on the world. It affects big companies, it affects governments,
universities, people all over the world in the most populated and
the most remote parts of the globe. And that's pretty cool. Even if people
are not coming to you and thanking you, at least you can know
that your software is probably getting used more than you think.
If you are building a business or you have a business that is
around the open source that you build, tracking the usage
of that software can be critical to
building and maintaining that thriving business.
If you are reporting metrics around the usage of your
open source software project, just remember that download counts can
be highly misleading. So our recommendation
is to always pair raw download numbers with
some notion of uniqueness, if you can, because there's huge
outliers in traffic and bursts from one person.
There's lots of bots, there's lots of redownloads, there's a lot of things that
really screw up those numbers.
If you maintain any kind of package manager or
program that is making requests on the Internet, please, please,
please put rich information at the user agent.
Others in the community will be very, very appreciative.
Maintainers should keep
in mind what the behaviors of users actually is when it comes to
upgrading towards new versions, and to keep a little bit of a more
pessimistic eye about that.
And I think the last and final thing is that
open source usage metrics can indeed
help the ecosystem. They can be collected responsibly.
And I think that this is something that we as a community need to,
over time, embrace more and more and not shy away
from, because the data is already being collected,
whether we like it or not, by the registries are going to
collect the data, whether or not we want them to. And so it's not a
matter of should or should we not collect this data, is is how should we
use it? Who should have access to it? That's a topic for a
whole other talk, a talk that I've given in other places. And so feel
free to check out my website for
other links to discussions about that.
Yeah. So thank you for your time. Thank you for listening.
I hope you learned something you can find
me kind of all over online. You can find scarf
at scarf sh. If you want to learn more about the
metrics that we collect, how you can collect your own. If you
have an open source project you want to know more about the usage for.
So yeah, please don't hesitate to reach out and get
in touch. And thank you so much for listening.