Conf42 Observability 2024 - Online

Open Source Success: Learnings from 1 Billion Downloads

Abstract

Walk away with actionable insights and best practices to optimize your open-source initiatives and excel in this competitive landscape.

Summary

  • Scarf provides usage metrics, marketing and sales intelligence for commercial open source businesses. The data comes from the distribution of about 2500 software, open source packages. We'll talk about the trends that we see in the data and then what we can learn from it and why you should care.
  • Most governments around the world also use open source. 1.2 million businesses in the data also represents 95% of the Fortune 500. Most largest companies in the world are leveraging open source in terms of the public cloud.
  • A lot of the downloads that you may have as an open source maintainer, they come from totally automated systems. One really standout piece of software is Pip, the Python package manager. We'd love to see more programs include this kind of very rich information in their user agents.
  • Most users will never download a second version once they get one. If you want to see fewer companies ditch open source licenses, the trend that we see is that you have all these companies that they kind of hit a wall with how much they can grow. Better analytics and better data, better observability, is crucial.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hello, Avypress here. Thanks for joining me. We're going to talk about the lessons that we've learned from doing the data analysis on 1 billion downloads of open source software on our platform at Scarf. All right, so let's dive in quick. Some stuff about me here. So I'm the founder and CEO of Scarf, based in Oakland, California. And my background is as a software engineer and open source maintainer. Scarf, on the other hand, we're about a four year old startup, it's about 16 people. And what we do is we provide usage metrics, marketing and sales intelligence for commercial open source businesses. And so a lot of projects and companies will distribute their software through our gateway from any given package and container registry. This data set comes from both commercial open source projects as well as non commercial open source projects. A bit skewed more towards the commercial side. They tend to have more of a reason to want to get the kind of metrics that we provide. But by and large we're here to try to promote open source sustainability by offering kind of, you know, by offering responsible anonymized analytics for businesses that have a, you know, resting stake in that usage of their open source. So today, let's see. So we're going to talk about just at a high level, like what the data is and how we collected it. We'll talk about the trends that we see in the data and then what we can learn from it and why you should care. So the data comes from the distribution of about 2500 software, open source packages. These packages come from a variety of different ecosystems. A lot of them are Docker containers or helm charts. There's a lot of binaries and tarballs and just other kinds of files for download, things like artifacts and GitHub releases, downloads from a download page on a website, files out of s three these kinds of things. There's also NPM packages, python packages, terraform modules and just kind of a long tail of other kind of artifact that people will distribute through our gateway. The maintainers of this software that is being distributed. Like I said, it's commercial and non commercial open source. Some of it is single vendor, some of it is multi vendor. And a significant amount of these artifacts are actually open source projects that are hosted by various open source foundations. So whether that's the Apache Software foundation, the Linux foundation, cloud native Computing foundation and a variety of others. So this comes from a pretty wide range of types of open source that we are looking at. Our title is a little bit misleading. We actually looked at more than a billion downloads. But we just figured we would look at what we had at the time of this analysis, which was a bit earlier in the year, but it's about 19 million anonymized origin ids, which we'll talk about what that means. The metadata around these IP addresses comes from a variety of different partners that we work with, but this is coming from IP metadata providers like Clearbit and six Sense and a couple of others, but also more even things that are basic as whois records, if you just do a whois on the domain. And so we collect this data from a few different ways, but kind of the most novel thing that we do is we host a package and container registry gateway. And so what that means is if you are pushing containers to Docker hub or you're pushing Python packages to pypi, Scarf has made it easy to distribute all those kinds of artifacts from one central place. Scarf is just redirecting the traffic to wherever the artifact is actually hosted. But the thing that we can do by sitting in between the user and the registry, by sitting in the middle, we can passively look at that traffic as it flows through and do this kind of data analysis for the maintainer, for our customer, for any given distributor of open source. That's the main way the registry level insights. With Scarf Gateway, we also collect data a few other ways. We do collect post install telemetry in certain cases. So we have a NPM library called Scruff JS, which will send post install hooks. But once the data about a download is collected, we'll process the metadata associated with the IP address that we see and then we anonymize it. So we will delete the IP address itself, any other associated PII, and just keep the metadata and use that for this analysis. Yeah, the total volume of this data was kind of increasing quarter over quarter. And this does speak more to kind of the growth of scarf than it does to most of the software that is being distributed on scarf. But this is the data that we are looking at. Back in q one of this year, we hit about 670 million downloads of open source packages that was increasing quarter over quarter. Um, but these are the volumes we were looking at. And so earlier in the talk, I said that we were looking at about 19 million uniques, um, unique users. And, you know, when we talk about, like, what a user actually is in this context, it's actually not a very straightforward question. Um, and so, you know, if, if, you know, an open source package or an open source project says, oh, look, we've been downloaded a million times, you know, what does that mean? Was that a million downloads from one person? Or was that a million downloads from a million different people? You generally don't really know. And because the user's not signing in, you will never know exactly what those real numbers are. But in Scarf's system, we have different ways of getting at what this number actually is. And so we talk about two different kinds of identifiers. We talk about an endpoint id, which is essentially just an IP address. We will hash that IP address and just store hashes. And then we also have the notion of an origin id. And that's where we start to include other bits of information in the hash that can further try to give uniqueness to any given source of traffic. And so typically that will be the IP address, it will also be a user agent and any other kinds of headers that we can find that will help continue to identify. And so these two different metrics will overcount and undercount in different ways. So if you have one corporate VPN, you might have thousands and thousands of different people all sending traffic out of one single egress point. And so you might have thousands of people on one IP address. Similarly, one person will have multiple ips throughout their life. They might look at a webpage from home, they might go to a coffee shop and work from there, maybe download a package from there. Then they go into the office, and at each step of the way their ip is changing, and they're going to have both different endpoint ids and different origin ids. And so, user agents, all these combination of things, basically, they might get to various distinct programs that are doing the downloading or the viewing or whatever kind of event that we are looking at. But endpoint ids are largely going to undercount pretty consistent in general. And origin ids will often overcount. And so whenever we talk about like a user in this context, we'll typically talk about kind of both of these metrics. And if you want to know kind of the user count, it's somewhere in the middle. So at the scarf gateway level, we're talking about registry level analytics. And so whether you're distributing on NPM or Docker hub or GitHub packages, pYPi whatever, what do those registers actually see? Well, they see things from the web, requests that come in to download stuff. And so the registry is going to see what got downloaded. They're going to see, when it got downloaded, the user agent of the downloads. So was this coming from curl or a browser or a package manager or what have you, and then any other headers that might be included. So there might be things like auth tokens, there might be things like other kinds of settings that may be fingerprinting, and ultimately they'll see the ip address of the request. And that's largely what the registry is working with. And so just from that kind of analysis at the registry level, one immediate lesson that we see is that open source is being used virtually everywhere. Now, Scarf is an american company, and so our user and customer base does skew american. But in general, the usage of open source is pretty much in every country. There were very notable exceptions, western Sahara, french and southern antarctic lands, but otherwise every single country is represented in those couple billion downloads. And interestingly, even the most remote, remote areas we're seeing as well. Actually here, let me move myself out of here so we can kind of see that. So we can see the points here. So very, very pretty close to the North Pole. Pretty close to the south pole. Open source is being used in the most extreme and remote parts of the world, which is kind of interesting. Cool. Let's keep going an interesting, I'm going to move myself back here really quick. Governments around the world also use open source. And so one thing that we see, and maybe I have a note on this on the next slide. Okay, so for any given type of, for any given IP address that comes in, there is typically what is called a connection type that will be associated that IP address. So that connection, that IP address can be owned by a business, it can be owned by an educational institution, a government, a hosting provider. So that would be like an EC two instance, a GCP machine, these kinds of things, or as an ISP Internet service provider from someone on their home network. And any given IP address that we see is going to fall into these categories. And so to jump back for a sec, what we see is IP addresses where the connection type is government. And we see government traffic coming in around the world. Definitely not every country, but most countries have some kind of open source consumption footprint from the public sector. We definitely see by far the most number of organizations, public sector organizations coming from the United States. Brazil is another notable frontrunner in terms of just overall traffic, Australia being kind of another high one as well. But otherwise we see pretty consistently across Europe, Asia, South America, Australia and parts of Africa where there is fairly widespread open source usage from the government. And so I think this is one of those things where on GitHub we don't see a lot of issues being created by people coming from the government, but they're very much using the software. And that's pretty cool to see in the data the overall volume, the overall volume of downloads that we are seeing. A lot of that stuff is coming from just plain old ISP networks. We do see a pretty high volume of hosting provider based downloads as well. This is just a lot of automated systems that are in AWS or GCP or what have you. Then the next highest category is businesses. I think it's definitely the case that a lot of the ISP and hosting traffic actually do kind of, you know, indirectly, you know, come for business purposes. But this is kind of the breakdown of IP address ownership that we have been seeing. So overall in this data, we see about 1.2 million different corporate associated endpoint ids. So a few things about this. So what we mean by this is that, you know, for any given IP address, if it, it belongs to a business, we're going to see a domain associated with it. So like a business identity. So we'll know the organization that was behind the IP address. And we've seen about 1.2 million different businesses. So I think one thing that is really, really gleamingly obvious from this is that there's a lot of data here that can support commercial, open source businesses. If you are a business commercializing open source, like you sell products and services on top of this software, the businesses using your open source are often potential customers. And so this is one of the main reasons that scarf has a business here to begin with, is that our customers really, really value this data. But more broadly is that in general, most open source maintainers do not have access to this data because the registries don't provide it by default. In general, I don't know of any that do, unless you're paying a pretty substantial amount for it. But yeah, 1.2 million businesses in the data from those 1.2 million businesses, that also represents 95% of the Fortune 500. And so literally just from these 2500 packages that are being distributed on Scarf's platform, the vast majority of the Fortune 500 was seen showing up, downloading these artifacts. And so it's very, I think there's a lot of surveys out there, there's a lot of corroborating evidence of this, where people report like, hey, we use open source at our massive enterprise, but this is a really nice independent verification of that where there's no self reporting going on here. This is literally just watching the traffic of live production, download traffic, and seeing that indeed most of the largest companies in the world are leveraging open source in terms of the public cloud. So this is connection type equals hosting that we see on the platform. AWS dominates the traffic by a huge margin. Again, because Scarf is a public. Because scarf is not public, because Scarf is an american company. That is the, you know, we will skew more towards the american providers for public clouds. And so, interestingly, like Hetzner, which comes in at number three, a more european based public cloud. But if scarf was really catering more, if we were based in Europe, we'd probably see Hetzner be quite a bit more prominent. But notably, Aws really dominated even more than Google. So I think that's kind of an interesting thing in the traffic where just like, if you're getting a lot of downloads of your artifacts and you, you know, do a lot of open source, you may want to, you know, if you're wondering who you should partner with, wondering who you should try to optimize for in terms of, you know, fixing bugs or whatever, whatever that might look like, this is the breakdown as we have witnessed it, I think a really interesting one. So we, like I said earlier, a lot of the artifacts in this set of 2500 packages, they definitely skew, skew towards docker containers being kind of not the majority, but the plurality of packages that we are talking about. And Scarf can redirect container downloads to any registry. What we see here is the, what we're able to see is just the market share of different container registries. What we've been seeing is that Docker hub, I think, unsurprisingly to many, is the dominating container registry. And interestingly, okay, the interesting thing here as well is that this market share is actually also trending towards Docker hub actually eating more of it. So I don't have slides for this, but earlier on in Scarf's trajectory, we saw GHCR actually very, very quickly eating up that market share and becoming one of the primary registries we're now starting to see. Docker hub is starting to push back and continue to take more of that download share. Key IO Quay IO, depending on how you like to pronounce it, is the third most popular container registry that we see. And every other registry out there kind of makes a really, really kind of negligible portion of this pie otherwise. So if you are wondering where you should publish your containers or which registries you might want to support if you're integrating to these other things, this is the general traffic patterns that we see. One question that we actually get a fair amount with this kind of IP address metadata is. Well, what about VPN's? VPN's will kind of mess all this up, won't they? And one of the really nice things is that a lot of the metadata providers actually are able to detect if a connection is on a VPN. And what we've seen in practice is that about 2% of the downloads that we see are through a VPN. The VPN providers do vary in terms of how easy they are to detect. So I think this will hit a lot of the more retail VPN's or if you're a big company rolling your own VPN that may or may not get detected by some of these metadata providers. But this is generally what we are seeing, the percentage. So I would say that the percent is at least 2.2%. It's probably not a ton more than that, but these are kind of the ballpark figures that we're looking at. I think one of the biggest surprises for most scarf users is that your total downloads versus your unique users often look very different than the data would look like if you're just looking at downloads alone. So here we see the red line is representing the total number of unique users that are being seen. And that is in millions. On the right hand side and then on the left, we just see the total raw number of downloads. And so the ratio here as you're seeing is about 100 to one is like the average. And so if you're looking across any given in open source package and they say, hey, we have 1000 downloads or a million downloads, and I think a lot of people will immediately jump into wondering, sure, but how many users actually, what, like, how many active users do you have? How many distinct sources of traffic did that come from your back of envelope math? You can just divide by 100 and that'll give you some kind of reasonable approximation. Um, however, you know, we, that it depends on the package, right? Like this is not, these are the averages across. We will see other packages, uh, where, you know, other projects where the ratio is actually closer to 15 to one. And so this kind of thing really just depends on the, um, the type of software that you're dealing with, kind of what usage patterns tend to look like. You know, obviously, I think where, where something is in CI pipelines that will tend to explode, the numbers and the metrics that you might see in terms of those total downloads versus uniques. But it really just depends on the type of thing that you are looking at. Often for something like an application, a full blown standalone application that you spin up as an internal tool is going to have very different usage patterns than say a library where every automated system in your infrastructure is going to be downloading it every time you spin it up, up. And so what that means is that many surges in the downloads that you might see, if you're looking at like your NPM metrics, your cargo metrics, whatever you might be looking at, if you see surges in your growth, they might be totally not real. And one of our users published a really great blog post about this. So Linux server IO, for those who are not familiar, it is a completely non commercial organization. Basically what they do is they repackage a lot of popular applications as Docker containers. They dockerize a lot of different things to make it really easy for anyone to spin them up on just arbitrary machines. It's a really cool project. Highly recommend checking them out. They've serviced. I think they're probably well over this by now. But it was coming up on 20 billion downloads of docker containers that they are maintaining. And they used scarf to track the metrics around these downloads. And what they found was that the correlation between unique users and total downloads did indeed vary across different applications. And some of them, they vary wildly. And so what you can see on the top here, what you can see on the top here for Wireguard is that there were multiple kind of spikes of, of total downloads, very little movement on unique users. And what they said was that about half of the polls could be attributed to 20 users. And those 20 users probably had misconfigured or overly aggressive update services. So in the Docker world, there's tools like Watchtower and Diune and kind of some of these others where all they're doing is they're just pulling and pulling and pulling, and they're just trying to make sure that they have the latest version of the software. And so what happens is the numbers get super, super inflated. And that's just the case on a lot of different registries that you might look at. And the really interesting thing to think about is that in the venture capital world, millions and millions of dollars are deployed to open source projects, turning into companies that just have really exciting download figures. And what we're showing with some of these metrics is that these download graphs are not reliable in any way, shape or form. When we think about the companies that have been started, the tens of millions of dollars that have been deployed, all basically garbage download metrics. It's an unfortunate reality to understand, but that is what the data shows. Like I said, a lot of the downloads that you may have as an open source maintainer, they come from totally automated systems. Some automated systems are very real signals and others are much less so. Like I said, container agents that are continually monitoring for updates. You have CI CD pipelines, you have artifact mirrors that are just trying to mirror any given registry and make sure that they're always up to date. If you have a link to your artifact on a website, there's web crawlers as well, web crawlers from all different companies that are just crawling your pages and downloading all your downloads. And those are all inflating your stats. In the top clients that we've seen, there's a lot of repetitive downloaders. Renovate bot, Scopio, Diyun uptime robot, renovate bot. These are all, these are all very low signal, but very, very high volume user agents. And so for most maintainers, this is noise, this is not relevant, but they will really impact your stats. And so like we were saying that the, you know that these user agents, there's quite a lot of rich information in those user agents, and it's actually even, it's a little bit more complex than that, where some user agents send tons and tons of information about the program that is doing the downloading and others less so. But one really standout piece of software is Pip, the Python package manager. It does an exceptional job where the user agent actually contains a human readable JSON blob with all the various information about it. Literally a point of telling you cpu architecture and right at the top, a CI flag telling the, you know, telling the server if this Pip install is running from within CI or not, incredibly helpful. There's build information, there's system information, there's dynamic dependency information. Humans can read it, machines can read it. It's incredible. Hats off to pip. We'd love to see more programs include this kind of very rich information in their user agents because it's really, really beneficial for open source maintainers that rely on this kind of data. And we see this in a lot of other places, like homebrew does a pretty good job of this. Docker does a great job of this. Even showing you different upstream from Docker, downstream from Docker. And that's really awesome. But then we see a lot of user agents that are not super helpful. We are looking at you go HTTP client, which for the go developer, this is not really, they did exactly the right thing. But for the folks that are using go HP client under the hood, which is most of this traffic, it's not usually coming from raw go. It might be coming from helm or kubernetes or what have you. A lot of other platforms that include very little information about the actual client that is making the request. I would love to see more pips and dockers and fewer kubernetes and just raw go HTTP client that give very little info on who's making the requests. Let's see. Another big piece of info that has come from these trends is that scarf users and customers are often very surprised that people do not upgrade versions in the way that you might expect. Excuse me. So in the container world, the default tag, the version of the artifact that you're publishing is just latest. That's kind of the default automatic tag that gets applied. And so if you try to download a container, you don't specify a version. It's going to use latest, it's just going to download the latest. And unsurprisingly, that is by far the most downloaded version across all the different packages on our platform, latest is by far the most popular. So more than three quarters of the packages latest is their top version. But interestingly, most users will never download a second version once they get one. And so if you see someone that downloads a stable release, getting them to upgrade will be very tricky. So it's kind of weird. We have a lot of people who are just going to use the bleeding edge of whatever is the default, and everyone and most other people are never going to grab something again, no matter how much you nag them to do it. And this is why a lot of software you use is begging you to upgrade to the latest version if there are security updates or these kinds of things, because in practice, people don't really do it. That's the world that we live in. If you are a maintainer and you have pushed a security patch and you think, oh great, everything is fine, we push a patch. Well, maybe not, because just because you push the patch doesn't mean that anyone is using it. And that is the reality. And so, you know, we've, we've talked about a lot of random, you know, learnings and trends here, but who cares? You know, why does this matter? There's a few big reasons why this stuff is important for anyone who has done open source work on their own, anyone who's been an open source maintainer or, you know, an auth, creator of a project, you probably know that you're often working in the dark about how your work is actually being used in general. These kinds, any work that you're doing, whether it's making a decision about what to work on, what not to work on, what to prioritize, what not to prioritize. You can do that more effectively if you understand how the software is actually being used. Interestingly, when we first got started working on scarf, the attitudes, the open source community were quite a bit different around collecting any kind of usage metrics. So that's changed quite a bit since we've been around. But the reason that this is really important for all of us is that we do look at some data. It's not like we don't use any data. We have these kinds of metrics. We have some download metrics in most of our registries, but these metrics are very misleading. And to make it worse is that the registries actually already have the data. All the stuff that we've shown here today from scarf, whether it's uniques or version adoption or what have you, the registries have that information already. We're just the first to show it to you. That's problematic, because that means that there's things that would be helpful to maintainers that are just locked away, that they don't have access to, even though they're kind of the rightful owners of the data in the first place. And so if you want to see fewer companies ditch open source licenses, the trend that we see is that you have all these companies that they kind of hit a wall with how much they can grow. Because whether it's that they're competing with their open source, whether it's at their community and their business, or not properly aligned with one another, one way or another, these businesses and their communities are kind of skewing apart in their interests. And that causes the businesses to, you know, they do what they need to do, and they sometimes that means they need to change their license. If we want to see less of this. I know I do. I know a lot of you listening. Probably also, do we have to support open source businesses more systemically, more holistically, make it easier to build a sustainable company around a successful piece of open source software? And so analytics and better metrics, better data observability, is, in our opinion, a very crucial part of this equation. If companies better understand how their software is used, they can more effectively commercialize. And if they can more effectively commercialize, they might reach for license changes less often. And I think we all want to see a world where more companies are successful doing open source, and that open source becomes the more dominant way to build software, period. Yeah. And so a few takeaways here from the data that we've learned today. So one is that open source, many of you already know this open source really is everywhere. And so maintaining open source is sometimes a pretty thankless job. A lot of times people only come to you when they have a problem, you know, when they run into issues, when they're upset, when they have something to complain about. But your work as a maintainer actually has a huge impact on the world. It affects big companies, it affects governments, universities, people all over the world in the most populated and the most remote parts of the globe. And that's pretty cool. Even if people are not coming to you and thanking you, at least you can know that your software is probably getting used more than you think. If you are building a business or you have a business that is around the open source that you build, tracking the usage of that software can be critical to building and maintaining that thriving business. If you are reporting metrics around the usage of your open source software project, just remember that download counts can be highly misleading. So our recommendation is to always pair raw download numbers with some notion of uniqueness, if you can, because there's huge outliers in traffic and bursts from one person. There's lots of bots, there's lots of redownloads, there's a lot of things that really screw up those numbers. If you maintain any kind of package manager or program that is making requests on the Internet, please, please, please put rich information at the user agent. Others in the community will be very, very appreciative. Maintainers should keep in mind what the behaviors of users actually is when it comes to upgrading towards new versions, and to keep a little bit of a more pessimistic eye about that. And I think the last and final thing is that open source usage metrics can indeed help the ecosystem. They can be collected responsibly. And I think that this is something that we as a community need to, over time, embrace more and more and not shy away from, because the data is already being collected, whether we like it or not, by the registries are going to collect the data, whether or not we want them to. And so it's not a matter of should or should we not collect this data, is is how should we use it? Who should have access to it? That's a topic for a whole other talk, a talk that I've given in other places. And so feel free to check out my website for other links to discussions about that. Yeah. So thank you for your time. Thank you for listening. I hope you learned something you can find me kind of all over online. You can find scarf at scarf sh. If you want to learn more about the metrics that we collect, how you can collect your own. If you have an open source project you want to know more about the usage for. So yeah, please don't hesitate to reach out and get in touch. And thank you so much for listening.
...

Avi Press

Founder & CEO @ Scarf

Avi Press's LinkedIn account Avi Press's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways