Low-Level Computing in a Cloud Native World

Video size:

Abstract

At the core of most software incidents is a misalignment of data and algorithms, coupled with a pattern of human factors that take place in production. At CircleCI we see a lot of incidents and solve them not only for our own team but for thousands of engineering teams all over the world.

Join CircleCI’s CTO, Rob Zuber as he uncovers a few instances when CS fundamentals didn’t pan out as they do in textbooks, and how to avoid them.

Summary

I joined my first startup in 1998. The first thing they wanted me to look into was a performance issue. Not understanding how tools are implemented can ultimately lead CTO degradation for your users.
At Circleci, we use Mongo. We've used Mongo for a very long time. But our initial choices ultimately caused issues based on how those things were implemented. Bigger is better in storage, bigger is worse. latency and throughput are both concerns when moving large volumes of data.
In 2015, Circleci switched to using EBS as the disk storage. EBS is elastic block store within AWS. Throughput is unbelievable inside of EBS, and the latency was good enough. Also, backups got a lot easier by using snapshots.
Concurrency has always been hard from a programming perspective. And the great news is distributed systems are also very hard, and these days we are piling them together. Here's an example of how Circleci has dealt with its own challenges while building out in a cloud native way.

Transcript

This transcript was autogenerated. To make changes, submit a PR.

I joined my first startup in 1998. I had no background in computer science and no industry experience in software, but they were looking for someone to run QA or to start a QA function, and I was interested enough, willing to give it a shot, take a chance. So I showed up on my first day, excited to get going, but not really sure what to expect. It turned out the first thing that they wanted me to look into was a performance issue. This was fairly straightforward, a page that had started out responsive and was slowly degrading over time as the total user base started to grow. And so not really knowing what to look for seemed like something that I could take on and give a shot. So I started to dig in and the first thing I did was go and look at some code. Now remember, this is 1998. It predates most tools in terms of building web applications. We were using Java, and so we really had code that looked a little bit like this. This is probably not exactly right, but we were building a table with a list of users by writing strings with HTML tables directly embedded in them. And so the key point here is every time we wanted to add another username to the table, we would append to the string, which is particularly interesting, and this is now a well known issue in Java, but back then it was not super well understood by most of us, which is that when you do this, you write the first string by allocating an object and writing the content into it. And when you append to it with the plus operator, you actually create a brand new string, allocate all of that memory and release the old object to be collected by the garbage collector. And then when you append to it again, you do the same thing. So we were doing this over and over and over for all of the users in the list without realizing the implications. In the virtual machine, what happens in this case is you allocate a string buffer and then you append to it with the append method of the string buffer, and then ultimately write that out to the output by converting to a string. So what's happening inside the system is you've allocated a large chunk of memory, you can continue to write into that chunk, and when you get to the end, you allocate another chunk, which is actually all done by the string buffer, not by the developer. And when it's complete, you can write it all out. So the big takeaway for me in this was, and it was a great early lesson CTO learn in my career. We all want to take advantage of the tools that are made available to us, so that we can focus on the things that matter to our users. But not understanding how those things are implemented can ultimately lead CTO degradation for your users. So it's important to know that you get great advantage by leveraging the tools that are available, but it's also important to pay attention to how they would be implemented, because there really is no magic and make good choices about the tools that you end up using. I'm Rob Zuber, CTO of Circleci, and today I want to give you a few more examples from more recent history in the Circleci platform of places where we've been able to make great use of available technologies. But our initial choices ultimately caused issues based on how those things were implemented. So let's start with schemaless databases. At Circleci, we use Mongo. We've used Mongo for a very long time. We're starting to use some other data stores and other pieces of our platform, but we still have a lot of mongo. And one thing that I always tell people about schemaless databases is there's always a schema. It's just a question of where that schema is enforced. And in order to understand the issue that we ultimately ran into, it's important to understand a lot of things about Mongo and its implementation and schemaless databases. So this is a comparison of what it would look like to structure a row in a schema defined database, like a relational database. And in the schema list or document store, like Mongo, on the top you have a list of names and email addresses, and all you need to store is names and email addresses, because the schema defines what a row looks like. This is a little bit more like a CSV than a true database table, because it's hard for me to print all the control characters on a screen, but you get the general idea, the amount of extra space is very limited. On the other hand, when storing a document in a schema list database, you have to identify every field in every record or document, because there's no schema defining what could be present. So one document might have a name and email and another might not, and that's totally valid inside the same collection. So one thing to note when you do this is that that second document is about 25% bigger than the one above it, and that's not necessarily a good thing. In many places, bigger is better in storage, bigger is worse. And in 2021, or even over the last ten years, we've come CTO believe that storage is effectively free based on the price that we pay. But storage has a lot of, or size of rows or documents in storage has a lot of impact on the overall performance of your systems. So this growth is not necessarily a positive thing. So, looking at this a little bit more specifically, if you think of a very simple system like the one that we use to manage our Mongo instances, you have compute, doing the actual calculations of what's happening and running the mongo binary, and then you have disk, which is where the data is stored, and most likely you have some sort of network in between those two. In some cases you would have locally attached storage, but in many cases, in our cloud systems, we're using network attached storage for the benefits of that. And the result is you're moving this data back and forth across these networks. And so latency and throughput are both concerns when moving those large volumes of data. So, as I said, up in the compute layer, the system effectively is where parsing and editing of those documents happens. So when you have an unstructured document, you have to load the entire thing in order to find any of the pieces of it. And that's done within the mongo binary. And that's also where editing happens before it's flushed back out to the disk. And more realistically, looking at that second example again, what truly happens is there's no defined ordering of the fields, so you could write the same content into two different documents. And the net result is that you have no way of knowing where in that document those fields might be. So when you go looking for something, you have to iterate over the entire document. Now, mongo uses a format called BSON, which is effectively a binary version of JSON, or close enough so there are some smaller characters and control functions used. Both CTO define indicators or separators of fields, as well as to allow for a couple additional data types that would not be obvious in a JSON document. But other than that, you can think about the parsing as very similar to JSOn, meaning you open up an iterator and you start moving over all the keys until you find the one that you want. This particular example is from the Mongo C driver docs. So if you were writing a parser in C, it would look just like this. You would start at the beginning. I guess it's not the parser, but it's a search for a particular element. You'd start at the beginning and iterate over until you found what you needed. Second, there are no joints, and this is a very common theme in schemaless databases. Ultimately the goal being that you have different operational characteristics and different models for deploying your systems when you don't have to worry about joins between collections. So this is considered generally to be a good thing. Unfortunately, it means that there are some specific behaviors that you account for in different ways, and sometimes they don't work out the way that you would expect. So, taking a look at the previous example, this is what it would actually look like laid out in a relational database. I mean, of core, there would be more to it. But if you had names and emails for your users or your characters and your superheroes that you're storing, and then you wanted to associate with them their powers, and that would be a one to many relationship, meaning an individual superhero could have many powers. You would store a separate table of powers and then have a relational id back to the character table to identify with which character they were associated. Now, that's not possible. It's possible, but there's no enforcement in the database, in a document store. So the more commonly proposed and supported pattern is to embed those. So in this case, we have Wanda as Scarlet Witch, who has some specific powers. We would list those inside of the document representing Wanda, and for a small number of powers, this is actually a totally reasonable approach, minus any opportunity to enforce uniqueness or constraints on that list of powers. Now, one of the amazing convenience functions that Mongo provides to allow you to manage these lists is something called add to set. So in that previous example, if I listed mind control, and then I wanted to add it again, if I call add to set, Mongo will ensure that that doesn't already exist inside of the record before it appends it, which sounds great and is super helpful, and it's something that we chose to use for one of our implementations, specifically our implementation of artifacts. So, circle CI is a CI and CD platform. And one of the things that we allow you to do when you run a build is store artifacts of that build. And this is a small document that represents a single artifact at the end of a build. In terms of what we would store, it describes the name, the actual path in s three, where we would have stored it, and then a URL to retrieve it. Now, these are obviously adjusted to fit on the screen, but it's approximately this format. In fact, I think they would mostly be longer than this. So for every single artifact that was stored, we had some repetition, which, as I've discussed before, was probably not great from a total size perspective. And we were storing these documents in an array and using the add to set operator, to identify when we had duplication inside of that array and to avoid that. So, similar to the short list of powers, when we had builds that ran with a small number of outputs, this was a great way to list them and a quick way to find the data we needed and go and fetch those artifacts. We could show the list to the user so they could see what artifacts were associated with their build, click on one and download it. Now, what ended up happening was we had customers who had 30,000 or more of these artifacts being generated within a build, which, as you can imagine, started to build some very, very large arrays inside of that document. So, here's an example or a description of what that sort of looks like as it's happening. So this is our very simple mongo instance again with compute, and then disk below and the little square representing the specific document that we're interested in editing and adding an artifact to. And up above, we have a builder machine that's actually doing the work and wants to write. And so when it makes a request to Mongo to get access to that build document, Mongo loads it off disk and holds it in memory in order to operate on it. And ultimately, we'll flush it back out to disk. Now, one of the capabilities that we offer at Circleci is the ability to parallelize your build. So you might have a very large number of teams. And in order to get through them faster, we will split those tests up onto different machines and run them at the same time. Of course, that means that each one of those is computing a different set of artifacts and writing it back to the build. And the build is the collection of results from all of those different machines, not one per machine. So when each one of those tries to write, it has to write to the same array. And when we call add to set, each one of those write requests is going to lock the build document and iterate over the entire array of artifacts, looking for any potential conflicts or duplicates before it ultimately writes at the end. And this could happen on 100 parallel builders trying to write 30,000 artifacts in total over that single document, each one of them locking and searching by the end, the previous 29,000 documents, in order to see if there was any duplication or conflict. This was not a great outcome for us. And even more exciting, we would be writing a big enough or a large enough number to that artifacts array that ultimately, Mongo would have to grow the document, so it would have to allocate a new chunk of memory, copy the existing document into that memory, and then continue to edit, which is another operation that happened while locked and blocking all, potentially 99 other or more builders from writing. And then we had to write that document back to disk, which requires Mongo to allocate new space on the disk because there's not room to fit it where it was before, find a new extent, move the entire existing document over there and write the new content into it. Or more likely, just flush the whole document and remove the old one. So that whole operation had to happen before someone else could write. Ultimately, this resulted in some very, very slow builds. So the first quick fix we made was to remove the add to set operator and replace it with a push operator, which just assumes there's no duplication, or that duplication doesn't matter, because it actually never did for us, and writes directly to the end. The next step was effectively to rebuild large parts of that artifact management system, shrinking the total amount of storage, because we knew the pattern to get to a document, and then changing the storage model so that we didn't need to core it inside the Mongo document at all. So the key takeaway here for me is that there actually was a great simple approach in the method that we chose. But when you take a simple approach like that, it's really important to know when it's going to break. For us, it was a surprise, and we had to go do some digging and learning to figure out what was going on, and then find a new system. If we had understood going in where the limitations of the mongo array capability would be, then we could have planned for future work to make a more comprehensive or capable system at the point where the scale would be a problem, or before that point. Now, taking another look at how we use mongo, but looking at a different aspect of it, let's talk a little bit about EBS. Now, this could probably be any block store. EBS is one that I happen to be fairly familiar with, because we use it a lot inside of Circleci. And the one thing that I will definitely highlight here is many things feel like magic. There's never any actual magic. There's just important, difficult engineering challenges that have already been solved, but it's important to think about how they've been solved. So, again, taking our simple example of a Mongo instance, this is circa 2015, we actually had Mongo outsourced to a third party provider who managed the database for us. And again, your simple compute and disk pairing. And in that model, those were actually locally attached. So we were using AWS's largest instances at the time, which meant we had their largest local ssds. Because disk and compute moved together, there's no way to attach directly, or at least at the time, larger disks, CTO, the same sized instance. Then we were running low on space for even a single collection, meaning we had shifted off multiple different collections, left a single collection on a particular store, and the system was running out of space. So we were constantly tweaking and managing the storage. But no matter how small you get the documents, if you're continuing to add them at a high rate of growth, you're ultimately going to run out of space. Additionally, we had operational management issues like backup and restore, so a backup we were trying to do daily, but we got to the point where a backup was taking greater than 24 hours because we were pulling it out of the database and pushing it across the network all through the compute engine, which was also trying to serve traffic. We ended up with stale, inconsistent backups by the time they were even created, and we obviously couldn't run them daily anymore because they weren't even done. Even worse, if we tried to use restore to do any maintenance operations, that would take two to three days. So the ability to build out a new host even move us off when we were having operational issues was severely limited. We had operational problems all over and we needed a new solution. So we decided to do two things at the same time. We moved the overall Mongo operation into our own AWS environment. So we cloud customize the buildout, and we switched to using EBS as the disk storage. EBS is elastic block store within AWS, so now we're back to having a network attached storage model. We could optimize the disk and compute separately. We had to pay attention to the operational characteristics, but ultimately for us, the throughput and latency were both manageable. Throughput is unbelievable inside of EBS, and the latency was good enough, with the right mix of computing power on the other side to store working sets, that it didn't end up being a significant problem. Also, backups got a lot easier by using snapshots. So snapshots are not actually instant, but they capture an instantaneous perspective of the disk itself, an instantaneous view, and then are capable of transferring that, despite the fact that it takes time with that understanding. Meaning if something is edited while it's still being transferred, the old version is held onto in order to complete the process of the backup or the snapshot transfer inside of EBS. So ultimately you get a consistent view from a moment in time. And because of the way that Mongo operates with journaling, it's not even necessary to stop operations. So the disk is known to be consistent at any particular instances. Additionally, we now had the opportunity to attach different compute without having CTO make any significant data transfers. So off of the same disk, we could do a rebuild of the machine. Whether it was because we wanted to upgrade the operating system, apply security patches, upgrade Mongo itself, and without having to try to change the state of the existing machine, we could just build a new one, throw out the old one, and attach it. So our hosts became effectively immutable and ephemeral in that way. And that was a significant improvement from an operational perspective. And then finally, that same instant transfer made restores really, really fast. It would take a couple minutes to build out a multi terabyte EBS volume from a snapshot that we had stored. And so this was quite magical and allowed us to again, perform those kinds of operations. Let's say we wanted to add another host to a replica set or replace one that wasn't working the way that we wanted. Or sometimes even AWS needs to take back a vm and we could just build a new one before they did that. So this was fantastic. Again, a couple of minutes to get multiple terabytes back online, and it seemed great until we actually tried to start Mongo up again. And then it was really, really slow, like kind of scary. Did I break the production database slow. When it takes minutes to start up a process that's normally pretty much instant. And I happened to be, or happened to have the opportunity to speak to some folks from Mongo and ask them if they had seen this type of behavior before. It turned out they had. And they were able to give me some clues about what might be happening. And it was bad enough that it was slow to start, but the more important thing was when we started the database and it actually did come online, if we put it into production, even as a secondary for reads, it would be extremely slow in its performance there as well for an extended period. So what we learned was happening was that the restore is not actually a transfer of all of the data, but it's a reference to the data that is placed on the new volume. And so the underlying file system, when it recognizes that something is missing, goes and fetches it. So first it starts in sort of sequential order, moving across the disk, but then a request comes in for something that it doesn't have. It prioritizes that and fills that in somewhere. So there's an on demand fetching model and this is a great model for smaller volumes and for very sequential reads. But in our case, with a large database and random access across all of our users, everybody was hitting very random locations around the disk under high load in a production environment simultaneously and causing great delays. Every single one of these was actually instead of a read from disk, a fetch out to s three, or wherever the snapshot is stored, some data transfer, some decompression, some placement of that onto the disk before the read could actually be completed. And so on the startup phase, it was the random sampling of indexes that Mongo does to ensure that all the content is there and consistent, or random sampling of the journal to then go look at the disk to check for consistency. That was taking a really long time because it's looking all over the disk. And then as soon as we put it into production, it was customer reads that were taking a really long time. And so once again, all of our customers are waiting and not impressed with performance. So ultimately we had to think about what was causing that and how we could manage it. And what we ended up doing and still do is when we'd make these large operations and move CTO new volumes, we run warming queries that we know will be most representative of the majority of content that will be fetched once we put it back into production, which is usually most recently edited, most recently written. So we drive those queries to force all of this data to be fetched, because if we wait for the entire disk to fetch, it will take forever. And a huge amount of it is not really that important to us. The key takeaway here is your cloud provider. Your tool provider is solving for a general case. They are building amazing things, but they're building them for a very large audience and thinking about what's going to matter. CTO, the largest subset of those people, probably applying an 80 20 rule, saying if we solve this 20% case, we're going to cover 80% of our customer base. And the question that you have to ask yourself is, are you the target? Are you in that 20% bucket in terms of your use case, such that your usage will be covered? Or do you need to figure out how to adjust your approach or your problem space, or work around that in order to have it work for you? Now I'd like to turn to a final example, which is a little bit less about the tools that we've used from others and more about how our own challenges have existed while building out in a cloud native way. Concurrency has always been a problem, has always been a challenging part of software development. And that's been true for as long as I've been writing software, and it's always been true within a single programming environment, and it hasn't gotten any easier as a result of most of the tools that we've thrown at it. So concurrency has always been hard from a programming perspective. This is a newer article, but speaks to an old problem. And the great news is distributed systems are also very hard, and these days we are piling them together. So let's take a look back at the early days of circleci. Like everybody else, we started out with a monolith. This monolith did many things. In this example, we're just talking about a couple of them writing to the database and fetching data from GitHub, or writing to GitHub, things like statuses. Even in our very early days, the monolith, we would have had at least two instances for redundancy, resiliency. These are very very early days, but ultimately we ended up with much more than two and far more than even on this diagram. But at some point it becomes a bit repetitive and we got to a point where each of these monoliths was effectively thinking of itself still as the owner of all of the work. And so the way that jobs were distributed was an individual monolith would talk to the database, take the next job, and try to start it. But there was a check to make sure that there wasn't any contention, no one else had taken that job. And at a certain level there were so many attempts to take the same job that we were losing the opportunity to process our queue effectively. So we reached a point where we couldn't really scale any further, which is not great as a business. So we took a fairly simple approach, CTO solving this problem, because we were under time constraints. And the first simple approach was to take a version or a copy effectively of this monolith and run it in a slightly different role, meaning it executed the work of determining whether or not a job should be processed and distributing that out to one of the other instances, rather than letting each instance make its own decisions about what work should be done. Now, due to time constraints, as I mentioned, this was effectively another copy of the monolith that was started up with instructions to only do this, and all the other instances were started up with instructions not to do that, but it was basically the same code base. And so now when it comes to statuses, you have this dispatcher, which is an instance of the monolith writing to GitHub and then the monolith doing the work, also writing to GitHub, they write different statuses based on where they are in different parts of the job. And that was basically because that code path was where those things happened. So if we dig into one of those a little bit and look at what happens in that original process, it's a little more complicated, which is within the monolith, we have a concept of a build, which is the full piece of work that we're doing or at the time that we were doing, and we might change the status of that from requested to queued, from queued to running, from running to completed, passed or failed. Pretty straightforward. And we would do that by writing to the database. But instead of also writing to GitHub, which is an external system, there's network issues and retry problems and timing issues. We had a watcher, which effectively was an asynchronous process, identifying, oh, something has changed on the build, and therefore I'm going to go do this work. And sort of an inverse of a pub sub sort of model, but really a concurrency model that allows us to continue doing the work on the build, knowing that the status will get updated to GitHub as soon as possible. But the customer would prefer that the build get run than that we wait on. Our ability to write that status makes great sense. Fairly straightforward programming model, plus or minus the fact that concurrency is always hard, and everything happening inside of that one instance of the monolith. Now introduce the dispatcher. So the dispatcher is the first thing to see that build, and it is again using this effectively same code base writes a build, notices that the build is now queued, or writes a status that the build is now queued and ready for processing, or has been handed off to a machine and has started processing. And then one of these monoliths that receives the work, executes the build, and updates the status again, which then gets picked up by a concurrent reader that goes and does the work of writing to GitHub, which is great under normal conditions. However, it's highly possible. It certainly is possible. We know it to be possible for the dispatcher to be under a great degree of load, because it's processing every job coming into the system, or a small like maybe there's a few of these. So it's processing a huge percentage of the jobs coming into the system and can be very busy on its rights. And then the instance of the monolith, the builder we'll call it, is going to pick up that work and execute it and the work might be very small, very short, and then deliver its status. And its status watcher could be very quiet because there's only a few builds running on any particular builder. So the net result is that the completed status can actually get to GitHub before the starting status, and GitHub doesn't understand what we're doing enough to determine that those are out of order and we are not coordinating in this particular model. So we send to GitHub a completed status, and then we send a starting status which reverts the pull request and the status of the pull request to incomplete in a way that can never be completed. This is the worst kind of waiting. This is waiting forever. And ultimately we ended up with customer tickets saying my build cannot be completed or my build was completed, but I never got this status. Which is another interesting thing about concurrency. No one told us you're setting the complete status before you set the starting status. They said you're never setting the complete status because that's what a customer sees, because they were usually so close together and they were unable to merge their work. So this minor concurrency issue on our side resulted in our customers unable to complete their work. So we ended up having to go find a model for coordinating that work and ensuring that the sequencing happened in a way that was obviously no longer guaranteed within that single watcher on the monolith, but had to be coordinated and guaranteed across multiple systems in the platform. So I would say services beget services. When you start to break apart your system, you end up with more and more systems, often trying to do the work of coordinating the systems that you have. And my takeaway in this case is that distributed systems really are concurrency. There's no world, well, there's a very small world in which you can build out a complex distributed system without paying any attention to concurrency. It's very likely going to end up being a parameter that you deal with because you're trying to handle complexity, resiliency, breakdowns in the communication network, whatever it might be. There is going to be concurrency in the conversations between your systems, and so you want to pay attention to that and manage for it and use it when it is most beneficial, but understand the cost. So, to summarize, it's a great time to be a developer. We are able to build on top of amazing systems that do amazing things and focus our time and attention on delivering value to our customers, building the systems that we're really excited about building in order to support businesses that we're excited about building, and all of that is novel and new to us and very, very cool. On the other hand, it's important to pay attention to what's happening. So do the simple thing, because that will help you get something out faster. But make sure you understand the simple thing that you're doing. Understand the tradeoff that you're making so that you'll know when that tradeoff will expire and you will want to make a different decision at a different point in the future. Always consider the constraints of your tools. Think about what the designers would have had to think about when they built that, and whether that's the case that you have. If they're not solving for your case, you either want to change the shape of your case to match theirs or find some mitigating approaches that will allow your problem CTO fit into the box of their solution and finally manage the complexity of distribution. Building distributed systems brings a lot of advantages, but as soon as we start, we're making the conscious decision, or it should be a conscious decision, to take on additional city, and we have to know that we're ready for that and that it's warranted for the particular problem that we are solving. Thanks so much for listening. And if you find these kinds of problems interesting, we're always looking to higher grade folks, so come visit us@circleci.com.

See all 45 talks at this event!

Conf42 Cloud Native 2021 - Online

April 29 2021

Low-Level Computing in a Cloud Native World

Video size:

Abstract

Summary

Transcript

Rob Zuber

CTO @ CircleCI

Join the community!

Featured event

2025

2024

Info

Conf42 Cloud Native 2021 - Online

April 29 2021

Low-Level Computing in a Cloud Native World

Video size:

Abstract

Summary

Transcript

Rob Zuber

CTO @ CircleCI

Join the community!