Transcript
This transcript was autogenerated. To make changes, submit a PR.
I joined my first startup in 1998. I had no background in computer
science and no industry experience in software, but they were
looking for someone to run QA or to start a QA function, and I was
interested enough, willing to give it a shot, take a chance.
So I showed up on my first day, excited to get going, but not really
sure what to expect. It turned out the first thing that they wanted me to
look into was a performance issue. This was
fairly straightforward, a page that had started
out responsive and was slowly degrading over time as the total user base
started to grow. And so not really
knowing what to look for seemed like something that I could take on and give
a shot. So I started to dig in and the first thing I did was
go and look at some code. Now remember, this is 1998.
It predates most tools in terms of building
web applications. We were using Java, and so we really
had code that looked a little bit like this. This is probably not exactly right,
but we were building a table with a list of users
by writing strings with HTML tables directly embedded
in them. And so the key point here is every time
we wanted to add another username to the table, we would append
to the string, which is particularly interesting,
and this is now a well known issue in Java, but back
then it was not super well understood by most of us, which is that when
you do this, you write the first string by allocating an object
and writing the content into it. And when you append to it with the
plus operator, you actually create a brand new string,
allocate all of that memory and release the old object to be collected by the
garbage collector. And then when you append to it again, you do the
same thing. So we were doing this over and over and over for all of
the users in the list without realizing the implications.
In the virtual machine, what happens in this case is
you allocate a string buffer and then you append to it with the append method
of the string buffer, and then ultimately write that out
to the output by converting to a string. So what's happening
inside the system is you've allocated a large chunk of memory,
you can continue to write into that chunk, and when you get to the end,
you allocate another chunk, which is actually all done by the string buffer,
not by the developer. And when it's complete, you can write
it all out. So the big takeaway for me in this was,
and it was a great early lesson CTO learn in my career.
We all want to take advantage of the tools that are made available to us,
so that we can focus on the things that matter to our users. But not
understanding how those things are implemented can ultimately lead
CTO degradation for your users. So it's important
to know that you get great advantage by leveraging
the tools that are available, but it's also important to pay attention to how they
would be implemented, because there really is no magic and make good
choices about the tools that you end up using. I'm Rob Zuber,
CTO of Circleci, and today I want to give you a few more examples from
more recent history in the Circleci platform of places where we've
been able to make great use of available technologies. But our initial
choices ultimately caused issues based on how those things were
implemented. So let's start with schemaless databases. At Circleci,
we use Mongo. We've used Mongo for a very long time. We're starting to use
some other data stores and other pieces of our platform, but we still have a
lot of mongo. And one thing that I always tell people about schemaless databases
is there's always a schema. It's just a question of where that schema is
enforced. And in order to understand the issue that we
ultimately ran into, it's important to understand a lot of things about
Mongo and its implementation and schemaless databases.
So this is a comparison of what it would look like to structure a row
in a schema defined database, like a relational database.
And in the schema list or document store, like Mongo,
on the top you have a list of names and email addresses, and all you
need to store is names and email addresses, because the schema defines
what a row looks like. This is a little bit more like a CSV than
a true database table, because it's hard for me to print all the control characters
on a screen, but you get the general idea, the amount of extra space is
very limited. On the other hand, when storing a document in a schema
list database, you have to identify every field in every
record or document, because there's no schema defining what could
be present. So one document might have a name and email and another might not,
and that's totally valid inside the same collection.
So one thing to note when you do this is that that
second document is about 25% bigger than the one
above it, and that's not necessarily a good thing. In many places, bigger is
better in storage, bigger is worse. And in 2021,
or even over the last ten years, we've come CTO believe that storage is effectively
free based on the price that we pay. But storage
has a lot of, or size of rows or documents
in storage has a lot of impact on the overall performance of your systems.
So this growth is not necessarily a positive thing.
So, looking at this a little bit more specifically, if you think of a very
simple system like the one that we use to manage our Mongo instances,
you have compute, doing the actual calculations of what's happening
and running the mongo binary, and then you have disk, which is where
the data is stored, and most likely you have some sort of network
in between those two. In some cases you would have locally attached storage,
but in many cases, in our cloud systems, we're using network attached storage for
the benefits of that. And the result is you're moving this data back and forth
across these networks. And so latency and throughput are
both concerns when moving those large volumes of data.
So, as I said, up in the compute layer, the system
effectively is where parsing and editing of those documents happens.
So when you have an unstructured document, you have to load the entire
thing in order to find any of the pieces of it.
And that's done within the mongo binary. And that's also where editing
happens before it's flushed back out to the disk. And more realistically,
looking at that second example again, what truly happens is there's
no defined ordering of the fields,
so you could write the same content into two different documents. And the
net result is that you have no way of knowing
where in that document those fields might be. So when
you go looking for something, you have to iterate over the
entire document. Now, mongo uses a format called BSON, which is
effectively a binary version of JSON, or close enough so
there are some smaller characters and control functions used.
Both CTO define indicators or separators of fields, as well as
to allow for a couple additional data types that would not be obvious in a
JSON document. But other than that, you can think about the parsing as very similar
to JSOn, meaning you open up an iterator and
you start moving over all the keys until you find the one that you want.
This particular example is from the Mongo C driver docs.
So if you were writing a parser in C, it would look just like
this. You would start at the beginning. I guess it's not the parser, but it's
a search for a particular element. You'd start at the beginning and iterate over
until you found what you needed. Second, there are no joints,
and this is a very common theme in schemaless databases.
Ultimately the goal being that you have different operational
characteristics and different models for deploying your systems
when you don't have to worry about joins between collections. So this
is considered generally to be a good thing. Unfortunately, it means
that there are some specific behaviors that you account for in different ways,
and sometimes they don't work out the way that you would expect.
So, taking a look at the previous example, this is what it would actually look
like laid out in a relational database. I mean, of core, there would be more
to it. But if you had names and emails for your users or your characters
and your superheroes that you're storing, and then you wanted to associate with them their
powers, and that would be a one to many relationship,
meaning an individual superhero could have many powers.
You would store a separate table of powers and then have a relational id back
to the character table to identify with which character they were
associated. Now, that's not possible. It's possible, but there's no
enforcement in the database, in a document store. So the more commonly
proposed and supported pattern is to embed those.
So in this case, we have Wanda as Scarlet Witch, who has some specific
powers. We would list those inside of the document
representing Wanda, and for a small number of powers,
this is actually a totally reasonable approach, minus any
opportunity to enforce uniqueness or constraints on that
list of powers. Now, one of the amazing
convenience functions that Mongo provides to allow you to
manage these lists is something called add to set. So in that previous
example, if I listed mind control, and then I
wanted to add it again, if I call add to set, Mongo will ensure
that that doesn't already exist inside of the record before it appends
it, which sounds great and is super helpful, and it's something that we
chose to use for one of our implementations, specifically our implementation
of artifacts. So, circle CI is a CI and CD
platform. And one of the things that we allow you to do when you run
a build is store artifacts of that build. And this
is a small document that represents a single
artifact at the end of a build. In terms of what we would store,
it describes the name, the actual path in s three, where we would
have stored it, and then a URL to retrieve it. Now, these are obviously
adjusted to fit on the screen, but it's approximately this format.
In fact, I think they would mostly be longer than this. So for every single
artifact that was stored, we had some repetition, which, as I've discussed before,
was probably not great from a total size perspective. And we were storing these
documents in an array and using the add to set operator,
to identify when we had duplication inside of that array and
to avoid that. So, similar to the short list of powers,
when we had builds that ran with a small number of outputs,
this was a great way to list them and a quick way to find the
data we needed and go and fetch those artifacts. We could show the list to
the user so they could see what artifacts were associated with their build,
click on one and download it. Now, what ended up
happening was we had customers who had 30,000 or more of these artifacts
being generated within a build, which, as you can imagine,
started to build some very, very large arrays inside of that
document. So, here's an example or a description of
what that sort of looks like as it's happening. So this is our
very simple mongo instance again with compute, and then disk below
and the little square representing the specific document that we're
interested in editing and adding an artifact to.
And up above, we have a builder machine that's
actually doing the work and wants to write. And so when it makes a request
to Mongo to get access to that build document, Mongo loads
it off disk and holds it in memory in order to operate on it.
And ultimately, we'll flush it back out to disk.
Now, one of the capabilities that we offer at Circleci is the ability to
parallelize your build. So you might have a very large number of teams.
And in order to get through them faster, we will split those tests
up onto different machines and run them at the same time.
Of course, that means that each one of those is computing a
different set of artifacts and writing it back to the
build. And the build is the collection of results from
all of those different machines, not one per machine.
So when each one of those tries to write, it has
to write to the same array. And when we call add to
set, each one of those write requests is going to lock the build
document and iterate over the entire
array of artifacts, looking for any potential
conflicts or duplicates before it ultimately writes
at the end. And this could happen on 100 parallel
builders trying to write 30,000 artifacts in total over
that single document, each one of them locking and searching by the
end, the previous 29,000 documents, in order to
see if there was any duplication or conflict. This was not a
great outcome for us. And even more exciting, we would be writing a
big enough or a large enough number to that artifacts
array that ultimately, Mongo would have to grow the document,
so it would have to allocate a new chunk of memory, copy the existing document
into that memory, and then continue to edit, which is another
operation that happened while locked and blocking all,
potentially 99 other or more builders from writing.
And then we had to write that document back to disk, which requires
Mongo to allocate new space on the disk because there's not
room to fit it where it was before, find a new extent, move the
entire existing document over there and write the new content into
it. Or more likely, just flush the whole document and remove the old one.
So that whole operation had to happen before someone else could write.
Ultimately, this resulted in some very, very slow builds.
So the first quick fix we made was to remove the add to set operator
and replace it with a push operator, which just assumes there's no duplication,
or that duplication doesn't matter, because it actually never did for us,
and writes directly to the end. The next step was
effectively to rebuild large parts of that artifact management system,
shrinking the total amount of storage, because we knew the pattern to
get to a document, and then changing the storage model
so that we didn't need to core it inside the Mongo document at all.
So the key takeaway here for me is that there actually
was a great simple approach in the method that we chose. But when you take
a simple approach like that, it's really important to know when it's going to
break. For us, it was a surprise, and we had to go do some digging
and learning to figure out what was going on, and then find a new system.
If we had understood going in where the limitations of the mongo
array capability would be, then we could have planned
for future work to make a more comprehensive or capable system
at the point where the scale would be a problem, or before that point.
Now, taking another look at how we use mongo, but looking at a different
aspect of it, let's talk a little bit about EBS. Now, this could probably be
any block store. EBS is one that I happen to be fairly familiar with,
because we use it a lot inside of Circleci. And the one thing that
I will definitely highlight here is many things feel like magic. There's never
any actual magic. There's just important, difficult engineering challenges
that have already been solved, but it's important to think about how they've been
solved. So, again, taking our simple example of a Mongo instance,
this is circa 2015, we actually had Mongo
outsourced to a third party provider who managed the database
for us. And again, your simple compute and disk
pairing. And in that model, those were actually locally attached.
So we were using AWS's largest instances at
the time, which meant we had their largest local ssds. Because disk and
compute moved together, there's no way to attach directly, or at
least at the time, larger disks, CTO, the same sized instance.
Then we were running low on space for even a single collection,
meaning we had shifted off multiple different collections, left a single
collection on a particular store, and the system was running
out of space. So we were constantly tweaking and managing the storage.
But no matter how small you get the documents, if you're continuing to add them
at a high rate of growth, you're ultimately going to run out of space.
Additionally, we had operational management issues like
backup and restore, so a backup we were trying to do daily, but we
got to the point where a backup was taking greater than 24 hours because we
were pulling it out of the database and pushing it across the network all through
the compute engine, which was also trying to serve traffic. We ended up with stale,
inconsistent backups by the time they were even created, and we obviously couldn't
run them daily anymore because they weren't even done. Even worse, if we tried to
use restore to do any maintenance operations, that would take two
to three days. So the ability to build out a new host even move
us off when we were having operational issues was severely limited.
We had operational problems all over and we
needed a new solution. So we decided to do two things at the same time.
We moved the overall Mongo operation into
our own AWS environment. So we cloud customize the buildout, and we switched to
using EBS as the disk storage. EBS is
elastic block store within AWS, so now we're back to having a network attached
storage model. We could optimize the disk and compute
separately. We had to pay attention to the operational characteristics,
but ultimately for us, the throughput and latency were both manageable.
Throughput is unbelievable inside of EBS,
and the latency was good enough, with the right mix of
computing power on the other side to store working sets, that it didn't end up
being a significant problem. Also, backups got a lot easier by using
snapshots. So snapshots are not actually instant,
but they capture an instantaneous perspective of the
disk itself, an instantaneous view, and then
are capable of transferring that, despite the fact that it takes time
with that understanding. Meaning if something is edited while
it's still being transferred, the old version is held onto in order
to complete the process of the backup or the
snapshot transfer inside of EBS. So ultimately you get a
consistent view from a moment in time. And because of the way that Mongo operates
with journaling, it's not even necessary to stop operations.
So the disk is known to be consistent at any particular instances.
Additionally, we now had the opportunity to attach
different compute without having CTO make any significant
data transfers. So off of the same disk, we could do
a rebuild of the machine. Whether it was because we wanted to upgrade the operating
system, apply security patches, upgrade Mongo itself,
and without having to try to change the state of the existing
machine, we could just build a new one, throw out the old one, and attach
it. So our hosts became effectively immutable and ephemeral in that way.
And that was a significant improvement from an operational perspective.
And then finally, that same instant transfer made
restores really, really fast. It would take a couple minutes
to build out a multi terabyte EBS volume from a
snapshot that we had stored. And so this was quite magical and allowed
us to again, perform those kinds of operations. Let's say we wanted to add another
host to a replica set or replace one that wasn't working
the way that we wanted. Or sometimes even AWS needs to
take back a vm and we could just build a new one before they did
that. So this was fantastic. Again, a couple of minutes to get multiple terabytes
back online, and it seemed great until we actually tried to start Mongo up
again. And then it was really, really slow, like kind of scary.
Did I break the production database slow. When it takes minutes to
start up a process that's normally pretty much instant.
And I happened to be, or happened to have the opportunity to
speak to some folks from Mongo and ask them if they had seen this type
of behavior before. It turned out they had. And they were able to give me
some clues about what might be happening. And it was
bad enough that it was slow to start, but the more important thing was
when we started the database and it actually did come online,
if we put it into production, even as a secondary for reads,
it would be extremely slow in its performance there as well for
an extended period. So what we learned was happening was
that the restore is not actually a transfer
of all of the data, but it's a reference to
the data that is placed on the new volume.
And so the underlying file system, when it recognizes
that something is missing, goes and fetches it. So first it
starts in sort of sequential order, moving across the disk, but then a request
comes in for something that it doesn't have. It prioritizes
that and fills that in somewhere. So there's an on demand fetching
model and this is a great model for smaller
volumes and for very sequential reads. But in our case, with a
large database and random access across all of our users,
everybody was hitting very random locations around the disk
under high load in a production environment simultaneously and
causing great delays. Every single one of these was actually instead
of a read from disk, a fetch out to s three,
or wherever the snapshot is stored, some data transfer, some decompression,
some placement of that onto the disk before the read could actually be completed.
And so on the startup phase, it was the random sampling
of indexes that Mongo does to ensure that all the content is there and
consistent, or random sampling of the journal to then go look
at the disk to check for consistency. That was taking a
really long time because it's looking all over the disk. And then as soon as
we put it into production, it was customer reads that were taking a really long
time. And so once again, all of our customers are waiting
and not impressed with performance. So ultimately we
had to think about what was causing that and how we could manage it.
And what we ended up doing and still do is when we'd make these large
operations and move CTO new volumes, we run warming
queries that we know will be most representative of the majority
of content that will be fetched once we put it back into production,
which is usually most recently edited, most recently written.
So we drive those queries to force all of this data
to be fetched, because if we wait for the entire disk
to fetch, it will take forever. And a huge amount of it is
not really that important to us. The key takeaway here
is your cloud provider. Your tool provider is
solving for a general case. They are building amazing things,
but they're building them for a very large audience and thinking about what's going to
matter. CTO, the largest subset of those people, probably applying
an 80 20 rule, saying if we solve this 20% case, we're going to cover
80% of our customer base. And the question that you have to ask yourself is,
are you the target? Are you in that 20% bucket in terms of your use
case, such that your usage will be covered? Or do you
need to figure out how to adjust your approach or your problem space,
or work around that in order to have it work for you? Now I'd like
to turn to a final example, which is a little bit
less about the tools that we've used from others and more about how
our own challenges have existed while building out in a cloud
native way. Concurrency has always been a problem,
has always been a challenging part of software development. And that's been true
for as long as I've been writing software, and it's always been true within
a single programming environment, and it hasn't gotten
any easier as a result of most of the tools that we've thrown at it.
So concurrency has always been hard from a programming perspective.
This is a newer article, but speaks to an old problem.
And the great news is distributed systems are also very hard,
and these days we are piling them together. So let's
take a look back at the early days of circleci. Like everybody else,
we started out with a monolith. This monolith did many things. In this
example, we're just talking about a couple of them writing to the database and
fetching data from GitHub, or writing to GitHub, things like statuses.
Even in our very early days, the monolith, we would have had at least two
instances for redundancy, resiliency. These are very very
early days, but ultimately we ended up with much more than two and
far more than even on this diagram. But at some point it
becomes a bit repetitive and we got to a point where each of
these monoliths was effectively thinking of itself still as the
owner of all of the work. And so the way that jobs
were distributed was an individual monolith would talk to the database, take the next
job, and try to start it. But there was a check to make sure that
there wasn't any contention, no one else had taken that job. And at a
certain level there were so many attempts to take the same job
that we were losing the opportunity to process our queue effectively. So we
reached a point where we couldn't really scale any further, which is not great as
a business. So we took a fairly simple approach, CTO solving this problem, because we
were under time constraints. And the first simple approach was to
take a version or a copy effectively of this monolith and
run it in a slightly different role, meaning it executed
the work of determining whether or not a job should be processed
and distributing that out to one of the other instances,
rather than letting each instance make its own decisions about what
work should be done. Now, due to time constraints, as I mentioned,
this was effectively another copy of the monolith
that was started up with instructions to only do this, and all the other instances
were started up with instructions not to do that, but it was basically the same
code base. And so now when it comes to statuses,
you have this dispatcher, which is an instance of the monolith writing
to GitHub and then the monolith doing the work, also writing to GitHub, they write
different statuses based on where they are in different parts of the job. And that
was basically because that code path was where those things happened.
So if we dig into one of those a little bit and look at what
happens in that original process, it's a little more complicated, which is within the
monolith, we have a concept of a build, which is the full piece of work
that we're doing or at the time that we were doing, and we might change
the status of that from requested to queued, from queued
to running, from running to completed, passed or failed. Pretty straightforward.
And we would do that by writing to the database. But instead of also writing
to GitHub, which is an external system, there's network issues and retry
problems and timing issues. We had a watcher,
which effectively was an asynchronous process, identifying,
oh, something has changed on the build, and therefore I'm going to go do
this work. And sort of an inverse of a pub sub sort of model,
but really a concurrency model that allows us to continue
doing the work on the build, knowing that the status will get updated
to GitHub as soon as possible. But the customer would prefer that the
build get run than that we wait on. Our ability to write
that status makes great sense. Fairly straightforward programming
model, plus or minus the fact that concurrency is always hard,
and everything happening inside of that one instance of the monolith.
Now introduce the dispatcher. So the dispatcher is the first thing to see
that build, and it is again using this effectively same code
base writes a build, notices that the build is now queued,
or writes a status that the build is now queued and ready for processing,
or has been handed off to a machine and has started processing. And then one
of these monoliths that receives the work, executes the build,
and updates the status again, which then gets picked up by a
concurrent reader that goes and does the work of writing to GitHub,
which is great under normal conditions. However, it's highly
possible. It certainly is possible. We know it to be possible
for the dispatcher to be under a great degree
of load, because it's processing every job coming into the system,
or a small like maybe there's a few of these. So it's processing a huge
percentage of the jobs coming into the system and can be very busy
on its rights. And then the instance of the monolith,
the builder we'll call it, is going to pick up that work and execute
it and the work might be very small, very short, and then
deliver its status. And its status watcher could be very
quiet because there's only a few builds running on any particular
builder. So the net result is that the completed
status can actually get to GitHub before the starting
status, and GitHub doesn't understand what we're doing enough to determine
that those are out of order and we are not coordinating
in this particular model. So we send to GitHub a completed
status, and then we send a starting status which reverts the
pull request and the status of the pull request to incomplete
in a way that can never be completed. This is the worst kind of waiting.
This is waiting forever. And ultimately we ended up with customer tickets
saying my build cannot be completed or my build was completed,
but I never got this status. Which is another interesting thing about concurrency.
No one told us you're setting the complete status before you set the
starting status. They said you're never setting the complete status because that's what a customer
sees, because they were usually so close together and they were unable
to merge their work. So this minor concurrency
issue on our side resulted in our customers unable
to complete their work. So we ended up having to go find a model for
coordinating that work and ensuring that the sequencing happened in
a way that was obviously no longer guaranteed within that single watcher on the
monolith, but had to be coordinated and guaranteed across
multiple systems in the platform. So I would say services
beget services. When you start to break apart your system,
you end up with more and more systems, often trying to
do the work of coordinating the systems that you have.
And my takeaway in this case is that distributed systems really
are concurrency. There's no world, well, there's a very
small world in which you can build out a complex distributed system without
paying any attention to concurrency. It's very likely going to end up being
a parameter that you deal with because you're trying to handle complexity,
resiliency, breakdowns in the communication network,
whatever it might be. There is going to be concurrency in the conversations between
your systems, and so you want to pay attention to that and manage for it
and use it when it is most beneficial, but understand the cost.
So, to summarize, it's a great time to be a developer.
We are able to build on top of amazing systems
that do amazing things and focus our
time and attention on delivering value to our customers, building the systems
that we're really excited about building in order to support businesses that we're excited
about building, and all of that is novel and new
to us and very, very cool. On the other hand,
it's important to pay attention to what's happening. So do the simple thing,
because that will help you get something out faster. But make sure you understand
the simple thing that you're doing. Understand the tradeoff that you're making so that
you'll know when that tradeoff will expire and you will
want to make a different decision at a different point in the future.
Always consider the constraints of your tools. Think about what the designers
would have had to think about when they built that, and whether that's the case
that you have. If they're not solving for your case, you either want to change
the shape of your case to match theirs or find some mitigating
approaches that will allow your problem CTO fit
into the box of their solution and finally manage the complexity of
distribution. Building distributed systems brings a lot of advantages,
but as soon as we start, we're making the conscious decision, or it should be
a conscious decision, to take on additional city, and we have to know that
we're ready for that and that it's warranted for the particular problem that
we are solving. Thanks so much for listening. And if you find these kinds of
problems interesting, we're always looking to higher grade folks, so come visit us@circleci.com.