Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello everyone, my name is Ankit and today we will be
talking about how to automate merges to keep your builds
healthy. Before we go into the details, a little bit about my
background, I'm cofounder CEO Aviator. We are building
developer workflow automation platform.
Previously I've been an engineer in several companies including
Google, Adobe, Shippo, Home, Enjoy and Sunshine.
You can also find me at Twitter at ankitxG.
So the first question you will be asking is merging.
How hard can it be? Right? All you have to do is press this green
button and your changes are merged. Well, let's take a deeper
look. Before we go into details, let's think
a little bit about different repositories that teams use.
So we have Monorepo, where all the entire engineering team
is causing a single repository, and then we have Polyrepo,
where every team within a company may be using a separate repositories.
In case of Monorepo, the advantages you typically have is easier to
manage dependencies. You can easily identify vulnerabilities
and fix everywhere together. It's easier to do refactoring,
especially if it is cross project. You have a standardization of
tools as well as code sharing becomes important,
you can kind of like share the same libraries across different projects.
In case of poly repos, some of the advantages are simpler CI
CD management. Everyone has their own different cis,
so there's like Lesser Bill failures,
you have independent bid pipelines. Bill failures typically are localized
within the team. In this particular context. For this conversation,
we will primarily be focusing on Monorepo and identifying what are
the challenges in a large team when you're working on a Monorepo.
So the first question you would probably want to ask is how
often do your mainline builds fail? And is
your current CI system enough? There are several reasons why
your mainline builds may fail. For instance, there could be like still
dependencies between different changes that are merging. There could
be implicit conflicts, two developers working on similar code base.
You can have infrastructure issues. There could be issue
with timeouts. There could both internal as well as third
party dependencies that can cause failures. Obviously there's like trade
conditions, shared states that also cause like flaky test.
In this particular conversation, we will primarily be talking about the stale dependencies
and implicit conflicts and why those things are important
to think about.
So to give you an example,
let's kind of just take an example of why this is
important in a Monorepo case. So let's say there are two pull
requests where they're merging together, or like they're
based out at two different times from your main and both
of them have passing CI. But eventually when you go and merges
those changes, the main line fails, right? This could
happen because both of them may be modifying the same pieces of code
where they may not be compatible with each other.
And the challenge is like as your team grows, these kind
of issues become more and more common. Eventually there
will be teams setting up their own bill police
kind of like to make sure that if ever a bill fails that there is
somebody responsible to actually fixing that. There will
be like typically people will do rollbacks, releases get delayed.
There could be like chain reactions where people are basing off of
failing branches and then eventually they'll have to figure out how
to actually resolve those issues. And these kind
of failures increase exponentially as your team size are
growing. And that's why it become very critical at certain point in your team
to make sure you can actually take care of it so
that developers productivity is not significantly impacted. You don't want a situation
where everyone is just like waiting for the bills to become green.
They cannot merge any changes because builds
are always broken and you're losing important
developer hours.
So then what's the solution?
The solution is merge automation, where many companies,
like for instance you may have heard of like GitHub launching their own merge queue.
GitHub also has something similar called merge train. There are
some open source versions that exist today, like hours that can also
provide similar capabilities. So we'll be
diving more into how this merge automation works and how you can adopt
that internally. To give you a very simple example,
let's take here that instead of like. So let's kind
of see this pull request which is pr one. Instead of
developers manually merging these changes, they would typically
inform the system that it's ready. And at
that point, instead of merging changes itself,
the system would actually merges the latest
main into this pull request and then run the CI.
The advantage here is you're always validating the changes with
the most recent main. Now if the CI passes,
this pr will get merged. And meanwhile,
let's say if there's a second pr that comes in while the CI is running,
it's going to wait for the first pr to merge before it picks
up the changes of the latest main and then we'll process
and run the same thing. So once the second pr passes also it's going to
merge the same way.
Let's kind of look at some of these stats. Let's assume for sake
of this conversation that the CI time is about 30 minutes,
hours. Pull request. You're merging in a small team is about ten prs
a day. This would take about. If you kind of run it
serially, it will take about like 5 hours because you're waiting for each PRCI
to pass before running the second one. The total amount of
CI you will run is about 50. Now the challenge is if
you're in a big team where you're merging 100 prs a day in the
same amount of CI, if it takes the same amount of CI now you're looking
at completing about 50 hours of merge time. Now this
is unrealistic for any kind of like system to be able to merges
changes as slowly. So then can we do better?
One way we can think about doing this better is by batching
changes. So instead of merging
one pr at a time, what your system can do is
it can wait for a few prs to get collected before running the CI.
The advantage here is you're creating these batches which essentially
make sure, one, you're reducing the number of CI that you're running,
but also it helps reduce,
essentially help reduce. How long is the wait time. So in
this case, if the CI passes, it's going to merge
all four of the prs together. Now, considering all of them consist
eventually pass the build.
And in cases there's a failure, we are going to bisect
these batches so that we can identify which pr is
causing the failure and merges rest of them. Now here you can imagine like it's
going to cause a little bit of slowdown in the system.
So let's look at the stats in case there is no failure. Let's say best
case scenario, if you're doing a batch size of four now, your total
merges time suddenly drops from 50 hours to twelve and a half hours.
That's a significant improvement. And also the total number
of CI runs are going to be small.
But in a real scenario, right, you're going to have failures.
So if there is even a 10% failure rate, you see like the merges
time increases significantly. So you're like waiting like 24 hours for
all your cis to merge, all your prs to merge and the total number of
cis also increase significantly.
So then the question is, can we still
do better? So we are going
to think a little bit about think of merges slightly differently.
Instead of thinking of merges kind of like happening in a serial
world, you think about this as like parallel universes by parallel
universes. What I mean is if you think of main as
not like a linear graph or like a linear path.
You think of this as sort of several potential futures
that the main can possibly represent. To give an example,
let's think about the optimistic queues. So let's say your main is at this particular
point, a new PR comes in, it's ready to merge. So what
we are going to do is something similar to before. We're going to pull the
latest mainline and create this alternate
main branch where we run the CI.
Once the CI passes, we are going to eventually similarly merge
the changes. But here what we're doing is imagine
while the CI is running, a second PR comes in as well. So instead
of waiting for the first CI to pass, we optimistically assume that
the first PR is going to pass. And in this alternate main,
we're going to start a new CI with second pr as well.
So once the PR for the first one passes, it's going to eventually
merge. And likewise, as soon as the CI
for the second one passes, it's going to merge as well.
Sorry, it's been a bit slow. So it's
going to merge as well. Now, obviously here we are
looking at like what happens if the CI for the first one fails.
The CI for the first one fails is what we're going to do is we're
going to reject this alternate main and essentially create a
new alternate main where we're going to run rest of the changes
and follow the same pattern. And in this particular case,
we're going to make sure the pr one does not merge and because a bill
failure. So in best
case scenario, given that we're now not waiting for NECI to finish,
you can technically merge all your 100 prs in less
than an hour. Obviously, in case of a median
case where we expect like 10% of the prs to fail,
your merge time is still very reasonable. Now you're merging
in 6 hours instead of the twelve and a half hours that we were
seeing before. And alongside your CIA runs
are slightly higher in this case because you can possibly be running this multiple CIA
at the same time. So then
the question is, can we still do better?
One way to improve on this is combining some of the strategies we've discussed
before. So let's say if you combine the strategy of optimistic Q with
batching. So instead
of running a CI on every PR, now we kind of combine them
together. So essentially you're running like these batches of
pures, and again, as they pass, you merge them. If they fail,
you split them up and identify what causes a failure.
Let's look at the stats again. So now we are saying
the total merge time is still less than 1 hour. But what we have done
is we have reduced the total number of cis to 25
instead of 100. And even in the median case,
the merge time is lower, from 6 hours to 4 hours, and your total
number of cis are still lower.
So can we still do better?
Now let's think about some more concepts here. One of the concept
is predictive modeling, but become that. Let's think about
what happens if we assume all possible scenarios
of what the main could look like if a particular CI is going to
pass or fail, or PR is going to pass or fail.
So in this case, we have represented these three prs and all possible scenarios
where all three of them merge, one of them merge or two of them merge.
And if you run these, essentially, if you run in
this way, then we never have to worry about failures, because we
are already running all possible scenarios and we know one of them is going
to be successful. Although the challenge here
is obviously running a lot of CI, right? We don't want to be running too
much CI, and this is where it can be interesting. So instead
of running it on all of them, what we can
do is we can calculate a score and
based on that, essentially identify which ones are worth,
which paths are worth pursuing. So you
can do optimization based on lines of code in a PR,
types of files being modified, test added or removed
in a particular PR, a number of dependencies. So in
this case we have specified the cutoff as zero five. And as you
can see, we are running only a few
of these,
excuse me, we are running only a few of
these builds paths, and that's why I'm reducing the number of
CI.
So now you're obviously asking, can we still do better?
Now we're going to think about concepts of multi queues.
So this is applicable in certain specific cases where we can understand
different builds. So instead of thinking of this as a singular
queue, now we're going to think of this as many different paths you
can take and many disjoint queues that you
can run. So we use this concept called affected target.
So there are systems like Bazel that actually produce
these results, or like these affected targets. So essentially, if you
can identify what builds within your
primary repository that a particular change impacts,
you can create disjoint queues where all
these queues can be independently run, while making
sure hours builds are not going to impact it. So let's assume that there are
like four different kinds of bills that your system
produces A-B-C and D. And this is kind of like the order of the prs
that they came in. You can potentially create hours different queues
out of it. So let's say kind of like the prs that impact a are
in first queue, the prs that impact b are in second queue,
and so on. So one thing to note here is
a PR can be in more than one queue if it's impacting more
than one target, and that's totally fine. So essentially
for PR four or PR five to pass, in this
case it need to wait for PR two to pass or fail.
But at the same time, we are still making sure that in a worst
case scenario where a PR fails, we are still
not impacting all the prs in the queue, but only the ones which are
behind that particular queue. This definitely
increases the velocity at which
you're merging the changes, because it's in some ways localizing
failures to a particular affected target.
So this is a great example where we are looking at two separate queues.
Let's say one target is back end, one target is front end, and there are
multiple prs queued, but they can independently be
compiled and run while making sure that you're not impacting
one change is not impacting the other one. And that way you can
run them parallel as well as merge them while impacting,
while not impacting the builds. So now
can we still do better?
There are like few other concepts that we can think about to actually
make it further optimize these workflows. So one
of them is thinking about reordering changes.
For instance, you can select
the high priority changes or the changes where there's lower failure
risk and put them ahead in the queue. The advantage here
is it's not going to cause a possible can
reaction of failures, and it's going to kind of reduce the amount of
possible failures you can have. You can also kind of order it based on
priority, something which is like really big change. You can probably say it's
going to be lower priority and we're going to pick it up later.
There are other concepts of like, for instance, fail fast,
so you can reorder the text execution, the ones which typically fail
more often. You'd probably want to run it sooner. That way, as soon
as the PR gets skewed, you identify these failures and are able to fail fast.
The other things you can do is you can split the test execution. This is
what many companies do where they will
run many of the fast test before merging.
And making sure kind of like these are the ones which are more
critical, or things which possibly fail more often and
then run the smoke test, or like things which typically
are stable but maybe slower, but you
run them after they're merging. Obviously you expect
the tests you're running after merge fail
very rarely, but if it fails, you can just roll back.
So essentially you're trying to find the best of both world to make sure your
builds are generally passing. And very rarely, if it fails,
you have a way of automatically rolling it back.
So that's about it.
I'm going to share a little bit about some of the other things we at
aviator work on. So we also support scenarios of Polyrepo.
To give you a very quick example. Let's say you have different repositories
for different types of projects, and you're kind of
like merging a change. Let's say you're modifying an API. That's going to
also impact how that
API interacts with web, how it interacts with iOS and Android.
Now here the issue is, let's say
you make this change, which kind of requires all of them to
be modified, but you merges three of them and one of them fails.
Now this can cause eventual inconsistency,
and our system essentially ensure we have
a workflows, we call it chain sets, that essentially combines
or defines these dependencies of changes across
the repositories. So you consider them as like single atomic
unit. So essentially we make sure all of these prs either
merge altogether or none of them merge at all.
We also have a system to manage flaky test. As you can imagine,
as your system and teams grows, everyone is very familiar with flaky
teams. Your system becomes unstable. You'll have like these shared states or inconsistent
test behaviors where your test may fail, even though you may
not have made any change related to that. So what
hours system typically does is we identify all these
flaky tests in your runs and essentially provide
with a customized check that you can use to essentially
automatically merges the changes. So what we typically
do here is we identify
whether a particular flaky test which has happened is related to your
changes or not. If it is not, then we suppress that test in the test
report so that you get a clean health of build,
clean build health, and then you're able to merge that changes. So this
works very well when you're thinking about automatic merging, because you can use
this check as a validation to make sure your systems are
still healthy. We also have
capabilities to manage stack prs. So if you have prs which are
stacked on top of each other. We have capabilities through which you can automatically
merge these changes as well as there's a CLI to
sync all your stacked prs together as well as be able to
queue all of the pr to be merged together.
And again, kind of like it follows similar patterns of chain set that we described.
So we automatically are able to identify dependencies of all these stacked prs
and consider them as atomic units. So all of them merge together or none
of them merge at all. I think
that's about it. So thank you very much. These are some of the references that
we use for this talk, but I really appreciate you spending time
listening to this. If you have any questions, you can always reach out to me.
My email is ankit at aviator Co. Let me
see if this is going to the next screen.
And yeah, check it out. Check us out. At aviator we do automatic
merging as well as many other capabilities, many automated workflows
for developers. If you have any questions,
as I said, you can reach out to me. You can also send me a
DM on Twitter. Thank you and have a great day.