Transcript
This transcript was autogenerated. To make changes, submit a PR.
Good morning, good afternoon, good evening, wherever you are.
Thank you for having me over at Conf 42 to
present you my talk titled the need for speed.
First off, I'd like you to have a think for
yourself about what's a fast build to you?
How long does it take? Or on the other hand,
what's a slow build? What's a frustratingly slow
build for you? And to continue from there,
have you ever had to deal with something
so agonizingly slow that you thought
about just kind of turning the CI CD off and
just going manual again? I certainly have.
And my story of my worst ever built
in my life went something like this. I wake up one
morning and first thing I do is obviously check my twitter
and next thing I do is check my email. Just naturally
comes like that. And at top of my list of emails is I have
an alert from my CI tool and it's saying that
the build has failed. Build from yesterday. I mean,
that's fine, right? We can make some coffee, open up the laptop,
make some changes, think, okay, this should do it.
And commit, push, forget about it, go to work.
Spent all day at work and doing all
that developer stuff these developers do. Anyway, the story is
some six years old from 2015 ish,
I was working on a mobile team, and we were
really following the best engineering practices that we knew
at the time. So essentially everything was written
according to the solid principles. We were adhering
almost religiously to clean architecture, and we
were also writing everything test first, and not only unit tests,
so like TDD, but also functional tests first. What this meant
is that every single feature, every single
part of the application, before we would start developing,
we had a functional test agreed on with
the product manager of what this feature should actually do.
For example, if you were doing email verification,
that's the functional test that you had and those functional tests that were testing the
UI. So as far as we were concerned,
everything was 100% tested. We had tests before we
actually started development and this was the best practice that we
could ever think about. Anyway, I'm at work actually
just finishing up my day when I get another email.
And yes, you may see where this is going.
It is from the friendly neighborhood CI. Guess what
it's telling me? It's telling me that these build has failed
again. Still. I mean, it took a bit of time, it's the same
build that I kicked off in the morning, but that's still fine,
right? It's just two failing builds in a row. What could ever
go wrong? This happens. I must have missed something. And so I'll
just make some more changes, kick off another build, go home tomorrow morning,
everything is going to be better. Right? Well, this would have been a short talk
if it was, it wasn't better. And the day
after, it also wasn't better. It still wasn't better. It went on
for several days or maybe even weeks. Before you know it,
we are facing this long, long queue
of features, bug fixes,
improvements, refactorings, everything.
We have all this kind of code, new code ready to
be released, but not being able to be released because we
were set here standing, sitting at this kind of junction,
looking at the red signal from the CI. We couldn't really
release anything because, yeah, we were just stuck with
a failing build. Yeah, I mean, this is kind of the point where you're
starting to think, do you just turn around and go manual?
Just like that ship that blocked Suez Canal a couple
of weeks ago? I imagine the captains of ships behind
this waiting to get across, to go through Suez Canal, they were thinking about,
okay, do we kind of wait for this to clear up, or do
we kind of turn around, go around Africa and,
yeah, to get to our destination, whether that's Europe or Asia.
So yeah, that's the equivalent, DevOps equivalent of waiting
on a build to clear and considering going manual.
But the thing is, every red build, even if it's super agonizing,
even if you're struggling very hard by
it, if you're really being frustrated, it's still sending
you a signal. And that signal is why we have
CI systems in place in the first place. And instead of
starting to think about how we could bypass it,
we should instead start to think about how do we make this
better. My name is Zan. I'm a developer advocate at Circleci,
and I have broken many, many builds,
definitely more than I would wish to admit. If you want
to get in touch with me, I'm quite active on Twitter and you can also
email me at zan@circleci.com. But yeah,
this talk is not about giving up. This talk is not about turning
CI off and going manual. This talk is about making your
builds faster and less frustrating
in the process. I mentioned I work for Circleci.
We're a CI CD platform. While all
the examples are going to be agnostic of
technologies, of tools that we'll be
talking about, I will be sharing concrete examples of how
to do things with Circleci. So for that, I would
take a couple of minutes to talk you through how
Circleci looks, works, and how it's configured. So you
have the better context of how everything falls into place. So the
core of everything is the pipeline, which is what
gets triggered when you commit something, you push some code
that gets triggered and taken, kind of runs everything that you want it to do,
from running tests to deploying this to wherever
you need to go. Whats pipelines is defined
in this circleci folder at the top of
your repository you'll have it. And in this config Yaml
file. In these file, as I said,
everything is defined including jobs,
orbs and workflows. So jobs
are your idea of whats needs to happen
and where. So where being the environment.
So that could be a docker container, whats can be a virtual
machine, we call it executor. And in this
environment we also specify a set of steps. So steps are
instructions that we want to do or
instructions that we want to make in order to go along
with that build or that job. That can be command
line instructions like NPM run test,
or maybe even things that tell Circleci to do something like
checkout. We'll kind of check out the code at that particular commit. Or they
could be more complex commands that are even
stored with these things we call orbs, which are
like reusable sets of config that you can pull in from our
central repository and then have access
to common commands for like for example kubernetes commands or
docker commands. You can either write it all manually or use
an orb to do it for you. Last thing to do is after we've
kind of defined all our jobs or our steps is put them into a context
of a workflow. And workflow essentially lets you order
and arrange those jobs so that they get run however
you wish to run them. For example, you can say, okay, first run all these
jobs and then only when these pass run something else. So that's
the configuration, it's only one yaml file. The other part is
the dashboard, which kind of gives you this view of
what's going on in your entire organization. So I'm looking at
my own GitHub organization, but you can have your
company's organization, anyone. And in there you will see
all the projects. You can go into a project, for example,
let's go back to the ones that I was looking at earlier. So this API
here you can see all the pipelines that have been run. For example,
we've run it like 96 times. And inside of a pipeline you
can see, okay, these are the jobs that have run in this order
and so on. And by going into a job, you see all the steps,
so you have this visual representation of what happens and
by whom. You can even go to a commit and so on. So that's kind
of the idea of how Circleci looks like what it does.
So when I go ahead with these talk, you see examples
and you'll have a kind of clearer picture of what's happening and
how. Anyway, back to the talk. Slow builds. When do you know you have
a problem? Obviously I've talked about ones example,
which is a very, very acute example, very gapingly
obvious one, which is when you're considering turning off
your CI because it's not going anywhere,
then you definitely have a problem. On the other hand, you might notice that
your builds are just becoming slower and slower. But that's very
hard to see very quickly because builds don't
become slower from today to tomorrow. They become slower with
weeks and months of work and adding features,
adding commits, like growing these code base. That's how builds become
slow, because you kind of add all that complexity, but you do it so gradually.
Or maybe you're looking at how often does it
break? How often do builds break? Or how often do builds stay broken?
Which is probably even more important metric. You may be looking
at how from month ago
you took five minutes to fix a build and now you take 20 minutes.
Where did that kind of complexity emerge from?
Like you can trace it from there. But my favorite example is
doesn't taken any scientific, any tools to detect.
It's basically asking your team, your team will tell you that
they're struggling with the CI. They're going to complain about it on every
single retrospective or every one on one you have with them. You hear it
from one person, maybe it's just one person, but if you hear it
from multiple people on the team repeatedly, then you know you have
something to work with. Okay, so whats do we do when we know
that we have slow build? First we need to figure out
what's slow. And to do that we need to actually measure something.
And measuring it comes in many ways in shapes and forms,
because from very rudimentary measurements,
like the entire build run or pipeline run time,
how long that takes, or to for example,
individual job, how long is that kind
of red, or how long it took you to take it from
red to green, or how often it goes red. All of these things are pretty
easy to measure. You can essentially track it in
a spreadsheet if you have no other tools. But usually you actually have
tools at your disposal that are much better suited for that in
a lot of cases, you'll be able to kind of drill down to an individual
job, say, okay, these are my functional tests, these are my integration tests,
these are my unit tests, this is my deployment. And see
on a job basis what's taking that so long and
why. And so you kind of pinpoint the problem a lot easier. And in
some cases you can even drill it down to an individual
job, test jobs kind of test suite that's
going to run. And you'll see, okay, these tests, this subset of
all my tests, is taking the longest or is the flakiest and
causing most of the builds to fail in the process.
So depending on the tools you have at your disposal, you can
measure different things. But yeah, the more you measure,
the clearer you can present it to yourself, the better.
Essentially, with Circleci, we have these insights
feature, which is, yeah, it's in the dashboard and you
see just from bird's eye to drill down
view into your builds, you have historic views of
how things have fared. So have your builds increased
in duration over the last 90 or so days?
Which builds have the greatest or lowest
success rate? Or even. We recently rolled out
this kind of test drill down feature in preview, which allows
you to actually see how your individual tests or tests
components are faring, which is pretty, pretty cool. So,
yeah, now we've hopefully identified what's going on
and we know that we know these to actually start
looking or improving our builds so
that, yeah, these are faster, they're less agonizing.
For us, first thing to do is increase horsepower,
or the computing equivalent is add more resources to
your builds. That's a very easy thing to do. It's very straightforward
to do. But how do you know that?
Actually, my jobs are running
on underpowered executing machines. You'll see out of memory
errors. That's a pretty clear example. Like you have an
out of memory error, add more memory, you'll be good. Sometimes you
know that the tools that you're using to build and run your
tests can benefit from more threads or more highly
parallel cpus. So that's where you can experiment
there. My personal favorite example is if something
runs faster on your local laptop developer laptop
than it does on your CI, it's usually because,
yeah, your developer laptop is, I don't know, an I seven with
16 gigs of ram. So ample kind of horsepower to
work with. And if that runs your build in three to
five minutes, and then your CI runs them in 20 minutes,
and turns out that your CI has two or four gigs
of ram at its disposal and very much
fewer cpu cores. That's probably why
your builds are slow and you take it closer to what your
laptop is working with and you've got instant improvement.
But that's like a very, very easy thing to do, because with Circleci,
for example, you're just going to specify this resource class and job
done. It's as easy as that. Next thing to do
is start thinking if you can go parallel.
By that I mean speeding up all your work
by running it in different streams, as opposed to kind of
running it in a single stream. First thing to do is make sure that your
jobs are running in parallel themselves. By default,
that's what happens in circleci, so you don't have to worry about that.
But depending on what tool you're using,
your jobs may not run parallel by default.
So you kind of make sure, okay, we're running everything, unit tests,
security scans, functional tests, all at the same time,
and only when they're kind of all green. We kind
of continued with these latter stages of the build, which might
be creating a production build or deploying this
to somewhere. The other one, which is more useful, I would say,
is splitting tests within an individual test suite.
So if you can imagine, our functional test suite had
a lot of tests that was like hundreds of functional
tests that took, yeah, most of the build time to complete.
So you can actually split those and make
sure that it runs in multiple parallel streams.
First thing to do is specify this parallelism here, and then
use the Circleci Cli to essentially tell
it to how to split or arrange those tests.
My favorite one is by time. So essentially you tell it how
you run it once, which gives it kind of the idea of how long
each tests run. And then basically it
will figure out a way to split your test job
into roughly the chunks that are running at
roughly the same time, which is pretty cool. Now, we've kind of
used the right size machines and we're running
everything nicely parallel. What else can we do? We can
utilize the cache to essentially deduplicate
parts of our work, because that's what caching really is. I like
to think of caching in two different types.
So what you need and what you build.
So what you need is all the dependencies that
you need before you can actually start doing a
build or running tests. For example, git cache is one example,
but git cache you usually don't have to worry about, because that's
cached by default in most places anyway. The second
one is dependencies. If you've ever run create,
react app or installed create react app. You can
see how long that taken to install all the NPM
dependencies and it's just a react application
and that's just like JavaScript, different languages, different frameworks,
they will have different tools for managing dependencies.
So all of that depends. But essentially once
you load them, once you can actually cache those and then it can be
reused across subsequent builds. As a matter of
fact, caching usually comes for free when
you're using some of our orbs because like the
node orb for example, you have
a command in there called install packages which will
utilize cache by default. So it's just like one command run for you.
And under the hood it's actually doing five or six steps to
kind of restore cache and save it so that each time your package
json the lock, one actually package log json
changes only then it will kind of
reload things. Otherwise it'll just kind of keep using whatever it
has in the cache for the duration of that cache anyway.
Lastly, if you are running your builds on Docker
inside docker containers, and you're relying on
a lot of installation for your purposes, so if you need to
install any other tools that aren't kind of built into the image
that you get from like Circleci,
python image or node image, you can think about creating your own
docker image that has all those tools bundled already. So you kind of
save that installation time, especially if you need to kind of compile some of
the dependencies each time. You can imagine whats can take
quite a lot of time to kind of speed that up. And the second part
of caching is what you actually build.
So I'm talking build artifacts. So from
finished applications that you kind of want to run functional tests
again, to intermediary kind of artifacts that are
used by different aspects of builds or different.
If you're kind of building Android applications, it's basically
generating a lot of kind of intermediary dexes and
modules that you can easily then reuse.
So it's pretty cool. Lastly, if you are building docker
images, so previously we mentioned using docker images, but if you're
building docker images to maybe deploy to kubernetes or somewhere,
you can actually turn on Docker layer caching, which essentially
machines all the instructions up to the point where you have made
any changes. So if you're thinking about how a Docker file is
written, you have run commands, copy commands, all of those different commands,
and they would be automation cached up to the point where
you kind of have any changes made which is pretty cool as well.
Beyond caching and beyond parallelism, we have to go
a bit more cleverer. We are out of pure technical
solutions for our problem. We really need to think
about how we can reduce the scope of our
work to make everything go faster. Because when less things
are run, things go faster. So in our case that
was where we could get a lot of gains because
we had a lot of tests. We had a lot of
tests for every single feature. And if you can imagine,
some of those features would be very critical to
be tested. For example, in a shopping application,
if someone can't register or make purchases,
that's pretty critical aspect of the app.
But on the other hand, if they can't change
their profile picture, change their preferences
for how the homepage should look, that might not be the
most critical feature. So you really think about which tests
you really need. If you have that many especially,
and which ones don't need to run at every single build,
you can always run them later or like asynchronously.
But for most of the builds, you really kind of want to make sure that
they run as fast as possible. Or like you run the most valuable
ones all the time and then less valuable ones
less frequently. You can also make sure that your workflows
reflect what your actual development workflow is. For example,
if I'm working on a feature branch, I don't want to run all
the tests at every single commit, but only when I'm kind of merging
stuff to main taken. I actually want to run
like a more substantial chunk of tests.
And maybe even when we are doing planning
a release, we're pushing a release tag, then run the
entirety of the scope of the tests. So you really think
about what needs to run and how, in which case
so that really kind of speeds up your entire process.
And lastly, get your team on board.
DevOps is a cultural movement first and foremost,
and culture starts with a
team. You can't be successful in CI
CD in implementing CI CD if there is only one person,
if you're the only person implementing or maintaining CI CD
pipelines, you have to get others on board.
That's for maintenance, that's for writing improvements to
your pipelines, that's for making sure that you fix
builds effectively. Really, it's crucial that
everyone knows where things happen. And it's also
very valuable because if you think about it, if your entire
CI CD pipeline is defined in a single file which
tells you the whole story of how tests are run, which tests are run,
and what happens when you need to deploy then
adding a new member of the team is basically
as easy as telling them to look at your CI CD
build script and they'll see which ones are the moving
parts and what they might need to look into
to start being more productive. And also, if the entire team is
on board, then they can all kind of think about
how they could contribute to making this as smooth
and kind of positive experience as
opposed to this tedious, frustrating piece
of work that we started off with,
perhaps. So, yeah. Now I want to ask you another question,
and it's what is the right build time for your team?
I can talk about build times, I can talk about speeding up
your builds, but I don't know what your team does.
I don't know how many developers you have on your team, what their skill levels
are, what kind of application or software you're building.
So it really depends on your own team to
kind of decide what are the limits that you want to
kind of operate or constraints you want to operate within,
and think about yourself, where you want to go with
your CI CD pipelines. What you can do, however,
is look at this state of software delivery report,
which we released late last year. Essentially, we looked at an
obscene number of projects and organizations
that built on circleci and how
they kind of compare when it comes to build speeds,
time to recovery, and yes, time to recovery tends to be
more important than actual kind of build times. And you can
actually read about this in these benchmarks and
see where you kind of fare compared to others on your
team. So yeah, I alluded to this earlier, but I don't think that
the need for speed is the right thing to be
looking at, especially not with CICD, because it's
not a race. It's much more closely resembles an
ambulance because yes, it should go fast. An ambulance
does go very, very fast if it needs to, but it also
needs to go reliably because it's carrying
that signal that we talked about. It's carrying that
signal that is only kind of useful if
it emerges in the end as kind of intact
as possible. You can have a CI build run in 5
seconds, but that's probably not going to run all the tests that you actually want
to run or need to run for success. And your signal,
your CI CD's cargo, if you will, is the
most precious cargo. And yes, if you see the lights on
the ambulance or on the CI, if it goes red,
you really should be all hands on deck
fixing this because yes, you don't want to be lingering on
and having your builds in an unshippable state.
In any case, that whats the need for speed. My name is
Zan Markan. I've greatly enjoyed giving this talk. Please do
reach out to me either on Twitter or in email if
you have any further questions. Thank you very much.