Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi everyone.
My name is Mike Morgan and I head up the technical account
management team here at Buildkite.
I've been at Buildkite for about four years now, during which time
I've held a number of roles, both on the pre and post sale side, working
with, with pretty much all of our enterprise customers during that time.
And, and that is a list that consists of some of the most exciting and
innovative, tech companies on the planet.
so before I get started here, please go ahead and check out buildkite.
com for all of the logos and testimonials and all of that kind of stuff.
So, during my tenure here, I've worked with many teams of all, on
all sorts of different types of projects to help them do CI better.
And one theme that's come up again and again during this
time is that of flaky tests.
So, really, as teams and projects grow, and tech stacks increase in complexity,
and developer velocity increases, there's really that need to better
understand and address flaky tests.
and If this problem isn't understood, addressed and and really woven into
the culture of your organization, then this can have a profoundly
negative effect on your ability to ship and build quality software and
meet those KPIs that are probably being set by your leadership teams.
So in this session, I'm going to talk a bit about test flakiness and why it's
a bad thing, what it means for your organization, and most importantly, a few
strategies to really nip this in the bud.
So to kick this off.
We're going to start with the basics and real simple stuff
like what is a flaky test.
so given that you're listening to a webinar about flaky tests at a
DevOps conference, the chances are that you already know what flaky
tests are, probably overly familiar with them to be quite honest.
But for those of you that are uninitiated, a flaky test could be
summarized as a test that sometimes passes and sometimes fails without
any changes to the underlying code.
So it's just flaps without any rhyme or reason.
And really, there's a ton of reasons that a test could be flaky.
now I'm not going to go into each one of these because there's a ton of them.
You probably know a lot of these anyway.
It'd be a waste of everyone's time.
So I just threw a bunch of them on this slide here.
but, this is by no means a conclusive list, but it's definitely, some of
the ones that we see very often.
I mean, of all of these different reasons, and again, kind of working
from custom with customers, rather.
I mean, this Purely anecdotal, but I think like order dependency, environment
issues and resource constraints are probably the ones we see most often.
but again, anecdotal, statement there.
So please take it with a pinch of salt, but, during this talk, we're
going to be calling back some of these, and, and referring to them as we go
through some mitigation strategies.
So, what is the impact of flaky tests?
Well, first of all, we can say that flaky tests undermine developer confidence.
so fundamentally, if the result of a test can be changed simply by
spamming, the retry button in your CI, then you really have to ask,
what value does it really offer?
As is customary these days, I asked ChatGPT to define to me what a test
is, and ChatGPT told me in software engineering, a test is a process
or procedure used to verify that a software application or system behaves
as expected under specified conditions.
so, With that in mind, if you're testing a supposedly in a controlled
environment with supposedly controlled data set, but you're still finding
you're getting unpredictable results on back to back test runs, and
it's probably safe to say that that definition is not really being met on.
The result of this is that.
Engineers working on the project don't really know whether the, the test that
they're, they're testing the thing with is actually broken or whether
it's just a test that's playing up.
So the best possible outcome in this scenario is that the engineer, in
question is just mashing that retry button until they get a green build,
which is, you know, obviously not ideal.
if a test is being flaky, then the, the worst case scenario
is you're gonna ship something.
broken and dodgy to production and ultimately this just erodes
confidence across the board and is universally a bad thing.
So another bad thing about flaky tests, it slows down your CI pipelines.
Apart from obviously eroding the confidence, which is a bad thing,
slowing down CI is even worse.
When a test fails, you'll probably have some sort of retry
logic that's baked into your CI.
It's gonna introduce some latency into your build as it retries.
maybe you just have the necessary gear to just rerun that single test
spec, which might only take a couple of seconds in the best case scenario.
But if you're batching a number of tests together in a single job
running in CI, then you might find that it reruns a whole load of stuff.
that's gonna add expense, both in terms of, yeah, time and money.
So I work with a ton of customers helping these build quite better and optimize
their software delivery pipelines, and I don't have any data to corroborate
this purely anecdotal statement.
But from the customers I work with, a vast majority of their build
time is spent executing tests.
So by having to rerun tests due to flaky behavior, they're turning
the test phase into something that's even more of a time sink.
and these days, everyone has KPIs about build performance and
across the board, improving test performance is a good way to do that.
To meet those KPIs.
So paralyzing tests, of course, can be the ultimate cheat code for slashing
build times across, across the board, but ensuring that tests are well
written and behave as expected and don't need to be retried continuously is
definitely going to be a close second.
Now, thirdly, there's a monetary value associated with tests being flaky.
so flaky tests, as we said, As I mentioned there, they erode confidence,
they slow things down, they're annoying.
but the impact goes beyond just getting on everyone's nerves, there's
also a financial hit here too.
So, retrying a test is going to burn some compute time.
whether it's just a single test or maybe it's a batch of
tests that need to be retried.
that's going to be some more money that you're going to be giving to AWS or
GCP or whoever is hosting that compute.
The longer build times will hurt productivity, which means there's,
you know, there's engineer salaries that are not necessarily being put
to their best use and, and overall the quality of the thing you're
shipping is going to be impacted.
And I think perhaps maybe a bit harder to attribute a dollar value
directly, to a flaky test in, in that particular sense, but still it's worth
mentioning, and, you know, over time and at scale, both in terms of the
number of bills and the number of engineers involved, I mean, this can
So I asked Chech EPT, as I'm wont to do, to provide some
statistics about flaky tests.
And, and here's what it told me, with, with citations, no less.
Google reports that 16 percent of their tests are flaky.
60% of developers are regularly encounter flaky tests and order dependency
is a dominant cause of flakiness, responsible 59% of flaky tests followed
by test infrastructure problems at 28%.
So based on my experience working with, again, hundreds or potentially even
thousands of customers, my anecdotal take on this would be that 16% of Tesla
flaky is, is probably about right.
based on what we see with customers, give or take, 60 percent
of devs encounter flaky tests.
that I didn't say that stat seems a little bit optimistic to be quite honest.
my feeling is that that would be, something close to a hundred percent,
maybe not necessarily a hundred percent, but certainly something,
that's, that's higher than 60%.
So, we've established that flaky tests are bad.
annoying and they cost money and they can erode trust.
So what we're going to do is now dive in some strategies that you
can use to address flaky test problems in your environment.
So first of all, we're going to want to isolate and identify flaky tests.
So to be able to address flaky tests you need to be able to establish
whether a test is flaky to begin with.
So in this session, we're not just talking about flaky tests.
Broken tests or those those do share qualities with flaky tests,
obviously, but instead we're going to be specifically looking at tests
that exhibit that flaky behavior.
so, first thing we're going to want to do is look for signs of flakiness.
So the real obvious one here, first things first, is a test exhibiting flaky
behavior versus just being a broken test.
So we're going to be looking for certain distinct patterns of behavior
to determine, whether a Test is flaky.
First one would be intermittent test failures without any code changes.
So this could be a case of a different test result across multiple retries in a
single CI run, or unpredictable results across multiple commits and CI runs for a
portion of code that hasn't been touched.
So when nothing obvious has changed, but your tests have given
unpredictable results, there's probably something flaky happening there.
We could probably at least take a second glance and assume
that test could be flaky.
tests that are passing locally but failing in, in your CI environment.
so if it works locally on your laptop, then it should work in CI, right?
well.
Probably in many cases, that's true, but there's also good reasons why not.
and it could be an environmental factor caused by, either, you
know, something local or again, something in your CI environment.
An environment, that you run your tests in is very important.
And we'll be taking a close look at that in a second, but this,
this would be definitely a behavior that could indicate some flakiness.
and thirdly, When the outcome of the test seems to depend on a non deterministic
factor could be like the time of day it's run or the environment again that
it's run in or any, any other variable that just doesn't seem to, doesn't
seem to imply stability, I guess.
So to be honest, I mean, this one, it really is a less of a distinct pattern
of failure and just something that's The, face value can seem extremely random.
in these cases, there can often appear to be no rhyme or reason.
a test just flaps unpredictably when it runs.
And again, if you kind of see that, then it's a pretty good chance
that the test is, is flaky there.
so really we're just looking to try and understand how a
test performs over many runs.
And if we're seeing a patchwork of like red and green results every time
it's run, across many builds, that, that would certainly imply flakiness.
So depending on the size and complexity of your code base, identifying these
things manually, it might be possible.
But on the other hand, it is not.
It is 2025 now.
You probably shouldn't have to do these things by hand.
It's way better, just to pay a bunch of money to, buy some
software to have it done for you.
So, second thing, how can we identify flaky tests?
Maybe try, rerunning it a few times, real brute force technique, Not
particularly sophisticated here, just rerun the test and see what happens.
So flaky tests, of course, by their nature, they tend to give you
different results across multiple runs.
And if you want to see if a test is constantly failing or flapping, then
why not just rerun it a bunch of times?
And I don't necessarily mean just as part of your standard CI workflow, but just,
you know, for the purpose of testing and trying to surface this kind of flakiness.
So if you run your test job a bunch of times, ten times, I don't
know, whatever, does it exhibit.
Flaky looking behavior.
If so, then yes, it probably is flaky.
And, finally here, utilize, test observability tools
and, flaky detection tools.
so I have actually a full section on this coming up.
And it's, it's very important that I don't just burn through all my
content in the first few minutes here.
So I'm, I'm, I'm don't really want to repeat myself later on in the session.
So I'm going to keep it real brief here and go into more depth.
as we, as we get to that point, but, it did seem worthwhile kind of
talking about very briefly at the top of the, the, the session here.
So if you're working in a large project and a large team, it's totally reasonable
to expect that, that your view, could be somewhat blinkered to the thing that
you're working on currently, and, your own CI runs rather than like looking at
your, your colleagues runs and all of the different, Pipeline runs that have
been occurring and looking at the test results and seeing if the thing that
is flaky for you appears to be flaky for everyone else to really understand
the scope of the problem so that that can be, in itself a mammoth ask.
So the good thing is now there's a number of tools that can help you out
with this, which will collect test results across any number of builds.
They'll surface failing and flaky behavior, assist with debugging
in general and really kind of give you a ton of visibility.
So unless your project is really small and manageable.
I would really recommend, exploring this avenue.
So the takeaway from this first strategy would be that reliable
testing begins with understanding which tests are flaky to begin with.
A test that fails doesn't necessarily make it flaky.
I mean flakiness is is another quality entirely.
Really use the means at your disposal to look at your tests, their behavior,
and determine which ones are just totally broken and those ones that are
flapping due to some external factor.
So the second strategy, optimize test design and implementation.
So, perhaps with some exceptions, most test flakiness can be attributed, to
a, a test design and implementation.
And we spoke about that statistic a moment ago where 59 percent of flaky
tests are the result of order dependence.
and this is something that can absolutely be addressed,
through improved test design.
And I have a couple of things, to consider here, which, which
might seem pretty obvious.
to, to some, but then also again, it's worth covering to
build that solid foundation.
So the first thing we have here is Test atomism, or making your tests atomic.
So, primarily this is going to be relevant to unit tests, but there's
also, some atomic principles that we can apply to, your integration
tests and end to end tests as well.
So what is an atomic test, for those who do not know?
so an atomic test, is a test that's designed to be independent.
So meaning it doesn't depend on any, the execution of any, any other tests.
It doesn't rely on the state set by other tests.
Again, think of that 59 percent order dependency stat, which honestly, Sounds,
sounds like a lot, but I don't have any better data, so, so I'll have to
do, but anyway, for, designing tests to run independently, hypothetically,
that number could be much, much lower.
0%, possibly, I don't know, but, tests that don't rely on other tests,
inherently gonna be more reliable.
so an atomic test should be tightly scoped, so where appropriate,
test a single unit functionality, such as a method or a class,
rather than a bunch of things.
so, yeah, if you've ever heard of the expression, less is more, so when
you write your tests, live by that.
An atomic test should be deterministic.
So ideally, you want your test to provide the same result every single time.
Nice big green tick next to that, that build, given the
same inputs every time it's run.
So, whenever possible you avoid randomness.
In the inputs, again, on the subject of inputs, stabilize those inputs,
provide static and predefined data sets that, that you know, are
good, that your tests can utilize.
And also, clean up the state between runs.
And again, we're going to, going to talk about environment, to run
your testing in a moment here.
but in the context, of determinism, we want to ensure that your tests
are clean and predictable and reproducible test bed to run in.
Oh, there we go.
And performant.
So, tests should be small.
Ideally testing, again, a single thing using a predictable data set.
It should be hypothetically, theoretically quick to execute.
So these high performance tests are perfectly suited to be run during your
CI process, and they should be less susceptible to timeouts in most cases.
So I mentioned at the top of the section, that atomic Primarily, refers to unit
tests, and your end to end and integration tests, should not be atomic in the same
way as your unit tests, but they can employ atomic like principles, so your
integration tests can be atomic to an extent, so while not as small as unit
tests, integration tests should focus on single integration points or interactions,
and you want to avoid Coupling multiple unrelated components, in in a single
integration test, if you can avoid it, each integration test should have a clear,
and single purpose like does service a, you know, correctly pass that, response
from service B or something like that.
And of course, you know, you should use mocking, dependencies whenever
appropriate, to avoid issues.
For your end to end tests, usually these can't be strictly atomic because they're
less likely to be small or isolated.
However, they can also be designed to focus on a single, well defined
user journey or scenario, minimizing overlap with other end to end tests.
And in the spirit of, of those four tenets of atomism that I listed a moment ago.
the test should use a clean state between runs, use well defined test data to
ensure consistency, and use mocks and stubs, for those, third party APIs.
So, next for this strategy, adopt mocking and stubbing.
so, rather than relying on downstream services or other dependencies
to provide correct response in a test, think about utilizing mocks
and stubs to provide, quick and predictable responses to your tests.
And there's a ton of different libraries, modules out there that
integrate with whatever language you're using that can really take,
some of the heavy lifting here.
So by no means something that you necessarily have to build out yourself.
this is, is well trodden ground.
And.
Again, in the interest of making tests truly atomic, we should be
looking to narrow the scope and make things as deterministic as possible.
So, mocking data is totally reasonable, in, in, in these circumstances.
And, of course, if you have a database or API or other dependency that's
acting up and causing things to break, then we certainly want to, I
don't want tests to surface this, but that would probably be in the realm
of like your end to end tests, and specifically I'm more referring to unit
and potentially integration tests here.
Now, finally, you'll want to structure your test hierarchically.
so you've probably seen this image before, which I grabbed off Google Image Search.
It's the test pyramid.
and what we have at the bottom is our unit tests, we have the
integration tests in the middle, and we have the end to end tests.
Tests at the top.
So at the base, the unit test, you're testing your individual
components, your functions, your methods, everything in isolation.
These tests are fine grained, focused, just like we discussed a moment ago.
They're nice and fast.
They give as much code coverage as possible and they make
it really easy to identify.
A root cause in the event of a failure, so unit test should be should be quick
and computationally inexpensive to run.
So ideally, they would cover as much surface area as possible, and
when they fail, it points to a very specific method or class that will
really help you to do some debugging.
So your, your middle section is your integration tests focused on,
you know, different pieces of how, how your application work together.
And, this one thing talking to a database, for example, or some microservice or API
or whatever, these integration tests are much more thorough, and more expensive
as a result, but you, you need Less of them than your unit tests, although
less tightly scoped than unit tests, a failure in integration tests should
still point to an interaction between specific components again, which will
help with debugging and then at the top, finally, you have your end to end tests.
So replicating entire workflows in in real world environments to
ensure that the entire application as a whole is working as it should.
So end to end tests are probably It's going to be the most expensive and
fragile, probably using something like Selenium, which, you know,
loves to get stuck waiting for a page element to appear or become
interactive or something like that.
But, we'll give you that, that, Representation of a user experience.
So you want to use these sparingly for critical path stuff.
when one fails, at the very least, you want to be able to
see where and how it failed.
and from a debugging perspective, it's going to be, involve a little
more work than perhaps, you know, looking at a unit test that's failed.
depending on the tool you're using, of course.
but maybe you debugging tooling to help with this, and maybe like APM
traces or something like that that can help provide some context here.
So, like, why, why is the test pyramid important?
well, a perfectly executed test pyramid, will mean that everything is being tested,
perfectly, everything's tightly scoped, little overlap as possible, and when flaky
test behavior occurs, there's as little noise as possible in your reporting.
It becomes much easier to see where the problem is occurring,
which is really, really important.
You know, vital for efficient and effective debugging.
So, for strategy two here, The takeaway is design your test to be tightly
scoped, performant, predictable, repeatable, and remove reliance from
other services and dependencies by mocking and stubbing where possible.
And really, this will help improve simplicity and, and reduce noise.
Okay, third strategy we're going to look at is improving the stability
of your testing environment.
so we briefly talked about this.
Touched on the importance of keeping a clean and a test environment
during some of the previous, the previous strategies there.
But this is really an important consideration when it comes
to keeping flaky tests at bay.
So I figured this really warrants its own section.
Now, a bad testing environment could be caused by really any number of factors.
You could be maybe using long running host to build on, and
perhaps someone didn't previously clean up after a build, and it left.
something there that that's resulting in in unexpected behavior, or maybe even,
you know, you have multiple jobs running in series from a single build and one
of the previous jobs didn't clean up the workspace and the workspace becomes
polluted and that causes strange behavior.
Maybe you have dependencies that get messy.
Maybe your build infrastructure is being stood up manually or inconsistently
rather than using a bunch of automation.
perhaps external dependencies.
Not working reliably, or perhaps the test infrastructure
itself is just not suitable.
It's under resourced, not enough memory or CPU or network bandwidth or whatever.
I mean, these are just a few examples.
And again, pretty much infinite reasons why the test environment could be
in a bad way or just not suitable.
But we've got some best practices here to hopefully avoid some of the common
pitfalls that you might experience.
So you want to strive to achieve a consistent test setup.
And this one is It's a very obvious one.
You want your test environment to be predictable and as consistent as possible.
And there's many different ways that you can do this depending on what
your CI tooling is and what it allows for you to do and how you design
your infrastructure and whatever.
So, how do I keep my build environment clean?
So many CI platforms provide the ability to run CI jobs in a fully
hosted, managed ephemeral environment.
that only live for the duration of the job and they would be spun up and you run the
job in, in that, that, that, that runner or, that, that container or environment
or whatever that the CI provider provides.
And then once the job is completed that, that compute is torn down
and deleted and it's not reused.
So these kind of environments really eliminate the risk of
workspace being contaminated by previously run jobs or build.
It's really a nice clean state for every single job.
so although the environment will be free from contamination, it's good to
bear in mind that there The piece of infrastructure itself will only be as
good as its original configuration.
So it might be immune from previous runs messing it up.
But if the dependencies aren't being handled properly, then you're still
gonna have a bad time or run into issues.
these kind of managed ephemeral runners are really useful, really,
in any number of scenarios.
In my opinion, particularly when you're looking at like Mac or iOS
jobs, I'm definitely like an Apple guy.
I have been for, for many years.
I have countless Macs during this time, but my experience has really led
me to the conclusion that whilst they work really great as desktop machines,
there can be a real handful when you manage them as headless CI runners.
And there's a litany of hurdles that you need to overcome to get
them to work well in this scenario.
But maintaining a clean, stable workspace is in, in my opinion, at
least one of the more challenging ones.
you know, when you're self managing Macs for CI, it's really common to see
machines that are really long running.
You know, rather than kind of that short lived one and done configuration
that we just mentioned there.
and, and in these scenarios, there would be some attempt to perhaps clean the user
folder between jobs so that there's like a relatively clean workspace and, you know,
a predictable environment to run tests.
But this is often quite difficult to achieve, successfully and effectively.
So being able to delegate the responsibility for managing Mac, CIM for
entirely, instead of having, you know, That long running, Mac to have something
that's completely ephemeral that runs, you know, presumably in a VM somewhere before
being recycled is pretty compelling.
And I'm certain that anyone listening here who's ever, whether in the past
or currently, is responsible for maintaining a closet filled with Mac
minis or maybe a bunch of cloud based Mac instances would definitely agree
that this is, kind of, a brutal job.
And, not having to do that or worry about that.
This issue on a Mac would, would be just be fantastic.
So, I am obliged at this point to mention that at Billkite, we do have
this capability for Linux and Mac.
So, if you're looking for ephemeral runners, for your CI
jobs fully managed, then we can certainly help you out with that.
next up at Billkite, we do provide a hybrid CI solution as well.
So where customers like own and manage the build infrastructure, you
know, hybrid in the sense that we.
We run the control plane as a SAS service, and then you manage the compute
yourself wherever you want to deploy it.
and we, we do offer a few out of the box agent stacks to achieve this using,
you know, Kubernetes or, using EC2 instances, all auto scaled and handles
orchestration and job life cycles for you.
And in both cases, much like those managed runners.
I mentioned a moment ago, in these scenarios, CI jobs can be configured
to run on on fresh infra each time before it's terminated and recycled
so that each, again, each CI job gets its own clean environment.
I can't speak for other CI providers on this capability.
I'm, I believe that some others do, should be possible.
So just take a look at the docs and see what's available.
Again, I can only vouch for Billkite in this case.
And whether your infrastructure uses short lived one and done configuration or
you have a long, long running instances dedicated to running CI jobs, definitely
consider running things in Docker.
I think it's pretty, you know, pretty ubiquitous capability for any
modern CI solution that can do this.
And, definitely at Buildkite we have plugins that you can integrate
into your CI workflows to facilitate this really nice and easy.
you know, if something is running, you know, A docker container, you
know, again, it's, it's a real nice predictable environment to run your tests.
And I was actually just talking with a customer of ours about
this very, very recently.
There are a very well known grocery delivery service, and they were
describing how they do just this.
So in that case, jobs are executed.
Using Bilkai, using our Docker plugin, so that every single job
gets that fresh container to run in.
They have, a single base image that all of their, their build and test jobs use,
and this, you know, the same base image as they use in production, in fact.
And then they have a bunch of scripts, that would run when this container
is orchestrated, depending on, like, the pipeline that, like, the CI
pipeline that's running, or the The application is building, and that would
pull in additional dependencies at runtime to give them what they need.
and this really just gives them that predictable, consistent, consistent
environment, to run, per job.
Next up, you want to try and determine which failures are
related to infrastructure.
so, Trying to determine whether a job failure is being caused by the
environment or the infrastructure might not always be easy.
You know, these things can manifest themselves in a number of different ways.
It's often not immediately obvious.
However, you want to keep your eye out for a few things such as timeouts
and IO errors, you know, tests that are failing, supposedly touching other
services, or, you know, tests taking significantly longer between runs.
you know, if there's that huge discrepancy with latency between test runs, that,
that would certainly indicate that there's something that's happening, perhaps at
the infrastructure environment level.
And it's pretty much a given that you'll have some sort of infrastructure
monitoring set up in your environment, you know, you know, whether it's
Datadog or, or, you know, Elastic or something that's open source that
you're running to monitor stuff.
system stats and, and, metrics.
But you, what you want to be doing is using this in your build environment too.
You want to be looking for, for spikes in various metrics that correlate
to undesirable test behavior.
And, and then this can certainly point to things like resource constraints.
and then this isn't, this isn't, By any means, like a golden ticket,
an infallible or conclusive, way of picking out problems.
But it's certainly worth monitoring.
And certainly when you correlate behaviors, it can be quite powerful.
and definitely if your observability tool can can do stuff like, you know, use
custom tags on, metrics that you can tag containers, or, or instances by job ID and
then, look at the relationship between.
That build, build job by ID and the metrics associated with that, whether
it's a container or an instance or whatever, and then, and then see if
you can see that correlation between failures and, and various different
spikes, then, this is really going to help you immensely for, for debugging
things like resource constraints.
So, going back to that, that grocery delivery service that I mentioned earlier,
when they observe a test failure due to a test bombing out, their default
behavior is they then scan the build logs, from that job and they look for
signs of failure, that, that, that implies that it's infrastructure related.
So, like, they have a list of various different error message
strings that would point to.
This is an infrastructure problem, and then if this is true, then, that would
go to the developer experience team.
And then if it's something that looks like it's app shaped, then it
would just go back to the app team and then they can fix the problem.
So it's a nice way of kind of separating these things out and
then, assigning responsibility.
So hopefully your CI solution provides you with the ability to.
You know, perhaps store your logs and do some, provide a means to
programmatically access them, not necessarily just through the UI, but via
the API or store them locally or copy them to an S3 bucket and run a Lambda
function across them or anything to look for, for, for certain behaviors.
definitely going to be a useful thing.
to be able to identify whether something's pointing to a system or environment issue.
so finally, Replace external dependencies with mocks where you can.
And we've already touched on mocks.
I'm not going to go, burn any time on this.
But, in, in the subject of System environment, stability and consistency.
we want to remove dependencies on downstream services wherever possible.
So if we have a test that's interacting with a downstream
service, use a mock instead.
and it will not only make it performant, but then, Reduce that,
that, the risk of some unknown variable, causing your test to flap.
So, the TLDR from this strategy, a consistent test environment
is crucial for reliability and there's many ways to achieve this
depending on your requirements.
So you really just need to pick and choose, what your CI tool provides,
relative to those needs and, and really lean into that where you can.
So, number four of five, strategy number four.
You want to level up your test management and monitoring tools.
So, I'm going to preface this section by saying that Billkite
has test monitoring built in.
It has our own test monitoring and management tool called Test Engine.
And this talk is really not intended to be a sales pitch for the product, although I
will mention it in passing as it relates to some of the subjects in this section.
So, we would love for you to test, Test engine out by if your organization
maybe give us some money, but it's by no means the only tool out there
that provides test visibility.
So, you know, alternatively, you can just take a look and pick one that
fits your use case budget and has all of the features that you need.
so I'll be speaking at a high level here rather than just, talking
about our own thing specifically.
but that said, I mean, I've worked in this, this Billkite for quite some time.
And prior to the launch of test engine, the main pain point that
customers encountered was around.
Testing.
So, especially when your engineering team is large and the code base is also
large and there's tons of tests being executed, it really becomes challenging
to see the forest for the trees.
you know, as an engineer, you can see that a test fails for you sometimes,
and then you're mashing retry until it goes away and everything's green again.
but Does everyone have that same problem with this same test as you
put yourself in the shoes of the the DevEx owner, the DevEx leader or really
anyone in the DevEx team, you know, there's going to be a subset of tests
that are slow and flaky for everyone, and they're causing, you know, a
significant cumulative impact on both.
Build times and wait times and engineer productivity, and, and compute spend.
and to be able to really address this, having appropriate tooling to, to surface
these kinds of issues and monitor them and assist in debugging and where possible,
you know, help address the problem is going to be a massive win for you.
So there's a couple of things you want to be looking for.
In your test monitoring tool.
So, it wants to be able to integrate with your CI workflows.
so, first of all you want to make sure it's got broad support for
all of your, required technologies.
Most organizations, especially when they hit like a certain size, is going
to have a pretty varied tech stack.
And being able to view all of this together, In a single tool is going
to be invaluable versus, you know, having separate tools for, you
know, this language and this system and things like that, everything
together in a single place is going to be, very, very valuable to you.
And additionally, it wants to be able to integrate with your CI pipeline.
So, test observability can be a real rabbit hole and probably
should be a separate function from your CI workflow UX.
you know, you want your build logs and, and, and, the, the.
Outcome of the build jobs in one place and perhaps some additional
tooling to drill into to your test results adjacent to that, but the two
should be tightly integrated because that will massively help usability.
So you know, imagine running a test in your CI workflow and then you see a
failure in the build logs and then you're able to click through from that build
log into the test failure, which takes you into a dashboard with debugging data
and run history and things like that.
So it's kind of this.
Slick workflow to go from clicking run build to looking at the build
results to looking at the failed test to looking deep dive into the
test history and that debugging data that's going to be invaluable for you.
So observability, very important.
The problem that most teams are trying to solve is that, you know, whilst your
CI pipeline probably gives you tons of logs and metrics and details about
that one build that you're running currently, it, it probably doesn't.
help you aggregate the test behavior from many builds and surface common issues.
And, whilst build logs can be super useful for debugging, they're also
very verbose and kind of a rough data source for you used for things like
trend analysis and really kind of any kind of manual analysis at all, really.
So the first thing's first from your test observability
tool, you'll want dashboards.
you want charts, you want metrics all over the place.
Like how reliable is my test?
Okay.
My test suite, rather, all the tests individually.
Like, how long does my test suite take to execute?
Are there any particular trends that I should worry about?
Like, what are my worst performing tests?
this is a screenshot I just grabbed from our own testing tool, and, blanking out
pertinent information, but, showing just, you know, here's a test, here's a test.
This one's deemed flaky, like, what's the history of this?
how long is this test taking to execute over time?
Here I can, I mean, this is perhaps a bad test, but we can kind of see some
of the behavior of this, if there's been degradation over time, or if
there's a very particular point in time where, things started going awry.
So you want to use flaky test detection.
so you should be able to surface flaky tests.
so at Billkite, we again, we have these huge customers and they run billions
and billions of tests every year.
and probably the number one things they want to know from, our test observability
tool is like, which tests are flaky?
How often are they flaky?
And when did they become flaky?
And there's Undoubtedly an infinite number of ways that an engineering or
could fine tune their tests to make their builds a little bit more performant But
the most significant improvement that can be made is is likely identifying
and squashing the bugs And this kind of specialized tooling will really help with
that and ensure that your team and your organization has the ongoing visibility
across the issue when it occurs.
So, specialized test focused tooling should at the very least be able to group
many historic executions given tests so that But when you identify a particularly
problematic test, you're able to flip through that history and, and, you know,
see all those historic runs, examine latency and debugging failures over many
different, invocations of that test.
so you'll want to be able to, assign responsibility for a bad test to someone.
So, you know, being able to identify a test as being flaky
and bad is great, but then what?
do you, do you fix it yourself?
Maybe.
is it your responsibility though?
You know, is this, does this fall within your remit or do you need to go and chase
some, someone else up to go and do it?
so a lot of these, test observability tools now have this ability
to assign a test to someone.
So, you see a flaky test and now you can make it somebody else's problem.
if it's their responsibility, of course.
but you would then be able to take that flaky test and assign
it to a person or a team.
Maybe they get a report or a notification, maybe when they log in
it will give them some reminders that this is something they need to fix.
But at the very least it gives team, product, org wide visibility into who
needs to address this flaky test problem.
and finally, you want to be able to take those flaky tests and quarantine them.
to prevent pipeline blockages.
so, yeah, some test management tooling, including our own test engine has this
ability, so you see a, a flaky test and then you can quarantine it and prevent
it from, from blocking the build.
so this is really just the next step beyond just having the metrics and
the knowledge that a test is flaky.
Now you can actually do something with it and use this, leverage this tooling to
make decisions about which tests can run.
now, we're going to touch on, something about test quarantining in a moment here,
but really being able to proactively alter test execution based off.
you know, observed performance or flakiness is kind of a superpower
in these kind of tools and the TLDR would be the observability is key.
So not having the correct tool for the job is going to make
your life significantly harder.
And you should really lean into tooling that not just monitors tests, but
also provides automation and proactive resolution of flakies when they occur.
Okay, so, final strategy number five here.
you want to foster a reliability.
So here at Buildkite, we spend a lot of time obsessing about test reliability
and build performance, of course.
So naturally, all of our tests, they work perfectly.
they never exhibit any flaky behavior and all of our builds
pass green every single time.
So, of course, that is a ridiculous, egregious lie.
Just like any other software engineering team, we have flaky tests and we
have build fails all over the place.
we do dog food, our own products to ensure we have visibility.
And we try to foster culture around good test practice.
But in reality, that that's really easier said than done.
we, we're very lucky.
We have, honestly, some of the most talented, sophisticated, and
expensive engineering teams, in the world as customers of Buildkite.
And if they struggle with flaky tests, I mean, what, what chance do we have?
but you know, the truth is.
You can have the best tooling in the world at your disposal and the most
sophisticated dev and build environments and all the money in the world that
you know, you can hire the world class developers and all of that kind of stuff.
But if you're not building that culture around test reliability, then then
eventually you'll you too will be, you know, configuring test jobs with 15 auto
retries to get your builds to finish green and everyone get mad at you.
And that's a bad thing.
So.
We've retried to instill that culture at Bilkite.
We've built and evolved our test engine project on.
The team has retried to live and breathe test reliability and
again, to kind of rework it into all of our engineering practices.
So first of all, you want to make unblocking everyone else a top priority.
So first and foremost, it's everyone's responsibility to
keep Tests passing reliability.
And if you encounter a test that's exhibiting flaky behavior,
then just do something about it.
you know, if it's not your test or you can't fix it, then talk to someone.
you know, you, you would assign that test to the necessary personal
team and then they can look into it.
and hopefully you have a tool that, that allows you to do this, something
like, you know, test engine or, or some other test observability tool,
or, you know, maybe you assign it in, you know, your, your dedicated tool, or
maybe write a JIRA ticket or whatever.
just, just action when you see, a flaky test, you know, that old
saying that you see in the subways, if you see something, say something.
well, this is true for, for testing as well.
Live, live by that.
so if you see, you log in perhaps and you, you see that tests are broken and
they're assigned to you, then make it your business to, to prioritize fixing that.
Help your colleagues out.
They will really appreciate you for it.
So quarantining tests is fine and good.
In fact, great.
But then it's important that you do something after that.
So, if a test is flaky or broken, then you might have that ability to quarantine it.
And it's good because it removes it from the test suite run, and
it means that the build's passed and everyone's happy about that.
Everything's green.
but then what?
Like, the thing that the test was testing is now untested, which is not great.
So Quick test quarantining should be a temporary measure.
once you've established that a test is problematic and should be quarantined,
it's probably time to fix it and then alternatively, you know, make a decision
whether it should be kept around at all.
when necessary, swarm.
So, there have been several, maybe, Perhaps even more than several occasions
when, Billkite has seen big swathes of red through all of our test results.
Everything's failing, and to the point where it can't just be ignored.
And on these occasions, the team will down tools and swarm on the flaky tests
to try and get these problems addressed.
So, typically, this works really well, and enough progress is made in
a relatively short period of time.
The things turn green again, things get moving, and, we can
all kind of move on from it.
but obviously this is not A great strategy to be used day to day.
It's disruptive.
It means that everyone has to down tools and, and pivot to looking at tests on
takes everyone away from that, that, you know, the important work of, shipping
features and providing shareholder value.
So, in these cases that swarming on test is a really good break glass
tool that you can use to, fix problems when things have really gone awry.
but, it's probably not like a day to day, solution for addressing tests.
and finally, you want to avoid silos of information.
Fixing tests is not always straightforward, and depending on the
nature of the failure, it could be buried deep down in the stack, in some library
that's being used, it might require some specialist knowledge to debug, and really
engineers like anyone else on the planet learn from experience, and you know,
if someone gains experience debugging, An obscure issue, or maybe not even an
obscure issue, just any issue at all.
And they can apply that knowledge in the future and address it to future, bugs
that they're, they're trying to fix.
So, we want to really give those engineers that experience.
ensure the team gets exposed to debugging tests throughout the stack
and make it, a priority to document this stuff right You know, write ups in
internal CMSs and knowledge bases and think about how you can mob on test to
share the information and experience.
And, as with anything in software engineering, it's very important to just
avoid knowledge silos wherever possible.
because that's gonna make it significantly harder and prolonged to, to debug
these complex and, and obscure.
test failures.
So the final TLDR here is, the people and processes are as important as the
tools, making, fixing tests a priority for organization is going to be key
to removing flakies from the equation.
So
that's my five strategies for helping getting a handle on flaky tests.
just to list them again, we have, figure out which tests are flaky to begin with.
Optimize your test design, work on environment stability, use
the best tooling for the job, and foster a reliability culture.
Now, maybe you are doing some of these things already.
I mean, maybe you're doing all of these things and you don't have any flakier test
at all, in which case, I don't really know where you're watching this presentation,
but thank you for for tuning it anyway, but whatever the case, I really hope
this, this has been helpful, useful, insightful, maybe thought provoking.
and, if you're.
Team is actually struggling with flaky tests.
then I can confidently, absolutely assure you that you're, you're not on your
own and you're in very good company.
And again, I work with many world class engineering organizations
as part of my role at Billkite and testing is a struggle for all of them.
And really that's why we've built this kind of tooling to help folks
understand, monitor, debug, and ultimately mitigate test issues.
So.
Just to close this out, I want to say a big thank you to the Conf42 team
for putting this whole event together.
And I'd really like to encourage you to pop along to Billkite.
com and check us out.
We have a world class CICD platform focused on scalability,
security, and flexibility.
We have a test observability project.
test engine, which will help you solve as many of these, these
issues, as, as I described today.
and we also have a hyper scalable package management and
distribution platform as well.
So definitely check that out.
and if you'd like to, join the likes of Uber, Slack, Shopify, Blog,
Pinterest, and many, many others.
just, you know, just to name a few.
Then head up to buildkite.
com, sign up for a free trial, and come and talk to us about this stuff.
we're looking forward to chatting to you about your test problems.
And with that, I'd like to say thank you very much, everyone.
Thanks for tuning in.