Tackling Flaky Tests: Strategies for Reliability in Continuous Testing

Video size:

Abstract

Flaky tests disrupt CI/CD, causing unreliable results and slowing pipelines. Join this session to learn how to identify, isolate, and eliminate them. Discover techniques to diagnose root causes, boost test reliability, and improve software delivery quality.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hi everyone. My name is Mike Morgan and I head up the technical account management team here at Buildkite. I've been at Buildkite for about four years now, during which time I've held a number of roles, both on the pre and post sale side, working with, with pretty much all of our enterprise customers during that time. And, and that is a list that consists of some of the most exciting and innovative, tech companies on the planet. so before I get started here, please go ahead and check out buildkite. com for all of the logos and testimonials and all of that kind of stuff. So, during my tenure here, I've worked with many teams of all, on all sorts of different types of projects to help them do CI better. And one theme that's come up again and again during this time is that of flaky tests. So, really, as teams and projects grow, and tech stacks increase in complexity, and developer velocity increases, there's really that need to better understand and address flaky tests. and If this problem isn't understood, addressed and and really woven into the culture of your organization, then this can have a profoundly negative effect on your ability to ship and build quality software and meet those KPIs that are probably being set by your leadership teams. So in this session, I'm going to talk a bit about test flakiness and why it's a bad thing, what it means for your organization, and most importantly, a few strategies to really nip this in the bud. So to kick this off. We're going to start with the basics and real simple stuff like what is a flaky test. so given that you're listening to a webinar about flaky tests at a DevOps conference, the chances are that you already know what flaky tests are, probably overly familiar with them to be quite honest. But for those of you that are uninitiated, a flaky test could be summarized as a test that sometimes passes and sometimes fails without any changes to the underlying code. So it's just flaps without any rhyme or reason. And really, there's a ton of reasons that a test could be flaky. now I'm not going to go into each one of these because there's a ton of them. You probably know a lot of these anyway. It'd be a waste of everyone's time. So I just threw a bunch of them on this slide here. but, this is by no means a conclusive list, but it's definitely, some of the ones that we see very often. I mean, of all of these different reasons, and again, kind of working from custom with customers, rather. I mean, this Purely anecdotal, but I think like order dependency, environment issues and resource constraints are probably the ones we see most often. but again, anecdotal, statement there. So please take it with a pinch of salt, but, during this talk, we're going to be calling back some of these, and, and referring to them as we go through some mitigation strategies. So, what is the impact of flaky tests? Well, first of all, we can say that flaky tests undermine developer confidence. so fundamentally, if the result of a test can be changed simply by spamming, the retry button in your CI, then you really have to ask, what value does it really offer? As is customary these days, I asked ChatGPT to define to me what a test is, and ChatGPT told me in software engineering, a test is a process or procedure used to verify that a software application or system behaves as expected under specified conditions. so, With that in mind, if you're testing a supposedly in a controlled environment with supposedly controlled data set, but you're still finding you're getting unpredictable results on back to back test runs, and it's probably safe to say that that definition is not really being met on. The result of this is that. Engineers working on the project don't really know whether the, the test that they're, they're testing the thing with is actually broken or whether it's just a test that's playing up. So the best possible outcome in this scenario is that the engineer, in question is just mashing that retry button until they get a green build, which is, you know, obviously not ideal. if a test is being flaky, then the, the worst case scenario is you're gonna ship something. broken and dodgy to production and ultimately this just erodes confidence across the board and is universally a bad thing. So another bad thing about flaky tests, it slows down your CI pipelines. Apart from obviously eroding the confidence, which is a bad thing, slowing down CI is even worse. When a test fails, you'll probably have some sort of retry logic that's baked into your CI. It's gonna introduce some latency into your build as it retries. maybe you just have the necessary gear to just rerun that single test spec, which might only take a couple of seconds in the best case scenario. But if you're batching a number of tests together in a single job running in CI, then you might find that it reruns a whole load of stuff. that's gonna add expense, both in terms of, yeah, time and money. So I work with a ton of customers helping these build quite better and optimize their software delivery pipelines, and I don't have any data to corroborate this purely anecdotal statement. But from the customers I work with, a vast majority of their build time is spent executing tests. So by having to rerun tests due to flaky behavior, they're turning the test phase into something that's even more of a time sink. and these days, everyone has KPIs about build performance and across the board, improving test performance is a good way to do that. To meet those KPIs. So paralyzing tests, of course, can be the ultimate cheat code for slashing build times across, across the board, but ensuring that tests are well written and behave as expected and don't need to be retried continuously is definitely going to be a close second. Now, thirdly, there's a monetary value associated with tests being flaky. so flaky tests, as we said, As I mentioned there, they erode confidence, they slow things down, they're annoying. but the impact goes beyond just getting on everyone's nerves, there's also a financial hit here too. So, retrying a test is going to burn some compute time. whether it's just a single test or maybe it's a batch of tests that need to be retried. that's going to be some more money that you're going to be giving to AWS or GCP or whoever is hosting that compute. The longer build times will hurt productivity, which means there's, you know, there's engineer salaries that are not necessarily being put to their best use and, and overall the quality of the thing you're shipping is going to be impacted. And I think perhaps maybe a bit harder to attribute a dollar value directly, to a flaky test in, in that particular sense, but still it's worth mentioning, and, you know, over time and at scale, both in terms of the number of bills and the number of engineers involved, I mean, this can So I asked Chech EPT, as I'm wont to do, to provide some statistics about flaky tests. And, and here's what it told me, with, with citations, no less. Google reports that 16 percent of their tests are flaky. 60% of developers are regularly encounter flaky tests and order dependency is a dominant cause of flakiness, responsible 59% of flaky tests followed by test infrastructure problems at 28%. So based on my experience working with, again, hundreds or potentially even thousands of customers, my anecdotal take on this would be that 16% of Tesla flaky is, is probably about right. based on what we see with customers, give or take, 60 percent of devs encounter flaky tests. that I didn't say that stat seems a little bit optimistic to be quite honest. my feeling is that that would be, something close to a hundred percent, maybe not necessarily a hundred percent, but certainly something, that's, that's higher than 60%. So, we've established that flaky tests are bad. annoying and they cost money and they can erode trust. So what we're going to do is now dive in some strategies that you can use to address flaky test problems in your environment. So first of all, we're going to want to isolate and identify flaky tests. So to be able to address flaky tests you need to be able to establish whether a test is flaky to begin with. So in this session, we're not just talking about flaky tests. Broken tests or those those do share qualities with flaky tests, obviously, but instead we're going to be specifically looking at tests that exhibit that flaky behavior. so, first thing we're going to want to do is look for signs of flakiness. So the real obvious one here, first things first, is a test exhibiting flaky behavior versus just being a broken test. So we're going to be looking for certain distinct patterns of behavior to determine, whether a Test is flaky. First one would be intermittent test failures without any code changes. So this could be a case of a different test result across multiple retries in a single CI run, or unpredictable results across multiple commits and CI runs for a portion of code that hasn't been touched. So when nothing obvious has changed, but your tests have given unpredictable results, there's probably something flaky happening there. We could probably at least take a second glance and assume that test could be flaky. tests that are passing locally but failing in, in your CI environment. so if it works locally on your laptop, then it should work in CI, right? well. Probably in many cases, that's true, but there's also good reasons why not. and it could be an environmental factor caused by, either, you know, something local or again, something in your CI environment. An environment, that you run your tests in is very important. And we'll be taking a close look at that in a second, but this, this would be definitely a behavior that could indicate some flakiness. and thirdly, When the outcome of the test seems to depend on a non deterministic factor could be like the time of day it's run or the environment again that it's run in or any, any other variable that just doesn't seem to, doesn't seem to imply stability, I guess. So to be honest, I mean, this one, it really is a less of a distinct pattern of failure and just something that's The, face value can seem extremely random. in these cases, there can often appear to be no rhyme or reason. a test just flaps unpredictably when it runs. And again, if you kind of see that, then it's a pretty good chance that the test is, is flaky there. so really we're just looking to try and understand how a test performs over many runs. And if we're seeing a patchwork of like red and green results every time it's run, across many builds, that, that would certainly imply flakiness. So depending on the size and complexity of your code base, identifying these things manually, it might be possible. But on the other hand, it is not. It is 2025 now. You probably shouldn't have to do these things by hand. It's way better, just to pay a bunch of money to, buy some software to have it done for you. So, second thing, how can we identify flaky tests? Maybe try, rerunning it a few times, real brute force technique, Not particularly sophisticated here, just rerun the test and see what happens. So flaky tests, of course, by their nature, they tend to give you different results across multiple runs. And if you want to see if a test is constantly failing or flapping, then why not just rerun it a bunch of times? And I don't necessarily mean just as part of your standard CI workflow, but just, you know, for the purpose of testing and trying to surface this kind of flakiness. So if you run your test job a bunch of times, ten times, I don't know, whatever, does it exhibit. Flaky looking behavior. If so, then yes, it probably is flaky. And, finally here, utilize, test observability tools and, flaky detection tools. so I have actually a full section on this coming up. And it's, it's very important that I don't just burn through all my content in the first few minutes here. So I'm, I'm, I'm don't really want to repeat myself later on in the session. So I'm going to keep it real brief here and go into more depth. as we, as we get to that point, but, it did seem worthwhile kind of talking about very briefly at the top of the, the, the session here. So if you're working in a large project and a large team, it's totally reasonable to expect that, that your view, could be somewhat blinkered to the thing that you're working on currently, and, your own CI runs rather than like looking at your, your colleagues runs and all of the different, Pipeline runs that have been occurring and looking at the test results and seeing if the thing that is flaky for you appears to be flaky for everyone else to really understand the scope of the problem so that that can be, in itself a mammoth ask. So the good thing is now there's a number of tools that can help you out with this, which will collect test results across any number of builds. They'll surface failing and flaky behavior, assist with debugging in general and really kind of give you a ton of visibility. So unless your project is really small and manageable. I would really recommend, exploring this avenue. So the takeaway from this first strategy would be that reliable testing begins with understanding which tests are flaky to begin with. A test that fails doesn't necessarily make it flaky. I mean flakiness is is another quality entirely. Really use the means at your disposal to look at your tests, their behavior, and determine which ones are just totally broken and those ones that are flapping due to some external factor. So the second strategy, optimize test design and implementation. So, perhaps with some exceptions, most test flakiness can be attributed, to a, a test design and implementation. And we spoke about that statistic a moment ago where 59 percent of flaky tests are the result of order dependence. and this is something that can absolutely be addressed, through improved test design. And I have a couple of things, to consider here, which, which might seem pretty obvious. to, to some, but then also again, it's worth covering to build that solid foundation. So the first thing we have here is Test atomism, or making your tests atomic. So, primarily this is going to be relevant to unit tests, but there's also, some atomic principles that we can apply to, your integration tests and end to end tests as well. So what is an atomic test, for those who do not know? so an atomic test, is a test that's designed to be independent. So meaning it doesn't depend on any, the execution of any, any other tests. It doesn't rely on the state set by other tests. Again, think of that 59 percent order dependency stat, which honestly, Sounds, sounds like a lot, but I don't have any better data, so, so I'll have to do, but anyway, for, designing tests to run independently, hypothetically, that number could be much, much lower. 0%, possibly, I don't know, but, tests that don't rely on other tests, inherently gonna be more reliable. so an atomic test should be tightly scoped, so where appropriate, test a single unit functionality, such as a method or a class, rather than a bunch of things. so, yeah, if you've ever heard of the expression, less is more, so when you write your tests, live by that. An atomic test should be deterministic. So ideally, you want your test to provide the same result every single time. Nice big green tick next to that, that build, given the same inputs every time it's run. So, whenever possible you avoid randomness. In the inputs, again, on the subject of inputs, stabilize those inputs, provide static and predefined data sets that, that you know, are good, that your tests can utilize. And also, clean up the state between runs. And again, we're going to, going to talk about environment, to run your testing in a moment here. but in the context, of determinism, we want to ensure that your tests are clean and predictable and reproducible test bed to run in. Oh, there we go. And performant. So, tests should be small. Ideally testing, again, a single thing using a predictable data set. It should be hypothetically, theoretically quick to execute. So these high performance tests are perfectly suited to be run during your CI process, and they should be less susceptible to timeouts in most cases. So I mentioned at the top of the section, that atomic Primarily, refers to unit tests, and your end to end and integration tests, should not be atomic in the same way as your unit tests, but they can employ atomic like principles, so your integration tests can be atomic to an extent, so while not as small as unit tests, integration tests should focus on single integration points or interactions, and you want to avoid Coupling multiple unrelated components, in in a single integration test, if you can avoid it, each integration test should have a clear, and single purpose like does service a, you know, correctly pass that, response from service B or something like that. And of course, you know, you should use mocking, dependencies whenever appropriate, to avoid issues. For your end to end tests, usually these can't be strictly atomic because they're less likely to be small or isolated. However, they can also be designed to focus on a single, well defined user journey or scenario, minimizing overlap with other end to end tests. And in the spirit of, of those four tenets of atomism that I listed a moment ago. the test should use a clean state between runs, use well defined test data to ensure consistency, and use mocks and stubs, for those, third party APIs. So, next for this strategy, adopt mocking and stubbing. so, rather than relying on downstream services or other dependencies to provide correct response in a test, think about utilizing mocks and stubs to provide, quick and predictable responses to your tests. And there's a ton of different libraries, modules out there that integrate with whatever language you're using that can really take, some of the heavy lifting here. So by no means something that you necessarily have to build out yourself. this is, is well trodden ground. And. Again, in the interest of making tests truly atomic, we should be looking to narrow the scope and make things as deterministic as possible. So, mocking data is totally reasonable, in, in, in these circumstances. And, of course, if you have a database or API or other dependency that's acting up and causing things to break, then we certainly want to, I don't want tests to surface this, but that would probably be in the realm of like your end to end tests, and specifically I'm more referring to unit and potentially integration tests here. Now, finally, you'll want to structure your test hierarchically. so you've probably seen this image before, which I grabbed off Google Image Search. It's the test pyramid. and what we have at the bottom is our unit tests, we have the integration tests in the middle, and we have the end to end tests. Tests at the top. So at the base, the unit test, you're testing your individual components, your functions, your methods, everything in isolation. These tests are fine grained, focused, just like we discussed a moment ago. They're nice and fast. They give as much code coverage as possible and they make it really easy to identify. A root cause in the event of a failure, so unit test should be should be quick and computationally inexpensive to run. So ideally, they would cover as much surface area as possible, and when they fail, it points to a very specific method or class that will really help you to do some debugging. So your, your middle section is your integration tests focused on, you know, different pieces of how, how your application work together. And, this one thing talking to a database, for example, or some microservice or API or whatever, these integration tests are much more thorough, and more expensive as a result, but you, you need Less of them than your unit tests, although less tightly scoped than unit tests, a failure in integration tests should still point to an interaction between specific components again, which will help with debugging and then at the top, finally, you have your end to end tests. So replicating entire workflows in in real world environments to ensure that the entire application as a whole is working as it should. So end to end tests are probably It's going to be the most expensive and fragile, probably using something like Selenium, which, you know, loves to get stuck waiting for a page element to appear or become interactive or something like that. But, we'll give you that, that, Representation of a user experience. So you want to use these sparingly for critical path stuff. when one fails, at the very least, you want to be able to see where and how it failed. and from a debugging perspective, it's going to be, involve a little more work than perhaps, you know, looking at a unit test that's failed. depending on the tool you're using, of course. but maybe you debugging tooling to help with this, and maybe like APM traces or something like that that can help provide some context here. So, like, why, why is the test pyramid important? well, a perfectly executed test pyramid, will mean that everything is being tested, perfectly, everything's tightly scoped, little overlap as possible, and when flaky test behavior occurs, there's as little noise as possible in your reporting. It becomes much easier to see where the problem is occurring, which is really, really important. You know, vital for efficient and effective debugging. So, for strategy two here, The takeaway is design your test to be tightly scoped, performant, predictable, repeatable, and remove reliance from other services and dependencies by mocking and stubbing where possible. And really, this will help improve simplicity and, and reduce noise. Okay, third strategy we're going to look at is improving the stability of your testing environment. so we briefly talked about this. Touched on the importance of keeping a clean and a test environment during some of the previous, the previous strategies there. But this is really an important consideration when it comes to keeping flaky tests at bay. So I figured this really warrants its own section. Now, a bad testing environment could be caused by really any number of factors. You could be maybe using long running host to build on, and perhaps someone didn't previously clean up after a build, and it left. something there that that's resulting in in unexpected behavior, or maybe even, you know, you have multiple jobs running in series from a single build and one of the previous jobs didn't clean up the workspace and the workspace becomes polluted and that causes strange behavior. Maybe you have dependencies that get messy. Maybe your build infrastructure is being stood up manually or inconsistently rather than using a bunch of automation. perhaps external dependencies. Not working reliably, or perhaps the test infrastructure itself is just not suitable. It's under resourced, not enough memory or CPU or network bandwidth or whatever. I mean, these are just a few examples. And again, pretty much infinite reasons why the test environment could be in a bad way or just not suitable. But we've got some best practices here to hopefully avoid some of the common pitfalls that you might experience. So you want to strive to achieve a consistent test setup. And this one is It's a very obvious one. You want your test environment to be predictable and as consistent as possible. And there's many different ways that you can do this depending on what your CI tooling is and what it allows for you to do and how you design your infrastructure and whatever. So, how do I keep my build environment clean? So many CI platforms provide the ability to run CI jobs in a fully hosted, managed ephemeral environment. that only live for the duration of the job and they would be spun up and you run the job in, in that, that, that, that runner or, that, that container or environment or whatever that the CI provider provides. And then once the job is completed that, that compute is torn down and deleted and it's not reused. So these kind of environments really eliminate the risk of workspace being contaminated by previously run jobs or build. It's really a nice clean state for every single job. so although the environment will be free from contamination, it's good to bear in mind that there The piece of infrastructure itself will only be as good as its original configuration. So it might be immune from previous runs messing it up. But if the dependencies aren't being handled properly, then you're still gonna have a bad time or run into issues. these kind of managed ephemeral runners are really useful, really, in any number of scenarios. In my opinion, particularly when you're looking at like Mac or iOS jobs, I'm definitely like an Apple guy. I have been for, for many years. I have countless Macs during this time, but my experience has really led me to the conclusion that whilst they work really great as desktop machines, there can be a real handful when you manage them as headless CI runners. And there's a litany of hurdles that you need to overcome to get them to work well in this scenario. But maintaining a clean, stable workspace is in, in my opinion, at least one of the more challenging ones. you know, when you're self managing Macs for CI, it's really common to see machines that are really long running. You know, rather than kind of that short lived one and done configuration that we just mentioned there. and, and in these scenarios, there would be some attempt to perhaps clean the user folder between jobs so that there's like a relatively clean workspace and, you know, a predictable environment to run tests. But this is often quite difficult to achieve, successfully and effectively. So being able to delegate the responsibility for managing Mac, CIM for entirely, instead of having, you know, That long running, Mac to have something that's completely ephemeral that runs, you know, presumably in a VM somewhere before being recycled is pretty compelling. And I'm certain that anyone listening here who's ever, whether in the past or currently, is responsible for maintaining a closet filled with Mac minis or maybe a bunch of cloud based Mac instances would definitely agree that this is, kind of, a brutal job. And, not having to do that or worry about that. This issue on a Mac would, would be just be fantastic. So, I am obliged at this point to mention that at Billkite, we do have this capability for Linux and Mac. So, if you're looking for ephemeral runners, for your CI jobs fully managed, then we can certainly help you out with that. next up at Billkite, we do provide a hybrid CI solution as well. So where customers like own and manage the build infrastructure, you know, hybrid in the sense that we. We run the control plane as a SAS service, and then you manage the compute yourself wherever you want to deploy it. and we, we do offer a few out of the box agent stacks to achieve this using, you know, Kubernetes or, using EC2 instances, all auto scaled and handles orchestration and job life cycles for you. And in both cases, much like those managed runners. I mentioned a moment ago, in these scenarios, CI jobs can be configured to run on on fresh infra each time before it's terminated and recycled so that each, again, each CI job gets its own clean environment. I can't speak for other CI providers on this capability. I'm, I believe that some others do, should be possible. So just take a look at the docs and see what's available. Again, I can only vouch for Billkite in this case. And whether your infrastructure uses short lived one and done configuration or you have a long, long running instances dedicated to running CI jobs, definitely consider running things in Docker. I think it's pretty, you know, pretty ubiquitous capability for any modern CI solution that can do this. And, definitely at Buildkite we have plugins that you can integrate into your CI workflows to facilitate this really nice and easy. you know, if something is running, you know, A docker container, you know, again, it's, it's a real nice predictable environment to run your tests. And I was actually just talking with a customer of ours about this very, very recently. There are a very well known grocery delivery service, and they were describing how they do just this. So in that case, jobs are executed. Using Bilkai, using our Docker plugin, so that every single job gets that fresh container to run in. They have, a single base image that all of their, their build and test jobs use, and this, you know, the same base image as they use in production, in fact. And then they have a bunch of scripts, that would run when this container is orchestrated, depending on, like, the pipeline that, like, the CI pipeline that's running, or the The application is building, and that would pull in additional dependencies at runtime to give them what they need. and this really just gives them that predictable, consistent, consistent environment, to run, per job. Next up, you want to try and determine which failures are related to infrastructure. so, Trying to determine whether a job failure is being caused by the environment or the infrastructure might not always be easy. You know, these things can manifest themselves in a number of different ways. It's often not immediately obvious. However, you want to keep your eye out for a few things such as timeouts and IO errors, you know, tests that are failing, supposedly touching other services, or, you know, tests taking significantly longer between runs. you know, if there's that huge discrepancy with latency between test runs, that, that would certainly indicate that there's something that's happening, perhaps at the infrastructure environment level. And it's pretty much a given that you'll have some sort of infrastructure monitoring set up in your environment, you know, you know, whether it's Datadog or, or, you know, Elastic or something that's open source that you're running to monitor stuff. system stats and, and, metrics. But you, what you want to be doing is using this in your build environment too. You want to be looking for, for spikes in various metrics that correlate to undesirable test behavior. And, and then this can certainly point to things like resource constraints. and then this isn't, this isn't, By any means, like a golden ticket, an infallible or conclusive, way of picking out problems. But it's certainly worth monitoring. And certainly when you correlate behaviors, it can be quite powerful. and definitely if your observability tool can can do stuff like, you know, use custom tags on, metrics that you can tag containers, or, or instances by job ID and then, look at the relationship between. That build, build job by ID and the metrics associated with that, whether it's a container or an instance or whatever, and then, and then see if you can see that correlation between failures and, and various different spikes, then, this is really going to help you immensely for, for debugging things like resource constraints. So, going back to that, that grocery delivery service that I mentioned earlier, when they observe a test failure due to a test bombing out, their default behavior is they then scan the build logs, from that job and they look for signs of failure, that, that, that implies that it's infrastructure related. So, like, they have a list of various different error message strings that would point to. This is an infrastructure problem, and then if this is true, then, that would go to the developer experience team. And then if it's something that looks like it's app shaped, then it would just go back to the app team and then they can fix the problem. So it's a nice way of kind of separating these things out and then, assigning responsibility. So hopefully your CI solution provides you with the ability to. You know, perhaps store your logs and do some, provide a means to programmatically access them, not necessarily just through the UI, but via the API or store them locally or copy them to an S3 bucket and run a Lambda function across them or anything to look for, for, for certain behaviors. definitely going to be a useful thing. to be able to identify whether something's pointing to a system or environment issue. so finally, Replace external dependencies with mocks where you can. And we've already touched on mocks. I'm not going to go, burn any time on this. But, in, in the subject of System environment, stability and consistency. we want to remove dependencies on downstream services wherever possible. So if we have a test that's interacting with a downstream service, use a mock instead. and it will not only make it performant, but then, Reduce that, that, the risk of some unknown variable, causing your test to flap. So, the TLDR from this strategy, a consistent test environment is crucial for reliability and there's many ways to achieve this depending on your requirements. So you really just need to pick and choose, what your CI tool provides, relative to those needs and, and really lean into that where you can. So, number four of five, strategy number four. You want to level up your test management and monitoring tools. So, I'm going to preface this section by saying that Billkite has test monitoring built in. It has our own test monitoring and management tool called Test Engine. And this talk is really not intended to be a sales pitch for the product, although I will mention it in passing as it relates to some of the subjects in this section. So, we would love for you to test, Test engine out by if your organization maybe give us some money, but it's by no means the only tool out there that provides test visibility. So, you know, alternatively, you can just take a look and pick one that fits your use case budget and has all of the features that you need. so I'll be speaking at a high level here rather than just, talking about our own thing specifically. but that said, I mean, I've worked in this, this Billkite for quite some time. And prior to the launch of test engine, the main pain point that customers encountered was around. Testing. So, especially when your engineering team is large and the code base is also large and there's tons of tests being executed, it really becomes challenging to see the forest for the trees. you know, as an engineer, you can see that a test fails for you sometimes, and then you're mashing retry until it goes away and everything's green again. but Does everyone have that same problem with this same test as you put yourself in the shoes of the the DevEx owner, the DevEx leader or really anyone in the DevEx team, you know, there's going to be a subset of tests that are slow and flaky for everyone, and they're causing, you know, a significant cumulative impact on both. Build times and wait times and engineer productivity, and, and compute spend. and to be able to really address this, having appropriate tooling to, to surface these kinds of issues and monitor them and assist in debugging and where possible, you know, help address the problem is going to be a massive win for you. So there's a couple of things you want to be looking for. In your test monitoring tool. So, it wants to be able to integrate with your CI workflows. so, first of all you want to make sure it's got broad support for all of your, required technologies. Most organizations, especially when they hit like a certain size, is going to have a pretty varied tech stack. And being able to view all of this together, In a single tool is going to be invaluable versus, you know, having separate tools for, you know, this language and this system and things like that, everything together in a single place is going to be, very, very valuable to you. And additionally, it wants to be able to integrate with your CI pipeline. So, test observability can be a real rabbit hole and probably should be a separate function from your CI workflow UX. you know, you want your build logs and, and, and, the, the. Outcome of the build jobs in one place and perhaps some additional tooling to drill into to your test results adjacent to that, but the two should be tightly integrated because that will massively help usability. So you know, imagine running a test in your CI workflow and then you see a failure in the build logs and then you're able to click through from that build log into the test failure, which takes you into a dashboard with debugging data and run history and things like that. So it's kind of this. Slick workflow to go from clicking run build to looking at the build results to looking at the failed test to looking deep dive into the test history and that debugging data that's going to be invaluable for you. So observability, very important. The problem that most teams are trying to solve is that, you know, whilst your CI pipeline probably gives you tons of logs and metrics and details about that one build that you're running currently, it, it probably doesn't. help you aggregate the test behavior from many builds and surface common issues. And, whilst build logs can be super useful for debugging, they're also very verbose and kind of a rough data source for you used for things like trend analysis and really kind of any kind of manual analysis at all, really. So the first thing's first from your test observability tool, you'll want dashboards. you want charts, you want metrics all over the place. Like how reliable is my test? Okay. My test suite, rather, all the tests individually. Like, how long does my test suite take to execute? Are there any particular trends that I should worry about? Like, what are my worst performing tests? this is a screenshot I just grabbed from our own testing tool, and, blanking out pertinent information, but, showing just, you know, here's a test, here's a test. This one's deemed flaky, like, what's the history of this? how long is this test taking to execute over time? Here I can, I mean, this is perhaps a bad test, but we can kind of see some of the behavior of this, if there's been degradation over time, or if there's a very particular point in time where, things started going awry. So you want to use flaky test detection. so you should be able to surface flaky tests. so at Billkite, we again, we have these huge customers and they run billions and billions of tests every year. and probably the number one things they want to know from, our test observability tool is like, which tests are flaky? How often are they flaky? And when did they become flaky? And there's Undoubtedly an infinite number of ways that an engineering or could fine tune their tests to make their builds a little bit more performant But the most significant improvement that can be made is is likely identifying and squashing the bugs And this kind of specialized tooling will really help with that and ensure that your team and your organization has the ongoing visibility across the issue when it occurs. So, specialized test focused tooling should at the very least be able to group many historic executions given tests so that But when you identify a particularly problematic test, you're able to flip through that history and, and, you know, see all those historic runs, examine latency and debugging failures over many different, invocations of that test. so you'll want to be able to, assign responsibility for a bad test to someone. So, you know, being able to identify a test as being flaky and bad is great, but then what? do you, do you fix it yourself? Maybe. is it your responsibility though? You know, is this, does this fall within your remit or do you need to go and chase some, someone else up to go and do it? so a lot of these, test observability tools now have this ability to assign a test to someone. So, you see a flaky test and now you can make it somebody else's problem. if it's their responsibility, of course. but you would then be able to take that flaky test and assign it to a person or a team. Maybe they get a report or a notification, maybe when they log in it will give them some reminders that this is something they need to fix. But at the very least it gives team, product, org wide visibility into who needs to address this flaky test problem. and finally, you want to be able to take those flaky tests and quarantine them. to prevent pipeline blockages. so, yeah, some test management tooling, including our own test engine has this ability, so you see a, a flaky test and then you can quarantine it and prevent it from, from blocking the build. so this is really just the next step beyond just having the metrics and the knowledge that a test is flaky. Now you can actually do something with it and use this, leverage this tooling to make decisions about which tests can run. now, we're going to touch on, something about test quarantining in a moment here, but really being able to proactively alter test execution based off. you know, observed performance or flakiness is kind of a superpower in these kind of tools and the TLDR would be the observability is key. So not having the correct tool for the job is going to make your life significantly harder. And you should really lean into tooling that not just monitors tests, but also provides automation and proactive resolution of flakies when they occur. Okay, so, final strategy number five here. you want to foster a reliability. So here at Buildkite, we spend a lot of time obsessing about test reliability and build performance, of course. So naturally, all of our tests, they work perfectly. they never exhibit any flaky behavior and all of our builds pass green every single time. So, of course, that is a ridiculous, egregious lie. Just like any other software engineering team, we have flaky tests and we have build fails all over the place. we do dog food, our own products to ensure we have visibility. And we try to foster culture around good test practice. But in reality, that that's really easier said than done. we, we're very lucky. We have, honestly, some of the most talented, sophisticated, and expensive engineering teams, in the world as customers of Buildkite. And if they struggle with flaky tests, I mean, what, what chance do we have? but you know, the truth is. You can have the best tooling in the world at your disposal and the most sophisticated dev and build environments and all the money in the world that you know, you can hire the world class developers and all of that kind of stuff. But if you're not building that culture around test reliability, then then eventually you'll you too will be, you know, configuring test jobs with 15 auto retries to get your builds to finish green and everyone get mad at you. And that's a bad thing. So. We've retried to instill that culture at Bilkite. We've built and evolved our test engine project on. The team has retried to live and breathe test reliability and again, to kind of rework it into all of our engineering practices. So first of all, you want to make unblocking everyone else a top priority. So first and foremost, it's everyone's responsibility to keep Tests passing reliability. And if you encounter a test that's exhibiting flaky behavior, then just do something about it. you know, if it's not your test or you can't fix it, then talk to someone. you know, you, you would assign that test to the necessary personal team and then they can look into it. and hopefully you have a tool that, that allows you to do this, something like, you know, test engine or, or some other test observability tool, or, you know, maybe you assign it in, you know, your, your dedicated tool, or maybe write a JIRA ticket or whatever. just, just action when you see, a flaky test, you know, that old saying that you see in the subways, if you see something, say something. well, this is true for, for testing as well. Live, live by that. so if you see, you log in perhaps and you, you see that tests are broken and they're assigned to you, then make it your business to, to prioritize fixing that. Help your colleagues out. They will really appreciate you for it. So quarantining tests is fine and good. In fact, great. But then it's important that you do something after that. So, if a test is flaky or broken, then you might have that ability to quarantine it. And it's good because it removes it from the test suite run, and it means that the build's passed and everyone's happy about that. Everything's green. but then what? Like, the thing that the test was testing is now untested, which is not great. So Quick test quarantining should be a temporary measure. once you've established that a test is problematic and should be quarantined, it's probably time to fix it and then alternatively, you know, make a decision whether it should be kept around at all. when necessary, swarm. So, there have been several, maybe, Perhaps even more than several occasions when, Billkite has seen big swathes of red through all of our test results. Everything's failing, and to the point where it can't just be ignored. And on these occasions, the team will down tools and swarm on the flaky tests to try and get these problems addressed. So, typically, this works really well, and enough progress is made in a relatively short period of time. The things turn green again, things get moving, and, we can all kind of move on from it. but obviously this is not A great strategy to be used day to day. It's disruptive. It means that everyone has to down tools and, and pivot to looking at tests on takes everyone away from that, that, you know, the important work of, shipping features and providing shareholder value. So, in these cases that swarming on test is a really good break glass tool that you can use to, fix problems when things have really gone awry. but, it's probably not like a day to day, solution for addressing tests. and finally, you want to avoid silos of information. Fixing tests is not always straightforward, and depending on the nature of the failure, it could be buried deep down in the stack, in some library that's being used, it might require some specialist knowledge to debug, and really engineers like anyone else on the planet learn from experience, and you know, if someone gains experience debugging, An obscure issue, or maybe not even an obscure issue, just any issue at all. And they can apply that knowledge in the future and address it to future, bugs that they're, they're trying to fix. So, we want to really give those engineers that experience. ensure the team gets exposed to debugging tests throughout the stack and make it, a priority to document this stuff right You know, write ups in internal CMSs and knowledge bases and think about how you can mob on test to share the information and experience. And, as with anything in software engineering, it's very important to just avoid knowledge silos wherever possible. because that's gonna make it significantly harder and prolonged to, to debug these complex and, and obscure. test failures. So the final TLDR here is, the people and processes are as important as the tools, making, fixing tests a priority for organization is going to be key to removing flakies from the equation. So that's my five strategies for helping getting a handle on flaky tests. just to list them again, we have, figure out which tests are flaky to begin with. Optimize your test design, work on environment stability, use the best tooling for the job, and foster a reliability culture. Now, maybe you are doing some of these things already. I mean, maybe you're doing all of these things and you don't have any flakier test at all, in which case, I don't really know where you're watching this presentation, but thank you for for tuning it anyway, but whatever the case, I really hope this, this has been helpful, useful, insightful, maybe thought provoking. and, if you're. Team is actually struggling with flaky tests. then I can confidently, absolutely assure you that you're, you're not on your own and you're in very good company. And again, I work with many world class engineering organizations as part of my role at Billkite and testing is a struggle for all of them. And really that's why we've built this kind of tooling to help folks understand, monitor, debug, and ultimately mitigate test issues. So. Just to close this out, I want to say a big thank you to the Conf42 team for putting this whole event together. And I'd really like to encourage you to pop along to Billkite. com and check us out. We have a world class CICD platform focused on scalability, security, and flexibility. We have a test observability project. test engine, which will help you solve as many of these, these issues, as, as I described today. and we also have a hyper scalable package management and distribution platform as well. So definitely check that out. and if you'd like to, join the likes of Uber, Slack, Shopify, Blog, Pinterest, and many, many others. just, you know, just to name a few. Then head up to buildkite. com, sign up for a free trial, and come and talk to us about this stuff. we're looking forward to chatting to you about your test problems. And with that, I'd like to say thank you very much, everyone. Thanks for tuning in.

Slides

Download slides (PDF)

See all 58 talks at this event!

Conf42 DevOps 2025 - Online

January 23 2025 - premiere 5PM GMT

Tackling Flaky Tests: Strategies for Reliability in Continuous Testing

Video size:

Abstract

Summary

Transcript

Slides

Mike Morgan

Solutions Architect @ Buildkite

Join the community!

Featured event

2025

2024

Info

Conf42 DevOps 2025 - Online

January 23 2025 - premiere 5PM GMT

Tackling Flaky Tests: Strategies for Reliability in Continuous Testing

Video size:

Abstract

Summary

Transcript

Slides

Mike Morgan

Solutions Architect @ Buildkite

Join the community!