Automated ‘Grenade Testing’ for Platform Components

Video size:

Abstract

We share our experiences building tooling that enables end-to-end testing of a platform release, and release bundling, improving the confidence developers have in changes beyond the level possible with standard unit, integration, and system testing

Summary

Cisco's Jonathan Mounty presents a technique for efficiently testing platform components together. At scale, your configuration becomes a critical point of failure of your platform. Grenade testing provides an output of compatible versions that don't blow up when they're used together.
Grenade testing is mainly driven using a pull request based model. It's a lower cost way of maybe detecting issues before you hit integration. It is also entirely dependent on your stability of your pipelines and data centers.

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hello and welcome to this talk on automated grenade testing for platform components. My name is Jonathan Mounty and I'm a cloud engineer and engineering leader here at Cisco, where we maintain a Kubernetes based multi cloud platform at scale it today I wanted to present to you a technique we've developed for efficiently testing platform components together and share that with the DevOps community. So to give a bit of background, we maintain hundreds of clusters in different environments, owning everything between the data center and the application layer. That includes environment definitions, the physical virtual machines themselves, the control plane metrics and observability stacks, as well as deployment pipelines. And at this scale, your configuration becomes a critical point of failure of your platform, especially when it comes to version combinations. And the velocity of the changes that you make in these environments necessitates a different approach to integration testing than we have found in other fields of software engineering. So grenade testing itself actually involves no actual grenades, but it does provide an output of compatible versions that don't blow up when they're used together. So to understand or introduce grenade testing as a concept, I'll first introduce you to our stack. Yours might be different to this, but it likely has the same or similar layers involved. So we start with provisioning the actual machines that we run on using terraform. We then make those machines easy to manage by clustering them. We then make the applications on top easy to manage by bundling them together using a release management tool. We link that in with the other peer clusters so we can see what's going on in one place, and then we add an awareness of what's going on in different environments, for example, your integration environment and your production environment and things like that. Grenade testing will help with the first three of these considerations. So we've mostly built our platform with CNCF projects. So to start with, we define a manifest which includes size, instance types and core versions of the actual infrastructure which we aim to provision, which is transformed into terraform by two tightly coupled tools which we call infrastl and federated infra. We then provision a cluster control plane using lots of relatively horrible ansible and kubespray, and we do application deploys using helm. We then use the concept of a command and control cluster, which we use to run a lot of our automation and make things easy to manage. And we've also defined a high level way of orchestrating applications across environments. This is a relatively extensible tool, so we can always change the way we run. Add cluster, API, eks, GKE clusters, whatever platform is really a fusion of these components, and it's constantly evolving, which is all the more reason that we need to test these things pretty comprehensively. It's great to see this stuff on a presentation slide. It looks like it makes sense, but actually under the hood, what you're looking at, it's a pipeline. So hopefully most of you are familiar with this absolute dream of everything going well in Argo or your workflow engine of choice. Other workflow engines are available, but the reality is, whenever we put these things on a slide, you're missing a lot of links, like a misconfigured cluster that breaks your terraform rendering, or when you think your fix was backward compatible, but actually it wasn't. Or the engineer that pushed that fix at 447 on a Friday night before heading on PTO for a week that you hit at 924 on Monday morning. Because you would, wouldn't you? And the pipelines that link these all up, how do you ensure that an update to your pipelines that adds a feature or allows you to enable new functionality doesn't break things somewhere else in your tooling? And often these things aren't tested as comprehensively as we might like. In these cases, combinatorial testing is intractable once you have a few versions, and unit testing will go some way to ensure individual components aren't faulty. But what about when you put them together? And maybe you upgrade a version of one tool that expects a different API version? And traditional integration testing very quickly becomes very expensive for every change we make, especially when it comes to various different environments where we're running multiple tools on top of the same platform, and every tool wants its, or every use case of the platform wants its own integration environment becomes very, very difficult very, very quickly. So we've come up with a relatively simple solution in what we like to call grenade testing. So this is a brief introduction to grenade testing at a very high level. We really identified two different failure modes in our platform tooling. One before we have a control plane, which might be due to broken configuration, broken terraform or broken ansible, and then the other use case is afterwards. So use these as an application incompatibility. Or maybe we didn't set up the networks correctly or incorrectly configured base images. So the first thing we did was to identify the core base tooling. So these are likely to cause control plane failure. In our case that's infrastl, federation, infra, and the Kubernetes provisioning step, or alternatively the base image itself. And then we automatically run, you guessed it a pipeline which then brings up a minimal integration environment in a variety of different cases that might be in our private data center, maybe one in AWS, maybe one that runs cluster API, and we then verify that the control plane is healthy and conduct minimal testing afterwards, installing some base applications and ensure that we have the basic functionality available. If that's successful, then we can commit and use that version with more confidence than we would do any other without actually testing with every single version of the base applications. If that fails, we can automatically go back to the engineering question. Because all of these pipelines are automated, we tighten that feedback loop from potentially weeks before we discover the issue to a matter of hours. So what we've discussed is an automated workflow which has a relatively quick turnaround time. It improves developers productivity, it verifies the tooling that will be deployed in the real world. It handles failure. In the case of failure, we identify the faulting component. We do have the option of including regression testing to check whether it would work with the previous version. We consistently test versions that people want to use and encourage the use of latest, because every time we run this pipeline, it actually pulls the latest version of all those base tools. It also mimics the production environment. We currently don't grenade test in our production environment live, but we have a mimic of it. We don't have any paging involved in that, and it's not a chaos monkey, but it does provide insight into how this would behave if it were running in production. So the question then is, where does this fit in with other testing? It sounds relatively similar to integration testing. So we built grenade testing as a relatively efficient tool to kind of fit around the other forms of testing and be something which is a relatively broad brush, but which is significantly easier to maintain than other environments. So we built this probably moderately vaguely accurate graph based on what feels right in the context of a platform where we discuss all of the different kinds of testing that we do. So as I mentioned, unit testing is relatively small, relatively tied to one version of an API. You're really testing it situationally against maybe issues that you've seen in the past. We then have integration testing, which is good, but pretty expensive, and everybody wants their own integration environment in our use case, so that comes with a significant maintenance cost. And they also don't like their integration environments being changed frequently, which is something that we maybe want to do when we're constantly releasing new versions of our platform. So we then have stubbed integration testing, which is a bit of a middle ground. It's a lot easier, but it does considerably reduce the chance of catching things if you're stubbing everything out and you've still got the issue of versions incompatibilities if something changes an API when you don't expect it. As much as I'd love semantic versioning to work in all cases, often it's not respected as much as we might like. We then have system testing. This is when we're starting getting into the really good stuff. We're getting there on how comprehensive we are, but it's still a lot of work to maintain system testing or a system testing process. And then the probably most comprehensive thing, and the most difficult to maintain thing is manual acceptance testing, which may well be the worst job maybe ever, but it does deliver you good confidence that your system is going to work. And when we get towards the final points of a release, then maybe that's something we want to do. So grenade testing here comes as a compromise. It's not wonderful, it doesn't fix everything that you're going to run into, but it does make your job a lot easier. It's relatively low cost and mostly automated. And again, it just allows you to tighten that feedback loop at a relatively low cost. So grenade testing is not an open source project. It's a concept that we found useful, that we thought we'd like to share. It's tied to your own pipelines, and in our case it was built in a bespoke manner, but it is something which is probably extensible, and we would support open source implementations if someone would like to do so. So I wanted to talk a little bit or share a little bit about what grenade testing looks for us. We're very dependent on Githubs, so you can see that this is mainly driven using a pull request based model. So we have pipelines that monitor the release of core platform tooling. In this case, we've upgraded federated infra to a new version, ten one six. And we've then pulled the latest version of infrastl Kubernetes provisioning and the base image to test against that. And the monitoring pull request commits reports of status and failure to a master branch, so that we have a log of when things failed and when things passed. So these then are then run in another environment, which again, all of our infrastructure pipelines are driven using git ops, so all of our infra repositories use real pipelines. So we will then provision this using generic accounts, and we then perform all of our testing against that cluster as we would a normal cluster. So we've kind of tested this base infrastructure and workload control plane somewhat, but then we have to consider the workload deployment and configuration. Now, we actually leave a grenade testing cluster up at any one time, the latest successful grenade testing cluster. And that ensures that we always have one active target that allows for pluginstyle checks against the active cluster to allow for extensible testing. We can automatically test a new chart version once it's released against a real cluster, and optionally automatically merge the bump into our integration environment if it works. And that allows us to reduce the drift between what we have in development and what we have in production by automatically merging things that we are confident, at least semi confident with, to our integration environment. So how does this affect our kind of diagram? We came beforehand, this is a relatively trivial initial pass through, and we just add two different sections, so we have the continuously maintaining a version of the cluster. So if we have a successful grenade test run, then we'll destroy the old testing target, and then we also do continuous chart verification against that running cluster. And we also have automated testing, plug and play automated testing, so you can write your own unit test that will run against a cluster and verify that whenever a new grenade test is run now, that also allows you to test, for example, particular networking expectations that you have, or the way that you expect the cluster to behave under particular conditions. It's not designed for really, really comprehensive integration testing, but just kind of initial indications that you might get as to whether your cluster is working as you expect it to or not. So the challenges that we came through when we were implementing this initially, figuring out who to blame. So we have number of cases where we have maybe flaky APIs or running out of capacity, or maybe we're struggling to identify the source pull request of errors. We can identify that a new version bump was released, but maybe a version bump will include multiple pull requests worth of changes. And then we also have the issue with pipelines. So pipelines can be bumped independently of all these base images and could also break the way that grenade testing works. But it's nonetheless useful to know that your pipelines are not working. Grenade testing, as we mentioned, doesn't catch everything. It's not a silver bullet when it comes to testing, especially when it comes to application configuration and specialized cluster configurations. You still want to do comprehensive testing for that, and it's more of a complement to other forms of testing that you might do. And it doesn't even replace your integration environment. It's just a lower cost way of maybe detecting issues before you hit integration. It is also entirely dependent on your stability of your pipelines and data centers. So system maintenance does become a bit of a question. It does need oversight. Most version releases that we release are rather large, so identifying the issue within that delta can be challenging. So it does require intervention of engineers. We also have the end user application compatibility, so application configuration can break deploys. In many cases, platform applications are also applications. So when we're trying to maybe roll out a CNI, for example, that could also break the cluster in various creative ways that you might discover during grenade testing, but may not be related to the base images that you've been releasing. We also talk about environment configuration and how that can also change things, mentioning again config being a major issue there, and then also validating application behavior. We don't really do that at all within grenade testing, and that's why we need to have normal integration testing, system testing, and even acceptance testing as well. But we never really do that validation of application behavior, merely that it deploys and things like that. We also have issues with noise, with failed builds. Often grenade testing can fail and can go down, and obviously we then need to commit engineering time to maintaining it. We should also I test in every cloud provider, so sometimes things will fail within AWS, but won't fail in our private data center, especially when your terraform is different between the two. So there are some challenges when you're implementing grenade testing as well. In general, though, we have definitely improved our level of confidence in releases that pass grenade testing. Previously, the first challenge when you joined the platform team was building your first cluster. Now, your first challenge is still building your first cluster, but at least you have the right versions, or at least you have confidence that your versions worked at least once before you do that. We also have some automatically generated compatibility information that enables you to make those kind of decisions when you're initially bringing up clusters, which was again a big issue before we introduced grenade testing. It's reduced our time to detect issues. Admittedly, we didn't have great integration testing beforehand, but grenade testing provides a reasonable job of kind of bridging the gap between those two things. It acts as a data source so we can gain information on build success rates, build frequency, et cetera, which can be used to inform us on maybe the quality of our platform. And we can also tag versions accordingly to whether they pass grenade testing and how confident we are with releases. We don't have a test environment for each application or each component, so it's reasonably cost effective. When we're doing grenade testing and there's no overhead of keeping these test environments up to date, we don't end up with one application team demanding that we run a particular version because they haven't tested against the new one. It's already consistently being torn down and replaced more frequently, far more frequently than our integration environment. And we have some confidence in whatever's running in master at any one time, that it works with the rest of the platform rather than just in isolation. So is grenade testing finished? Absolutely not. We thought it was a cool concept that's relatively simple and kind of is sufficiently different from traditional integration testing that you might do introducing more automation. And we think that it could be valuable to other people, whether you're a big or a small operator of cloud platform technology. That said, we would love to start a conversation on how other people do their testing within their platforms, share enhancements. On top of what we've done here with the community and generally working together, we can improve the reliability of all of these cloud platforms, which benefits everybody in the long run. So I hope this talk has been insightful for you or maybe raised some questions. Thank you very much for taking the time to watch this talk, and thank you very much for the Conf 42 organizers for inviting us to talk about this. I'd just like to take a little bit of time to acknowledge JP and lorn for their work, work within grenade testing that made it the reality that it is today, and I hope you enjoy the rest of the conference. Thank you very much for your time.

Slides

Download slides (PDF)

See all 57 talks at this event!

Conf42 DevOps 2024 - Online

January 25 2024

Automated ‘Grenade Testing’ for Platform Components

Video size:

Abstract

Summary

Transcript

Slides

Jonathan Mounty

Software Engineer @ Cisco

Join the community!

Featured event

2025

2024

Info

Conf42 DevOps 2024 - Online

January 25 2024

Automated ‘Grenade Testing’ for Platform Components

Video size:

Abstract

Summary

Transcript

Slides

Jonathan Mounty

Software Engineer @ Cisco

Join the community!