Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello and welcome to this talk on automated grenade testing for platform components.
My name is Jonathan Mounty and I'm a cloud engineer and engineering leader here at
Cisco, where we maintain a Kubernetes based multi cloud platform
at scale it today I wanted to present to you a technique we've
developed for efficiently testing platform components together and share that
with the DevOps community. So to give a bit of background,
we maintain hundreds of clusters in different environments, owning everything between
the data center and the application layer. That includes environment definitions, the physical
virtual machines themselves, the control plane metrics and observability
stacks, as well as deployment pipelines. And at this scale,
your configuration becomes a critical point of failure of your platform,
especially when it comes to version combinations. And the velocity
of the changes that you make in these environments necessitates a different approach
to integration testing than we have found in other fields of software engineering.
So grenade testing itself actually involves no actual
grenades, but it does provide an output of compatible versions that don't
blow up when they're used together.
So to understand or introduce grenade testing as a concept,
I'll first introduce you to our stack. Yours might be different to this,
but it likely has the same or similar layers involved. So we
start with provisioning the actual machines that we run on using terraform.
We then make those machines easy to manage by clustering them.
We then make the applications on top easy to manage by bundling them together using
a release management tool. We link that in with the other peer clusters so
we can see what's going on in one place, and then we add an awareness
of what's going on in different environments, for example, your integration environment and
your production environment and things like that. Grenade testing will help with
the first three of these considerations.
So we've mostly built our platform with CNCF projects.
So to start with, we define a manifest which includes
size, instance types and core versions of the actual infrastructure which we
aim to provision, which is transformed into terraform by two tightly coupled
tools which we call infrastl and federated infra.
We then provision a cluster control plane using lots
of relatively horrible ansible and kubespray,
and we do application deploys using helm.
We then use the concept of a command and control cluster, which we use
to run a lot of our automation and make things easy to manage. And we've
also defined a high level way of orchestrating applications across environments.
This is a relatively extensible tool, so we can always change the
way we run. Add cluster, API, eks, GKE clusters,
whatever platform is really a fusion of these components,
and it's constantly evolving, which is all the more reason that we need to test
these things pretty comprehensively. It's great to see
this stuff on a presentation slide. It looks like it makes sense, but actually
under the hood, what you're looking at, it's a pipeline.
So hopefully most of you are familiar with this absolute dream of everything going
well in Argo or your workflow engine of choice. Other workflow engines are available,
but the reality is, whenever we put these things
on a slide, you're missing a lot of links, like a misconfigured cluster that breaks
your terraform rendering, or when you think your fix was backward compatible,
but actually it wasn't. Or the engineer that pushed that
fix at 447 on a Friday night before heading on PTO for a week that
you hit at 924 on Monday morning. Because you would,
wouldn't you? And the pipelines that link these all up, how do you
ensure that an update to your pipelines that adds a feature or allows you
to enable new functionality doesn't break things somewhere else in your tooling?
And often these things aren't tested as comprehensively as we might like.
In these cases, combinatorial testing is intractable once you have a few
versions, and unit testing will go some way to ensure individual components
aren't faulty. But what about when you put them together? And maybe you upgrade a
version of one tool that expects a different API version?
And traditional integration testing very quickly becomes very
expensive for every change we make, especially when it comes to
various different environments where we're running multiple tools on top of the same platform,
and every tool wants its, or every use case of the platform wants its
own integration environment becomes very, very difficult very, very quickly.
So we've come up with a relatively simple solution
in what we like to call grenade testing.
So this is a brief introduction to grenade testing at
a very high level. We really identified two different failure modes in
our platform tooling. One before we have a control plane, which might be due
to broken configuration, broken terraform or broken ansible,
and then the other use case is afterwards. So use these as an application incompatibility.
Or maybe we didn't set up the networks correctly or incorrectly
configured base images. So the first
thing we did was to identify the core base tooling. So these are likely
to cause control plane failure. In our case that's infrastl,
federation, infra, and the Kubernetes provisioning step, or alternatively the
base image itself. And then we automatically run,
you guessed it a pipeline which then brings up a
minimal integration environment in a variety of different cases that might
be in our private data center, maybe one in AWS, maybe one that
runs cluster API, and we then verify that the control plane
is healthy and conduct minimal testing afterwards, installing some base applications
and ensure that we have the basic functionality available. If that's
successful, then we can commit and use that version with more confidence
than we would do any other without actually testing
with every single version of the base applications.
If that fails, we can automatically go back to the engineering question.
Because all of these pipelines are automated, we tighten that feedback loop from
potentially weeks before we discover the issue to a matter of hours.
So what we've discussed is an automated workflow
which has a relatively quick turnaround time. It improves developers productivity,
it verifies the tooling that will be deployed in the real
world. It handles failure. In the case of failure, we identify
the faulting component. We do have the option of including regression testing to check
whether it would work with the previous version. We consistently test
versions that people want to use and encourage the use of latest,
because every time we run this pipeline, it actually pulls the latest version of
all those base tools. It also
mimics the production environment. We currently don't grenade test in our
production environment live, but we have a mimic of it. We don't have any paging
involved in that, and it's not a chaos monkey, but it does provide insight into
how this would behave if it were running in production.
So the question then is, where does this fit in with other testing? It sounds
relatively similar to integration testing. So we
built grenade testing as a relatively efficient
tool to kind of fit around the other forms of testing and be something which
is a relatively broad brush, but which
is significantly easier to maintain than other environments.
So we built this probably moderately vaguely
accurate graph based on what feels right in the context of a platform where
we discuss all of the different kinds of testing that we do.
So as I mentioned, unit testing is relatively small,
relatively tied to one version of an API. You're really testing it
situationally against maybe issues that you've seen in the past. We then have
integration testing, which is good, but pretty expensive,
and everybody wants their own integration environment in our use case,
so that comes with a significant maintenance cost. And they also
don't like their integration environments being changed frequently, which is
something that we maybe want to do when we're constantly releasing new versions of
our platform. So we then have stubbed integration testing,
which is a bit of a middle ground. It's a lot easier, but it does
considerably reduce the chance of catching
things if you're stubbing everything out and you've still got the issue of versions incompatibilities
if something changes an API when you don't expect it. As much as I'd love
semantic versioning to work in all cases, often it's not respected
as much as we might like. We then have system testing.
This is when we're starting getting into the really good stuff. We're getting there on
how comprehensive we are, but it's still a lot of work to maintain
system testing or a system testing process. And then
the probably most comprehensive thing, and the most difficult
to maintain thing is manual acceptance testing, which may well
be the worst job maybe ever, but it does deliver
you good confidence that your system is going to work. And when we get towards
the final points of a release, then maybe that's something we want to do.
So grenade testing here comes as a compromise. It's not wonderful,
it doesn't fix everything that you're going to run into, but it does make your
job a lot easier. It's relatively low cost and mostly automated.
And again, it just allows you to tighten that feedback loop at
a relatively low cost. So grenade testing is not an
open source project. It's a concept that we found useful, that we thought we'd
like to share. It's tied to your own pipelines,
and in our case it was built in a bespoke manner, but it
is something which is probably extensible, and we would support open source
implementations if someone would like to do so.
So I wanted to talk a little bit or share a little bit about what
grenade testing looks for us. We're very dependent on Githubs,
so you can see that this is mainly driven using a
pull request based model. So we have pipelines that
monitor the release of core platform tooling. In this case, we've upgraded federated
infra to a new version, ten one
six. And we've then pulled the latest version of infrastl
Kubernetes provisioning and the base image
to test against that. And the monitoring pull request
commits reports of status and failure to a master branch,
so that we have a log of when things failed and when things passed.
So these then are then run in another environment, which again,
all of our infrastructure pipelines are driven using git ops, so all
of our infra repositories use real pipelines. So we will then provision
this using generic accounts, and we then
perform all of our testing against that cluster as we would a normal
cluster. So we've kind of
tested this base infrastructure and workload control plane somewhat,
but then we have to consider the workload deployment and configuration.
Now, we actually leave a grenade
testing cluster up at any one time, the latest successful grenade testing cluster.
And that ensures that we always have one active target that allows for pluginstyle
checks against the active cluster to allow for extensible testing.
We can automatically test a new chart version once it's released against a real
cluster, and optionally automatically merge the bump into our integration
environment if it works. And that allows us to reduce the drift between
what we have in development and what we have in production by automatically
merging things that we are confident, at least semi confident
with, to our integration environment. So how
does this affect our kind of diagram? We came beforehand,
this is a relatively trivial initial pass
through, and we just add two different sections, so we have the
continuously maintaining a version of the
cluster. So if we have a successful
grenade test run, then we'll destroy the old testing target, and then we
also do continuous chart verification against
that running cluster. And we also have
automated testing, plug and play automated testing, so you can write
your own unit test that will run against a cluster and verify that
whenever a new grenade test is run now, that also allows
you to test, for example, particular networking
expectations that you have, or the way that you
expect the cluster to behave under particular conditions. It's not designed for really,
really comprehensive integration testing, but just kind of initial indications
that you might get as to whether your cluster is working as you expect it
to or not.
So the challenges that we came through when we were implementing this initially,
figuring out who to blame. So we have number of cases where we have maybe
flaky APIs or running out of capacity, or maybe we're
struggling to identify the source pull request of errors.
We can identify that a new version bump was released,
but maybe a version bump will include multiple pull
requests worth of changes. And then we also have the issue with
pipelines. So pipelines can be bumped independently of all
these base images and could also break the way that grenade
testing works. But it's nonetheless useful to know that your pipelines are
not working. Grenade testing, as we mentioned,
doesn't catch everything. It's not a silver bullet when it comes to testing,
especially when it comes to application configuration and specialized cluster configurations.
You still want to do comprehensive testing for that, and it's more of a complement
to other forms of testing that you might do. And it doesn't even replace
your integration environment. It's just a lower cost way of maybe
detecting issues before you hit integration.
It is also entirely dependent on your stability of your pipelines
and data centers. So system maintenance does become a bit of a question. It does
need oversight. Most version releases that we release are rather large,
so identifying the issue within that delta can be challenging. So it does
require intervention of engineers.
We also have the end user application compatibility, so application
configuration can break deploys. In many cases, platform applications are also
applications. So when we're trying to maybe roll out a CNI,
for example, that could also break the cluster in
various creative ways that you might discover during grenade testing,
but may not be related to the base images that you've been releasing.
We also talk about environment configuration and how that can also change things,
mentioning again config being a major
issue there, and then also validating
application behavior. We don't really do that at all within grenade testing,
and that's why we need to have normal integration testing, system testing,
and even acceptance testing as well. But we
never really do that validation of application behavior, merely that it deploys and
things like that. We also have issues with noise, with failed
builds. Often grenade testing can fail and can go down,
and obviously we then need to commit engineering time to maintaining it.
We should also I test in every cloud provider, so sometimes things will
fail within AWS, but won't fail in our private data center,
especially when your terraform is different between the two.
So there are some challenges when you're implementing grenade testing as well.
In general, though, we have definitely
improved our level of confidence in releases that pass grenade testing.
Previously, the first challenge when you joined the platform team was building your first cluster.
Now, your first challenge is still building your first cluster, but at least you have
the right versions, or at least you have confidence that your versions worked at least
once before you do that. We also have some automatically
generated compatibility information that enables you
to make those kind of decisions when you're initially bringing up clusters, which was
again a big issue before we introduced grenade testing.
It's reduced our time to detect issues. Admittedly, we didn't
have great integration testing beforehand, but grenade testing provides
a reasonable job of kind of bridging the gap between
those two things. It acts as a data
source so we can gain information on build success rates, build frequency, et cetera,
which can be used to inform us on maybe the quality of our platform.
And we can also tag versions accordingly to whether they
pass grenade testing and how confident we are with releases.
We don't have a test environment for each application or each component,
so it's reasonably cost effective. When we're doing
grenade testing and there's no overhead of keeping these test
environments up to date, we don't end up with one application team demanding that we
run a particular version because they haven't tested against the new one. It's already
consistently being torn down and replaced more frequently,
far more frequently than our integration environment. And we have some confidence
in whatever's running in master at any one time, that it
works with the rest of the platform rather than just in isolation.
So is grenade testing finished? Absolutely not.
We thought it was a cool concept that's relatively simple and kind of is
sufficiently different from traditional integration testing that you might do introducing
more automation. And we think that it could be valuable to other people,
whether you're a big or a small operator of cloud platform technology.
That said, we would love to start a conversation on how other people do their
testing within their platforms, share enhancements. On top of
what we've done here with the community and generally working together,
we can improve the reliability of all of these cloud platforms, which benefits
everybody in the long run.
So I hope this talk has been insightful for you or maybe raised some
questions. Thank you very much for taking the time to watch
this talk, and thank you very much for the Conf 42 organizers for inviting us
to talk about this. I'd just like to take a little bit of time to
acknowledge JP and lorn for their work,
work within grenade testing that made it the reality
that it is today, and I hope you enjoy the rest of the conference.
Thank you very much for your time.