Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hey everyone, this is Matt Schillerstrom. Today I'm going to be talking
to you about revolutionizing software development by integrating chaos engineering
and feature flags for enhanced reliability and agile response.
I'm currently a product manager for feature flags at Harness,
the modern software delivery platform. But prior to this I
was a product manager at Gremlin for Chaos Engineering. I worked
at Target Corporation out of Minneapolis, Minnesota,
building out their Chaos engineering program and at a nuclear power plant,
ensuring that we're safe and reliable.
The opportunity I'd like to talk about is with software delivery.
Imagine if your development cycle looked like
know. Your development team confidently pushes code to production to
solve business outcomes and customer needs. Your customers use
software. The customers give feedback to the product they
love, and business responds with new ideas to tests.
The development team solves problems and tests new feature
flags. Get feedback right, and everyone is happy and
the business continues to grow and respond to the change in the customer needs
and business outcomes. Now, I probably wouldn't be talking to
you today if this is what your software delivery lifecycle looked like.
All green and happy path, right?
There's many issues that exist today just with releasing
code to production and even before that with testing,
right? But I want you to know there's some open source solutions that
exist today provided by the Linux foundation through the CNCF,
the cloud native Compute foundation such as Lipnis
Chaos, which is donated by harness and
open feature, which is a feature flagging open source
tool as well. Now taking a step
back, my experience has always been around
understanding how systems work. And Andy Stanley, who's a
pastor and a podcaster, says it best when if you don't
know why it's working, when it's working, you won't know how to fix it when
it breaks. And if you think about,
you know, at 02:00 in the morning when an incident happens, or when
you're releasing software, you have to think about how to fix it,
right? You don't always know how to respond. And that's where practicing
chaos engineering and using feature flags
help you learn and understand the behavior
of your system. Such that you can be confident when you release software,
such that you can also know how to respond to something proactively
when it breaks. A recent exercise I took
was around this opportunity, but I took it
with my team and I said, what if an incident happened? Let's just talk
through it. Let's not even run a chaos experiment. Let's just get a mirror
board or a paper and write it out. So in this diagram,
take a 30,000 foot view, right?
You have that green happy path, and you have a red
column of variety of different types of incidents, whether it was
a network outage or a disk failure or
just something else, you name it. But if you look at this,
all the yellow boxes here are things that a
development team has to respond to or do before
they get back on those green squares of software development happy path,
right? So if an outage happens, your pager goes
off or your customer notifies you, and then you have to look at a
dashboard, then you're responding, then you're fixing,
you're testing, you're troubleshooting, you're doing all these things
when in reality you could have proactively tested this in the past,
right?
Aligning into that a little bit more, I asked my team, what are
some of the things that happen when that incidents
occurs, right? And often we related this to just like what
happens when you release code to production, right?
So thinking about this, like support teams get
involved, ops teams, multiple development teams, security database
teams, they're all trying to help resolve that incident,
right? So basically that feature release that was
supposed to make your business more productive and your customer happy now becomes
an incident rather than just like a simple deploy and release,
right? So all these things are occurring, which is interesting,
right? Because you don't think about all these things.
You get used to it, right? Like you get trained and you have muscle
memory on how to respond to incidents, and you normalize the
fact here. But let's lean to the business impact
of all of this. So let's talk about cloud costs,
right? So what impacts the bottom line?
Cloud costs. But why does cloud costs
increase? Right? Like, incidents happen so folks have to
provision more servers and more workloads to respond
and catch up to the recovery of the system.
Speed and velocity also impact the bottom line for businesses
as far as how fast you're delivering the right solution to
your customer. I like to talk about churn and training
and onboarding of new developers or existing
teammates that are just learning a new system. But the inefficiency
around that affect the bottom line.
Risk, obviously, like how risk adverse your company
is or your customer is, and then being
too reliable, right? Like you don't necessarily need to be 100%
reliable or even five nines or four nines, right?
You just need to be reliable for that customer experience,
for that business outcome. You're trying to solve some
interesting facts. So here in 2023,
we saw cloud infrastructure spending increase by 23%,
which highlights the need for that effective cost management strategies
within DevOps and site reliability engineering practices.
We also see an increase in cloud costs in 2024,
simply from gen AI services, right.
More and more new companies are starting and more companies
are investing in this technology, but they're quick to
use it, right? So they're over provisioning workloads
and servers just to support the business cases that they need.
Another business impact here is velocity, or lack
thereof, but continuous delivery isn't good enough,
right? So CI CD gets you
so far, but it ends at the production deployment.
And what I like to get at here is there's risk in large deployments
that a feature flags can solve.
Currently with continuous delivery. Like one bad feature equals a
full rollback of a system. There might be 15 features
within that, right? Deploy and release are
the same. Developers don't have that control in
production that they want. They're babysitting that deployment out
there and babysitting the testing and kind of nervous to
deploy it to prod because it's also releasing to customers,
right? So then the production issues affect all the users
and you can't resolve the issue well in prod, right? So you
have to roll back everything. And then ultimately once you
get going with CI CD, you have that diminishing return,
right? So the slow cumbersome deployments, lack of production
governance, and then you get more tech and rework, right?
So where we see that as an industry with software delivery is reduced
velocity, you have increased risk and
ultimately poor developer experience, which I care deeply about.
Right? So again, just to repeat, like the big deployments and
rollbacks equal more rework and less features.
And then no control, production means the deployment must be perfect,
right? So again, CI CD helps us get there. But now you're
worried about that release being perfect for your customer.
And then the poor developer experience is just tightly coupled.
Deploy and release adds stress and toil to the developer.
But now let's talk about reliability and resilience with the business impact.
So common Kubernetes, failure modes that exist today,
system instability, resource exhaustion, resource contention,
configuration errors, scaling issues, these all exist
in Kubernetes, which is supposed to be a self healing system, right? But the
applications you deploy on Kubernetes might not survive
these instabilities, right? And that's where these resiliency
patterns are being used in code and infrastructure today
to handle those failures gracefully, either degrading the experience
or giving a warning message to users. But what
I'm seeing in the industry is that lack of testing,
right? And when I bring up chaos engineering and
proactively testing your system, people kind of laugh at it, right?
They don't see it as a priority. But what I like to speak about
is technology and standards change. And similar to
this 56 years ago, child, car seats have evolved.
And what seemed okay then is not humorous now. So looking
back five years from now, we might be like, of course,
chaos engineering is required for all this testing, right?
It's silly to not think you have to do anything proactively.
And when I try to box these failure modes in,
I always like looking at this type of a table, right, where we talk about
known failure modes or unknown failure modes. But these are all things
and questions that I like to ask my team around why
chaos engineering can help answer these questions, right? Like, what are my single
points of failure? Where does my system tip over?
What happens when a Kubernetes pod restarts? Or does
my system scale appropriately during peak traffic? Did I
configure my dashboard correctly? Does my paging system work?
These are all questions that you should have. And if you're not doing chaos
engineering, you're not going to know the answer. And that's where integrating
chaos engineering and feature flags helps you.
So with this, I have a software release workflow,
and this is where you can really integrate chaos engineering and feature flags.
So for step one, think about this. Devs write
their code, but any changes are put behind a feature flag,
right? And the value there is that you can deliver code on time,
test features for impact. Step two,
DevOps releases the code to production, and they can
test in production, and nothing has changed, right? But the value there is that you
can deploy on time every time,
no change to your process and test safe failure.
Then, step three, product managers can decide who gets
the new feature and who doesn't. And the value there
is that control release variations, and you don't roll back,
you just turn the flag off. Step four,
product and development. Decide what to iterate on.
And this is faster and more collaborative with feature flags,
right? And the value there is that you can iterate on features faster
and have higher feature quality, right? So it's this continued process
of integrating feature flags and chaos engineering
into your development process. And again,
tools that you can use for this are open feature and
litmus chaos. So today, I hope
you understand that integrating chaos engineering and feature flags
enhances your reliability in software and response.
If you need to contact me, please reach out to me on LinkedIn,
and I'm happy to engage in a conversation with you to help make your journey
in software liability safer and ultimately more fun.
All right, thanks,