Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hey, guys. My name is Joyce. This is my dog, Lucy. She's a
year old, and I am in from San Francisco.
She's alone. So I'm going to head back tomorrow to see Lucy.
So my talk is, who owns Chaos? Who is responsible for chaos?
Again, my name is Joyce. I'm a developer advocate at Postman.
So thank you, John and Manuel, for demoing Postman earlier this morning.
I'm not going to be talking about Postman, but it is an API
development platform used by more than 8 million developers.
So one of the best parts of my job is that there's a ton of
people who use Postman, and I get to talk with all of them
and find out what are they working on and how are they doing it.
So this is kind of a side project of mine, chaos. I've been interested in
it. So I've been going to community days, conferences, and one of the questions
that I had when I was in the audience listening to people talk
about chaos engineering, who, who, who, who, who?
Who is responsible for chaos? With the engineering slack
group, they've posted a diagram of
the people in the tools in this chaos community. These are the famous people.
Let's take a quick look at this data and break it down.
So what job titles are doing chaos? Who here identifies as being
engineer in this room?
Okay, almost everyone here. So if
you look at that diagram of the tools that the famous people in
chaos, most of them include the word engineer
in their job title. There are specialized roles or vanity roles, as villas
will call them, like chaos engineer. You have site reliability engineers
that also might handle chaos. And about third of the
community comes from functions like security or ops or
R D. So typically, the folks
that are most motivated to start a chaos program
are the ones who feel the pain of a failure in production.
So if you're on call, Colton Andres,
CEO at Gremlin, says it boils down to who gets paged. If that's an
SRE or ops team, they have the most incentive to start doing this
work and making their lives better. And this is how Colton personally
started off doing chaos engineering when he was over at Netflix. So when
you're thinking about roles and responsibilities, which typical responsibilities
do the folks who have that are interested in chaos have?
So we have chaos specialists, those vanity roles.
You have a dedicated team for chaos engineering, and it might actually be
a youre competency for your business. Other companies
have sres or production engineers. They handle
continuous deployment and production support. So postman engineering, we have
a microservice architecture and the ones that are responsible for
deployment and uptime are the developers that are building the services themselves.
If your team has a traditional DevOps department,
they might be doing the deployment and uptime.
Other people who care about chaos might be responsible for
incident management. So Russ called it earlier this morning, he called
it a post mortem analysis. Right. But the
difference here between incident management and chaos is that you're shifting
the focus from the post mortem to the pre mortem, and you're actively proactively
trying to prevent errors from happening.
So companies that have chaos engineers,
there are some companies that have domain knowledge,
right? So you're talking about the data, the storage, or the networking teams.
And lastly, we think about folks that. Who, who, who is responsible for
chaos? Responsible for, and this one might be a little bit controversial here,
but those that have responsibility for testing and production,
who here test in production?
Okay, I see, like seven hands.
So this is the best environment for this. It has the most
information to accurately recreate situations
and demonstrate the true consequences of your attacks.
But some companies can't test in production. So if you have
fincare or healthcare, you might have compliance
issues. You can't take down customer data or it's very,
very costly. So these are the general responsibilities of folks who do chaos.
Which roles tend to have these responsibilities? When you translate these responsibilities
to roles, a lot of the people doing chaos tend to
be quality driven or production focused operations engineers.
So here's my question. This is all good. Chaos engineers
are running chaos tests. They're identifying vulnerabilities,
and they're automating these experiments.
My question, why aren't testers doing chaos?
Does anyone also identify as a tester?
Okay. About the same amount of hands.
Okay. So before there was chaos engineering, there was chaos
testing. So if you go back and look at the earliest blog posts,
when Netflix first introduced Chaos monkey, they actually called it chaos
testing, and they introduced it to the test community. So it makes
sense if you think about the traditional software development lifecycle,
youre should be introducing the responsibility of resilience or quality
earlier in that cycle, when the cost of bugs is the lowest.
That's a noble goal. But as we talked about earlier, the testers
aren't the ones on call. They're not rolling out hot fixes
in production. So this is the reason why we see a bunch of
sres and ops people pioneering the work in chaos engineering.
But because you want to build resilience a little bit earlier in the software
development lifecycle, we actually see a very new trend
of testers, people who solely identify as tester,
who focus, are focused now on production testing in addition to
what we imagine that they focus on pre release testing.
So one such test engineer said the biggest
limitation in the fear of delivering software faster is the focus
on adding more pre release testing.
Abby Bangser says chaos engineering is all about building trust, that your
systems are resilient and the meantime to recovery is
acceptable. She goes on further to say chaos engineering is
all about building confidence that we aren't fragile.
We have less fear that any one change will bring down our system.
And when issues do occur, we know how to triage and
deploy fixes faster. This is Abby Bangzer, based in London
here, one of the very few testers I found that was approaching
and trying to get a Chaos program launched at her company. So why
aren't more testers doing chaos today? Have you even heard of such a thing?
So I've talked to testers at events like these that are curious
about chaos, but they're still spending
most of their time on pre release testing.
And it's very rare, but especially at events like these, the people that have a
side project or a passion project. We're starting now
to see cast experiments created by sres
and then run and automated by testers. So you can
see that this is an anonymous attribution here, but I've talked to a
few very large companies that do have this kind of workflow. And so we see
these emerging programs that are being pushed
by the testing function. But job titles aside, very few people
here identify solely as being tester. Job titles aside,
who can start a chaos program? Who gets the ball rolling on
chaos? Okay, so who has the insights?
Who knows about potential vulnerabilities and
how to properly structure a chaos experiment?
Insights. Who has the access to pull the plug and in
case you need to roll it back, who has the insights to access?
And lastly, who has the organizations pull to
convince management and adjacent teams to support
this chaos program? So this one's probably going to be the hardest
lever to pull. And it doesn't matter where you are, what function you're in,
what industry you're in, what company you're at, it becomes
actually a lot easier if you have a catastrophic failure. So, at the
last chaos event that I was at, I met somebody who told me
that there's now a directive from our CTO to start a Chaos
program. Note this is a director of test after they lost
$600 million in 22 minutes. Not going to talk about this,
but come grab me later to hear the gory details.
So this is the easiest way to start a chaos program when you
have a catastrophic failure. But for the rest of us,
do I need to wait for this? No. Clearly no.
Start thinking about chaos in order to prevent it.
And for you, if you're thinking about starting a Chaos program,
Casey Rosenthal has some advice for you. Casey says perhaps
aggregate bits and pieces from different frameworks that appeal to you and then create
a practice around it. You'll likely be the first person to
create a similar practice in your particular context. And he
goes on further to say, I wish you the best of luck in that undertaking,
but I wouldn't wager that you get it right on your first try or your
second. So be prepared for failure.
Who owns Chaos? Final thoughts here.
More teams and functions new functions are thinking about chaos engineering
because, to use Russ's buzz phrase from earlier,
they're going cloud native stuff is getting complicated,
and they're thinking about how chaos testing can complement or
augment traditional testing. So as youre
teams begin to cover the bases when it comes to pre release testing,
we see them spending more time in production, testing in production.
And now we start to see chaos experiments created by
sres and then run and automated by testers.
And lastly, a valuable chaos test.
Villas was talking about this a little bit earlier, alluding to it, a valuable
chaos test will not only teach you about your systems, but also your team.
So as you're thinking about building more resilient software, also think
about building resilience into your organization,
with your people, with your culture, celebrating the failures instead of hiding
them and sweeping them under the rug. That will impact the overall
resilience of your systems. Thank you.