Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi there, this is chaos engineering and service ownership at
enterprise Scale. My name's Jay and I'm a lead chaos engineer
at Salesforce. I've been with our site reliability team for about
six years, working on public cloud operations, incident response,
and our Chaos game day program. Now I'm an engineer on
the Chaos platform team where we're delivering a toolkit to service
owners that lets them safely run chaos experiments whenever
they'd like. I'm here today to share with you some of what
we've learned while shifting left and scaling up in the public cloud.
But our story actually starts about nine years ago in
2015. In these days, Salesforce infrastructure
was hosted out of first party data centers. We had
a simpler application footprint, security controls and ownership
models. In short, SRE owned the availability
of pretty much everything, and sres had widely
scoped privileged shell access to be able to log into servers
and take actions for incident mitigation or maintenance and so
forth. Our Chaos game day program made use
of this, and we would rely on someone with that privileged shell
access to log in to a server and shut it off
or kill processes on the node. Or sometimes we
relied on a tight partnership with network and data center engineers
to do things like turn off network switch ports, or remove power
from hosts manually. So a game day back
in 2015 might have looks like this, where an SRE
would shell in and run a privileged command, and at
the same time the game day team would be observing critical
host or app metrics and writing down their findings.
But at the same time, Salesforce was undergoing some other transformations
that involved infrastructure and service ownership.
New business needs demanded larger and more flexible infrastructure, and this
is things like new products and sales growth, as well as new
regulatory requirements across the world. So this leads to
the birth of the public cloud infrastructure known as Hyperforce.
And for internal engineers, Hyperforce enforces a lot of new internal
requirements and operational practices. The new infrastructure
brought a bevy of foundational shared services that
were used to ensure a safe and zero trust sort
of environment that developers would build on top of.
It also eliminated most manual interactive touch
points, such as shell access. Underlying these
changes was also the embrace of service ownership.
Salesforce decided early on in the process that service owners would
be responsible for the end to end operational health of their service and would
no longer be able to sort of give that burden to SRE.
So in short, we all had to learn new ways of working in the
public cloud, which simply didn't apply to our first party data centers.
And while we reimagined the infra and app ownership models.
Our approach needed to be the same for operational
things such AWS chaos Engineering we
started to imagine what chaos engineering as a part of service ownership
would look like, and we uncovered a few challenges.
First off, in a service ownership world,
SRE has less of a centralized role than before.
A centralized game day team would have difficulty learning all
the architectures and edge cases of new or
newly redesigned services for the new public cloud architecture.
And finally, like I mentioned, new technical containers around privileged
access made previous chaos approaches unsuitable
for Salesforce.
But if we started to look at shifting our approach, we could frame
these issues differently. First off, service owners
know their service better than anybody else, so they'll be the best equipped to
talk about potential failure points as well as potential
mediations to those. Also,
shifting left in the development cycle should reduce turnaround time
on discovering and fixing the issues. Moving it closer left
in the development lifecycle will let a service owner respond to the problem faster
than if they had waited until it reached prod and a customer found
it. Finally, we decided that we should deliver a
chaos engineering platform that lets service owners run chaos
experiments safely and easily, eliminating the need for
manual interactive touch points.
And this was a great idea and it's working out well for us. But there
were some major scale and shift left challenges to address
the size and shape of our AWS footprint, the need to
granularly attack multi tenant compute clusters, the need to
discover inventory and manage RBAC access,
and finally, maintaining safety, observability and observing
the outcomes while doing all of the above.
Let's talk specifically about our AWS footprint. The core
CRM product that most people think of when they think of Salesforce
is hundreds of services spanning across at
least 78, about 80 AWS accounts.
And for one contrived example, a service might
have their application container running on one shared
compute cluster in an account over here, but then might have
their database and cache and s three buckets, all in different
accounts. So to begin with, it's infeasible for humans
to log into every AWS account we have for core and do
failure experiments. At the very least, we need to script that.
So we translate to a requirement that says that we need a
privileged chaos engineering platform that can run AWS attacks
in multiple accounts simultaneously.
The second challenge is around our multi tenant Kubernetes clusters that I
alluded to before. Services here are deployed
across many namespaces, but also many clusters,
and so service owners should be able to attack their service and
their service namespace. But in the multiple clusters that
they might be deployed into. At the same time, service owners shouldn't
be allowed to attack the shared services or the cluster worker nodes themselves.
And finally, service owners may know or care less about Kubernetes
infrastructure than we assume.
So to translate to requirements, we need a privileged chaos
engineering platform that can orchestrate attacks in multiple namespaces and
clusters simultaneously, just like how we need to drive
multiple AWS accounts. Also, we need
the platform to provide failures without requiring
ad hoc cluster configs and service accounts. Basically, the service
owner should not need to know all the inner workings of the Kubernetes
cluster to be able to do chaos attacks on their service.
The third challenge is around inventory and role based access,
so it's obviously a challenge to discover an account for all the different
resource types that a service team might own.
As I've said, we're spanning nearly 80 AWS accounts,
and within those we've got EC two instances and s three buckets
and RDS databases in addition to the containers.
So discovering and making a sensible inventory of those
is a challenge. Also, enforcing role based
access could be a challenge. We need to control the blast radius
and know that when a service owner logs into the platform, they should
only be allowed to do chaos attacks on a service that they own,
again removing shared infra and service components from
the blast radius. So for requirements,
the chaos engineering platform should integrate with discover and group
all sorts of infrastructure, resources and
applications. Also, the chaos platform should integrate with
SSO to match service owners to their services. And our
chaos platform needs to make use of the opinionated hyperforce
tagging and labeling scheme to make sure that
we're grouping services and service owners together appropriately.
And the fourth challenge is around safety, observability and outcomes.
We started to ask ourselves questions like what would happen if
there was an ongoing incident or maintenance or patching. It might be
unsafe for service owners to run experiments at this time.
And also, how should service owners measure the success of
their chaos experiments? How do we track improvement so
concretely? Translating to requirements, the chaos engineering
platform should integrate with our change and incident management database
and refuse to attack if it's unsafe.
And for the observability point, service owners should
measure their chaos experiments through the same slos and monitors that
they use in production. This creates a positive feedback
loop so that learnings can be fed back into SRE
principles for production operations, but also used
to evaluate the service health and make tweaks coming
out of the chaos experimentation.
So if you're facing similar challenges, these are my recommendations
for a self service chaos platform. Number one,
pick a tool that is multi substrate and can support future flexibility.
You never know what sort of regulatory or business changes are ahead,
and for example, you might acquire a company that uses
a different sort of cloud. I've talked about AWS quite a
bit today, but hyperforce is designed to run on other public cloud
providers, and if we need to support another provider tomorrow,
we want to be able to do that. The chaos tooling that we've got
supports that also make use of RBAC and tags
and labels, et cetera, to control the blast radius and limit
your attack access. There should be no question who's
logging into a platform and what they are allowed to run an attack against.
Also consider prioritizing extensibility to integrate
with custom systems like we did for our change management system.
Anything that offers custom extensions or webhooks
and subscribe notifications can be good candidates.
Fourth, seek out a sophisticated toolbox of attacks to
support both large scale game day style experiments,
AWS, well as precision granular attacks on individual
services. And finally, use slos.
Make them part of your hypotheses and make sure that service owners
observe experiments, AWS they would observe production anytime
there seems to be a disagreement between the outcome of a chaos
experiment and the SLO is an opportunity to
improve the definition of that SLO.
And I want to talk briefly just about the ongoing role of game day exercises,
because I don't mean to imply that we don't do game days anymore.
We absolutely do. In this case, we optimize game days
for purpose and for expertise. So service owners
can take charge of concrete technical fixes on their own service because
they're so close to it and they understand exactly what the failure
modes can be and what potential solution paths are easiest.
But game day teams that have a wide ranging view of
the infrastructure can support things like compliance exercises
or run organizational and people chaos. So,
for example, you might test your incident response mechanism,
but make sure that the first person paged is unavailable.
That could reveal some interesting insights about your incident response process,
your pager readiness, et cetera. Also the same,
I would suggest, for shared it service chaos.
So consider attacking your wiki or your operational runbook
repository, right? What happens if the document
repository with all of the instructions for incident remediation is
slow or unavailable? I bet you'll come up with some ways
that you can become redundant to that as an organization during incident
response. And finally, game day
teams continue to have a role in tabletop exercises to
help service owners scope their attacks because of their wide ranging view,
game day teams can suggest issues to service owners that
have plagued other services in the past and give them recommendations
on running generically in the public cloud. So with
that, I want to say thank you. I hope this was useful and helped
shine a light on some of the things that we experienced as we shifted
left and scaled up. We're really passionate about giving this
power to service owners because everybody has a hand in chaos and
everybody has a hand in operating their service at scale. Thanks so much and
have a great rest of the conference.