Abstract
This topic will cover Citi’s journey to adopting Chaos Engineering and the benefits and challenges encountered. Additionally, a very lightweight guideline model touching on industry tools versus Citi’s in-house products and the value proposition factoring costs, compliance, SDLC, support, etc.
Transcript
This transcript was autogenerated. To make changes, submit a PR.
My name is Charles Acol, and I work at Citibank.
I manage the USPB
SRE team. And our next
phase is transforming our l two into
production support, into an SRE cytrology
engineering team. And as part of that, chaos engineering is a very
core principle that we like to adopt.
So chaos engineering really is a discipline of experimenting
on a system in order to build confidence in the
system's capability to withstand turbulent conditions in production.
And really, this is, if you google that concept,
it exists under principlesofchaos.org.
But to give some background,
chaos engineering was pioneered at Netflix roughly around
2010, when they migrated from their
legacy hardware to AWS, and they started inducing
artificial flavor. Or at least this is one avenue of
where we see the initial views of it.
And then fast forward to today, 2023. I mean, it's a common
practice in the industry across tech
giants like Google, LinkedIn, also the
banking industry across. And you
have Colton Andrews from Netflix and Matthew from
Salesforce. They kind of merged forces and they
started Gremlin. Gremlin is a tool we also use,
which I'll briefly touch on. And then today also
there is component of Lightspeed,
which is harness. That's also another very nice
tool that can be used for chaos engineering. Of course,
there's other tools in the industry, there's other different flavors,
but we also have in house tools. But in
general, that's the idea behind it. So the
benefits of chaos engineering is to promote innovation,
elevate partnership, improve incident response,
generate knowledge, and really increase the reliability
and the resiliency to kind of measure the ecosystem
from what does that mean to the customer?
So different flavors that we do. We do a production
game day. We do it once a month. As part of that
exercise, we've written tons of ansible playbooks
where through an in house tool, we do one touch
failover, and then we run it through the entire day out of a single data
center. So we're really doing a
production stress test of the ecosystem to see what
is the threshold that it can tolerate. And we
have tons of applications that we do that, too. And then
we get the results, we measure it, we have automated measurement,
and then now we're working into an automated normalization of it as
well. Another flavor that we do in production is called
wheels of misfortune. Wheels of misfortune is a very fun exercise.
And where what we do is we
gather teams across sectors, from incident management,
problem management products, sres, performance,
and then we usually meet every couple of weeks for
30 minutes. We kind of pick topics, usually major
outages, or we can go all the way to a cybersecurity. And then
what happens after that? We conduct one exercise
every quarter and then we record it and it
helps build the stress level gets elevated. When you have
outages, we have a lot of takeaways from
meantime to recover. Do we have the right architecture
diagram? Do we have the right factoring? Any opportunities
for improvement? So it's kind of like a role play,
and then you would have volunteers, non playing characters.
So that's a fun exercise. And then we record it and measure it
and kind of come out with certain results behind it to see what
improvements can be done. We also do chaos engineering
in the lower environment. So we do it in Gremlin.
Gremlin is one of the industry standard platforms
which is available in SAS. It allows to inject failures
at various layers of the system. It can assess robustness
using one of different attacks.
Now you can do Gremlin on legacy physical
jvms on Linux servers. And then you kind of
measured. So let's say you have
100 transactions per second within
a 30 minutes time frame. Then at the ten minute margin
you're measuring the average response time. Then you invoke a
high cpu attack, and then you kind of observe, is there
any impact to the I o, so on and so forth. So we do that
also on a quarterly basis. And now what we're doing is we're
integrating Gremlin with Openshift, where you can
kind of measure the pods and see the different
types of attacks that can be done. The other tool
that we use as well is chaos monkey. Chaos monkey is
one of the original tools that was created by Netflix,
and it's one of my favorite tools
as well. It stimulates failures by randomly
terminating instances. So you can stop one
of the namespace instances in Openshift for
example, or PCF, Google Cloud, foundry, dell or
AWS, or whatever ecosystem you're in. And then you measure
what happened to the other layers or what was
the customer experience. We do have another
tool which is called Ape army, that's an in house tool where we
execute different types of costs to it. And then you can do basic
manual tests where you can manually manipulate
the environment. You can change the yaml file.
If you have services that are Java based, you can change the configuration or
the parameter to disable specific
components, or do a restart and measure the behavior.
So those are different types of attacks that can
be done. You can do like resource
attack, which has high cpu, high memory,
high I o load. There's also some that
are like state attacks where you can shut down process
skill or do time travel. You can do network
attacks, latency and packet loss, or a
black hole network connectivity. One of the other tests
that we do is not in production.
We kind of shut down one of the core services,
whether it's a major database or a core mainframe component,
and we start measure. What does that mean to the end user?
Did they get the right message? Was their
data available for them? So there's different flavors around
it. And if you have any questions,
please let me know. But thank you for listening in
and enjoy the conference.