Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello everyone. Welcome to the session on SRE enablement through
Chaos Engineering Conf. 42 SRE 2023.
I am Chandra Dikshit. I am Sre architect
at HCL, HCL, HCL, HCL Cloud Native Labs,
London. In this session, I'll be breaking you through our journey
of chaos Engineering through which we
have enabled SRE upskilling and
SRE practice enablement into various
delivery teams and clients here at HCL Tech. So first
of all, let me take you through what we
do here. HCL cloud native labs labs so
we are cloud native labs thought
leaders of cloud native programs and strategies along
with cloud native Engineering here at HCL Tech.
We are located in three locations in US,
UK and in India. We work
across four key areas as you can see
on the screen, strategy and direction, art of
the possible adoption and enablement and cultural transformation.
You can see that the range of
areas, range of practices that we work across
is complete spectrum of cloud native
engineering. We start from a strategy
over to building the state of mind workforce
upskilling modernization to the
showcase of what new is coming up in
the ecosystem, what new is coming up in the industry from hyperscalers
and how that can be adapted to our clients
ecosystem, our clients specific scenarios.
We are very skilled team of engineers,
architects, strategists,
technologists and as
I said, we work across the board for anything which is
cloud native and therefore we interact with a lot of clients.
One area where I am specifically inclined
towards is the cultural transformation part workforce
modernization, cloud native state of mind part,
and specifically within there the DevOps and SRE culture
building. I'm part of a unit where we upskill
our colleagues into SRE practices, DevOps practices,
and a part of that. Very key part of that has
been chaos engineering. And we'll expand on that in a bit.
So first, what services we provide out
of cloud native labs, what SRE services, particularly we provide
out of cloud native labs. So our SRE services
fall under three categories. SRE enablement
services under which we do skill assessment
of a team of a client, of a business unit.
Then we do enablement. So we design programs,
we design custom learning journeys, we run certification programs
all through providing end to end management from
the labs. We also do SRE consulting services quite
a bit to our clients, wherein we go into their environments,
we assess their environment, we assess their tool sets, we assess
their maturity at which they are running into their operations, their development
areas, and how SRE inclined, how mature
their processes are in terms of SRE maturity
scale or index. And then we work with them to design programs to
enable practices of SRE, like chaos engineering, like SLI
slos, like observability enhancements, et cetera. And third,
if there's maturity assessment services that
we also do, which is purely
basically consulting service, going in,
looking at the state,
probably doing third party assessment
of architecture automation setups, SRE setups,
et cetera, and then maybe doing some coaching as well,
just to guide them onto the right path. In doing so,
we do work with a lot of customers. So far,
I think since we started, we have interacted with 100 plus customers,
particularly in the SRE space and from the lab.
Since we started this program, I think two years back, now we have
enablement more than 1000 SRE who are certified
from the certified certification program that we run here
at labs. So it's done quite at an extensive
level, and it has become now the de facto
standard of SRE enablement across
our organisation. So coming to chaos engineering,
then, what has been our journey with chaos engineering?
Chaos Engineering has been a niche, up and
coming kind of practice.
It started, I think, quite a while back now at Netflix when
I think they came up around 2016 17 and
explained what they do with the tools like chaos monkey.
And since then, it has been developed into more methodical,
more customized kind of a practice.
So we started with the practice around August 21.
So we started exploring products, we started looking at practice,
the different concepts and aspects,
the value that it brings, and basically what
SRE, the art of the possibilities around chaos practice,
how we can benefit, what value we can bring to our customers.
We started from there, we started utilizing a
few tools in a smaller capacity, doing basic
kind of experiments. And I think
by end of 2021,
we were doing experiments which were like, which can be called
simple chaos experiments, like pod delete, node shutdown.
But we also started doing them in a more as
code kind of scenario. We started integrating
them with our CI CD pipelines, et cetera.
So we were also doing chaos engineering, but at
the same time, we were also looking to kind
of automate the execution on them so that they can be attached
to an already existing flow.
Furthermore, startup 2022 late
in 2021, around Christmas
particularly, we started playing around with the
workflow kind of scenarios where you can combine multiple experiments,
and the intention there was that we can create something very
close to what an actual fault in a production scenario
can be, how close you can get to that scenario,
and basically test your services,
their resiliency, and then work on them.
Finally, I think around Feb March 2022,
we sort of included this as a standard part of
offering that we demonstrate to our clients
that we include into our SRE cohorts,
our SRE enablement trainings.
And since then we have been maturing this
along implementing
and exploring these in hyperscalers. Tools like Azure Chaos
Studio, AWS fault injection.
We've been explains obviously the
cloud native tools, chaos Mesh, litmus chaos standalone
tools like Gremlin. Now harness also has come
up. So we've been exploring these tools quite a bit.
We've been using them, building them into our client demos as
well, and they have been received quite well.
This is a new practice, so we had to kind of explain
this, build some material around it to explain the
value of these to a prospective
client or to colleagues. But it has been well received
and you can see from the numbers here, we've been running
around 20 plus cohorts since then. We have done around 100 plus
client showcases, so it has been a major part of that.
Then how have we progressed?
So as I was explaining a little bit in the previous slide
as well is that we started with simple experiment.
We wanted to try have a taste of how do
you inject faults. We started in
VM based scenarios, something replicating chaos
monkey, maybe taking down certain part
of VM based stack or data center based tag.
Then from there we went on to kubernetes based stacks,
cloud native stacks, taking down pods,
blacking out particular services,
dropping DNS, et cetera. So there was simple
experiment and they were being manually driven.
From there on we went into workflow based experiments which
were more logical, like if we start
from here and if there is like taking an
application, if there sre five microservices dropping their
services one by one, or deleting pods one
by one, and seeing cyclically what is the impact and
how well those microservices recover.
We started thinking in teams of CI
CD integrations, argo CD
integrations, GitHub integrations, and basically making it more
automated workflows, something that we can attach
to an end of existing delivery
workflow or deployment workflow. So that every time there
is, from developer's point of view, from developer
velocity point of view, that every time there is a
deployment of a new version of a microservice or version
of a Java kind of service,
the flows that are designed for chaos,
they run in production after the service has been deployed.
And that is the resiliency of the service.
And if that goes through,
we just go ahead with the deployment, right. The kind
of concepts that we have taken care while designing
these experiments have been things like design a
hypothesis, select a blast radius,
then test and observe. And based on the insights,
you basically improve your service all over again. We have
also expanded to hyperscalers. So the kind of technical domains
that we have covered have been hyperscalers.
Azure aws, particularly kubernetes
based environment, have been a big hit. There are quite a few tool
sets which cater to cloud native stacks,
cloud native environments. And we have also developed
some solutions for private and on prem private cloud or on
prem based chaos experimentation, because quite a
few of the clientele, quite a few of the
environments that we work along with will
be on prem or private cloud. Next, I have an
example of how chaos
workflow works in a
cloud native kind of environment, and how we
demonstrate these kind of values, these kind of
chaos engineering impact.
How do we basically explain or emphasize
that from two sres, particularly,
that this is how chaos engineering can make your service
more reliable? Right. In this example over here,
what we're showing is how we can run a chaos
workflow, which can contain multiple experiments
and then drive value, make decisions on the basis
of that. Right? So this example was
done with tool sets,
mainly chaos tool set, which was litmus chaos,
and then GitHub actions, Argo CD,
which is basically part of litmus chaos actually,
and Grafana for observability, Grafana and Prometheus
for observability, actually. And then the application that we
have used is the block shop microservice application.
So how the workflow works is that you
can design your experiment, you can design a complex workflow
of experiments, like take down one service,
first see the impact on other services, then take down
another service, then see the impact and keep
on going like that. It can be written in
form of a YAML workflow file, and then the developer
can simply check it in. Right? And once that checks in,
the GitHub action is triggered. And then
the GitHub action basically triggers the
submission of that workflow to argo workflow server,
and then Argo workflow is basically started. Argo workflow
is the one that executes this complex
set of experiments. So how it does that is
that it creates custom resources prescribed
or provided by litmus.
Those are like chaos experiments,
which are definitions of the actual experiment that you want to run.
Then it creates the chaos engine, which is basically the
running instance of the workflow. And then finally
it generates something called chaos results,
which sre basically the outcome that
tells you how those experiments and that those workflows have
fared. So argo triggers
the workflow, generates those crds,
and then once those crds are generated, they run
themselves to create the kubernetes, native objects
like jobs and running pods,
which will then execute your experiments.
Once that happens, that obviously impacts the application
and the impact can then be captured
by capturing the golden signals which
can be observed on your observability dashboard, in this case
Grafana. One thing which we can also
include, and we have done that, is that we can put in a
bot kind of scenarios that, okay, if your blast
radius is getting too big, or if the impact on your golden
signals like latency is becoming too big and the service
is starting to go down, the crds can be
aborted, the chaos engine can be stopped,
those kind of things can also be incorporated. And this is just one example
with litwuse chaos, but the same kind of things can be done with other tool
sets as well. And I'll talk about those tool sets in
the next slide. So talking
about tool sets and there is a rich selection
of tools, chaos engineering tools now available. And thus that's
a very good thing to practice chaos
engineering in the current setup,
current ecosystems, whether it's CNCF ecosystems or
hyperscalers, or even individual enterprise players
like Gremlin and harness. So it's really good time to
practice chaos engineering, I'd say what
are the points that we like in this practice
with the tools that are available right now is that
quite a few of these are open source, so you
are free to experiment with them. And once
you kind of happy with the product, that's the
product you want to sort of roll out into your environments or your clients environment.
There are enterprise versions available for multiple of them.
These are quite cloud native, all of these tools,
so they are compatible with almost all the versions of kubernetes.
Lightweight Kubernetes managed kubernetes as well as
your on prem and cloud platforms. So the range is
quite huge. And then there is
detailed API support, right? So if you want to do these things
programmatically, like I was explaining, through workflows,
Yaml based files checked in into your GitHub repositories,
triggering your CI CD pipelines,
or in terms of kubernetes, custom resource
support so that you can control these through operators,
et cetera. It's quite good.
There are best practices support like GitHubs
observ Githubs through argo
observability through Prometheus, GitHub action supports
and other CI CD tools SRE compatible as well.
Not the least great documentation, which is quite key.
Many a times you see open source products not
having such good documentation. So these
tools, I think are quite good with that especially helps
when you're exploring, when you are looking to adapt them to your
particular scenarios, your use cases.
And then finally, the enterprise versions
of many of these tools are now available. So once
you come to a stage where you want to go to production
with these practices, or you want to adapt to these practices,
introduce them, roll them out to your organizations,
you don't really have to worry about being open source
and not being able to find enough specific support.
There are enterprise versions available. The hyperscalers
are also coming up their tool sets now, which is
another thing, because these tool sets like Azure Chaos Studio is
quite native to Azure services. They are
built and they are native, and if you want to run
experiments into your Azure environments and Azure chaos
studio bots really well. So basically
what we wanted to showcase through this presentation is
that how chaos engineering, how has our journey been with chaos
engineering, particularly how this has helped
us rolling out chaos engineering,
enhancing the understanding of our sres and
our clients in their journey of SRE adoption and
chaos engineering tools. The chaos engineering practice
has been a wonderful, ah,
enabler in that journey. So with that,
I would say thank you very much.
If you want to reach out to us, you can reach out to us
at SRE, underscore cnl@etscl.com thank
you very much.