Transcript
This transcript was autogenerated. To make changes, submit a PR.
Jamaica make up real
time feedback into the behavior of your distributed systems and
observing changes exceptions. Errors in real time
allows you to not only experiment with conference, but respond instantly
to get things working again.
Close SAS
Fox I'm so delighted to be part of the Conf 42
ks engineering 2022 as speaker with such great speakers.
Today I will be presenting to you as you all chaos, chaos chaos
engineering alongside Litmus and Jenkins.
It's like Rex what I have done with my clients talent
to improve the resilience of its app before promoting to production
and making it easier to developers and SRE to execute
chaos. I'll be explaining it in a deep dive
in a bit. I'm sure that also that you have already seen many great
talks about today about the why and
when to do chaos engineering, but I will be focusing on the
how today so it will be also
shown by a demo. Before that, I would like
to present myself. I'm Akram Riahi. I'm an SRE
and the Chaos engineering at scale. I'm also an author of
several blog posts posts related to the Chaos engineering
and litmos and also I'm an organizer
of the Chaos Week which is like a week long chaos
engineering fest with great speakers such
as Wambukara from Litmus,
Yuraninu, Jay Z, Ni, also for
Gremlin. There are other many chaos engineering
folks also who has been included in the Chaos week. This event
aims at Cloud native community in France and I'm also
part of the Paris Chaos engineering meetup. I've also
participated in the Chaos Carnival recently
as a speaker who we are
so we are with scale that has been created
in 2015, it's already more than
50 experts whose mission is to help you
become cloud native. We help you also think, build and
master your cloud native architecture, continuously adapting to your
maturity. What set us apart is our high level
of expertise and personalized support on cloud native technologies
with a convection that know how only has value if it is
already shared. That's why we are trying to write many
several blogs and do some meetups etc.
We are also a member of the CNCF and partnership
with Hashicorp, but also with AWS and GCP.
In a few minutes we will have presentation
of our menu for today we will have an introduction. Then we are
going to see together how we use case engineering easier with litmus for
developers and SRE to improve their resilience. We will also discover
the harmony between Litmus two, Jenkins and Slack. We will deep
dive into that via an amazing demo at the end
with my clients it's always hard to look for solution to
test and improve our application resilience before
promote into production. And also it's hard to
ensure that it will be hydro resilient. And we
are not going to get some incident in
the morning and the very morning. And we
are not going to get surprised. Today we are faced by two
major choices. Either we can create scripts and some
tests that can take a lot of time and investments and also a
lot of consultants also. Or we can go to the chaos engineering
discipline with scientific approach based on hypothesis
and experimentation. But here the question
is it difficult? Do we have enough knowledge to do
chaos? How can we deal with it in a daily basis knowing that
we are production a lot of code that has to be tested in term of
resilience. And also we are going
to face this question, can we make it easier for developers to do that if
we say okay, for example, if you are going to say okay,
we can do chaos, but is it going
to be easier for developers and SRE to do that? Well,
I'm very certain that this answer is yes,
but how case engineering is easier. Now the delivery
process has more steps from dev all the way to the saicd.
Each time the developer push a code, it will be tested by a classical approach
called QA tests. With all type of tests that looks for
things we all know. It means that we are going to test things
that we already know built. We don't going to see
what we can't expect and the unknowns
in other words. But we have also to the right to get surprised sometime
by a problem we don't know and we don't expect. Also in order
to improve the app resilience. For that reason we have also
to enable developers in order to inject chaos in their DevOps pipeline
as often as they want. So today our talk will be
around the disenablement and how they can easily inject chaos
via sample push or a pull request.
To do that we are going to present the environment.
So our environment will be based on AWS and
kubernetes for the infrastructure and AWS for
as a cloud provider and kubernetes as a container
orchestrator. We'll have also Jenkins
for the CACD part. We'll have terraform for the Amphra
configuration. And we also have slack
for the nfcation and communication alerting, notification,
alerting and communication. As we all know that communication
in chaos engineering is very important in
order to make people knowing that
we are going to inject chaos and so they don't get
surprised. And also that helps us to collaborate together
in order to improve the app resilience and to improve the
system results. In other words, we have also GitHub for
our code SEM or code management for
the chaos injection we will use our famous framework
which is Litmos chaos. So what is Litmos chaos?
Litmos Chaos is like an open source framework used for chaos engineering
and which helps kubernetes sres and developers to
practice it in a Kubernetes native way.
Litmos was in CNCF sandbox and now it's incubation
because of its great community and a big community behind it
that supports Litmus and make it and
trying to improve it continuously,
continuously. And you can find it in GitHub
repo Litmus chaos Litmus. Now it
is in version 2626-0260
well, the importance of Litmos is like behind
the Chaos experiment that provides. So Litmos
provides a lot of chaos experiments in kubernetes and AWS,
et cetera. It will help us inject various scenarios
such as cpu hugging, memory hugging that target the resources and
we can also have some experiments that target
for example network such as network latency
for example. And this is in the pod, for example the
radius. We can enlarge the radius in order to attack, for example nodes.
These experiments are available in like Chaos hub
that group all these chaos experiment.
This chaos experiment can be organized and executed on another
word orchestrated within a case workflow. So here's a question,
what is the chaos workflow? A chaos workflow is like a
set of different operation, or we can call it chaos experiment
coupled together to achieve like a desired chaos impact on
a Kubernetes cluster. Well, the importance
of this chaos workflow is like it
is very useful in automating a series of preconditioning
steps or action which is necessary to be
becoming before triggering the chaos injection. And a chaos workflow
can also be used to perform different operation
parallel to achieve a desired chaos injection scenarios, for example.
So I see that my application is very
affected by the cpu and memory
memory. So it means that if I, for example,
I want to test the
impact of the update or the
injection of chaos injection on
these two resources, these two resources so I
can go for a chaos workflow for example and
create for example a workflow with two experiments,
cpu hugging and memory hugging and make it running parallel
for example. I have noticed that lately I have gotten many
network latency on this app, so I would like to see that
and to make it for example randomly on
the different also dependencies so I can create workflow
that chaos. For example, two experimentation
running in parallel CPU Hogan memory Haagen and then going
in serial with network latency.
This chaos workflow can be for example created through Chaos
center, which is like a portal that helps us
see our workflow, observe it and monitor it, and even
create our workflow. From the case center. We can see it, we will see it
in the demo lately. How is that
so how the case engineering is easier with litmus and Jenkins.
So to do that we are going to begin to
present the infrastructure which is based on AWS eks. It has
been cooked via terraform for the sake of demo as KS
engineering requirements. We all know that we have deployed
and configured the monitoring stack composed from Grafana
and promote. And also we have configured Slack to get notified with
the necessary actions and also to communicate before executing chaos.
Because it's very important, I'm trying always
to insist that the communication is very
and highly important keys
of the chaos engineering discipline. We have
also configured Jenkins GitHub to be triggered via pull
request. And also we have like a container registry like
get Docker hub or artifactory for example. For example.
So here developer will update its code or the app
code and will push or create a quest that will trigger the
pipeline. And this pipeline, it will notify slack
that the pipeline has been started and it's going to prepare
the amp and then build the application and push it to
dev. So here it will push to the Docker hub container registry.
Then we will start the QA tests.
So I'm not going to present the
QA test because it's not very important for the sake of our demo and
our presentation. And then it will go to update
the deployed app image, so it will be updated by the new
application or the new image and then
it's going to inject chaos. So injecting chaos,
it will be done via applying the workflow that we have already talked about.
It's like CRD Kubernetes CRD and will be
applied. So here we will face two results,
pass or fail. If it fail, it means that a chaos experiment
fail. Our app is not resilient, so it will get notified
by slack or if it passed, so our app resilient so
it will be promoted to production, tagged and promoted to production,
pushed to the container registry with a prod tag. Then we will
clean up the resources that has been created with the chaos workflow,
for example chaos workflow and also will clean up the chaos results,
which is like CRD for litmus. And then it
will interface slack with the Qatas and the chaos, such as
the chaos report. Now we are going to
move to the amazing part, which is the demo. So get ready
for it. So to
begin with, I will present the code which is like a simple,
very simple code. Let's see together
Netmas chaos workflow up.
It's a very simple code. Our app, it's like it's running.
It's like doing hello Chaos folks.
So here,
hello folks. For example here hello Fox. I will be updated
it. I'll be updated
it. And then we have our app
which is in here. Sorry, yes, it's here.
So here we have our app, which is
here is Dockerfire of the apps like Apache app, PhP app,
and also we have the Jenkins file for pipeline.
We have the prepare stage.
We have configured the build image and pushed the dev decay
test and also the update
of the app, the app deployment. Our app
is like deployed in deployment. And then the case injection
jump via script and work that contains a workflow.
And if everything goes fine, we will promote the
app, the manifest of the deployment
which is here for the app. And we have different scripts.
This is the KS SH and the cleanup sh, the KsH,
it will apply the workflow which is here, for example the
CPU Huggin workflow and the cleanup.
It will clean up the workflow that has been created and also
the chaos result which reflect where
we can get the reports that has been sent to
slack in the workflow. We can find three steps.
Install chaos experiment Sepio Huggin, which is the experimentation that we
are going to running and driver chaos in order to delete the runners
and the agents that has been the runners and the resources that has
been created through this workflow.
In order to target our app,
we have to update the app info, the app
info which is here. So of course in
the chaos engine, in the chaos engine resource
here, for example, we are going to target the app namespace, which is app
with the app level which is app chaos chronicle demo.
And the app kind is like a deployment. And for the
CPU Huggin, we are using to do that for
60 seconds chaos duration. And we're
going to target one cpu core. And this will
trigger the chaos workflow. And this
is of course the revert part. If we don't need to revert, we can delete
this part. It will keep the different runners
up to clogs. For example, we have also for example, memory hogging.
We have the pod delete experimentation, for example, to delete pods
randomly, generally for deployment, for example.
Okay, so for the workflow here, we have seen that
the workflow is like an AIC code. You can
also get it from the chaos center. So here for example, you have the litmus
chaos center where we can see the different workflows
that has been run, the chaos engine, chaos engine which
connect the cluster, the chaos hub, also that contains many
experiments, Mary experiments in azure, AWS, et cetera.
We have the observability part. So here we are going to use our hours,
which is Grafana,
which is Grafana for example. And also we
have other stuff. So here we
can setting, we can use team management, user management,
and also we can integrate it to GitHub.
So for example, if I create like workflow here,
it will be pushed directly to the GitHub repository. So how
can I, here's the question, how can I get this cpu
Hogan workflow? I can do it, it's very easy.
So here from the cave center I will create a new one.
For example I will call conf 42.
Next I will add for example like a pod
up pod cpu huggin.
Thank you. So here I'm going to like this,
I'm using to target the app namespace while the count is deployment
and the app label which is chaos carnival.
So I can also define
the steady state like a probe that define, if I would like to
define the steady state of the app. It's very important. So for the sake of
demo, we are not going to use it. So we can
use HTTP, CMD, prom, et cetera.
And also for the tune experiment we can do. For example,
I'm going to go for 62nd chaos duration
and one cpu core click finish for me
here for the resiliency score, reliability score. It's like for
me the cpu Hogan is very important. So I will give it ten if it's
not important, for example, I can give it six, or even I can give it
from zero to three. For example, I will
give it ten and I will scale it now. If I can scale
it now, now I will get view yaml, I can get
this yaml and copy paste it in the vs code and push it
to our GitHub repo.
So here for example, I have
already, for example create a pull request which called Trigger Chaos
conf. This chaos conf has prepared the app for me,
build image and push it to dev and chaos, done the Qs
testing, then updating the app. So in the beginning we
have received like a notification saying
that the Chaos conf has been, the pipeline
has been started, then it will at the end,
at this phase, after the cleanup,
we will get results like you will get notification saying that the
chaos result, that the pipeline has succeeded and
the chaos result with the experiment name, which is Potsybu Hogan exec
with the verdict pass and the resilience Convert 100,
which is it means that our app is fully resilient
and it's like going
great. Here, for example, I will
update the app. Well,
I will do hello casework from around
the world and
I will push it,
git add, git commit,
git push origin.
So here it will trigger the app,
it will trigger the master. Here, for example,
the master will be triggered, it's pending. And then
we will get like a notification here saying that it is started up,
waiting for it to
get started up,
it's like waiting for it, waiting in the queue.
And here we have the startle. Then it will inject
the chaos and we will go to all the way to
the different steps that has been
shown here, up all
the way that here the Q eight hasn't been updating the app.
Here it will update the app. Here it has injected the chaos
and we will wait for the chaos to finish.
When the chaos will finish, we'll finish. We will see that the
app, it will be
updated and we will see the CPU Hogan
for that. For example, we can see that there are several resources that
are created in the litmus
namespace, for example, which like for
example, they are the runners. They are the runners. For example,
see here, runners that will execute the experiment which
is the pod cpu huggin. Once it's finished, it will create for us the
chaos results. You see
it's case results that will be updated to our chaos
with the verdict and the reliability score. And also we have the
workflow, workflow resources that will be created
which will be in running state that will run our
experiment,
which will run our experiment in the litmus namespace.
Here the chaos is waiting for chaos to finish and
certainly it has finished through the
master.
So if we can take a look to the slack notification,
we will get notified that it has been updated
with the new image.
It's like it takes sometimes to get notified.
It's like a connection issue up.
Well,
we will wait for it to finish and
then we will get such the case
results experiment name CPU Hogan, et cetera.
It might take some time to get notified.
Well, going back to our it
while it finish tech,
let's see,
succeeded. So it's like notification error. So it's like the Internet
connection is lagging so if we are good,
we will see that the promotion. Yes, it's like
the Internet has already finished. So normally it's updated.
So it's right updated. And here normally we'll get
the experiment name with the result and the report.
Going back to our presentation,
I hope that you enjoyed the demo.
So as we have seen that starting chaos injection is
a must in order to improve our app resilience.
And also before chaos
injection we have always communicate what we are going to
do and it's very important for the sake of other
team and the work of improve
in order to keep everyone posted
that there will be like a downtime
or something like that. And also we have also to make
chaos more and more automation, for example, as we have seen Jenkins
or other tools, in order to improve
it continuously. And also we have to keep
enhancing one of the most requirements of the chaos engineering, which is the alerting
systems in order to get notified
when we have errors incident.
And also we have also to keep enhanced in the
monitoring systems. And also all of
that will reveal a lot of failures and
it will expose many
things that we have already forgotten or we didn't
have the chance to take it into account in our
infrastructure or our system. So we don't have to be afraid
of and we have
to keep moving forward. And also I
believe that the key success we need to hack failure before
it's very important to to
learn from our failures. I hope
you enjoyed it and I would like to thank you very much for attending
this session and see you soon.