Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hey everyone, thanks for joining my talk.
I'm happy to be with great other folks that share the knowledge with
others. Our talk today is how do we utilize chaos
engineering to become better cloud native engineers?
Let me first introduce myself. My name is Iran and I'm leading the sirenbox security
engineering at Siren. I'm an engineer, I'm a problem solver
and love sharing my knowledge with others obviously.
So before we start, I would like to start with this one.
What are you going to gain out of this talk? I would like to share
with you how we leverage chaos engineering principles to
achieve other things besides its main goal. We wanted at
the beginning to bring more confidence to our engineers while responding
to production incidents and in addition to that, train them to
become better cloud native engineers as it requires additional expertise
which isn't just the actual programming job.
And ship your code to somewhere. I would like to
share with you how we got there, what we are doing, and how
it improves our engineering team expertise. And you might
ask yourself, why is this title on color being
so? It's a series of workshops that are composed in Siren,
which are the beginning, I must admit, meant to bring more confidence
to the engineers during their own call shifts.
But later on it became a great playground to train
the cloud native engineering practices and share knowledge
around that. So during this session, I'm going to share
what we are doing in such workshops. So stay with
me to learn more.
So let's start with the buzzword cloud native. Here's the definition I
copied from the CNCF documentation. We call
it the cloud native definition. I highlighted some of the words
there. While you read these definition,
you see the words scalable, dynamic,
loosely coupled, resilient, manageable, observable.
At the end you see something that I truly believe in
and I'm trying to make it part of the culture of any engineering
team I join. As engineers. We deliver production.
From the definition, you see that this is what actually
cloud native brings. As a result, engineers can make high
impact changes. These is my opinion, what every engineers
culture should believe in, make an impact as an as
a result, you will have happy customers.
The evolution of the cloud native technologies
and the need to scale engineering, leading organizations
to restructure the teams and embrace new architectural approaches such
as services. We are using cloud environments that
are pretty dynamic and we might choose building microservices to
achieve better engineers scale. You should remember that as
a side note, microservices are not gold, but we use
that, the cloud and other stuff as the tools to scale
our engineering product. As your system call,
your system probably becomes more and more distributed.
Distributed systems are by nature challenging.
They are not easy to debug, they are not easy to maintain. And why it's
not easy just because we see pieces of a larger puzzle.
Two years ago I wrote a blog that's trying to describe
the engineering evolution. At a glance, I think that
the role of engineers grown to be much bigger.
We are not shipping code anymore.
We design it, we develop it, we release it, we support it
in production. The days that we had to throw
the artifacts on operations are over.
As engineers we arent accountable for the full relief cycle.
If you think about it, it's mind blowing. We brought
so much power to us as engineers
and with regard power we should be much more responsible.
You might be interested in reading my post that I wrote
two years ago. I just talked about the changes
and the complexity, but we should embrace these
changes. These changes enable teams to take an endtoend
ownership of their deliveries and enhance their velocity.
As a result of these evolutions,
engineers these days are closer to the product and
the customer needs. In my opinion, there is still a long way to
go and companies are still struggling. How to get engineers closer to their
customers to understand in depth what their business impact.
We called about impact, but what is this engineering
impact that we talk about? Engineers need to know what
do they solve? What's these influence on the customer and
know these impact on the product. If you think
about it, there is a transition in the engineering mindset.
We ship products and not just code. We embrace
this transition which brings with it so many benefits to the companies that are
adapting them, and we are among
them. And on these other end,
as a team, as a system scale, it becomes challenging
to write new features that solve a certain problem.
And even understanding service behavior is much more complex.
And let's see why it becomes more complex.
The advanced approaches that I just mentioned bring
great value. But as engineers, we are now writing app
that are part of a wider collection of other services that are
built on top of a certain platform in the cloud.
I really like what Ben is sharing in these slides and I would like to
share it with you. He's calling them deep systems.
Images are better than words and the pyramid
in the slide explains it all. You can see
that as your service scales, you need to be responsible to a
deep chain of other services that these services actually depend on.
This is what it means. We are maintaining deep systems.
Obviously, microservices and distributed systems are deep by
nature. Let's try to imagine just a
certain service you have, let's say you have some,
I don't know, order service. What do you do in
this order service? You fetch the data in one service
and then you need to fetch another data from another service and
you maybe produce an arent to a third service.
The story can go on, but you understand the
concept. It's just complex. Deep systems are complex and
you should know how to deal with them. As part of transitioning
to being more cloud native distributed and resolving
on orchestrators such as kubernetes at
your foundations, engineers face more and more challenges that
they didn't have to deal with before. Just imagine this
scenario. You are on call, there is some back pressure that
burning your SLO targets. There is some issue with
one of your AZ and third of your deployment couldn't
reschedule due to some node availability issues. What do
you do? You need to found it.
You need to find it, but and you might need to point that to your
on call DevOps colleague. By the way, DevOps may be
working on that already as it might trigger these slos as well.
These kind of incidents happen. And as a cloud
native engineer you should be aware of the platform you're running
in. What does it mean? You should know that these are AZ
in every region your pod affinities are defined in
some way and the pods that are scheduled have some status.
These are cluster events. And how do you read the cluster events
in such a failure? This was just one particular
scenario that happened to me and might happen to many of you before.
As you see, it's not your just service
anymore, it's more than that. And this is what it means
to be a cloud native engineer.
As I said already, being a cloud native engineer is fun
but also challenging. So these days
engineers arent just writing code and
building packages, but arent expected to know how to write
the relevant Kubernetes resource yamls or use
helm or containerize these app and ship it to
some variety of environments.
So as you see, it's not enough to know it
at a high level, being a cloud native engineers
means that it's not enough to just know the programming
language you are working on well. But you should also keep adapting
your knowledge and understanding of the cloud native technologies
that you are depending on. Besides these tools you are
using, building cloud native applications involves taking into account
many moving parts such as these platform you are building
on, the database you are using, and much more.
Obviously there are great tools and frameworks out there that
abstract some of this complexity out from you.
As an engineer, but being blind to them might teams
you day or even night.
If you haven't heard of the fallacies of distributed computing,
I really suggest you to read farther on them.
They are here to stay and you should be aware of them and be prepared.
In cloud things will happen, things will fail.
Don't think that you know what to expect,
just make sure you understand them. You handle them
gracefully and breath them. As I said, they will
just happen. We talked a
lot about the arent benefits and
also the challenging. So these is what
we had to do and we had to deal with them with
these challenges here at Cyren. So let me explain what
did we do to code with these challenges here?
We utilize Kerson engineering for that propose and we
have found this method pretty useful and I think that it can be
nice to share with you the practices how
we dealt with them here.
So stay here with me.
Let's first give a quick brief of what chaos engineering
is. The main goal of Chaos engineers is
as explained in the slide that I just copied from the Chaos
principles website. The idea of chaos engineering is
to identify the weaknesses and reduce certain when building
distributed systems. As I already mentioned
in previous slides, building distributed systems at scale
is challenging and since such systems tend to be composed
of many moving parts, leveraging chaos engineering practices to
reduce these blast radius of such failures improves itself
as a great method for
that propose. So I've created
a services of workshops called Oncoloke King. These workshops
intend to achieve two main objectives.
The first train engineers on production failures that
we had recently and second is to train engineers on cloud
native practices, tooling and how to become better cloud native
engineers. A bit on
our call procedure before we proceed so it will help you
to understand better what I'm going to talk
about in the next slides. We have weekly engineering shifts
and ANOC team that monitors our system 24/7 there
are these alert severities defined severity one, two and three which
actually define the business impact alerts to the
actual service owner, alerts that we usually monitor.
We have alert playbooks that assist these on call engineer while
responding to an event. I will elaborate on them
a bit later. And in case of a severity one, the first
priority is to get the system back to a normal state.
These on call engineers that is leading the incidents should understand the
high level business impact to communicate back
to the customers and in any case that there needs to
be some specific expertise to bring back into
a functional state. The engineers is making sure that the
relevant team or the service owner are on the keyboard to lead
it. These are the tools that the engineer got
in his box to utilize in case of an incident.
It's a pretty nice tool set, I must admit.
We have Jager, Kibana, Grafana and
all the rest. Now that we understand the big picture,
let's read down to the workshop itself.
The workshop sessions are composed into three parts.
We have the introduction and a goal setting.
Then we might share some important stuff that we would like to share with
everyone. And then we start the
challenge as the most important thing
of this workshop. Let's dive into each one
of the parts that I just mentioned above.
The session starts with a quick introduction and motivation.
Why do we have these session? What are we going to do in the upcoming
session? And make sure that the audience are
aligned on the flow and the agenda. It's very
important to show that every time as it makes people
more connected to the motivation and understand what is
going to happen. This is part of your main goal. You should try to
keep people focused and concentrated, so make sure that
the things are clear and concise at the beginning of
every workshop session.
Sometimes we utilize the session as a great opportunity to communicate
some architectural aspects, platform improves
or process changes that we had recently. For example,
we provide some updates on the on call process,
maybe code service flow that
we made some adaptations on and more.
And the last part, and the most important part is we work on
maximum two production incidence simulation
and the overall session shouldn't be longer
than 60 minutes. We have found out that we
lose engineers concentration for longer sessions.
So if you work hybrid,
it is better that these sessions will
happen when you are in the same workspace as we have found that
much more productive. The communication is a key
and it's making a
great difference. Let me share with you what we
are doing specifically in this part, which is the core
of this workshop. I think this
is one of the most important things. Our oncology being
workshop sessions are usually trying to be close
to real life production scenarios as possible.
By simulating real production scenarios in one of
our environments, such real life scenarios enables
the engineers to build confidence while taking care of real
production incident. Try to have an environment that
you can simulate that incidents on and let people play in
real time. As we always say, there is no identical environment
to production and since we are doing specific experiments,
it's not really necessary to have production environment in place.
Obviously, as more as you advance,
it might be better to walk on production, but it's
much more complex by nature, as you already understand.
And we have never done this before since
we utilize chaos engineering here. I suggest having
a real experiments that you can execute within a few
clicks. We are using one of our lotus environments for
that proposal. I must
say that we started manually, so if you don't
have a tool I really suggest you not to spend dime on that.
Don't rush to cause a specific chaos engineering tool.
Just recently we started using litmos chaos
to run these chaos experiments. But you
can cause anything else you would like to or
you can just simulate this instance manually as we have done
in the beginning. I think that the most
important thing is, as I said before,
we need to have a playground for the genes to actually exercise and
got just hearing someone talking about presentation slide
in some demo you will be convinced that
when they arent practicing and not just listening
to someone explaining it makes the session very very
productive. Right after the introduction slides
we drill down into first challenge.
The session started with some slide
that is explaining a certain incident that we arent going to simulate.
We usually give some background of what is going to happen. For example there
is some back pressure in one of our services that we couldn't handle in
some specific UTC time. We present some metrics of
the current behavior. For example we present the alerts
and the corresponding grafana dashboards.
Usually I
think that you should present something that is very
minimal because these is how it usually happens during the real
production incident. These we give engineers
some time to review the incident by themselves. Giving them time to think
about it is really crucial.
These are exercising alone thinking. If they haven't
covered something similar before this very important step,
it will encourage them to try and find
out more information and utilize their know how to get
more information such as gather some cluster metrics
or view the relevant dashboards, read the logs or
service status.
And I think that it's understanding
the customer impact is a very important
aspect. You should understand the customer impact.
And even more importantly, when you are in
an on call shift, in case of a severity one,
you should communicate the impact on the customer and see if
there is any walk around that did the incident resolved completely.
Engineers is not always aware of the actual customer impact.
It's very good time to discuss it. I think
that the walks of chess and I really think
that it's a good time to speak about such things.
I think that you should also pose their analysis from time to
time and encourage them to ask questions. We have found
out that the discussions around the incidents is a great place for knowledge
sharing. Knowledge earning can be anything from design
diagram to some specific Kubernetes command line and
if you are sitting together in the same space, it can be pretty nice
because you can see who is doing what and then you
can ask them to show the tools that they are using and
other people can learn a lot from that.
What I really like in this session is that it triggers conversations
and engineers tell to each other,
for example, to send them Cli or tools
that they recommend to use. It's pretty nice and
it really makes the engineers life
while debugging an incident much, much easier.
The workshop sessions will teach you a lot on the know
how that people have and I encourage you to update the playbooks
based on that. If you don't have such a playbook,
I really recommend you to have such we have a variety of
incidents playbooks.
Most of them are composed for major severity one
alerts as a provider on colongineer with some gotchas
and high level flows that is important to know and
to look at when dealing with different production incidents
or scenarios. And this
is how our playbook template looks like.
You can see that these is some description how to detect
the issue, how to assess the customer
impact, and how to communicate in case
of a failure. Drive the conversations
by asking questions that will enable you to share some
of the topics that you would like to train on. A few
examples that I have improves to be efficient arent
you can for example, ask an engineer to present the Grafana dashboard
to look at or ask someone else to share the kibana logging queries
or ask another one to present their jega
tracing and how do they fund trace.
It's pretty nice. I must admit.
You sometimes need to moderate these conversation as
the time really flies fast and you
need to bring back the focus of it because the conversation
getting very heavy sometimes during
the discussion portal, figure on interesting
architectural aspects that you would like the engineers to know about.
Maybe you can talk on specific async channel that
you might want to share your thoughts about or anything
else. Encourage the audience to speak by asking
questions around these areas of interest that will enable
them to even suggest new design approaches
or highlight different challenges that
they were thinking about lately. You might be surprised,
I tell you. You might be surprised what people say and
sometimes you might even add them some of the things
that they suggest to your technical depth in order to take
care of them later on.
At the end of every challenge, ask somebody to present their
end to end analysis. It makes things clearer for
people that might not feel comfortable
enough to ask questions in. For example in large forums or
engineers that have been just joined the team or junior
engineers that might want to learn more afterwards.
It's a great source for people to get back into
what has been done and also a fantastic part of your knowledge base.
These you can share the onboarding training process
to new engineers that just join the team. I found
out that people sometimes just watch the recording afterwards and
becomes handy even just for engineers to get some
overview of the tools that they have.
So just make sure to record it and share
the meeting notes right after the session.
As you have seen, chaos engineering for training is pretty cool,
so leverage that to invest in your engineers team's knowledge
and expertise and seems like it was successful for
us and maybe successful for you as well.
So to summarize some of the key takeaways,
we found out that these sessions are an awesome playground
for engineers. I must admit that I didn't think about using chaos
engineering for this simulation at the first place.
We started with just manual simulations of our incidents
or just presenting some of the evidence we gathered
during some time of failure to drive conversation around them.
As we moved forward, we leveraged the usage
of Kios tools for that proposal. Besides training to
become better cloud native engineer, non code
engineers are feeling more comfortable now in their shifts
and understand the tools that are available to them to respond quickly.
I thought it can be good to share as
we always talk about chaos engineers experiments to make
better, reliable systems, but you can also leverage them
also to invest in your engineering team's training.
So thanks for your time and hope it was a
fruitful session. Feel free to ask questions
anytime and I will be very happy to share with you much
more if you want to.