Abstract
As engineers, we used to write code that was interacting with a well defined set of other applications. You usually had a set of services that were running in well defined environments. The evolution of cloud native technologies and the need to move fast, led organizations to redesign their structure. Engineers are now required to write services that are just one of many other services that usually solve a certain customer problem. Your services are smaller than what they used to be, they aren’t alone in a vacuum and you have to understand the problem space that your service is living in. These days engineers aren’t just writing code. They are expected to know how to deal with Kubernetes, HELM, containerize their service, ship to different environments and debug in a distributed cloud environment.
In order to enhance engineers’ cloud native knowledge and best practices to deal with production incidents, we started a series of workshops called: “On-Call like a king” which aims to enhance engineers knowledge while responding to production incidents. Every workshop is a set of chaos engineering experiments that simulate real production incidents and the engineers practice on investigating, resolving and finding the root cause.
In this talk I will share how we got there, what we are doing and how it improves our engineering teams expertise.
Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hey everyone, thanks for joining my talk. I'm happy to be were with
great other force that shares the knowledge with others.
Our talk today is how do we utilize chaos engineering to become
better cloud native engineers and improve incident responding.
Let me first introduce myself. My name is Iran and I'm
leading cyren security engineering at Tyrant. I'm an engineer problem
solver. I love sharing my knowledge with others.
So before we start, I would like to start with this one. What are you
going to gain out of this talk? I would like to share with you how
we leverage chaos engineering principles
to achieve other things besides its main goal. We wanted
at the beginning to bring more confidence to our engineers who are responding to
production incidents and in addition to that train
them to become better cloud native engineers as it requires
additional expertise which isn't just the actual programming
to ship your code. Somewhere I would like to share with
you how we got there, what we arent doing and how it improves
our engineers team's expertise. And one more thing
you might ask yourself, why is these title on
Kolaki King? It's a series of workshops I composed
of Tyrone which at the beginning I must admit meant to bring more
confidence for engineers during their encore shifts.
But later on it became great playground to train the
cloud native engineers practices and train knowledge around it. During this session
I'm going to share with you what we are doing in such workshops. So stay
with me. Let's start with the buzzword cloud native
as a definition. I copied from the sensitive documentation which
cause I call it the cloud native definition. I've highlighted some of the
word there. While you read the definition, you see the word scalable,
dynamic, loosely couple, resilient, manageable,
observable. At the end you see something that I truly believe
in and I'm trying to make it in part of every
culture of any engineers team that I join.
As engineers, we deliver products for
the definition. As you see, this is what it brings. As a result,
engineers can make high impact changes. These is my opinion,
what every engineers culture should believe in. Make an impact
and as a result you will have happy customers.
The evolution of the cloud native technologies and the need to scale
engineering leading organizations to restructure their teams
and embrace new architectural approaches such as microservices.
We are using cloud environments that are pretty dynamic and
we might choose building in microservices to achieve better engineering
scale. You should remember just as a side note,
microservices are not these goal but we use that the cloud and
other stuff as tools to scale our engineering and end
product was your system scales, your system
probably becomes more and more distributed. Distributed systems
are by nature challenging. They arent not easy to debug,
they arent not easy to maintain. Why it's not easy just because we
ship these pieces in a larger puzzle.
Two years ago I wrote blog that is trying to describe the engineers
production. At a glance I'm feeling that the
role of the engineers grown to be much more bigger. We're not
just shipping code anymore. We design it, we develop it,
we release it supporting production. The days that
we had to throw artifacts on operations are over.
As engineers we are accountable for the full relief
cycle. If you think about it, it's mind blowing. We brought
so much power to the engineers and with regard
power we should be much more responsible.
You might be interested in the blog that I just mentioned and
I wrote back in 2020.
I just talked about the changes and the complexity,
but we should embrace these changes. These changes
enable teams to take an endtoend ownership of their deliveries and
enhance their velocity. As a result of these evolutions,
engineers these days are closer to the product than the customer needs.
In my opinion. There is still a long way to go and companies are
still struggling. How to get engineers closer to the
customer to understand in depth what is their business
impact. We talked about impact, but what is
this impact engineers need to know? What do they solve,
what they influence on the customer and know these impact on the product.
There is transition in the engineers mindset. We ship products
and not just code. We embrace these
transition which brings with it so many benefit to the
companies that are adopting them. On the other end,
as a team at the system scales, it becomes challenging to write new
features that offer certain business problem and even understanding
the service behavior is much more complex. Let's see
why it's complex. These best approaches that I've
just mentioned bring arent value.
But as engineers we are now writing apps that are part of
a wider collection of other services that are built on a certain platform
in the cloud. I really like what Ben is sharing his
slides. I would like to share it with you. He's calling them deep systems.
Images are better than words and these experiments in the slide explain
it all. You can see that as your service scales
you need to be responsible to a deep chain of other services
that this service actually depends on. These is what it
means. We are maintaining deep systems. Obviously microservices
and distributed systems are deep.
Let's try to imagine just a certain service you have.
Let's say that this is the auto service. What do you do in this order
service. You fetch data from one service and then you
need to fetch data from another service and you might produce
an event to a third service. The storage is on your own, but you
understand the concept. It's just complex. Deep systems are complex
and you should know how to deal with them.
As part of transitioning into being more cloud native distributed
and relying on orchestrators such as kubernetes at your
foundation, engineers face more and more challenges that
they didn't have to deal with before. Just imagine
this scenario. You are on call, there is some back pressure
that bring your slo target, there is some issue
with one of your availability zones, and third of your deployment
couldn't reschedule due to some node availability issues.
What do you do?
You need to find it out and you might need to point that
to your on call DevOps on colleague. By the
way, DevOps may be working on that already was. It might trigger
their slos as well. This kind of incident happened
and as a cloud native engineer you should be aware of the platform you
are running. What does it mean to be aware of that?
You should know that there are AZ in every region.
Your pod affinities are defined in some way and these pods that
are scheduled have some status. That are scheduled have some status.
These cluster events and how you read
the cluster event in case of such a failure.
This was just a particular scenario that happened to me and that
happened to many of you before. As you see,
it is not just your service anymore,
it's more than that. And this is what it means to be a cloud native
engineer. Was I said already. Being cloud
native engineer is fun, but also challenging. These days
engineers arent got just writing code and bidding packages,
but are expected to know how to write these own Kubernetes resource yamls
use helm, containerize their app
and ship it to a variety of environments. It is not enough
to know it at a high level. Being cloud native engineers
means that it's not enough to just know the programming
language you are working on well, but you should also keep adapting in
knowledge and understanding of the cloud native technology that you are depending
on. Besides the tools you are using.
Building cloud native applications involves taking into
account many moving parts such as the
platform you are building on, the database you are using,
and much more. Obviously there are great tools and frameworks
out there that abstract some of the complexity out from you
was engineer, but being blind to them might hurt you
someday or maybe night.
If you haven't heard of these fallacies of the distributed computing
I really suggest you to read further on them.
They are here to stay and you should be aware of them and be prepared.
In cloud things will happen, things will fail.
Don't think you know what to expect, just make
sure you understand them. You handle them carefully and embrace
them. As I said, these would just happen.
We talked a lot about the great benefits and also the
challenges, so we had to deal
with these challenges.
Let me explain to you what did we do to cope with these
challenges. So we utilized chaos engineering for that
propose we have found this method pretty useful
and I think that this can be nice to share with you the practices
and also with others.
Let's first give a quick brief what is chaos engineers?
The main goal of Chaos engineering is as explained in the slide
that I just copied from the Chaos principle website. The idea of the
chaos engineering is to identify weaknesses and reduce uncertainty
when bidding a distributed system. As I already
mentioned in previous slides, bidding distributed systems at scale
is challenging and since such systems tend
to be composed of many moving parts, leveraging chaos engineering
practices to reduce the plus radius of such failures
improves itself as a great method for that proposed.
So I created a series of workshops called on Karaki King.
These workshops intend to achieve two main objectives.
Train engineers on product ferros that we had recently
and train engineers on cloud native practices turing and how to
become better cloud native engineers a
bit on our own core procedure before we proceed, we have weekly engineer
shifts and an octeam that monitors our systems with these four seven
there are these alert severities defined severity one, severity two
and severity three which actually define from these
business impact alerts to the actual service owner alert monitor.
We have alert playbooks that assist these oncology responding to an
event. I will elaborate on them a
bit later. In the case of a severity one,
the first priority is to got the system back to normal state.
Don't call engineer that is leading these incidents shall understand
the high level business impact to communicate. In any case
that there needs to be a specific expertise to bring back into the
functional state, the engineer is making sure the
relevant team or service owner are on
keyboard to lead it. These are
the tools that the engineer got in the box to utilize in case of an
incident. Pretty nice tool set. Now that we understand
the picture, let's read down into the workshop itself.
The workshop sessions are composed into three parts.
We have the production and the goal setting. Then we might have
to share some important stuff that have been
changed lately and some that change right away.
Let's dive into each one of them the session
starts with a quick production of the motivation. Why do we have
the session? What are we going to do in the upcoming session and
make sure the audience are aligned on the flow and agenda.
It's very important to show that every time as it
makes people more connected to their motivation and understand what is going
to happen. This is part of the main goal. You should try
keep the people focused and concentrated so make sure
the things are clear and concise. Sometimes we
utilize the session as a great opportunity to communicate some architectural
aspects, platform improves or process changes
that we had recently, for example to
don call process or core service flow adaptations and
much more. We work on maximum two
production incident simulations and overall session time shouldn't
be longer than 60 minutes. We have found out that
we lose engineers concentration for longer sessions.
If you walk hybrid it's better to do these session when you are in
the same workspace as we have found this much more productive.
The communication is making a great difference.
Let me share with you what we arent doing specifically in
this part which is the code of the workshop.
I think that this is one of the most important thing.
Our Nkolaki King workshops sessions are usually trying to
be was close to real life production scenarios as possible by simulating
real production scenarios in one of the environments.
Such real life scenarios enable engineers to build confidence
while taking care of real production incidents.
Try to have an environments that you can simulate that incident
on and let the people play in real time. As we always
say, there is no identical environment
to production and since we are doing specific experiment,
it's not necessary to have a production environment in
place. Obviously as more as you
advance it might be better to work on production, but it's much more
complex and we have never done this before.
Since we utilize chaos engineers here,
I suggest having a real experiments that you can execute within a
few clicks. We are using one of our load test environments
for that proposal. We started manually. If you don't have any
tool, I suggest do not spend time on that.
Don't rush to a specific chaos engineers tool. Just recently we started
using litmus chaos to run these chaos experiments,
but you can use anything else you would like to or
you can just simulate these incident manually. I think
that the most important thing is, as I said before, we need to have a
playground for the engineer to actually exercise and got
just hearing someone talking on a presentation slide,
you will be convinced that when they are practicing and
not just listening to someone explaining something on a slide,
it makes the session much more productive.
Right after these introduction slides we drill down into the first challenge.
The session starts with a slide explaining a certain incident
that we are going to simulate. Usually give some background
of what is going to happen. For example there is some back pressure
that we couldn't handle since specific UTC time
represent some metrics of the current behavior. For instance
we present the alerts and the corresponding Grafana dashboard.
You should present something very minimal because
this is how it actually happens during a real production incident.
Then we give engineers some time to review these incidents by
themselves. Give them the time to think about it
is crucial they arent exercising alone thinking
if they haven't code and suffering similar before these
very important step it will encourage them to try find
out more information and utilize these know how to get more information
such as gather cluster
metrics, view the relevant dashboard, read the logs and
service status. It is very
important aspect you should understand the customer
impact and it's even more important specifically when you are
in an on call in case of a security one.
You should communicate these impact on the customers and see if
there is any walkaround until the incidents resolved
completely. Engineers not always aware of
the actual customer impact. It's very good time to discuss it,
put their analysis from time to time and encourage them to ask questions.
We have found out that discussions around the incidents is
a great place for knowledge sharing. Knowledge sharing can
be anything from design diagrams to some specific Kubernetes
command line. If you are sitting together in the same space it
can be pretty nice because you can see who is doing what
and then you can ask them to show which tools these use and how
they got there. What I really like on those sessions is
that it triggers conversations engineers tell to each other
to send some of their Clis or tools that
make their life easier while debugging an incident.
The workshop sessions will teach you a lot on these know how
that people have and I encourage you to update the
playbook based on that. If you don't have such playbook,
I really recommend you to have such we have a variety of
ECL playbooks. Most of them arent composed for major
services. One alerts they provide don't call engineers with some
gotchas and high level flows that is important to look
at when dealing with different scenarios.
These are how our playbook templates looks like.
Drive the conversation by asking questions that will enable you to
share some of the topics that you would like to train on. For example,
some examples that I have proved to be efficient ask
an engineer to present a group final to look at.
Ask an engineer to share his
keybanolog inquiries or ask someone else
to present its drag tracing and how to find such
a trace. You sometimes need to moderate the
conversation a bit as the time flies pretty fast and
you need to bring back the focus a bit during
the discussion. Point your finger or interest on interacting architectural
aspects that you would like the genie to know about.
Maybe you can talk on a specific async channel that you might want to
share your thoughts about. Encourage the audience to speak
by asking questions around these areas of interest that will
enable them to even suggest new design approaches or highlight
challenges they were thinking about lately. You might be surprised
and even add them to your technical debug at
the end of every challenge. Ask somebody to present these endtoend
analysis. It makes things clear for the people that might
not feel comfortable enough to ask questions in large forums
or engineers that have been judged joined the teams or junior engineers that
might want to learn more.
It's a great source for people to get back into what has
been done and also fantastic part of the knowledge base
where you can share onboarding training process to
new engineers that just joined the team. I found
out that people sometimes just watch these recording afterwards.
It becomes handy even just for the engineers to get some
of you of the tools that are in their best.
So just make sure you record and share the meeting notes right
after the session. As you can see,
chaos engineering for training is pretty cool.
Leverage that to invest in your engineering team knowledge and skills
and it seems to be successful and at least
it was successful for us.
So to summarize some of the key takeaways,
we found out that these sessions arent an awesome playground
for engineers. I must admit that I
didn't think about was engineers for this simulation at the first place.
We started with just a manual simulation of
our incidents or just presented some of the evidence we gathered
during a time of fellow to drive conversation around them.
As we move forward we leverage the usage of chaos tools
for that proposal. Besides the training to
become a better cloud native engineers, donco engineers are
feeling much more comfortable in their shifts and
understand the tools available to them to respond quickly.
I thought it can be good to share as
we always talk about chaos engineers experiments
to make better reliable systems, but you can leverage
them also to invest in your engineers
teams training. Thanks for your time and
hope it was a fruitful session. Feel free to ask me
any questions anytime. I will be very happy to share
more. Thank you.