Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi everybody, thanks for being here. It is an honor
to have an opportunity for shading my dashboard documenting for
chaos engineering. I have given this type to my talk.
Why chaos engineering documents matter nice
to meet you, I am Jody Nino. I am colombian. This is the
flag of my country. I work as cloud cloud infrastructure engineer Google Cloud.
Also I am chaos engineering and site reliability engineering
advocate. You can find me in LinkedIn,
Twitter and medium at jolinin. As it is mentioned in the
poster of my talk, everybody here knows that the software documentation
is a key to communication the decisions of the project.
But something that we have missed is that it includes chaos engineering
exercise, so we should have frameworks for that.
However, Chaos engineering is still a new discipline.
There is no framework that defines
how to documents experiments. In this talk I am going to
present, but before that I would like to share a chaos
story that puts in evidence the importance of having documentation for chaos
engineering. Having said that, I am going to share the benefits
of documents each software development lifecycle stage.
It's important to mention that I am not talking about documents
for building and operating systems. I am talking about
the documents for the chaos. At this point I will present a framework
composed by a set of documents that should be keeping
the stage of the chaos. I am going to talk
about the general documents for the team, for example, the team
charters and I will try to explore another documents
for preparing, ruining and finishing the case. So let me
start with the story, not Jude. My skills for Titan stories.
I am really bad on that. That is the reason for giving this simple
title to the story. A chaos story in the middle of chaos.
This story began a Friday until 50 50 time
ten minutes before the control chaos. I am going to use my name
for the main actress juries as a reliability engineer and she's participating
of a chaos engineering exercise. For the first time she watched an
experiments cohost announce the routine for
chaos in the Chaos Engineering network channel is a classical
experiment in which the engineering team unplugs some network
outlets. In production at 14 time is
the moment of truth. One of the cohosts run the prepared
command to insate the failure. A jury's colleague and note
take care recurs the time at 50.
After 14 is the time for all participants except
the note taker to sprinkle to action. It is the
moment for collecting the evidence of the failure, the recovery and
the impact on the systems. Confirm on disconfigure
all the details of your hypothesis and make note of the remediations.
At 1430 is the moment to restore the
service. It is the real moment of the truth.
All the preparation and the exercise in the development environment
has led you to this moment. You notice that the
production environment doesn't return to the steady state.
Since something is wrong. The chaos lead decides to stop
here. But it is so late because the automated
recovery procedure has put online the isolated region
for the chaos engineering exercise. The failure is now
noticeable for the customers. This feels truly terrifying.
For a moment. Jury has no idea what to do for tonight.
Leach is not alone. Jury proposes extrapolate the situation
to the development team. The team thinks it is a good idea.
Jury pages the document team. Unfortunately,
she gets no response. The chaos team search on
the intranet to find some documents that could help.
After precious minutes go by Juris find a
document referring a wait a script that can
apparently fix the problem. But this is the first time that the team
has heard of it. Since it
is the only option, they run the script and cross fingers.
Fortunately, it works. Luckily, all the characters
and episodes in this story are fictional.
Let me analyze some thoughts about the story.
A perfect recipe for the chaos in the middle of
the control case. It is the postmortem of our story
starting the things went well. Let me mention that the team war
use the communication channels and chaos engineering installation
were things that they didn't run well.
The things that they didn't go well.
Lack of knowledge about the automation recovery. Skip, did you remember
that the development team was unavailable?
Of course, the lack of documentation and finally let
me close with the next actions plan. A meeting between the
development and SRE team and no less important documentation
than user is the main topic of this talk. The moral
of the story the documentation is very very important.
Let me explain you why it is common that the organizations
depend on the performance of highly skilled individuals of
the team and the teams used to preserve
the important operational concepts and principles as nuggets of
trivial knowledge are passed on bear value
to new team members. It is a fact that if these
concepts and principles are not qualified or documented,
they will often need to be relent painfully through
trial and error. As I chaos mentioned here,
documentation helps developers communicate
with others and documents help future
developers understand and maintain the code. For example, it is
important these documents are important
on the onboarding processes. For example, good documentation
helps you learn from your mistakes. I would like to
share this fragment of an excellent article related to SRE documents.
SRE teams can prevent this process decay by
creating high quality documentation that lays the communication
for teams to scale up and take a principal approach for
managing new and unfamiliar services.
It is the most expected moment to present the documentation
framework. Here are the framework I have classified the
documents according to the stages that are part
of a classical chaos engineering
exercise. In the category of general, I have included
team charters, production, readiness and technical designs
before the chaos. It is important to consider that the chaos policies,
service agreements and on call policies. I am
proposing to use chaos designs, playbooks and incident
management documents during the exercise and finally after the
chaos, our mission will document postmortems and reliability reports.
Let me go in detail on the general documents.
I am going to start with one of my favorite assets.
I am talking about the team charters. A charter general includes the
following elements a vision statement. It should be an aspirational
description of what the team would like to achieve the long term
a short and high level explanation of the space in which your
team operates. This includes the types of services
the team engage with, related systems and examples.
Also a short description of the top two or three services managed
by the team. This section also highlights key technologies,
use and challenges for running them. Benefits of chaos engineering
engage and what chaos engineering does finally,
key principles and valuables for the team and needs to
rebel documents the production readiness
review, or PR as it is known, is a document with
the criteria considered to make sure that a service meets the
accepted standards of operational readiness and that service
owners have the guidance they need. A service go through
the review process prior
before the launch to production. Important to mention that
during this stage the service has not SRE support,
so the product development team supports the service.
Some common sections includes architecture
and dependencies. What is your request flow from user to
front end to back end? The capacity planning very
important. A key question here is have you obtained
all the compute resource needed to support your traffic failures
mode? For example, do you have any single points of failure
in your design on how do you mitigate an availability
of your dependencies, processes and automation?
Very important also. And finally, external dependence is
critical to know to do any partners depend
on your services? Technical design documents, also known
as TDDs, are similar to proposals, but they
describe in detail how a specific solution will work.
Design documents are often written when the implementation
a solution is not trivial or not well understood.
The author will usually have to design
decided, but he's looking for an approval from reviewers.
Since a design document should describe the steady
state of a system, it is really valuable asset for chaos
engineers. Some common sections include an overview
of the system, the system architectures, infrastructure,
services and documentation standards, which includes naming
convention. For example,
another sections for at PDD includes programming
standards, development tools, requirement traceability metrics,
document controls, documents syncops and document change reports.
Now I will move to the documents that should review before the chaos
experiments. A policy is a guide to
make decisions for achieving excellent outcomes.
Policies are generally adopted by governance body within
an communication, so my first recommendation is to
include an asset that documents ks policies.
Here are the sections an overview, the policy goals,
a description of the meaning of the steady state in your services,
the specification of your slos and slas.
The description of the keys policies includes
outage policies, escalation policies and related documents.
These are the classical service level agreements
famous in SRE. They used to include
what is the target for measuring the user
experience? What process will indicate success or
failure by average user? What systems
and piece of the infrastructure are used to conduct
that process and meet the target? For each target
or metric you should identify the source, the measurable
aggregation duration, the threshold by
which we can determine success and any
explanation behind the target service. The selection
finally, to this stage, the on call policies very important
also which use include an overview and
readiness, the training and scheduling. The chief details,
the pager load compensations, tools and processes
and communication standards. This is the real moment
of truth in the following the documents that could be useful
during the chaos experiment I think this is one
of the most important assets in the framework. I know I have
said that all documents are very important, but I think there
is one most important assets in the framework because
it describes the experiment. Here is registered what happened.
The hypothesis very important that you will verify
or refute. Then by domain in which the application is
running the duration and load applied in the experiment.
The result it is like an airplane black box. Your observability
strategy and the actions that you will take from the results.
The playbooks also called roombox.
With them, the oncology engineers respond the alerts generated
by service monitoring in the story, if jury had a playbook
to tell her to do in case of failure,
the incident could have been
resolved in a matter of minutes.
Playbooks reduce the time to mitigate an incident
and they provide useful links to consults and procedures.
Playbooks contains instructions for verification,
troubleshooting and escalation for each alert generated
from network monitoring processes.
They contain commands and step to review for
accuracy and incident management documents.
Sometimes called an incident response plans or emergency
management plans, they are documents that helps
an organization return to normal as quickly as possible following
an implied event. The sections
here overview and readiness chief handsome
escalation incident responsibilities very very
important. If you remember, for example, jury is the reliability
engineer in the exercise,
but she's not the lead of the team.
Prioritization key tools, dashboards and monitoring
and finally the useful links. And finally the documents
for after the chaos is a classic.
As you know, I couldn't allow me to miss the postmortems.
Remember, a software postmortem is analysis conducted
after a system failure. The goal is to understand
why an incident or error happened and to learn from the
experience. The mission with
this asset is that the future software becomes more robust.
Remember that problems happen. By conducting
a postmortem analysis you can ensure that they don't
happen. Goin reliability reports
are common assets to communication the results of KPIs to
the management. Remember that the software reliability is
the probability of failure free software operation
for a period of time in an
environment. Some sections that I think you can include
are indicator name collection method,
assessment formula, state criteria,
target and performance threshold, source of data,
data frequency data entries and finally
expiration or revision date.
And that is all. Thank you. Thank you so much. Thank you so much for
being here. Remember, you can find me LinkedIn,
Twitter and medium at Jorinino and that is my personal webpage
journino at.