Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi everybody, thanks for coming. I'm very
excited to be here. The title of my talk
is postmortem culture learning from failure.
Nice to meet you. I am jodinignino. I work
as DevOps engineer for Awa Digital Labs, a company
in Colombia that provides technology and innovation service for
our banking group. And also I am
engineering advocate in my country. There are photos
of my coworkers and my company.
And before starting, I would
like to tell you about my hometown,
Garagoa. I am from Garagua. Garagoa is a town located
in the Bojaka department in Colombia. And Garagua
means behind the healing Chipcha language and it
is a town located in Bojaka department in Colombia.
Since I can remember each December 16
people in Garagoa celebrates the end of the year with a postmortem.
So the postmortem is called the death
of the sadness. And in this ceremony, people evaluate their
actions in the last year and make resolutions for the
new year. So this celebration made
me wonder if we are doing postmortems in our life,
in our daily life, why software engineers don't
practice postmortems after an incident. According to the survey
applied to 45 engineers
in my country, software engineering don't practicing
postmortems. In this case, for example,
just 40% of them read
a postmortem and 60% of them
don't practice postmortems in their jobs.
So knowing there is a problem, in the next 15
minutes I am going to talk about postmortems. I am going to try to explain
why we don't write postmortem culture.
I will explain why we can use chaos engineering,
a chaos game days in order to promote this practice
in companies. And finally, I am sharing our
journey in my company, trying to implement this
practice in our daily life and trying to promote
chaos engineering in our jobs.
So let me start with some definitions about what
is a postmortem. Postmortem. According to the site reliability
engineering. If you remember the book of Google,
a postmortem is a
written record with the details that
happened after an incident. And according Pedro Daddy,
a postmortem is a register
of what happened after the postmortem.
However, there is a definition I like more that is an
answer for these two questions and what
went wrong and how do we learn from it.
So probably you remember these two postmortem culture. First one,
written by GitLab documented an incident in which
an engineer drop a database in production.
So it lets us several hour outages and the
second one details one of
the most critical outages in AWS in February
2017. And this document
explained how they overcome a failure
in the s three service in Virginia.
And it has a huge impact
in many websites. So if the companies
such AWS and GitLab are practicing
postmortems, why we don't do it?
So according to the same survey, the most common persons
include ignorance and culture. According this,
55% of people said that they
don't know what is a postmortem and a second group think
that writing postmortem is an activity for DevOps engineer
and operations engineers. But if they are software developers,
why to write postmortems? So this
answers was studied by
Adrian Hosky and John Osbo, who concluded that the
lack of accountability is related
with a blameful culture. So John Osbo
analyzed the five y methods,
one of the most famous techniques to write postmortems.
According to his work, this technique is
not a proper way to promote a blameless culture.
Asking why conduct? To answer another question,
the question is who? No why? Which is almost
every case is irrelevant. It is common to
find postmortem culture conclude with
root causes like this in which a human is
tagged and blamed for the incident.
Now we know that there is a culture
of blaming and how can we change it? How can we
pass from these sentences to a
culture where the failure is promoted?
Our proposal is with chaos engineering. Chaos engineering
is the discipline of practicing or injecting
failures on production in order to reveal the winners in
the systems. This definition is available in the website
of principles of chaos and with a Chaos
game day. Chaos game day. They are an
easy way to introduce
engineering teams to this practice. So in
this exercise we have three roles, a master of
disaster, first on call person and the team
the engineering team. In this exercise, a master
of disaster decides, often in secret, what sort of
failure the system should undergo. With the
team guided in one room, physical or virtual,
the master of disaster declares a start of the incident
and start the attack.
And one member of the team who
ask a person
called person try to see, triage and mitigate
whatever the failure that a master of disaster has caused
easily. After that, the idea
is the team analyze and understand the
failure and provide a solution for them. And they even finish
when the team write a postmortem that
is a Chaos game day. So what it does means
in the practice, what are the activities involved
in this? Planning a game day involves
a lot of work because if we are planning
this event, probably we have to create an agenda.
We have to define what are the events involved in
them. So after that we have to define the users,
define who is the master of disaster, who is the first
on call person and who are part
of the team. And in a third activity, the idea
is to send communications in order to keep contact with
the people involved in the event. And after that we
have to design the experiment, because chaos
engineering follow a scientific
method. So the next activity is design the experiment.
And finally, we have to provision hardware,
software, chaos attackers, chaos agents,
for example, chaos Toolkit or Gremlin
in this part. And finally, we have to provide an observability
tool, because we need to have a vision of what's
happening during the year. So although Gremlin
and Chaos IQ have done a great work generating
and contributions, tutorials, documents, experience for
doing chaos game days, the reality is that planning
a chaos game day is an activity,
takes a long context. On average, a person spends
90 days planning a game day.
So probably we don't have
time to have a lunch. That is
it. So in order to reduce these times in digital
labs, in my company, we designed a tool
to reduce these times and we are working in the implementation.
So it is a view of Gabeta. Gabeta means in english toolbox.
So it is a view that includes four layers with
activities, separates for them. And in the first layer we have
the roles and the users involved. In the event, we have the master of disaster,
the persons called person, and the team who are accessing
to Gabeta, probably using a web browser
or a mobile device. Why a mobile device? Because the
chaos game days are exercise in which people is relaxed and
with food, music. So probably some activities
of deep is easier with
a mobile. So these persons are accessing
to Gabeta to the core gaweta. In the second layer,
Gaweta is implemented with Go and it is using terraform
in order to provision the infrastructure required to
the event. So in the third layer, we have five
layers for managing the Chaos game day. If you remember,
we have activities for planning these events.
So in this case we have a planner in order
to have a tool to plan the event. So the idea
is to have the possibility to use different tools in
this case. So in this case, I are using Google
calendar. But if you want to use another tool, it is
possible. So in the second supplier we
have tools to send communications. For example, you can
use a slack or push notification for the mobile devices.
So in a terror supplier we have a terraform.
Terraform has the responsibility to provision
the infrastructure required for the event. So in this case,
probably we need a cloud provider,
for example AWS or GCP. We need
a chaos agent. In this case we are using Gremlin or chaos
toolkit anyway. And finally we need an observability tool.
In this case we are using datadoc
but you can use new relic
or any others. So finally we
need two tools in order to provide infrastructure
to write a postmortem and document the actions in
the next steps because the idea is to provide in
this case with Jira, provide tickets or
user stories in order to avoid a similar incident
in the future. So finally we have the system.
So probably you remember this kind
of architectures.
Gaweta at this moment is using this architecture for
provide the features mentioned and in the last slide. So in this case
we are using a hexagonal architecture.
The hexagonal architectures use three concepts,
the core domain in the center of the hexagon and adapters
and ports in order to communicate the external work with the internal
core. So it is a view of Gabeta
using this architecture. In this case we have
in the core handlers in order to manage the activities
involved in the Chaos game day. So in this case we are using
a handler for plan the event and
I am using another handler to manage the
communicator in order to send communications and reminders
to the participants. And I have a handler to manage
terraform in order to provision the infrastructure required for this.
And finally we are using two
handlers to document and register in the backlog. The next
action, the responsibility of Gabeta orchestrates
these handlers and we are using another
concept, the adapters, in order to have the possibility to
implement these interfaces using different technologies.
So in this case, for example, we are using Google suite
for planning the event. But if you can use,
I don't know, outlook or any other tool that is
possible using this concept adapters and
interface and using a core with the
responsibility to orchestrate them.
So let
me show a video with the mockups for Gaweta.
So I think that it's
possible to see the yeah,
in this case I am using the master of disaster credentials
to access Gaeta. Gaweta at this moment is
a web application, provides the functionalities as
a service. So the first thing probably you want
to do in Gaeta is define what are the adapters.
In this case, for example, we are using Google suite for
planning the event and
we are using slack for sending communications.
But as I mentioned, it is possible to use
another tool if you are providing an implementation
for the interface in this case. So we are using terraform
and the concept related with infrastructure as a code in
order to provision the tools required
to run the event. And finally we are using confluence
and Jira for documenting and providing the
next steps after the Chaos game
they finish. So now
we have a view of the home page for
a master of disaster who is planning the event. So in this
case gawa in the top bar we have four options
to manage the event. So we have a first option to plan
the event. What does it mean
that we have to define the agenda?
We have to define the agenda. Probably this text in
this text field will be transformed by Gaweta in activities
in Google Calendar for example. In a second part we have the
possibility to define who is the master of disaster,
who is the users and who is the persons called. We have to define
the users before because we have to send communications or
to send reminders to the participants before
the chaos game day start. So in a
second option we have the possibility to define
or to provision the infrastructure required to run the Chaos game
day. So in this case we are uploading a terraform
file because terraform offers the possibility to define
using the concept providers in terraform define
what are the cloud providers. If we are
using for example AWS, we have to define a TF file
for terraform with this definition. So in
this case the file allows to define
for example if you are using AWS,
datadoc and gremlin for attack the application.
So in that option we have the possibility to create
or design the experiment. Remember that
the intention in the chaos game day is to
design the experiment. So in this case I am registered
the application name, the observability tool,
the hypothesis for the experiment and
in this case for example I am trying to test if
the histrix configuration and the sequel breaker pattern
works in this architecture. So in the
next option we have the possibility to chaos
any attack in this case
during these ten minutes. And finally
we have the possibility to define what is the blast
radius. And also we have the possibility to add
some notes and the idea is when the experiment is
defined, launch the attack during the event.
And in a final option,
in a final option we have the possibility to see
the postmortem and the actions provided by your team during the
event. So yeah,
that is so slow. Did you click on that?
Probably yeah. That is
a view of a postmortem
provided by Gaweta and it is a post mortem prepilled by
Gaweta and that is the responsibility to the team to
complete them. But that is an example of Gaweta could
do by the team providing a template of
a postmortem inspiring in this case with the KSIQ
template for postmortems.
So, to close, I would
like to share the conclusions of signire
is the author of this classical book, a really
good book. So the field guide to understanding
human error. And Sidney
Decker told us to see the world with a different
view in which the human are not the cause of
the problems. If something goes wrong and
a human is involved, it is a symptom of a
deeper trouble. Notice a cause of them. So that
is all. Thanks for coming. Thanks for listening.
They are my social networks if you want to
contact me. Thank you and welcome in Colombia. Welcome in Carago. Thank you.
Thank you very much.