Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi everybody, thanks for being here. It is an honor to have a new
opportunity for sharing my last work avoiding the death of
SRE documents that matter. Nice to meet
you. I am Yury Nino. I am colombian. I work as cloud infrastructure
engineer at Google Cloud. Also I am chaos Engineering
and Cyrelia IT engineering advocate. You can find me LinkedIn,
Twitter and medium as jolinita. Regarding this
talk, I have to say that this material is the result of a lot of
reading and self learning trying to build a framework for documenting
my exercises in chaos engineering.
In this path, I found an excellent article
titled why SRE documents Matter? Written by
Nukala and Bibecro who are SRE product managers
in Google. I'm going to start with this.
After that, I am going to describe a non extensive list of
SRE documents which were completed using the book seeking
SRE. With this in mind, I am going to share some recommendations
that I have learned while I was studying about this exciting topic.
Since you know what about sres just to share this
site reliability engineering is a job function, a mindset
and a set of engineering approaches for making web
products and service run reliable. SRE operates
an intersection of software development and system engineering to
solve operational problems and engineer solutions to villan
larger scale distributed systems. SRE is
focused on the lifecycle of services from
inception and design, road development, operation,
refinement and eventual decommissioning.
In these phases, SI provides principles
and practices for monitoring and metrics. It is about
measuring how the service is actually behaving and correcting discrepancies,
emergency response in another words,
noticing and responding to service payloads in order to
preserve their availability. According to slas,
capacity planning is very important here. That means
projecting future demand and ensuring that a service has enough
computing resource in appropriate locations
to satisfy that demand service turn
up and turn down deploying and removing computing resources
for a service in a data center in a data center
in a predictable fashion, often as a consequence of
capacity planning, change management about creating
the behavior of a service while preserving service reliability
and performance regarding design, development and engineering
related to scalability, isolation, latency, dropout and efficient.
Regarding the commentation, I would like to share that according
this stack overflow 2022 developer survey,
the most important resource for people to know how to code is
technical documentation. However, the same participants rank
documentation as the number two challenge facing
developers. This is a problem because missing incomplete
or out documentation hurts
development, velocity, software quality and something that
is critical for SRE service reliability.
So let me join this idea with the reason for documenting SRE processes.
The teams used to preserve the important operational concepts
and principles as nuggets of trivial knowledge that
are passed on verbally to new team members. And it
is a fact that if the concepts and principles are not
qualifying and documented, they need often
needed to be relearned painfully
through trial and error. In this sense,
documents are very important for many reasons.
They help developers communicate with each other and
they allow to future developers understand and maintain
the code. In general, good documents
helps you learn from your mistakes.
Sres often spend 35% on their time on operational
work, which leaves only 65%
for development. This last person includes time for
documenting of course. However, documentation takes
time and effort and there is a specialist challenges
for SIS. It can be a major cause of frustration
and job unhappiness for developers.
Some developers consider that documenting
might not be recognized or rewarded during performance
review or promotion process. Having mentioned the
importance and the challenge of documentation, let me
focus on the type of SRE documents according
to the article why SRE documents matter? Remember this
presentation is based on this article. They can be
classified in these categories. Documents for running a
service, documents for new service onboarding documents for reporting services
take documents for production products, documents for service decommissioning,
and documents for running SRE teams. So let
me explore the documents in each category in the next slide.
Documents for new service onboarding these documents the
documents in this category belong to a more general concept.
I am talking about PRA, a production readiness review.
A PR ER is conducted to make sure that a service meets
accepted standards of operational readiness and they owners
have a necessary guidance about running them. So let
me go throw in the specific documents for production ready review and
the first one is actually a dagger, the architecture and
dependencies which provide answers for why your
request flow from user to front end to backend. And there is
different types of requests with different latency requirements.
Review now the capacity planning it is a document with
answers for these questions. How much traffic and
rate of ground do you expect during and after the launch?
And have you obtained all the compute resources needed
to support your traffic? Another important document here
is failure modes which provides answers for do you have
any single points of failure in your design? How do you mitigate
on availability of your dependencies about processes
and automation? These SRE the questions are any manual processes
required to keep the service running and how are
we automating these processes? Finally, in this
subcategory of documents for new service onboarding,
remember we were talking about production readiness review documents.
The main objectives with this documents is answering these questions
what terrify code, data, services or
events do the service or the launch depend upon?
Do any partners depend on your service? If so,
do they need to be notified of your release?
In addition to peer ers in this first category,
this first category includes documents for onboarding service.
The SRE organization needs to create overview documents
that explain the SRE role and responsibilities in general. Teams to
product development teams. This services to set the
expectations correctly. A primary goal of this document
is to ensure that developer teams don't equals sres
with an operations team. When developers don't
fully understand what sres do,
miscommunication and misunderstanding can result.
Another important documents here is the engagement model document
that explain how the SRE team which engage with developer
teams during and after service takeover.
Topics covered with this documents includes service
takeover criteria and peer process, sloan negotiation
process and error budget, new release criteria
and release policies, content and frequency
of services status report from SRE team,
SRE staffing requirements and future roadmap
planning processes. If you remember, sres need
to be effective across two domains, two domains development
and operations, so they need documents to understand enrolled
service and production. Example of the core documents
include service overviews, playbooks and procedures,
postmortems policies, and SLA. So let
me review each one. SRE needs to know the system
architecture, components, SRE dependencies and
service contacts and owners. These overviews SRE often
an output of peer process and they should be updated
as service change. For example, if you include a new dependency
in the service, a complete service overview provides a wall
description of the service and how it interacts with the core
around it, as well as links to dashboards,
metrics, and related information that the service needs to solve unexpected
issues. Also cultroom books I love this type
of documents. I am talking about playbooks. The on call engineers
respond the alerts generated by service monitoring playbooks
reduce the time to mitigate an incident and they provide useful
links to consults and processes. Playbooks can contain
instructions for verification, troubleshooting, and escalation
for each alert generated from network monitoring processes
that contain commands and steps to review for accuracy.
It is a classical as you know, remember a software
postmortem. It is can analysis conducted after a system
failure. Each post mortem derived from this template describes a
production outdash timeline, description of user impact,
group calls, action teams, and lesson learn technical policies.
In this slide I am going to talk about policies and the first one
is technical policies. Technical policies can apply to areas such as
production change, login, log retention, and internal service
name. Policies also apply to processes,
escalation policies, health engineers classified production issues
and emergencies or no emergencies and provide recommendations
on the appropriate action for each category.
Can on call expectation policies outline the structure
of the teams and responsibilities of the team members.
An SLA is a formal agreement with a customer on the performance
of a service. SRE teams document their services for
availability, latency, and monitor performance
nodes. Having SLAS allows SRE
teams to innovate more quickly while preserving a good
user experience. SRE's running services will
well defining slas will detect outage faster and
therefore resolve them faster. Good slas also
resoning less friction between and
software engineers that is classical in
the next slides, I am going to describe documents that are related to the
products and tools SREs documents developed.
These documents are important because they enable users
to find out whether a product is right for them or to adapt
how to get started and how to get support. They also provide a
consistent user experience and facilitate production adoption.
This helps SRES product development engineers
to understand what the product or tool is and
what it does and whether they should use it.
In the next slides I am going to describe documents that are related
to the products and tools SRE developed. These documents
important because they enable users to find out whether a product
to adopt. Let me start with this
guide, a concept guide or glossary which define
all the terms unique to the product. Our goal of a quick start
guide is to get engineers up and
running with a minimum of delay. It is helpful
to new users who want to give a product a try.
How to guide this type of document is for users who
need to know how to accomplish a specific goal
with the product, how to help users complete
important specific tasks and they are general procedure
based. Finally, engineers use development guides that
is really important here to find out how to program
to a product. APIs engineers can use
code labs or tutorials you can find in Google
for example as tutorials combining explanation example code
and could exercise to get up to speed with the
product. Could labs can also provide in depth scenarios
to wall engineers step by step draw
a series of key tasks. I really like this. The fact
page answers common questions. Also, the common
answers dispoints users to reference documents and
other pages on the site for more information.
The support page which identify how engineers can get
help get when they are stuck on something.
It also includes an escalation flow, troubleshooting info
group links, dashboard and SLO and on call information.
An appearance is usually generated from
code comments and sometimes written by tech writers.
This provides description on functions, classes and
methods which minimal verbose or narrative.
This part describes the documents that SRE teams produce
to communicate the state of the services they
support. Information about the state of a
service comes in two forms, quarterly service review and
a presentation about this. The goal of the quarterly report
or presentation is to cover the state of the service review, including details
about performance, sustainability, risk, and overall
production health. SRE is
interested in highquality reports because they provide visibility
into the following uncle tickets
postmortems performance of SLA risk quarry
reports also provide opportunities for SRE team
to highlight the benefits of SRE prioritization
for solving problems and request feedback about the SRE
team. Folks SRE teams need to have
a cohesive set of reliable, discoverable documentation to function
effectively as a team. So it is a good practice to create a
team site with a team chapter because it provides a focal point
for information and documents about the SRE teams and projects.
At Google, for example, many SRE teams use Gitris docs,
that is an internal documentation platform where documentation and
source code lie. Specifically, team charters
explains their rationale for the teams and documents scoring major engagement
a charter service to establish the team
identity, primary goals, and role relative
to the rest of the organization. A shorter general includes the
following elements, a high level explanation of the space in which teams
team operates, a short description of the top two or three services
managed by the team, the key principles and service and
values for the team, and the links to the team. Sres and
documents SRE teams invest in training
materials and processes for new sres because training
results in a faster onboarding to production environment.
SRE's team also benefits from having new members acquire
the skills required to join the ranks of phone call
as early as possible. In the absence of comprehensive
training the SRE school found
during a crisis or during a potential minor
incident to a major outbreak. Many SRE teams use
checklists for on call training and on call checklist generally covers
all high level areas. Team members should understand well.
Example of this include production concepts, front end and backend
start, automation and tools and monitoring.
Sres also use roleplay
training. For example, in Google we use well of misfortune
which is an educational tool for training teams
members. In this exercise, a notage a scenario
is presented to the team with the data the team members taking
of playing the role of the on call
engineer in order to mitigate any emergency role
play exercise teams the ability of the
SREs to know where to find the documents and to use
it for solving the hypothetical incident
let me finish with this presentation with some recommendations from
Google to keep live documents. The first
one is communicating the value of documents. That is very important.
That is my first recommendation. If you want to combine fellow
engineers and leadership to invest time and resource in documents,
it is essential that you gather data that demonstrates the quality,
effectiveness and the value of your documentation.
Remember, when you talk about the impact of your documentation work you
sre talking about the business value for your output.
So while structured quality pretty easy to measure,
functional data is combining. Remember that it falls in three
buckets. Measurable success, user behavior and sentiment
data. The second one is to create a centralized repository.
SI team information can be distributed across
a number of sites, local team knowledge and Google Drive folders
for example, which can make it difficult to find correct and
relevant information. So it is really important to define a
structure for all information and ensure that the team members know where is
to start, find and maintain information that is the
most important. Here are some guidelines to create and manage
a team documents repository. Determine relevant stakeholders
and conduct brief interviews to identify all needs.
Locate as much documents as possible and do
a gap analysis on content. Set up a
basic structure for the site and the new documents can be created
in the correct location. Create a monitoring
and reporting structure to track the progress of migration archive
of documentation that is very important. Verify which
common terms are used in search feature
for example and finally use signals such as Google Analytics
to track analyze usage. After you determine
the functional requirements and quality indicators for each document,
it is really valuable to create templates. Templates make
easy for authors to create new documentation and providing the
clear structure that they can populate with relevant information.
With a good template, creating a simple documents can be easy as
filling out forms. Templates make easy for
readers to quickly understand the topic of the document,
the type of information and how it is organized.
The site reliability engineering book contains several
of documentation templates. Here are an example of playbook
template that provides a structure and guidance for engineers
filling in the content. Another important
documents is defining success metrics.
It is essential if you want to able to communicate the value
of your work to the rest of your organization. For example,
a service review has a high functionality if it
provide necessary the context needed to handle in the outage
and if you have the possibility to measure
the impact in the organization. It is important
to have a guidance from technical writers because
they provide a lot of support that make SRE documents are
effective and productive. Here are some guidance to work with technical writers.
Technical writers should partner with sres to provide operational
documents for running services and product documents for SRE products and
features. Writers should provide consulting
to assess assets, analysis, documentation and information management
needs. Writers should evaluate and improve
documentation tools to provide the best solution for SIS.
That are my recommendation when you are working with
technical writers. Finally, required documents as part
of code review documentation is like testing.
Nobody really wants to do it, but if you can
minimize this sentiment taking advantage
of code reviewers powers. Not all changes requires
documents updated of course, but here a good rule of thumb
doing documents better. Best practice if a
developer, SRE or user of your project needs to change
their behavior after visa change, the changes should include
document change. That is the recommendation here.
Require this as part of code
review and that is all. Thank you so much. Thank you
so much for being here and thank you so much for listening
this presentation. I really hope that it will
be useful for you. Thank you. Have a good day.