Abstract
This talk will explain my experience when I join in a new engineering team. I mean a DevOps team as a team which includes developers and IT operations working collaboratively throughout the product lifecycle.
Bottom-up approach
Imagine, you don’t have much room to learn everything about your service at once. Therefore, let’s chunk continuous learning.
Learning process is observed -> recorded -> action.
In the observe phase, let’s jump into alerts and exceptions even if you’re not familiar with them.
Gradually, you will start to know what are critical points for our service (i.e. website down).
In the recording phase, let’s write up missing documentation behalf of teammates. Even if you cannot solve issues directly, it would be helpful to your team.
There, writing Runbooks give you chances to understand your service and participate in operations. I would recommend you write the “Architecture” part to organize your understanding of your service’s technical design.
In the action phase, let’s try service operation (e.g. fixing broken data by manual operations). Pair operation is a good idea to jump into service operational works.
Top-down approach
Check the big picture of your service reliability (e.g. is there any SLA? SLO? SLI?)
Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi everyone, in this talk I'm going to talk about how join
join operational works DevOps team.
You join a new team such as manual recovery operations,
triaging alerts, arresting incidents and so
on. Sustainable shooting is essential,
especially for SRE guy is, but it's difficult
for new members because you don't understand how your systems work
enough yet. This talk will give you some tips
on quickly catching up your service specific knowledge
and contributing to your team from early on.
I'm Kazki, a senior backend engineer at ODFI in
Autify. The responsibility of our backend engineer
role includes infrastructure development and
operations like instant handling, being in on call
rotations, etc. Autify is a
startup company which provides an AIB based software test
automation platform. We have two products,
autify for web and autify for mobile.
I'm involved in the development and operations of the autify
for web service. The idea in this talk is based
on my experience at OTFi.
First, I would like to discuss what makes us difficult
to join services operations troubleshooting
is one of the critical activities for anyone who operations web services.
It's often viewed as an innate skill that some
people have and others don't.
However, the book cycle reliability engineer
shows a general model of troubleshooting process.
It's beneficial to analyze what makes it difficult to
join service operations, so let me explain briefly.
First troubleshooting start with a problem report for
example, metrics, alerts, application,
exception, customer inquiries,
et cetera. After then
we can look at system directory and logs such as cybermetrics,
LR logs and so on.
Through this exercise we can understand the current state
of the system and identify possible causes.
When we find potential solutions, we can actively
trace a system that is chances the system in
a controlled way and observed the results.
This is a general process of troubleshooting,
but when you join a new team, there are some challenges
regarding the Toria waste. Let's say you get analyzed.
The cpu utilization is over 90%.
Once receiving the problem report youre need
to consider whether you have to take action immediately after
answering the following questions in your mind.
Is this all as the first time your team has which
workflow is the server used for? Do users use
the services or only for internal use?
Is it a known issues for the team? However,
you will not have enough knowledge to answer these questions.
Yet. When youre just joined the
book practical monitoring mentioned in the
context of alerts, they said generally
there are two different alerts.
The first one is alert meant to wake someone up.
This kind of alerts requires action to be taken
immediately or else the system will go down.
For example, when all web servers are unavailable,
we should take action immediately.
The second one is alerts metadata for youre information.
It doesn't require any immediate action,
but someone ought to be informed that is occurred.
For example, when an overnight backup job failed,
it ought to be informed to software engineer so that
they recognize they may need to take action the next business
day. Contextual judgment is
one challenge to doing operational work.
It highly depends on knowing its failure patterns.
After a few months of watching various alerts
you may guess a hunger weight, but it will take
time. Also, in the examine
and diagnose and later phase there are problems.
Even if you configure suburb
of problem reports, you need knowledge of how
the system is built, how it should operate
and its failure modes.
Basically, the exercise depends upon two factors.
The first one is an understanding of how to troubleshooting
dynamically. The second one is a Sli
knowledge of the system. For example,
let's say you get the problem report that queuing
processing has stopped and the number
of waiting events has increased.
To solve this case, you need to know such information,
what events the queuing system handles.
Are there any known possible causes?
Have we influenced the retroactive mechanism? So again,
lack of service knowledge becomes an issue here.
So far I introduced the troubleshooting process and
mentioned that the lack of service understanding can be a
challenge within that process.
Next, I'm going to explain three pieces of advice
to start participating in service operation work.
Firstly, I would advise you to look at problem reports
even if you were not sure about them. When you see problem
reports that you're not youre about, click link to
them. It doesn't matter if you can to directly
contribute to solving the problem,
lets yourself be yourself and keep looking at incoming
reports in a casual manner. It's a
good idea to set a time box, in other words a time limit,
for example 30 minutes, as too much
time may interfere with your main work. The more you
jump into problem reports, the more knowledge you gain about problem
patents. The next piece of advice is to leave
what you will learn in documents.
After going to detail about alerts, let's create a
blank page in an internal documentation system.
In the case of OTwi, we use notion as a
knowledge sharing system and use Datadoc as
an observability platform.
So I create a blank page in notion or sometimes create
a new investigation note in datadoc,
then leave what youre learned in the document.
For example, when you have learned about the system
architectures related to the problem, write them
down briefly. It's also recommended to note
any similar cases youre find that have occurred
in the past. This will help you visualize your
learning. Keep participating in operations
trust is essential.
Making your learning visible lets to
letting your peers know how much you understand about how
your system works and how diagnose atypical system
behaviors. In the book 97 things every
SLI should know, Lori Hoxtain, who is an
engineer of Netflix, recommends watching experts
in action. He said, in general,
the best way to facilitate skill transfer is to watch experts
in action. Ideally, you're working alongside them,
watch them solve youre problems and document
how they mitigate it to operational supplies.
I love this idea. In my case,
autofide workstation is remote, so if
I don't know how my peer investigates the problem,
I ask them on slack after it's been
recorded and write a new document.
The third piece of advice is to write lamb
book. This may be a well known concept, but let
me explain it briefly. It's a detailed
how to guide for completing a commonly repeated
task or procedure within a company's IT operations process.
It guides an operator by step by step instructions.
It's sometimes known as a playbook.
The might for your team is that land runbooks are a
shared waste of knowledge and expertise that
would otherwise be kept solely in the hands of subject
matter experts. A subject matter expert is
a person who is an authority in a particular area or topic.
Once you put it on, you will be able to take over
its population. At this point, you can
start participating in service operations.
Writing Lambrooks will mitigate common system ant
pattern called only brand knows
the book operations ant patterns DevOps defines this
ant pattern. The book said, unless purposeful
action is taken, information tends to coalesce
around key individuals. It makes those
individuals incredibly valued but also
equally burdened. Your documentation work will
reduce the burden of experts by showing their knowledge and
expertise. Besides,
this is only possible because of you. The book software
engineer at Google said like this, the first time you
learn something is the best time to see whether the
existing documentation and training materials can be
improved. You are the best person to write
a new lamb book if your team doesn't have a document about it.
A good ram book answers these questions.
Specifically, I would recommend you answer the
third and fourth questions in your document.
Regarding the third question, what dependencies does
it have? Modern system are distributed system
and in some cases external services are
involved. In one case, this information is
beneficial to understand the system architecture.
The fourth question what
does the infrastructure for it look like is also
a good question to understand the system architecture.
Writing these parts is a great opportunity to organize
your learning and form reusable knowledge for the team.
You may feel uncomfortable writing books because
you don't feel you can write perfect, useful documentation
for the team. However, the book seeking
Sli mentioned this concern. Like this
city first draft, an imperfect document
is infinitely more useful than a perfect one that
does not yet exist. You don't need to
finalize the documentation perfectly. I would recommend you
to be the first writer. Finally,
let's recap three tips. First, let's look at
problem reports, even if you are not sure about them.
Second, leave what you will learn in the document.
Thirdly, write one books youre for your team.
The books and blogs mentioned in this presentation
are listed on this guide. I hope this presentation
will help you quite. Thank you for listening to my
presentation.