Transcript
This transcript was autogenerated. To make changes, submit a PR.
Well, I hope you having a wonderful day. My name
is Evgeny Korniv and today I would like to talk with
you about how to assemble and organize a development team in
the way that enhances both the resilience and the
quality of your product. Let's begin by looking at the
topics I want to talk about in my brief presentation.
I want to clarify that this talk is addressed more to engineering team
leaders than engineers themselves.
We will explore the lessons I learned from building product teams
that developed high load services. We will discuss
whether SRE is necessary for small and medium
sized teams. We will consider the obvious
alternative to creating an SRE department,
the developer on calls of duties. We will try
to determine at what point a company needs
to hire an SRE engineer. We will
also touch on how crucial communication
and people are in implementation of SRE principles.
Okay, let's start. When it comes to small
teams, it's crucial to understand that SRE is not about creating
a team of on call engineers who will save your
service day and night from crashes. It is
a set of fundamental principles,
principles that can be implemented right from
the start. If your product is already in production
and you are even slightly concerned about its availability
to users, start by defining an SLA.
Understand what level of availability suits your product.
Define for yourself what availability means to
you, what in your product must fail to
be considered differently. Down set up monitoring
and alerts. According to this,
inform your team about the target metrics
they need to respond to these alerts,
and they are important to the business.
Try to automate the processes of code
compilation and deployment right from the start. The fewer
manual actions, the fewer mistakes.
Ensure your code is covered by tests.
Teams that overlook testing in the rush to launch
their product quickly are still common.
Sometimes this is justified, but often having
tests not only enhances stability,
but also accelerates further development.
Adopt modern approaches to DevOps.
Availability issues often fall on the shoulders of
DevOps engineers, so make sure the production infrastructure
you create for your product is resilient by itself.
Limit access to production systems,
especially databases. In the ideal scenario,
no developer should have such access,
only the operations team. This will
limit potential leaks as well as incidents
related to unauthorized data changes. This seems
like an obvious point, but many small teams
overlook it. Create tools for
servicing your service, dashboards,
admin panels, feature toggles,
etcetera. This will help you achieve the previous point.
As my experience shown, these simple principles
are sufficient at the beginning of the product's journey,
as it has not yet turned into
the complex and multifaceted system.
Ok, let's talk about on call duties.
As your product grows, so does the number of the different
subsystems and components. With the increasing
complexity of your system, the amount of time required to identify
and resolve failures also increases.
One obvious solution may consider is to
appoint on call developers from existing teams.
This means an employer works as usual, but in
case of failure, they are first to respond to critical
alerts. It is usually assumed
that such an employer prioritize work slightly over
life. In our beloved work life balance during the
on call period, this approach presents
a couple of pitfalls that create an interesting dilemma.
On one side, if the on call person only responds
to alerts about product and availability, then during
periods without incidents, developers can become too
relaxed. Everything is stable. I can
go to the pool at launch. It's unlikely anything
will happen, and as usual happens,
that's exactly when something does happen.
You could assign more duties on the call person so
they don't become complacent,
such as responding to all production
bugs. But then each on
call period becomes grueling, leading to rapid
burnout. I strive to care for the mental health
of my employees, so I opt for the first approach
to hedge against risks, I use an on call
roster. If this week's on call person is somehow unavailable,
the next person on the list responds, and if
they also unavailable than the next.
Moreover, anyone who has to respond out
of term get moved to the end of the list.
This delay dissolves as a pleasant reward.
The week is ideal duration for an on call
shift. Our team tried rotating on call duties daily,
but when a service fails, facing the incident
analysis to the next employee becomes inconvenient.
However, to prevent a week of on call
duty from becoming atonement, do not overload on
call person with additional duties. They should only respond to the
alerts you defined as significant and disruptive to
the normal operation of service. All other
incidents will follow your standard workflow.
Okay, let's discuss when you might need to assign a
dedicated role in your team. For SRE, there is
no strict criterion, but I can highlight
several points that you should evaluate to
make this decision. How strict are
the availability requirements for your product?
For instance, if you are operating a fintech service,
these requirements might be higher since you
are likely handling user funds.
Second point, how heavy is the load of the service
or how popular is it? Failures in popular services
leads to more significant reputational losses than
those with a smaller customer base.
Next point, how well is your development team
performing, both in terms of productivity
and the quality of the product released?
If on call duties are time consuming and you can't
find a way to improve the situation,
this might be an important signal. Next point
how complex is your system? As your product
and team grow, the overall complexity of the system increases
and on call shift alone no longer suffice.
Developers begin to know less about different
parts of the system and less independent
in handling incidents. The last point
what does your budget allow? Can you afford to
have a staff member whose peak activity occurs only
during major outages? As experience
shows, the first SRE engineer in the team is
usually a former developer within the same team,
but the vacancy created still needs to be filled.
So watch your budget. If after assessing
this criteria, you conclude on
at least three of them that you need an SRE team,
then it's likely true.
However, even before that, understand whether you
can fix some issues with other methods.
Could you implement a coverage plan for testing to enhance quality
or refactor some services to reduce the complexity
of the system? Okay, let's talk about communications.
To ensure your team performs well in creating a
resilient service, it's important to keep them informed within
a unified information field. For the
first, communicate in shared chats.
In our company, each development team has a chat where employees
discuss tasks and technical solutions. This allows
all members to see current issues,
decisions made, and their impacts on the system.
Try to ensure your employees discuss work related
things as little as possible in private messages.
Organize all team meetups.
At these meetups, employees can talk not only about
technologies, new technologies, but also
how they solved major tasks and outcomes of
those solutions. Team leader meetups can be
used to discuss team interactions, major system design
decisions, and how collaboration can enhance the product stability,
documentation, and knowledge transfer to
ensure that system support quality is maintained
in the event of staff rotation,
document at least the most important and complex
part of the system. Make sure this documentation is included
in the onboarding process for the new employees
and after their probation period, check how well
they have understood it. A bit about people I
truly believe that every team should have at least one specialist
who has significantly stronger than the average market
level senior. An experienced and skilled developer not
only completes more and better tasks,
but also has extensive problem
solving experience. When we talk about SRE,
we often discuss readiness for various types
of problems. Specialists with over ten plus years
of experience have encountered these problems
far more often than recent graduates, even from
the best universities. And believe me, this experience will
more than once save you money and possibly even your
business. Even if your budgets are limited,
it's better to sacrifice an extra position in the
staff, but hire one experienced and costly senior
specialist to finalize my
speech. It's important to understand that achieving
high availability for your product requires efforts
in all directions. So let's summarize what I
have discussed today. Determine your
availability criteria and SLI conduct improvements
in alerting and monitoring. Ensure maximum
transparency in the team communication there is no
room for secrets in an engineering team.
Improve code quality testing code reviews
quality control CI CD this will be
all fundamental an on call duty policy.
Have the project grown. Assess the need to create an
SRE department. Develop an SRE engineer
within your company and if necessary,
build an SRE team around them which
they will train. This is challenging path,
especially for teams that already have an
established culture, but it's worth
it. That concludes my presentation.
I hope you found it interesting and that it will be helpful
in the future. If you would like to discuss
any topic, this or other, feel free
to contact me my contacts on the screen.
Thank you and have a good day.