Transcript
This transcript was autogenerated. To make changes, submit a PR.
Um,
hello everyone. My name is Paul Marsicovetere and today
I'm going to talk about combat sports principles that apply to
site reliability engineering. So a little bit about myself.
I'm senior cloud infrastructure engineers engineer at Formidable in
Toronto and I've been here since October 2020.
Formidable partners with many different companies to help build
the modern web and design solutions to complex technical
problems. Our key values are inclusion,
autonomy and craft, and we have a sizable open source community thanks
to products like Victory, Urkel and Spectacle.
Previously I was working in SRE for Benevity in
Calgary for three years, and while I'm originally from Melbourne,
Australia, I've been living happily in Canada for over ten
years now. You can get in touch with me on Twitter at paulmasi
Cloud on LinkedIn and via email. I'm always open to
chat with anyone about anything cloud computing,
SRE DevOps, and certainly about anything related to
MMA and boxing. I also run a serverless blog
called the cloud of my mind in my spare time as well. So combat
sports and SRE are not areas that would typically be associated
with one another. And let's face it, combat sports like boxing and
MMA are never typically associated with tech,
software engineering or systems administration as well.
Specifically, these two things are not quite the same in SRE.
We don't have to worry about protecting ourselves at all times or sustaining
any bodily injuries, depending on which company you work for, of course.
However, there are some principles from the combat sports world
that have quite an interesting application to our professional lives in
SRE. Today, I'll demonstrate five of
these combat sports principles, along with some examples of how
these specifically apply to SRE. So let's get started
with number one. Sometimes you wear the hammer, sometimes you wear the nail.
What exactly does this mean in combat sports terms?
This means that in some fights you will be the one winning and punching your
way to victory, but in the next fight, it might just be your turn
to be on the losing end. It's important to roll with the punches,
as you might find yourself as the nail from time to time.
For SRE, due to the complex nature of senior
senior senior cloud infrastructure engineer, you are never too far
away from your next outage or difficult incident at any given time.
However, it's important to remember that when things are
being very well or very poorly, that the inverse is always
right around the corner. We can all have those days
where things can be going incredibly well and the updates and
changes that you are making are helping tremendously.
These updates and changes could be adding an extra AZ zone for
redundancy, optimizing slow database queries,
or putting together a backup and restore solution that you are confident
in during disaster recovery. This would be us
in SRE being the hammer. However,
outages such as cloud vendor or cloud servers unavailability
can and will happen. Full regions can go
offline or be intermittently unavailable
or your DNS cloud be completely unavailable.
Those days are when we are not the hammer,
we are definitely the nail like combat sports in
SRE, you need to roll with the punches. Your next
outage is never far away and as best you try, you cannot
prevent or architect around each and every one of
them. There are going to be busy and complex days,
situations and incidents where you're clobbered from
many different angles. It's important to remember that
the days where awesome improvements are instrumented,
they're always just around the corner, so never be too hard on yourself.
Number two, don't throw and hope, aim and fire.
So here's a visual representation of throwing and
hoping for everyone to understand,
and this is a visual representation of aiming and firing.
Homer didn't have quite enough power on the punch, but the
accuracy was pretty on point. So in combat sports,
this takes shape of fighters sometimes wanting to throw haymakers
and hope that a knockout shot lands to win the fight.
However, many fighters are of these belief that precision beats
power and timing beats speed. In SRE,
similar situations certainly occur.
For example, when faced with a slow performing website,
you may just increase the infrastructure, memory or cpu
availability to see if that resolves a particular issue.
You could start terminating whatever services are running on a server
to free up available memory or cpu resources to
see if that helped as well. You might even perform the
classic turn it off and on again tactic and hope that
fixes an issue. All those actions could be viewed
as throwing haymakers rather than properly investigating
and debugging why a system or service is failing,
and then making the necessary changes to fix the underlying issue.
Granted, the haymaker's throne might solve
an issue completely or buy you extra debugging time
during a particular incident or outage, which is always welcomed.
However, always defaulting to these responses each
time an incident occurs, unfortunately is not a valid
long term tactical response. It's far
better for you and your SRE team to aim for
the root cause or causes as to why an issue
might be occurring and these fire your remediation actions
via monitoring, alerting or codeflicks deployments
number three, don't react respond. No, this isn't
another framework for JavaScript. We're actually talking about our
actions and emotions during particular situations.
In combat sports, this principles is these to remind fighters
to keep their emotions in check when something is happening
as it can help control the outcome of what happens next.
Uncontrolled a reaction might cost a fighter victory
or expose a flaw that cloud be used against them in the
future. A response is always more calculated and
tactical in these long term. In SRE,
this is a very important principle when it comes to dealing with incidents
or outages. When faced with an unfortunate situation,
it's far too easy to think that the whole world is crumbling around you or
that the incident itself is completely unresolvable.
Some folks might even start to blame their colleagues,
vendors or the software application and logic this
is very easy to do and it's very easy to react
in this way. However, the consequences of directing this
blame is very hard to undo.
A more measured and mature response during an
incident or outage has several advantages. It allows
your SRE team to focus clearly on these task at hand
and not be distracted by the external noise. It also shows
great leadership and calmness under pressure to those involved with
the incident. No one wants to be on the receiving end of a negative
reaction to an incident or outage, and a more measured response
will always elicit a better outcome than a knee jerk reaction.
We in SRE call this incident reasons and not
incident reaction for this specific reason. So remember
to take a moment to breathe, think, plan and
then respond accordingly. At the end of the day,
all problems have solutions. Number four,
it only takes one punches to change a fight. This is very important
when it comes to combat sports, as a perfectly placed punch that
is thrown can cause you to win a fight or
a perfectly placed punch that is absorbed can cause you to lose a
fight. This gives fighters hope that even if
things are not going these way, one punch can change
everything. In SRE, we can take this principles
and apply it to many things, but in particular it applies to
incident response. Sometimes it might feel like a
particular problem or incident cannot be solved or fixed.
However, it's important to keep in mind that sometimes all
the team requires is one particular command or corrective action
to fix an incident or problem entirely.
These can take shape in the form of a firewall rule,
updating an application package, downgrading an application
package, or migrating traffic from one availability zone to
another. Sometimes that one particular action helps resolve
many different issues. Conversely, it's equally important to
keep good communication amongst the SRE group and teams and
not perform any commands or actions without letting the team know and potentially
having their authorization and permission to do so as well.
Nobody enjoys cowboy coding and a team member deciding,
you know what? I'm going to update this database row or screw it.
I'm going to restart this server that can make a bad incident even worse.
I say these things not to stress anyone out or advise
folks perform no actions. Rather, you must
be cognizant that those actions that you take can
have very positive or very negative effects.
So think carefully and talk to your team before you hit the enter key,
especially during an incident. Finally, with number five,
it's important to remember that a fight is won or lost based
on preparation. What this means for combat sports
terms is that these true outcome of a fight is determined
by how well the fighters prepare. Preparation takes
place in the form of training in the gym and when situations
are not prepared for. This is when fights can be lost.
Fail to prepare means prepare to fail.
For SRE, this translates to preparing for as many possible
outcomes as is feasible. Sure, you cannot
prepare for every outcome, but preparing for as many
as you can will have great success should
you encounter any roadblocks. For example,
it's important to practice game day exercises for disaster
recovery so that when an actual outage occurs,
things do not feel so unfamiliar. This can be
hugely beneficial as it gives the SRE team a chance to
refine their incident response coordination, be it through roles
and or processes with practice exercises
rather than during an actual outage where services are unavailable
for your clients and for your customers. This refinement
is essentially the preparation needed so that if similar incidents
happen in the future, the SRE team has some comfort
in being able to troubleshoot and resolve the incident faster.
Further, by practicing these game day exercises,
there is much less reliance on winging it or guessing
what to do when troubleshooting an issue, whether or
not it is actually related to an incident. Practice certainly
makes perfect, and the more an SRE team seeks, the uncomfortable,
the more they will become comfortable.
Similarly, being prepared can also take shape
by practicing deployments in lower environments before releasing directly
to production. This can potentially save you from outages
or downtime simply by taking these time to preview
how a service might look and be deployed in a
pre prod or staging environment rather than always deploying
directly to production. Yes, there are trade offs for deployment
velocity for teams to consider here, but this can be
a helpful approach in establishing confidence in production
and deployments as you set an expectation
on what you expect to see based on a similar
looking environment. The same can definitely
be said about automated testing in deployment pipelines as
well. Now, I said I would talk through combat combat sports principles
SRE. There's actually a bonus one that applies to SRE and in
life in general. And that is the infamous Mike Tyson quote.
Everyone has a plan until they SRE punched in the face.
This is important to remember as this principle
still rings true for SRE. The best laid plans
for launching a service or product can quickly go awry due
to a whole host of reasons. Failing tests,
incorrect size scaling groups, even expired SSL
certificates can all cause incidents or affect
service availability very quickly.
SRE teams need to stay on their toes and stay focused
for when they too are metaphorically
punched in the face. In tech, not just in SRE
plans often challenges quickly and drastically, so this
becomes easier to adapt to over time and with experience,
the threats of seemingly random errors, incidents and outages
that alter the original plan will forever remain.
So it's important to stay ready for what the machines decide to
throw at us. So, to recap combat sports principles SRE
that apply to SRE number one,
sometimes you wear the hammer, sometimes the nail. You need to appreciate the
good and the bad within our lives. In SRE number two,
don't throw in hope. Aim and fire, especially when troubleshooting
and resolving incidents. Number three, don't react.
Respond. This is why it's called incident response and not incident
reaction. Number four, it only takes one punch to change
a fight. Think of this. Whenever you are troubleshooting or are mid
incident, one command can alter the outcomes for better
or for worse. And number five, the fighters
is won or lost based on preparation. So always try to stay prepared in
your incident response and troubleshooting so that you're not having to
wing it as often. And of course, try to remember that all
the greatest laid plans can go awry once things start
to go badly. Keep your head up, stay in the fight,
and as these referee says, protect yourself at all times.
With that, I want to thank you for tuning in and listening to this talk
on combat sports principles Sre apply to site reliability engineering.
A big thanks to 42 for providing this opportunity and I
look forward to hearing from everyone in the future. Thanks.