Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi, this is bridging the gap, leveraging incidents to align platform and product
engineering, and the stock will cover the following how to survive
at 02:00 a.m. On call apocalypse. Explain how severe incidents happen.
How to make the most out of an incident. The gaps in your testing talk
about a metric you should be tracking and you're probably not. How product management or
product engineer think in very, very similar ways.
Even more tips and conclusions. Let's start.
So this is going to be a choose your own adventure. And let's
start here with a story about how to survive at 02:00 a.m..
Apocalypse. So it's 02:00 a.m. And your pager's
app starts buzzing. Do you revert the code?
You won't even check what contains or how it broke your app. You're just going
to go and revert and pray that it fixed the issue? Or do
you go and investigate the issue further? Let's assume that you chose a
you're going to revert the code and let's hope that it fixes your issues.
Nah, it didn't. Of course, this was clearly not the culprit,
so youll should go back. And now let's try and look deeper into the
issue. So you look deeper into the issue and success.
You define an issue and you quickly deploy a fix. It was probably
just a typo or something really simple that you can caught up quickly.
So yeah, you're happy. Now do you just go back to
bed? I mean, your shift is almost over. In 30 minutes it'll be someone else's
problem, so no need to worry. Or do you try something else?
Let's assume that you were lazy and you just went back to bed.
Who cares? Someone else will deal with it if
it races again. Unfortunately, this was terrible. Your site
has gone down again. The engineer just went ahead and reverted your
fix, which in turn caused an ether bigger outage.
And teachers receive a call from an angry customer demanding
explanation. The end. And this is a terrible ending. This is definitely
a failure. For your company and your team and even yourself.
There is definitely something else that you have done. So how about we
try something different? And the goal of this talk is
to share with you all what that something else is.
But before jumping into that, let's have another story.
It's time for meltdown. Swiss cheese and the anatomy of severe
incidents. So, severe incidents and cheese have
a lot of things in common. We're the product
of letting something simple go unchecked.
Bacteria cause cheese, and bugs create
incidents. You can also not predict well the holes will be,
they'll appear at random and because of that randomness,
holes will align in very unexpected ways, which in
turn will make even bigger holes. This is extremely common.
And this is so common that there is even a model
named behind this. It's called the James recent sweet cheese model.
And as you can see, even with three very good layers,
issues can get through. If the holes just align
right away. It's almost as if there was nothing in between
them. And this is a very common phenomenon. So it's
something that you need to keep always can eye on. So what
can you do about these comes you can acknowledge that they will happen.
I mean distributed systems are hard so happen.
Youll can try to increase monitoring and alerting like yeah,
data helps spot these issues. You can
try to invest in tools to detect anomalies. Like this
is very common. Like most platforms now have anomaly detection built in.
Or you can ask developers to instrument their apps again,
make it a developer's problem, don't have to worry about it,
and they will learn a lot if they do that well,
unfortunately this is not enough. Like yes, bugs happen,
but we should strive to reduce them as much as we can.
Increasing, monitoring and alerting, it's great, but again, if you choose
the wrong metric, you'll end up fixing the wrong thing. Like we
saw on that first example. Like if you assume that something small is the issue,
it's not necessarily bad. Investing in tools to detect anomalies,
this is great, but unfortunately in our business, anomalies happen
all the time for all reasons. The network is not reliable. Solar happen
or data centers can have a power fluctuation.
All those things will create anomalies on your code. And if you're
not careful, you can end up addressing it in the wrong way. And finally,
asking developers to instrument their apps without any guidance,
it's going to eng up being extremely expensive. And also
you're contributing to that problem that people might be looking in the wrong place.
Sometimes they're going to blame issues on the wrong metric, and that's definitely not what
we want. So you should never let a good crisis go to
waste. Incidents are great as a
wake up call to invest in modernizing your code base,
improving your infrastructure, adapting better tools.
And that's because incidents expose our weak spots,
incidents that show where these issues are scheming from
and identify the areas that need further investment.
It's important though to beware of survivorship bias.
Adding armor to the least hit areas of an aircraft is more effective
than reinforcing non mission critical areas. So when
you're thinking about your issues, try to
also look into the things that are not failing,
because some of them can be the critical thing. Those can be the load
bearing parts of your system. So it's important that you look at this holistically,
right? Basically. And yes, a severe incident will energize
your step to spring into action. But unfortunately, we cannot just solve
the problems with the same mindset that the
mindset that created decisions. On the third place, we need to do something else.
We need to look at this with different lens. So now
I'm going to reveal the secret of the font pyramid. Tests are not
enough. Imagine this scenario. You have 800
tests for the user model, 80 tests for the endpoint,
40 tests for the react component, and even eight end
to end tests with playwright and cypress. Things can still
go bad. And I've seen situations like this happen.
Someone deletes an important SV bucket,
someone accidentally uploads bottom
color that matches the background. There's a route
misconfiguration, like these simple issues happen.
DNS, BPS issues, these things are
extremely common. And it's really hard to catch these
issues if you rely on testing alone,
because as we saw previously,
holes align very easily. So with this
level of testings, it's not enough to catch all the
issues that you're looking at. So now,
without further ado, I want to reveal what that something else is. That thing
that can really help youll feel more
confident in your call.
That's something else. That's called implementing business value proxy metrics,
bpms for short. I haven't found a better name
for this metric,
so I just came up with this, and it
sounds fancy, but it's not complicated at all.
Let me go for an example that will clarify everything.
So let's assume on the first example, you want to make sure
that your login is working correctly. This is a metric that
you can generate, and it's extremely simple to generate.
You just say, like, okay, given all the login people that go
to the login page, how many get to the dashboard?
Meaning that that dashboard had to go to this
sign up process, login process. And to
generate a number, you generate a heuristic and saying like, hey, in normal
situations, let's say 70% of people
are able to do that. That's your metric, that's your PPM.
That's something that you can track. And whenever that number comes way up or way
down, there is something happening on your system, something that you
need to look into on the positive side, yeah,
if people are
logging in more and this turn out to be
like 80 or 90, like okay, maybe your effort of simplifying the
login process is paying off or your auto login is working
versus the opposite. Like okay, maybe this metric
is tipping. Like did someone delete a button,
did someone did something wrong? Or even worse,
how about the login is not working at all and people are just laying on
the dashboard page because it's unauthenticated. All these three
scenarios are really easy to detect with a BPM.
So yeah, alerts on BPM are great.
They allow youll to move fast with confidence. This is something that you
can just page on and this will be a better
proxy than just looking at these individual metrics.
But there's a catch. Like as the VPN
says, you need to understand what is business value?
Let's go to a definition. I really like
this definition by Coda Hale.
Business value is anything which makes people more likely
to give us money. A temple like you build a product,
people are using it, people are paying for it. That is business value.
And it's really important to remember that the
moment that your code generates business value is when it runs,
not when we're writing it, not when we're testing it, not when
people in QA are checking on it. Until it reaches
our customers and our customers are successful with our product,
that's when the business's value gets generated.
This is very clear at this
point. I hope it's clear at this point that
platform engineering should care about the business as much as
product management does. We need to be BFS,
business focused friends. Hurrah. Okay,
I know I'm getting again with these acronyms, but these are simple to
understand. And this is way to understand RA is
to just map them to bpms. So retention.
Retention is simply the number of logins per hour.
Easy activation, the number of sign ups per hour,
which is great. Like how many people that landed on your page actually
sign up for your product revenue,
how many subscriptions you've had per day, and then a
referral. Like how many marketing page views happen.
Or if you have landing page, like how many people ended
up on those landing pages. Bpms are just objective
measures on how well your system is performing business wise.
As youll can see, it's a one to one relationship. But this is great.
This definitely allows us to build a lot of alignment between
different parts of our business. And not
only that, product development is shifting.
As you can see at the beginning here on the left,
product development was focused a lot of an experience.
Like, I'm a user and I want to be able to do this and
hopefully we'll get some benefit out of it. That is
good. Definitely we're always going to be building futures, but thinking
about that, it's limiting because sometimes you need
to have more than one future, more than one experience to be able to
achieve an outcome. So people are slowly shifting
into this capability model, which we say like,
hey, we want users to be able to do this on our product and
we want this ability, this capability to give us
this business outcome. And this is great.
This is a much better level of sophistication because now we can say like,
hey, this business outcome is actually measurable. Well,
we know that this is working because we have the signal. As before,
we were just guessing if the future was working or not.
We can actually now see what's going on and
try to get some objectives on that.
And nowadays product
development is going a step farther, which is going to measure things.
We want to say like, hey, this is the signal that we need to measure
with this data. And then we're going to experiment on it. We're going to do
a b test, we're going to do all this process
to make sure that what we built is actually being used
in the way that we intended and that the outcomes that we're expecting
happen. So thinking this way is a great
way to achieve the results that we're looking for.
And it's very similar to the way we're thinking on the platform
side. Business value proxy
metrics is not just a signal that a feature is creating the algorithms
that we want. It is, but it's also doing more than that.
Good business. Valid proxy metrics are showing what is critical.
What is mission critical to a business. What are the things that if they're
not working, you should be worried about?
This is a screenshot of dashboard that we call mission
control. This very simple board allows
us to see when things are debating from our expectations.
So look at that large orange spike. It's clear that there's
something going on there.
With login, we can be like, hey, did someone
change something recently? Did we activate a feature flag recently?
We can correlate this data with these
changes, and just keeping an eye on these numbers will
always allow us to go faster with confidence.
We know that as long as this is leveraging within our standard
range, we don't have to worry about what's bridging or not.
But whenever something breaks, this is a great way to make
actionable items. So now let's
go back to our story. Let's assume that you're back
at a 02:00 a.m.. Apocalypse.
And here's just something else. Instead of
going back to sleep or jumping right to it, just start checking mission control.
If everything looks okay, you can meet the alert and wait until
business hours to deal with it. Can you go back to sleep?
You know that things on the system are looking okay,
but if for whatever reason, one of your vpms is looking
off, you can quickly look and try to correlate with
alert signals. And once you do get out the fix,
as long as there's a BPM, you can rest and
go back to sleep because yeah, the bpms will detect if anything else
is wrong. Also, I really encourage
you during incident to think about like, hey, is there a VPN for
this or not? Because if not, this is a great opportunity to do this one.
And here are a couple more tips on how to use incidents.
Call up to action.
Make sure that you're using some sort of incident management software to capture
your learnings. It's a great way to spread the
knowledge. And also during a crisis,
it's a great way to collaborate. People are able to jump in
and out of an incident and just gain by the context by looking
at those records.
Just utilize the learnings to create BPM backed
alerts. So yeah, every time that you have a serious incident that
feels bad during your retrospective, look into like, hey, what are the bpms
that I can use to make sure that this won't happen again?
Bpms are very similar
to key performance indicators. So if youll company has some
sort of process like okrs B, two moms,
or any other framework, bpms can guide you and create
those conversations. Like what are the metrics that we want to change and prove?
Bpms are almost one to one with KPIs,
don't forget to share youll perspectives with broad and other departments.
Again, the most successful way to introduce bpms is to
have alignment with the rest of your company and talk about what's important for everyone.
And remember, the same tools that we use
to fix things on the infrastructure side,
SRE side, platform side, whatever you want to call it,
they can be used for business metrics. We can use UDA Loop to
address business metric issues. So yeah,
it's great to introduce this and share them with the rest of the company.
And finally, when doing can introspective,
don't forget that finding the root cause is not enough. You should try to
also implement preventive measures,
BPMS tests, dashboards. They're just
as important as bug fixes. So please introduce them,
and your whole company will be happier because of that.
All right, this is the talk. Thank you so much. If you have
any questions, you can reach out to me on master
or I'm at Algons everywhere else, Twitter,
whatever people are using these days, you can probably find me in there.
Thank you so much.