Transcript
This transcript was autogenerated. To make changes, submit a PR.
Are you an sre? A developer?
A quality engineer who wants to tackle the challenge of improving
reliability in your DevOps? You can enable your DevOps
for reliability with chaos native. Create your
free account at chaos native litmus cloud G'Day
welcome to avoiding Goodheart's law using slos as tools,
not cuddles before I start sharing the PowerPoint,
let me just cover what we're going to talk about. See the concepts of SLI
and SLO and error budget. They're there to balance risk and reward
risk around the acceptable rate of change and reward being the business
success and customer contentment. Using such metrics to punish
teams for exceeding budgets or forcing acceptance of change
within the business is a path to failure. And this session is
going to give you some hints for success. But first, maybe I should introduce
myself. I'm Marco Coulter. I am an ex CTO who
has worked for one of the top 50 international banks. I've supported data
centers for hospitals and service providers. I've worked for
some of the industry's largest vendors. I've lived in three countries
and managed teams across 13 countries.
I also spent five years as an industry analyst.
I ran the data science team organization at four five one research,
which has since been acquired by standards and tools.
Seeing technology from every side as an operator, a developer can analyst,
a vendor, a buyer, and a CTO gives me a unique view
on technology. I can read or, sorry, you can
read some of my writing or interviews in the publications on the
left here or on my website, tech whisperer.com.
So enough about me. It's good
to have targets, right? Think of Robin Hood. The story where
he places a child against a tree and he loads up an arrow and he
aims now with an apple was the target on the child's
head. This is the story of a skilled archer, but without
a target. Without the target, Apple. It's the story of a crazy
guy, dangerous guy, shooting arrows at children.
So it's good to have targets as long as you use them correctly.
And that's what we're going to talk about. Today's session comes in three chapters.
I will talk about how I experiences Goodhart's law
before I even knew it existed. Then we will think
about slis in a better way, across dimensions.
And then finally, I'm going to throw a few hints about negotiations, your Slis,
and give you some links to further reading. So let's get
going. I'll get to good heart's law in a moment, but first
I want to share a story of my experience with you, depending on your personality,
you will either relax and enjoy my story, or you might already
be searching for good heart on Wikipedia. That sort of thing of searching
for the answer first is that's gaming the system. And some
folks see gaming the system as the smart play.
Others equate the phrase to like cheating.
I'm more in the second group. For me, gaming the system means manipulating
the rules meant to protect a system, to instead manipulate
the system towards a desired outcome.
In a prior life back in Australia, I worked for
a service provider that supported all of the hospitals in a
state. In hospitals, nurses, they take
lab samples and they get sent to the labs.
They are processed and the results get
transmitted back to the patient record where the nurses back in
the ward can then immediately look them up.
Pretty simple, right? By the way, the wards are up on
the 16th floor or somewhere high up in the hospital buildings, while the
labs are generally in the basement of another building on the campus. So there's
some physical distance between the two.
Technically, it looked a little like this.
The messages from the lab's Unix system would be sent
into message queues and the queues would be read by the lab.
Update would feed the lab updates into the mainframe system holding
the patient records. Now, everything allegedly spoke a
common HL seven standard slo. There was never going to be any problems,
all different vendors involved. I think you see where this is going.
The support of the HL seven standard was not perfect.
Malformed messages created by proprietary software along
the way would get stuck in the queue.
We would then get phone calls from hospitals that they
had to go to manual procedures. Now, the backup procedure,
if the patient record is not getting updated, was for the nurse to
physically run down from the ward to the labs to get the
results and bring it back to the ward. That was not optimal
as the patient's health was at risk both from the delay and
from the nurse being absent. To take care of this,
we agreed an SLA that if the message queues got higher
than 100, the service provider that I worked for had to refund
money back. And that should address things, right?
Looking at the thing that we thought was broken. So I coded a bash monitor
script, and so when the queue length approached, 100 alerts would go off.
Monocor icons would turn from green to yellow to red. As technicians,
we were focusing on the measure as the target, the goal.
We even built capacity plans around making sure the queue processing
got all the power it needed. Now you might think, well, that's great,
Marco. Top notch result lab results. Get back to the ward in time,
right? The only problem was that
we would get these pesky phone calls from nurses in the wards saying
the system sucked. They were always having to
run down and manually collect results. But the
message queues were empty. So the problem was that
transactions were often timing out before hitting the message queue.
We hadn't seen the whole picture. We were managing the capacity
plan, in fact, the whole application to the metric,
not to the outcome. Now, years later,
I learned that this behavior had a name. Yes, finally, we're going to get to
Goodhart. Goodhart was an economist
in the U. K. And in 1975 he stated,
I'm just going to read this. Any observed statistical
regularity will tend to collapse once pressure is placed upon it
for control purposes. It's kind of wordy, I guess. He was
english and a politician, and wordy is what you get.
Anyway, here is what he meant. Basically, Goodhart's law
says that when a measure becomes a target, it ceases to be a
good measure. You see, Q length was a good measure
of q function. We were managing to that as a
target for success, building capacity
plans and so on, based on the queue instead of the successful
laboratory transactions in patients'records.
So how do we avoid Goodhart's law?
Well, we needed to let the slis be measures and
then use the slos as the goal.
So what should we measure?
And the key is to stand in other people's shoes,
see everything from a few different angles. Hence three
dimensions. Remember that SRE is really
about balancing the risk of unavailability against
rapid innovation and efficient operation. And we embrace that
risk by giving it a value through well defined and governed
service level indicators, objectives and agreements.
As you are identifying slis,
we need measures that see the whole picture in the three
key dimensions. So let's step through a simple example of
each of these dimensions based on that hostile environment. The examples
will not give specific slis that you can apply to your environment.
It's not meant to. The example is intended to share the thought process.
So first, let's quickly review the SLI SLO
slas model so that we're talking about the same thing.
So SLI, or service level indicators, are the numbers
and work better when they are percentiles. Avoid averages,
you miss things. In mature environments. Slis will be nested.
They will combine from slis that sit against the code
and the technology, up to slis that sit next to the customer.
They are a defined quantitative measure, a metric
you will then set limits on the slis, say an upper band
or maybe upper and lower, and that gives us the
SLO service level objective.
Slos should capture the performance and availability levels that,
if barely met, would keep your typical customer
happy. They are generally target failures
are less than x or a range. Responses will
be between x and y. Generally, I like to translate the slos
into periodic budgets of between x and y over
this period of time that I can track weekly, but that weekly timing
might depend on your release cycles and so on. Maybe you can expend it a
month, maybe you need it every single day. The slas
define what actions are acceptable once the budget gets used
up. Defining these ahead of time is critical.
There's one additional thing. Full outages tend to
happen less these days, so the focus needs to be on sort of slowdowns
as well. So rather than traditional uptime
availability, I try to focus on the customer domain and their experience,
and then use successful customer requests as a general
goal instead. Now, to capture the overall environment,
those nested slis supporting the customer experience,
supporting the CX ones, should cover each of the three dimensions.
So first, let's start with the code. Now, from our code,
we want functional code that does not fail, and additional
measures and write downs of technical depth.
Slo, you'll be dealing along the way with multiple languages.
So you want to make sure that the metrics will work across them generically,
where you can avoid metrics that would only apply to a specific coding
language. It's too much extra handling. You're creating work for yourself.
I guess you're creating tech debt in a sense, even in the measure.
Also, it will not be limited to applications you built in house.
So sometimes you're going to have to deal with third party applications like SAP and
the ABAP language, or SLAs applications like Salesforce.
And maybe what you're managing is environmental like know
or your reaction times in pagerduty or something.
And then you're not just watching the code transaction errors,
you are looking at the configurations and looking for
configurations errors again. So some of these nested
slis there around the code might indicate yaml
code accuracy or something like that. If you're running a Kubernetes environment,
you want to avoiding silos of data. Here you don't want the SRE
team working off one source of metrics while the development team works off another.
You need a single source of SLI truth as part of deciding
what you will base the slos on.
So for the sample indicator in
healthcare, let's embed a few clarifications. Our first step
would be to focus on well formed updates. That specifies
the transaction that we're looking at. We want to update the patient record
and acknowledge completion successfully.
That acknowledge of completion specifies the reaction. And in
this case, we agreed to using an APM tool,
an application performance monitoring as the tool, and that specifies
the source of the data, that single
place of truth. Now I'm familiar with appdynamics for ABM,
but you could be just as easily use Datadog or
new relic. In fact, as we are only concerned about code here,
you might prefer observability offerings like honeycomb or
know this observability segment. I don't know. IBM acquired Astala
observability is getting very interesting right now. Now, some SLI
books recommend good over bad ratios for slis.
If you control all the code, that can work. But in
this example, I'm avoiding the fail ratio as there were too
many elements out of our control. The HL seven update transactions
were coming out of purchase proprietary software running on the lab hardware.
We had no way of fixing that code. We had to wait for patches to
come in, so we were not going to be held accountable for
those failures. Also, the queuing systems were third party software, different vendor
SLO. We couldn't tweak them to respond better, to malformed entries, to not
get stuck, and we could not be certain that
we would reach a point where the HL seven outputs were always
well formed, so that the basic code SLI would
need to be focused on well formed updates, getting to the code
that we were writing and controlling. So that's an SLI down the
bottom of the nesting in meetings,
but just an example. So for the code SLO,
we are already assuming well formed HL
seven records so we can set this fairly high.
Again, we're being clear about the transaction, the reaction,
and the source. And it's often easy to
set this goal tools high. Remember, the goal is people say
they like to say 100% for slos, but that means there
will be no experimentation, no innovation, no risk taking.
So it needs to be just over the level that will keep customers happy.
Too high. And opportunity costs, opportunity costs
are being wasted. Okay, so for the code
SLA, we need
to apply an outcome now. Note the SLA goal also
allows some wiggle room with the SLO. We added a time range
here now, and the SLO must be met over a sliding range
of 28 days. The SLA should specify
what happens when the SLA is missed. Does one department
owe the other department a refund? If it's a service provider
relationship. It might be something like that of just like just pay us in cash.
But perhaps as the SLA is
getting missed, something gets locked down. We don't allow changes for the next couple of
weeks while we sort out all this nonsense. Or maybe the software release cycle
gets automatically frozen for 28 days. It is good to agree
those things before you hit the heat of the moment, that you're in
the middle of something going wrong and your coders are like, yeah, I think I've
got it. I think I can fix this immediately. But your
nurses are still phoning up and going, the records are not in the record.
What the hell is going on? So the SLA is the part
that is negotiations. Now, in a perfect world, this is
defined by the business or customer, but in reality it is a conversation.
Normally I would not put technical phrases like well
formed HL seven in the SLA it would be a customer
outcome, but we'll come to that a little bit later in this session
when I talk about negotiations. Okay, so you're
going to have some slis and slos around code.
Hopefully that gives developers a sense of balance, something to measure opportunities
to add features and innovate against
clearing technical debt against business impact.
Now, code runs on infrastructure, so that can have customer experience
impacts as well. You have availability concepts of how
to support updates to the infrastructure in the same way that you want to update
your code, updating operating systems, moving to different cloud or network providers,
adding new locations to better support remote customers.
These risks to availability and performance need to be balanced as
well. So excuse
me, this is our new dimensions. As we add the infrastructure dimension,
things get more complicated. You will be dealing with the full stack
and often multiple stacks in multiple locations.
In our hospitals, we had pretty much everything from
Windows client applications to Unix Labs and MQ systems
and mainframe and systems. And they're all scattered across a state
that is physically one third the size of mainland USA
in Western Australia, and by the way, not close to any cloud providers.
So networks mattered. Some of the nested slis around infrastructure,
they could be inherited slis from your cloud network or service providers.
Like for code, you want to avoid slos of data here, it's best to have
a single source of SLI truth for infrastructure.
What you're looking for is the infrastructure's ability to
support, load, and deliver predictable latency.
Now, we need to include some more things
here. The impact of all the infrastructure components
into the slis slo part of our nesting process.
So we look at the total transaction time as a way of doing that.
This would certainly have nested slis for each piece of the puzzle.
Can SLI for the labs update, leaving the lab hardware,
an SLI for the message queue, adding and leaving. Remember that first one that
I described in the start of this session? Can SLI for the
labs update, arriving at the patient record system. An SLI for traversing the
networks, an SLI for adding it into the patient record.
And you might even have an SLI for the database inserts on the patient record
system. You can get too crazy with slis.
So define slis at sort of system boundaries or team boundaries,
so that there can be can extensive ownership there.
The strength of system boundaries is that they're less likely to
change, but it might be too detailed. The strength of team boundaries
is that you can assign responsibility more easily. Now,
for slos on infrastructure, you may want to
express the slos in the shape of performance curves. Here, we expect
the bulk to occur normally within 30 seconds,
well within the technical capabilities of the infrastructure. Right. It just needs
to be enough to keep the customer happy.
You don't want to set it so high that you're overspending
and under innovation, under innovative.
So, for example, where there's a high system load, you will see that we have
a long tail here for about five minutes. At the top of the curve.
As we move to the SLA, negotiations with the customer, you see a big jump.
We're only committing to the five minute time with them.
And this came about after conversations with the ward nurses, and this is
a key aspect of what I'm talking about here, that once
we'd worked out, after we'd screwed up on the message queue and realized that we
weren't making the nurses happy. And nurses are very clear when they're unhappy,
by the way. And I went out and
talked to the nurses, I went out and watched what they did in the
wards, and I went down to the lab systems to see what these people were
actually doing so that we could build genuine measures for them
of what they actually cared about. So we asked them about time, and we
expected them to be like most customers, seeking instant response times.
But their view was different. They know it takes time for the samples
they takes to be delivered physically from the wards to the labs.
So they had a view of overall processing time. For them,
the time frame was about beating the time it took a nurse to
run from the ward to the Nat lab system when the system
was down. Now, that took about ten minutes.
So when we offers five minutes, they were happy.
That was an important lesson, and it
allowed us to avoid unnecessary infrastructure costs and
other things. It doesn't always have to be as fast as possible,
just as fast as necessary. And of course, when I worked
for banks on stock trading systems, it was different. The processing
time was a competitive differentiator for
the traders and as fast as possible. And never mind the cost
was the approach. The dimensions of code
and infrastructure are not the full picture,
however. So let's talk about the third dimension,
the business and customer experience. And I saved the best for last.
This third dimensions is the business, or if you're a nonprofit or a government
body, the customer experience.
This is about the revenue and or service production capabilities
of the application.
As we add the business dimension, it can become difficult to measure
the full experience. You will want to get out
to the customer interface, and that may require business
integration or mobile platform agents for availability.
And to track predictability of response times, you may want to add in
synthetic testing tools into your environment. I'm going to keep it a little
simpler for our hospital example and
just talk about we're sort of just looking at
the doctor and nurse experience now.
We built and owned the patient record application,
the mainframe piece. So we knew we could add in our own specific measure
for that piece. And from observing the behaviors and the wards,
we worked out that the nurses had an instinctive expectation of when
the labs would come back. They would start looking at the record.
If the update wasn't, they'd open up the patient record and go,
is the update there? And if they wasn't there, they would come back and try
again in a few minutes. Repeated record lookups
was our sign that we weren't meeting those instinctive expectations
for the people in the ward, and soon they would be calling us to complain.
So we coded a repeat counter into our patient record application.
Now, why did we set up beyond 10 seconds in there as
part of the measure?
Well, we had just within five minutes at
first, and then we had a problem. We kept missing the target even when the
records were processing fine. You see,
one or two of the nurses would not wait at all. They would just sit
there hitting enter again and again and again. So we added the beyond
10 seconds as well as the sort of within five minutes to get
around those sort of crazy and patient ones so that we weren't getting beaten up
for their behavior.
Now, the SLO here is
a little different as we want a low number or zero as
the outcome. You might have expected a tiny percentage here, but the
SLO includes all the malformed transactions coming out of that crappy
lab system. So we needed to be realistic with them. In fact, this SLO
sort of best everything else in the system.
And again, the SLA had reaction room against the
SLO. So the eight hour timeframe came
from the nurses, as they thought in terms of their shifts that
know what happens in the time that I'm here in the ward.
So this three way approach, we were really starting to work
with the customer towards their success, instead of
managing towards a metric or a contract or a specific piece
of technology. Now, both parties were using the SLA
as a tool instead of a cardinal to beat each other up with. So of
course there were many more slis and slas and slas.
In reality, with the service provider,
it's generally a fairly thick contracts. But remember, my goal here with the examples wasn't
to match reality, but to step you through the thought process.
You need to consider all three dimensions for success. The slas
are not there to beat each other up. They are there to capture
the mutual understanding.
You reach the mutual understanding through negotiation.
Slis, slos, slas, error. Budgets are
the tools to support negotiations.
Now, negotiating is a key skill for any
SRE. There are some great books out there,
although they can be a little contradictory sometimes getting to yes versus getting
to no. And many are targeted salesforce and closing
a deal. I'm personally, I'm a win win negotiator. I want everybody
to leave feeling like they won. That's not always possible.
So here's a few quick thoughts based on my experience around negotiation,
and I'll say it again, negotiating is a key skill in
SRE. You may think that you just
need technology skills and no, there is more to being an SRE than that.
Now, know thyself is not a new idea. It was
carved into the temple of Apollo in Greece around the fifth century
BC. And knowing myself, it's the best place
to start. Consider your level of maturity as an SRe team
and as the environment. How much can you control?
What risk can you absorb and the business survive and
you keep your job? Is that risk spread evenly throughout
the year, or do you have peak periods, like a Black Friday or a Super
bowl, or a new year's Eve? Are you in a period of significant
transformation as an enterprise? Will things be the same in twelve months,
or be slo different? As to make today's sois meaningless?
Use all this to gather your needs. You probably have a
feeling for what expectations the business will have.
Hopefully, what will you need to deliver
that expectation? Could you accept tougher slas
if you could grow your team or purchase supporting tools.
Now, when consulting, I try and brainstorm this to identify where my outer
boundaries are, what will be unacceptable,
or what will be too easy. Preparing to engage
is about gathering information. So what you want to do is
build a strategic model from your information.
There is a people factor here as well. So gather opinions about
the people that you will be negotiating with. What are their goals,
their aptitude to risk versus innovation,
even subtle things like what time of day? Are they more open
to ideas or in a better mood?
Also, because you're going to schedule your facilitations around that.
Also identify a facilitator.
If this is going to be you, then read up and practice ahead of time.
Facilitation is a very specific skill.
Consider bringing someone in. You can bring in a contractor or a
service partner who does this for a living.
Facilitation is a skill. Or it can be great to bring in a leader
from another part of the organization who you know is a natural facilitator,
who can park their own ego and needs and draw input from
everybody in the room or in the meeting.
And actually that can give that volunteer career profile
within the company as well. It'll let them see other sides
of the company and the other sides of the company to see them. So it
can be a win win. I said, I like win hints. Now it's
time to schedule the meetings and get on with your negotiations.
The negotiating meeting has a general flow, right? You have your warm up, and here
you set the scope for the discussion, the application, the dimensions,
the business value, and what they will get out of participating.
Be brief. They're all experts in some aspect here,
so you don't want to turn over every little stone yet. Just get everybody to
talk a little bit. Ask them to spend one or two minutes describing their
aspect. The nominator facilitator should
politely close down anybody who starts to exceed a brief introduction.
And that's why it's good to have sort of an outsider, that it doesn't create
resentment in the room. It's just like people get controlled.
Then you hit your test drive, and that's where you present what some of the
indicators could under consideration, at least could
be. Give one specific example of an SLI can
SLO and SLA to clarify for them again.
So in the same way that although pretty much everybody attending this conference
would know the basics of slas,
SLA, I took one SLI to just step through to
make sure that you and I were working off a common understanding
of them. And you're testing the water when you do this in the negotiation.
So try to make this something that could live on your example,
Slis that could live on in the final agreement,
something real. Now you want to assess.
So you've test driven something with them, now they have something
to talk about and assess the business value. Is this the best place to start?
Will pursuing this scope give return on investment?
If you're a revenue based organization,
do you need to balance innovation and risk on this application?
What actions will be effective for missed slas?
Will you just freeze changes for a while to return to
stability, and if so, for how long? An important part
of this phase is extracting and capturing assumptions.
Clarifying assumptions is why footnotes on slas can
sometimes be longer than the actual SLA.
And don't be surprised if you end up that way. That the definitions and footnotes
explaining things in the SLA is more detailed than the SLA.
Then we reach the point of proposing. So now you have more information for
everyone, you've updated your test drive, you can make a new
proposal now. Now a few hints here.
Predictability is often more important than speed,
so the higher variance in response times, the more user experience is
negatively affected. If they know it's going
to take 60 seconds every time they can,
I don't know. Can you make a cup of tea of coffee in 60 seconds?
Probably not, but they know that that's coming. But if it's
most of the time taking 15 seconds, and then, but every once in a while
taking five minutes, that's a negative experience.
So avoid spreads that are greater than
six standard deviations of the metric.
Assuming you have the metrics tracked already, greater than six standard negotiations
is an indication of low capability of process. It's not good.
And now you've proposed and we discussed it,
now you recur. Now you assess the new proposal like you did for the test
drive. You expect to iterate or recur through these takes,
as this is where the real negotiating occurs. You may need to schedule
follow up meetings now. For each meeting, always take a few minutes to
revisit the warm up, restate the scope and goal, recap the
conversations to date, and try to acknowledge something from each person
during that warm up piece. Again, during the warm up, you're trying to
get them in the room and feeling
like they belong and that they are a participator because you want them
participating and feeling respected. Then of course,
the wonderful thing. Eventually you reach the agree. This is the
final presentation of a finished SLI SLO SLAs
for sign off. That doesn't have to be a physical signature,
but it is worth saying that you will need everyone to. As an example,
you need everyone to confirm the email. They will need to commit
on the record. If they're reluctant, it means you missed
something during the assess phase. Return to that
and try again. This is the
process of the negotiating meeting or meetings.
Sometimes this might only take ten minutes and sometimes it might
take three or four meetings. Okay, so thanks for
coming on that journey with me. Here's what I hope you take away from this
session. Learn from my experience. Don't manage
to the metrics. Focus on the outcomes, the full transaction,
the complete process, the overall experience.
Don't use service levels to beat each other up.
Use them to become preemptive. Use them so that you can offer more services
ahead of time. And when you build out
your service levels, remember to assess them against the three
dimensions. Are you seeing the full picture?
Is there some critical aspect that's being overlooked? One of the
most critical things around customer experience
that I have learned is predictability. With higher variance in
response times, the more user experience is negatively
affected. High variance is also an indicator, as I say,
of low capability process. So keep your eyes on the transactions that
are outliers. Outliers annoy the crap out of us
and they annoy the crap out of users as well. SLo finally
realize that if you want to be great at SRE, you will need negotiation
skills. Negotiations is useful for life and it's useful
for SRE. So as I promised,
here are some links on reading on similar topics.
SLO for those people who've swapped screens, this is the time to switch back to
the screen and take a snapshot of the links here. You can catch up
with most of my thoughts on my website and if you
found the session interesting, then please feel free to connect with me on
LinkedIn or Twitter. I may have more interesting thoughts tomorrow.
You never know, so it might be worth a try.
This is not my first time with the session, so I
love feedback. Feedback would be great. So what caught your attention?
What was important that I missed? Because I'm sure there's something that
I've left out. But with that, I want
to thank 42 and sre.com for hosting this session. I want to thank
you for your time today and I will see you in the Discord
channels. You can reach out to me there as well if you wish.
I hope you have the rest of the event and the rest of your day
is fantastic.