Transcript
This transcript was autogenerated. To make changes, submit a PR.
Good morning, good afternoon, good evening. Whatever time of day it happens
to be for you when you are watching this video from comp 42 SRE,
this is my talk on implementing a learning team, and it's
a case study of a situation that occurred a number of
teams ago when I was working at LinkedIn that I
think you can identify with. And I hope that you'll be able to benefit from
the approach that we took, even if the specific principles are a
bit different. First off, I'd like you to imagine
that it is 05:00 a.m. 06:00 a.m. Your time
locally, wherever you are, and you get
an important email from the CEO that says,
hey, some numbers don't quite look right.
We have at LinkedIn this lovely report,
which yes, I realize is a bit of an eye chart.
We called it those key business metrics, and this
would come out every hour. It lagged real time by two
to 4 hours by the time different cleanup processes
had been run on it. And our CEO
was unique in his capability for pattern detection.
He would look at this and let me zoom
in on those a little bit for you. It's an hourly report
covering the last 24 hours for a
handful of crucial activities that would occur on the site.
Page views is one of them. And then I've just labeled the others as
activities one through eight. And you can see that when
you compare week over week that a number of these are
down, a number of them are up, the patterns don't quite match,
but that's where the red and the green came in.
And this contrasted to the
metrics that engineering would typically look at.
Engineering would look at these metrics that are on 1 minute intervals.
They would lag real time by about five minutes compared to the key
business metrics that would lag by two to 3 hours.
And this example from the ingraphs shows,
obviously a place in time where
we had less activity than the
previous week, comparatively. And so this would be
the kind of thing that would often get flagged in the key
business metrics. The question that would come up,
and this occurred, of course, right as the shift would
change. We has teams that were geographically spread on
both sides of the world about by 12 hours.
And the question would come in just as one team was
about to go away and the next team was coming on.
The incoming team, of course, had very little context at
the time. The outgoing team wanted to head home.
And the question of, hey, what's going on?
Would come up. And being, as it was,
important and of course related to, as I say, key business metrics.
People would want to jump on this and understand what's going on.
So the teams would scramble around trying to answer the question
right at the top of the shift change. And it
was not an easy thing to do. We were
figuring out what was going on, and every time
we would come in and look at it, we'd have to potentially
consider a plethora of different systems that were affected.
So what to do about those well
being engineers and from an engineering
culture? Our first idea was, hey,
let's get more data and let's analyze
the heck out of it. So we came up with an approach,
an idea of going with a Holt winters seasonality.
Because the idea with Holt winters is that it can identify
multiple periodicities in the data. If we
can predict, trend wise at greater detail,
what should be happening, then that should give us a better insight as to what
might be going wrong. As you can see here
with an overlay from a sample chart,
it's not so easy effectively,
and the variations are still pretty extreme
and don't lend themselves to answering
the question of what's gone wrong.
Those problem in this scenario is that we have a
lot of different systems that go into these key activities.
They could have been emails that were being sent out,
and maybe something went wrong with the send, maybe it went out late,
maybe it went to the wrong people, maybe something
was wrong with the site and people couldn't get in and perform the key activities.
Or maybe the page latencies had changed due
to some change that had been shipped out.
It's really difficult to answer these questions off
the top of your head because there were so many factors
that could have gone into it. So,
needless to say, our idea of getting more data,
diving into the data with more detail and attempting
to predict just wasn't breathing fruit.
So we took a step back and we got the
teams together and we said, okay, let's see if we can equip
the teams, both the outgoing team and the incoming team,
to be able to answer these questions. Let's start by answering the
questions. We'll figure out how to automate things down the road.
So we started off with what I call calibrating
the intuition of the teams. We got the
teams from the two different geographies, plus kind of an overseeing
team together, actually keep practitioners from these
and said, let's all get on the same page as to what
could be happening, what is happening, and how
can we tune our intuition to match our CEO's
intuition so that we can be ahead of the game and
say, hey, we saw this and here's the reason,
and you don't need to worry. That was ultimately what
we wanted to be able to answer. Or, of course, if there was a real
problem, then we wanted to be able to send that off to the correct teams
to address. So we started with a regular weekly meeting,
of course, with teams that are split by 12 hours.
This wasn't an easy thing, necessarily to do,
but we got together, we went and combed
through last week's data. Each week we would get together, we would
look at the past week's data, and we would bring the
insights from everybody on the project team together to
say, what did we see? What did we miss?
And then look at why did we miss it?
What did we not know that would have allowed us
to answer the questions in the moment
if we had known it? So those is really in a
joint cognitive system. This is improving the common
ground amongst the practitioners, making sure
that the people who are involved are talking,
starting from a common place that we know both
the performance of the system as
well as the first and second order contributors to
how the system is performing and what could be impacting
these results. We also found a very
important thing that needed to be considered were environmental aspects,
being a global service. And of
course, people interact with a service when
they're awake, not generally, unless they're a bot or a
scraper, are they operating outside of their
normal daytime hours?
Also being work oriented,
a professional network, as LinkedIn is,
people would tend to interact with the service quite differently
if it was a holiday from work, if it was,
for instance, the recent Memorial Day holiday.
The performance and the activity from those United States
on Monday, when it was a national holiday in the United States,
would be very different than the performance, for instance,
from Canada or for Europe, where there is no holiday and
normal behaviors persist. The United
States would be quite different on a holiday
that's just for the United States. So we had to become aware of those holidays
and whether or not they were going to be impacting people's
use of those site on a
global basis.
Interestingly, we even found such
anomalies as the
performance of sporting events would cause anomalies
in the system. This is a graph from a talk that was given
a couple of years ago from one of my colleagues, where he looked
at the performance of the site in the context of
the United States Super bowl. It turned out that the Sharks were also playing,
which is a local sports team from San Jose.
But you can see significant anomalies at key points
in the game as it
applied. This was on a Sunday. Super Bowls
were always on Sunday in the United States. But we would also find similar
things for the World cup, whether that was in
football or cricket. Of course, those things
affect different geographies and have differential impact by different
countries. So it was an interesting
challenge, to say the least, to understand how these
different aspects play out against each other.
So, having calibrated the intuition
of the folks on the team, we started
off to take dual control
in a sense, such as this airplane.
We wanted ultimately to get everybody flight qualified.
We wanted every individual engineer that was doing
the on call response to be able to successfully
fly the plane and answer the CEO directly without having
to escalate or cross check their answers.
But we started off with an experienced engineer leading and
taking points, so to speak, and responding to the CEO,
but at the same time having other learners,
not just on a weekly basis, but in real time.
Okay, what do we think? Okay, here's the answer.
Let's formulate it and send it off.
After a bit of time and people feeling greater
comfort in taking the lead on these things,
we would have the initial response written by
one of the first line engineers,
but then reviewed by the experienced engineer before sending it off.
That worked. We continued this for a couple
of more weeks, a number of more incidents effectively
coming along. And then we went moved
to learners. The initial learners were leading.
Their responses would be monitored by the experienced engineer who would
provide offline feedback to the
learner saying, hey, you could have tightened this up here.
Did you consider this? And essentially doing
has close to a real time feedback has possible.
Ultimately, we got to the point where our first line engineers were fully
flight certified. They were able to
take control, to jump
in, make their responses. And what
was really the best over
a little additional time is that they got to the point where they were able
to anticipate most of
the questions that were coming from the CEO.
They would be able to say, hey, there's a holiday in
India today. And as a result, the traffic from
India is going to be significantly anomalous,
or there's a significant holiday across Europe,
continental Europe, for example. And so during
the hours of X, Y and Z,
traffic is not going to match last week's traffic.
With this success, we were able to back off on the calibration meetings.
We were able to drop it down from meeting weekly to meeting
every other week, and then once a month. And then
ultimately we were able to discontinue the calibration meetings
because with the full flight certification, so to speak of the team,
they were able to do this,
and anticipating these problems made
everybody much happier because there wasn't a huge scramble to deal
with problems at the end.
Now, over time,
automation was built in order to help.
So we started off with
understanding what was going on in the environment
thanks to time and date. I have a link to this at the end
of those slide, we were able to get a picture of
holidays around the world. Now, what this
doesn't tell you is whether or not the holiday matters to people's
behavior on a professional networking site.
That's something that we had to learn, what the key holidays were
and whether or not it was significant enough to have a
measurable impact on these key business metrics.
The other automation that was developed over time was
involved, called Third Eye, and third eye has
been open sourced by LinkedIn. I also have a link to some blog
posts which link to the GitHub repositories at the end of those
talk. And this is an example screenshot
from one of the blog posts which shows that
based on a number of different dimensions
in the data, we see in this example here
that the iOS presence is dramatically
negatively affected and maybe something
was broken, maybe direct links for iOS were
broken. This is all hypothetical, but the result on
key business metrics is notable and negative.
And so with the third eye tooling, those team
was much able to drill in much more quickly and
understand the dimensions that mattered, whether it be by
country, whether it be by platform, or a number
of other factors that affected how
the performance of the system came out.
I wanted to recap this
a bit. It did take some
time for this automation to come along,
and with time the team was able to
successfully continue their tradition of working
effectively and answering these questions largely before the
CEO was asking them, which I consider that a huge
success. Now, how does this
whole system work from
a learning team perspective? Because those
learning team is a group of directly involved engineers,
participants, operators. Whoever is involved, it might be
product managers for that matter, or business analysts
who get together and encounter the same
problem from diverse perspectives and figure,
but through that diversity, how to
address it. So this fits perfectly with
a group learning model called as Reds.
We came together as a group with joint purpose.
We sensed gathering insights from everybody in the group,
what was going on and how we could potentially respond.
We developed these plans, those response plans for the next time, and it
never took more than a couple of days before we would have an opportunity
to experiment with one of these response plans.
With some other anomaly in the data, we were able
to observe the effect of the responses and meet
in our continued calibration meetings to understand how
to fix and respond to make better our
responses, we were able to refine these plans
and collaborate on a continuing basis. At the same
time has tooling was being developed now? The tooling did take
three or four years to come into play.
If we has had to continue to scramble several
times a week for three or four years without having a good
answer for the CEO, that would have been a terrible experience for everybody involved.
Ultimately, we were able to share outside of the non
project team to the wider group of on call engineers
who were fielding these things and dealing with the key
business metrics data. And the larger
team was up leveled because of the work of this core
project learning team. I want to
point out in summary,
that resilience is already amongst you.
Even if your technology is not doing
what you need it to do. If it can't answer the question because of too
much ambiguity, trust in your people.
They have encountered it, they have dealt with it.
While they may not know all of the pieces, if you get
people together and they work on it together, then they
can figure it out. So learning teams
bring you several steps closer to the characteristics
of the operating patterns of what are called high reliability organizations.
There are five main characteristics that I won't go into right now,
but four of the five are covered by a learning team. They increase
the awareness of the imminence of failure so that you become
aware and attuned to failure and you can capture
the instance early and not get caught
afterwards having to respond.
They recognize the practical expertise of the people at the front line
and this is really important because the people who have their hands in
the game are the ones who are best equipped to answer problems.
It builds in a commitment to resilience amongst
those team that was working together,
different people from different organizations
and different geographies, all working together to solve this one
joint problem. This brings resilience.
It enhances people's
self awareness of their own resilience and the resilience of
their teammates and brings this way of working to
a higher level and actively seeking diverse opinions
in dealing with these problems and how to respond to them is a
great way to make people feel included.
Frankly, learning teams exemplify
at LinkedIn, our priority for having
those talent is our most important priority and
technology is second. And when you're doing reliability
engineering or resilience engineering, don't forget
the people. Don't forget that your talent is your number one value
and the technology is in service of
the teams that bring the value. I have
some resources and links here. I think the slides will be
available afterwards if you want to know more about any
of the content that I mentioned. Thank you
for joining me here at comp 42 SRE
and enjoy the rest of the conference.