Conf42 Site Reliability Engineering 2022 - Online

Implementing a Learning Team: A real-world case-study

Video size:

Abstract

When you have an executive breathing down your neck, looking for answers, but no one person has the necessary information to provide those answers, what do you do?

This talk uses a case study from a cross-team collaboration effort at LinkedIn. The work was intended (and succeeded) to address specific knowledge, process, and skill gaps between teams via the methodology of a learning team approach.

With this specific case study as a focal point, the talk will connect sources from the wider literature about organizational learning and learning teams to highlight processes, organizational structures, and skills that can be used to foster a healthy work environment with inclusivity and inter-team camaraderie while also achieving important business metrics and getting answers for that executive!

Participants will learn:

  • principles guiding the implementation of learning teams,
  • strengths and applicability for learning teams,
  • how to foster a more humane and effective workplace by appreciating the importance of “work as done” above work as imagined.

Summary

  • A case study of a situation that occurred a number of teams ago when I was working at LinkedIn. We came up with an idea of going with a Holt winters seasonality. If we can predict, trend wise at greater detail, what should be happening, then that should give us a better insight as to what might be going wrong.
  • A learning team brings resilience to high-reliability organizations. Frank Frankly, having talent for resilience is our most important priority at LinkedIn. The tooling took three or four years to come into play.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Good morning, good afternoon, good evening. Whatever time of day it happens to be for you when you are watching this video from comp 42 SRE, this is my talk on implementing a learning team, and it's a case study of a situation that occurred a number of teams ago when I was working at LinkedIn that I think you can identify with. And I hope that you'll be able to benefit from the approach that we took, even if the specific principles are a bit different. First off, I'd like you to imagine that it is 05:00 a.m. 06:00 a.m. Your time locally, wherever you are, and you get an important email from the CEO that says, hey, some numbers don't quite look right. We have at LinkedIn this lovely report, which yes, I realize is a bit of an eye chart. We called it those key business metrics, and this would come out every hour. It lagged real time by two to 4 hours by the time different cleanup processes had been run on it. And our CEO was unique in his capability for pattern detection. He would look at this and let me zoom in on those a little bit for you. It's an hourly report covering the last 24 hours for a handful of crucial activities that would occur on the site. Page views is one of them. And then I've just labeled the others as activities one through eight. And you can see that when you compare week over week that a number of these are down, a number of them are up, the patterns don't quite match, but that's where the red and the green came in. And this contrasted to the metrics that engineering would typically look at. Engineering would look at these metrics that are on 1 minute intervals. They would lag real time by about five minutes compared to the key business metrics that would lag by two to 3 hours. And this example from the ingraphs shows, obviously a place in time where we had less activity than the previous week, comparatively. And so this would be the kind of thing that would often get flagged in the key business metrics. The question that would come up, and this occurred, of course, right as the shift would change. We has teams that were geographically spread on both sides of the world about by 12 hours. And the question would come in just as one team was about to go away and the next team was coming on. The incoming team, of course, had very little context at the time. The outgoing team wanted to head home. And the question of, hey, what's going on? Would come up. And being, as it was, important and of course related to, as I say, key business metrics. People would want to jump on this and understand what's going on. So the teams would scramble around trying to answer the question right at the top of the shift change. And it was not an easy thing to do. We were figuring out what was going on, and every time we would come in and look at it, we'd have to potentially consider a plethora of different systems that were affected. So what to do about those well being engineers and from an engineering culture? Our first idea was, hey, let's get more data and let's analyze the heck out of it. So we came up with an approach, an idea of going with a Holt winters seasonality. Because the idea with Holt winters is that it can identify multiple periodicities in the data. If we can predict, trend wise at greater detail, what should be happening, then that should give us a better insight as to what might be going wrong. As you can see here with an overlay from a sample chart, it's not so easy effectively, and the variations are still pretty extreme and don't lend themselves to answering the question of what's gone wrong. Those problem in this scenario is that we have a lot of different systems that go into these key activities. They could have been emails that were being sent out, and maybe something went wrong with the send, maybe it went out late, maybe it went to the wrong people, maybe something was wrong with the site and people couldn't get in and perform the key activities. Or maybe the page latencies had changed due to some change that had been shipped out. It's really difficult to answer these questions off the top of your head because there were so many factors that could have gone into it. So, needless to say, our idea of getting more data, diving into the data with more detail and attempting to predict just wasn't breathing fruit. So we took a step back and we got the teams together and we said, okay, let's see if we can equip the teams, both the outgoing team and the incoming team, to be able to answer these questions. Let's start by answering the questions. We'll figure out how to automate things down the road. So we started off with what I call calibrating the intuition of the teams. We got the teams from the two different geographies, plus kind of an overseeing team together, actually keep practitioners from these and said, let's all get on the same page as to what could be happening, what is happening, and how can we tune our intuition to match our CEO's intuition so that we can be ahead of the game and say, hey, we saw this and here's the reason, and you don't need to worry. That was ultimately what we wanted to be able to answer. Or, of course, if there was a real problem, then we wanted to be able to send that off to the correct teams to address. So we started with a regular weekly meeting, of course, with teams that are split by 12 hours. This wasn't an easy thing, necessarily to do, but we got together, we went and combed through last week's data. Each week we would get together, we would look at the past week's data, and we would bring the insights from everybody on the project team together to say, what did we see? What did we miss? And then look at why did we miss it? What did we not know that would have allowed us to answer the questions in the moment if we had known it? So those is really in a joint cognitive system. This is improving the common ground amongst the practitioners, making sure that the people who are involved are talking, starting from a common place that we know both the performance of the system as well as the first and second order contributors to how the system is performing and what could be impacting these results. We also found a very important thing that needed to be considered were environmental aspects, being a global service. And of course, people interact with a service when they're awake, not generally, unless they're a bot or a scraper, are they operating outside of their normal daytime hours? Also being work oriented, a professional network, as LinkedIn is, people would tend to interact with the service quite differently if it was a holiday from work, if it was, for instance, the recent Memorial Day holiday. The performance and the activity from those United States on Monday, when it was a national holiday in the United States, would be very different than the performance, for instance, from Canada or for Europe, where there is no holiday and normal behaviors persist. The United States would be quite different on a holiday that's just for the United States. So we had to become aware of those holidays and whether or not they were going to be impacting people's use of those site on a global basis. Interestingly, we even found such anomalies as the performance of sporting events would cause anomalies in the system. This is a graph from a talk that was given a couple of years ago from one of my colleagues, where he looked at the performance of the site in the context of the United States Super bowl. It turned out that the Sharks were also playing, which is a local sports team from San Jose. But you can see significant anomalies at key points in the game as it applied. This was on a Sunday. Super Bowls were always on Sunday in the United States. But we would also find similar things for the World cup, whether that was in football or cricket. Of course, those things affect different geographies and have differential impact by different countries. So it was an interesting challenge, to say the least, to understand how these different aspects play out against each other. So, having calibrated the intuition of the folks on the team, we started off to take dual control in a sense, such as this airplane. We wanted ultimately to get everybody flight qualified. We wanted every individual engineer that was doing the on call response to be able to successfully fly the plane and answer the CEO directly without having to escalate or cross check their answers. But we started off with an experienced engineer leading and taking points, so to speak, and responding to the CEO, but at the same time having other learners, not just on a weekly basis, but in real time. Okay, what do we think? Okay, here's the answer. Let's formulate it and send it off. After a bit of time and people feeling greater comfort in taking the lead on these things, we would have the initial response written by one of the first line engineers, but then reviewed by the experienced engineer before sending it off. That worked. We continued this for a couple of more weeks, a number of more incidents effectively coming along. And then we went moved to learners. The initial learners were leading. Their responses would be monitored by the experienced engineer who would provide offline feedback to the learner saying, hey, you could have tightened this up here. Did you consider this? And essentially doing has close to a real time feedback has possible. Ultimately, we got to the point where our first line engineers were fully flight certified. They were able to take control, to jump in, make their responses. And what was really the best over a little additional time is that they got to the point where they were able to anticipate most of the questions that were coming from the CEO. They would be able to say, hey, there's a holiday in India today. And as a result, the traffic from India is going to be significantly anomalous, or there's a significant holiday across Europe, continental Europe, for example. And so during the hours of X, Y and Z, traffic is not going to match last week's traffic. With this success, we were able to back off on the calibration meetings. We were able to drop it down from meeting weekly to meeting every other week, and then once a month. And then ultimately we were able to discontinue the calibration meetings because with the full flight certification, so to speak of the team, they were able to do this, and anticipating these problems made everybody much happier because there wasn't a huge scramble to deal with problems at the end. Now, over time, automation was built in order to help. So we started off with understanding what was going on in the environment thanks to time and date. I have a link to this at the end of those slide, we were able to get a picture of holidays around the world. Now, what this doesn't tell you is whether or not the holiday matters to people's behavior on a professional networking site. That's something that we had to learn, what the key holidays were and whether or not it was significant enough to have a measurable impact on these key business metrics. The other automation that was developed over time was involved, called Third Eye, and third eye has been open sourced by LinkedIn. I also have a link to some blog posts which link to the GitHub repositories at the end of those talk. And this is an example screenshot from one of the blog posts which shows that based on a number of different dimensions in the data, we see in this example here that the iOS presence is dramatically negatively affected and maybe something was broken, maybe direct links for iOS were broken. This is all hypothetical, but the result on key business metrics is notable and negative. And so with the third eye tooling, those team was much able to drill in much more quickly and understand the dimensions that mattered, whether it be by country, whether it be by platform, or a number of other factors that affected how the performance of the system came out. I wanted to recap this a bit. It did take some time for this automation to come along, and with time the team was able to successfully continue their tradition of working effectively and answering these questions largely before the CEO was asking them, which I consider that a huge success. Now, how does this whole system work from a learning team perspective? Because those learning team is a group of directly involved engineers, participants, operators. Whoever is involved, it might be product managers for that matter, or business analysts who get together and encounter the same problem from diverse perspectives and figure, but through that diversity, how to address it. So this fits perfectly with a group learning model called as Reds. We came together as a group with joint purpose. We sensed gathering insights from everybody in the group, what was going on and how we could potentially respond. We developed these plans, those response plans for the next time, and it never took more than a couple of days before we would have an opportunity to experiment with one of these response plans. With some other anomaly in the data, we were able to observe the effect of the responses and meet in our continued calibration meetings to understand how to fix and respond to make better our responses, we were able to refine these plans and collaborate on a continuing basis. At the same time has tooling was being developed now? The tooling did take three or four years to come into play. If we has had to continue to scramble several times a week for three or four years without having a good answer for the CEO, that would have been a terrible experience for everybody involved. Ultimately, we were able to share outside of the non project team to the wider group of on call engineers who were fielding these things and dealing with the key business metrics data. And the larger team was up leveled because of the work of this core project learning team. I want to point out in summary, that resilience is already amongst you. Even if your technology is not doing what you need it to do. If it can't answer the question because of too much ambiguity, trust in your people. They have encountered it, they have dealt with it. While they may not know all of the pieces, if you get people together and they work on it together, then they can figure it out. So learning teams bring you several steps closer to the characteristics of the operating patterns of what are called high reliability organizations. There are five main characteristics that I won't go into right now, but four of the five are covered by a learning team. They increase the awareness of the imminence of failure so that you become aware and attuned to failure and you can capture the instance early and not get caught afterwards having to respond. They recognize the practical expertise of the people at the front line and this is really important because the people who have their hands in the game are the ones who are best equipped to answer problems. It builds in a commitment to resilience amongst those team that was working together, different people from different organizations and different geographies, all working together to solve this one joint problem. This brings resilience. It enhances people's self awareness of their own resilience and the resilience of their teammates and brings this way of working to a higher level and actively seeking diverse opinions in dealing with these problems and how to respond to them is a great way to make people feel included. Frankly, learning teams exemplify at LinkedIn, our priority for having those talent is our most important priority and technology is second. And when you're doing reliability engineering or resilience engineering, don't forget the people. Don't forget that your talent is your number one value and the technology is in service of the teams that bring the value. I have some resources and links here. I think the slides will be available afterwards if you want to know more about any of the content that I mentioned. Thank you for joining me here at comp 42 SRE and enjoy the rest of the conference.
...

Kurt Andersen

SRE Architect @ Blameless

Kurt Andersen's LinkedIn account Kurt Andersen's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways