Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi. I'm super excited to be here at Conf 42,
talking about four key metrics to measuring your team's
performance. But before I get started, let me
just introduce myself. So my name is Christina.
I'm currently a founding engineer at Cortex. I've been at Cortex for
over a year now. We're a series startup backed
by Sequoia. And basically we're giving organizations
visibility into the status and quality of their microservices
and helping teams drive adoption of best practices so that they
can deliver higher quality software. Before joining Cortex,
I was a front end team lead at Bridgewater Associates
for four years, and I also previously interned at
Microsoft. I studied at the University of Pennsylvania,
and I'm originally from Columbia.
So what I'll be talking about today actually is dorometrics.
So dorometrics are used by DevOps teams to measure their
performance. And the origin of dorometrics is pretty cool.
So it actually comes from an organization called DevOps
research and Assessment, hence Dora.
And it was a team put together by Google to survey thousands
of development teams across many different industries
to understand what a high performing team is, what a
low performing team is, and the key differences between those and
after this, basically, dorametrics,
these four metrics that I'll be talking about came to life.
You might have heard of them before. If any of you are using Circle CI,
I actually just got an email from them two days ago asking
me to fill out a survey because they're partnering with DevOps research
and assessment to put out the next report.
And so the four key metrics are actually lead time for changes deployment
frequency, meantime to recovery and change failure
rate. So the first one I'll
be talking about and digging into is lead time for changes. This is basically
the amount of time between a commit and production.
Different teams choose to measure this differently. You could also choose
to have it be from the time that a ticket gets added
to a sprint and is actually in progress to getting to production,
from a time it's merged to getting to production. That's up to you
and your team to actually decide and figure out what works best for you.
But this is a good indicator of how agile and responsive your team
is. And so from that DevOps report that was put out in 2021,
what they found is that elite performers can have less than can
hour in lead time for changes, and low performers have six
plus months. And so how do you
actually measure and improve lead time for changes?
Right. We've all been in this situation of opening up
a pull request, being ready to review it and seeing that
it has hundreds of file changed and many commits and lines
of code and you just close it back up, you're like not going to do
this right now. And that's something that's actually going to
increase your lead time for change and actually
mean that you're not doing so great.
So this image is actually a perfect example of what can lead
to long lead time for changes. If your team is trying
to make huge changes in one go, it's going to take a lot longer to
review it. It's going to take a lot longer to test it. It's going to
take a lot longer to be confident before you get it to production.
It also means that your code reviews might sit around for too long
and so you want to avoid making huge changes like this.
Another thing that can lead to long lead time for change is potentially changing
requirements. So once you open up a pull request and you're testing it out,
say, with your design team or your product manager or your users,
and they're like, oh, but can you just add this one extra thing
or this other thing? That's where it's on you as an engineer
to say, no, these were my requirements going in and so
this is what my pr is going to do and just
make follow up tickets for those additional tasks. And then
the fourth thing that can lead to long lead time for changes is
insufficient CI CD pipeline. So you could be
merging to production often, but if releasing is
actually really long and painful process, you're probably not releasing that
often. And I'll talk more about that later when I talk about
deployment frequency. So what does short
lead time for changes look like? You want to make sure that everyone
on your team can review prs and that there's not
bottlenecks waiting on that one person to do the review.
You also want to reject tickets that aren't fully fleshed out.
So if something doesn't make sense,
they're like, oh, we'll get the design box to you later. That's going to mean
that your ticket is going to be open for a long time. You're going to
have merge conflicts. You're going to be going back and forth.
It's not worth starting development on it if it's not fully
fleshed out and the requirements aren't clear, and then
you want to again escalate changing requirements. So if
you see that something is taking super long and
people keep adding things to it, you probably want to say, let's pause on
development and make sure that we go back and flesh
out the tickets before starting. And so how
you can actually improve this lead time for changes is by breaking it up
into buckets. So you can see the time that a developer
is taking to work on the change and see if that's what's taking the
longest, or if it's the time that these pull request
is open and these review is done and taking it to test it,
or if it's actually after it's merged and getting to production.
And you can identify which of those three buckets is taking the longest and
focus on increasing that amount of time.
And so a way to measure this, again, is just using Jira.
Look at your tickets, look at how long they've been open, look at the status
of your tickets and how long it's taking to go from column to column.
And you can see sprint over sprint if these numbers are
getting longer, decreasing, or just staying the same.
And again, spot where those bottlenecks are and figure out how your team
can improve on it. So,
moving on to my second metric, which is I touched
briefly upon in lead time for changes, is deployment frequency.
So deployment frequency is how often you ship changes and
how consistent your software delivery is. So you
want to ship as few changes to production as you can.
And so a common misconception is that by shipping to production
more often, you're creating more risk and that you
might have more incidents. But actually it's the opposite
because it's going to be easier to figure out what
caused those incidents by having small changes.
And so you'll be able to actually pinpoint incidents
faster and get your meantime to resolve another metric I'll be
talking about later to decrease.
And so basically, the idea is that if you ship
to production often, you deeply understand the small changes
going into it, and you'll be able to basically
improve upon that. A high deployment frequency
will end up actually reducing your overall risk,
even though you are deploying more often. And it's a useful
to determine when your team is meeting goals for continuous
delivery and that you're actually continuously improving customer experience.
So deployment frequency has an impact on
the end user, right. You're just getting stuff out to them way
more quickly than if you're waiting on
many releases to deploy. And then again, you don't know
how those changes can play with each other, and potentially it can create problems.
So going back to the report put out
in 2021, they basically found that
elite performers deploy multiple times a day and low performers
do it every six months. And so basically,
you want to encourage your engineers and your QA team
once again to work closely together to make sure that you can deploy
often and you want to build out good automated tests so that
you are confident in your releases as you're going through.
And so again, another image that we've all seen
before, it kind of looks like that cascading waterfall.
It kind of reminds me of back when Microsoft Office used
to sell the cds and put out releases every
like three years or so and we would all go and buy the software and
upgrade. We're not in that world anymore. And so we're in a
world where you can be deploying and releasing changes to customers often.
And so you don't really want a waterfall looking thing
like this. So again, low deployment
frequency can be the result of having insufficient CICD
pipelines. It can
be that people areas bottlenecked. So if you only have
say, three engineers who know how to deploy to production,
you're taking up their time, they might not be around, they might be
on vacation. That can mean that you're deploying less often. And then if
you have a lengthy manual testing process, that's also going to mean that you deploy
less frequently because it's going to be taking up your engineers
time. Whereas a high deployment
frequency comes from making it super easy
to release. You want to be shipping each pr to production in
can ideal world on its own so that basically you know exactly
what the change is. And I totally get that. This might not
work for big teams with monolith, but in this case you can
use a technique called release trains where basically you
ship to production in fixed intervals throughout the day. And that can
help also increase your deployment frequency. You want
to make sure that you're setting up good integrated and end time teams
so that you're confident in your deployments and aren't spending a long
time on manually testing each use case and
each application. And you want to make sure that you have good testing
environments with accurate data once again, just so that
you're more confident in these releases. And you really want
to drive a DevOps ethos across your whole team so
that everyone knows that this is how things work.
And so again, just ways to actually measure deployment
frequency. You can look at the number of releases in a sprint.
Everyone has different sprints. I've seen one week, I've seen two weeks,
I've seen three weeks. Whatever it is that your team is doing, just measure
how often are you actually releasing every sprint.
Is your average number once a day or
is your average number once a week and see how
you can actually get that to be more frequent.
And so you can do this by looking at GitHub, you can do it
by looking at your deployments and
seeing your pods. There's various
ways to measure deployment frequency using whatever tools you're
using today. Moving on to our
third metric, meantime, to recovery.
So this is the average amount of time that it takes
your team to restore service when there's a service disruption,
like an outage. And so this one actually offers
a look into the stability of your software and the agility of your
team in the face of a challenge.
So again, the DevOps report found that elite performers
have less than an hour meantime to recovery,
and low performers can be anywhere from over
six months to actually get that app. And by that point,
you've probably lost all of your customers and should really
evaluate why it took you that long to get back up and running.
But to dive into why this metric is
important a little bit more than I've done for the other two,
I'll just use a concrete example, which is Meta's
outage from October 4 that
lasted five and a half hours. So whether you
use Facebook Messenger, Instagram,
whatsApp, you were probably impacted by this outage. I know I
was. I have all of my family in Colombia and couldn't
talk to them during that day because WhatsApp was down. But a
lot of businesses actually run on WhatsApp.
And so a lot of businesses were impacted by this outage as
well. And so the outage was actually triggered by a
system that manages the global backbone network capacity
for Facebook. And basically it's built to
connect all the computing facilities together.
And they consist of tens of thousands of miles of
fiber optic cables crossing the globe and link
to all their data centers. And basically the
problem was that during a routine maintenance job
for these routers, there was a command issued
with the intention to assess the availability of that
global backbone capacity, which unintentionally
took down all of the connections and
effectively disconnected Facebook, all of Facebook's data centers.
And so their commands are designed to audit
things like this and prevent mistakes from happening.
But there was a bug in the audit tool that actually did not catch
this one. And as the engineers
were working to basically figure out what was
going wrong and how to get it back up, they faced two main obstacles.
The first is that it was not possible to physically access
the data centers because they were protected.
And then also the total loss of DNS ended up breaking many
of the internal tools that would help them diagnose these problems.
And so Facebook actually put out a long post mortem
on this and a long article about what they're going to do
to prevent this from happening in the past. And I encourage you to take a
look at it if you're interested. But at the end of the day, this outage
cost Facebook over $60 million
and again lasted five and a half hours. It's the longest outage they've ever
had. Another popular tool
that had a similar issue also in October of last
year is Roblox. I was at a party recently with
a bunch of kids and the seven year olds were talking my ear off
about Roblox. And actually
what happened was that they had an outage that lasted over three
days. And you may be saying, yeah, it's just a kids game that's impacted,
but it actually cost them about $25 million. So once
again, huge cost associated with
this outage. But what happened was
two issues and once again, they put out a long post cortex on this and
what they're going to do to fix it. So encourage you to take a look
at it. But they were enabling a new
feature that created unusually high read and
write load and led to excessive contention and poor
performance. And these load conditions actually triggered
a pathological performance issue in an open source system
that is used to manage the write ahead logs.
And what this did is it actually brought down
critical monitoring systems that provide visibility into these tools.
And so this circular dependency on the thing being out,
being the thing to help you diagnose is exactly what Roblox
had said they're going to fix going forward. And it's something that
you need to be thinking about as your team thinks about the meantime to
resolve. You don't want your observability stack to be tied
to everything that your tool is, because at the end of the day,
it's just going to make it harder for you to bring it back up when
these outages do occur.
So again, if we look at what could cause long
meantime to recovery,
risky infrastructure, poor ability to actually roll
back these changes, right. You want to make sure you always have a plan in
place so that if there is an outage, you can roll back while you
figure out what's wrong with that latest release. Having a
bad incident management process where potentially you don't know who's on
call or who the owner is or who to call, and then
having tribal knowledge or insufficient documentation. You want to make sure
that you have clear documentation for all the services that
you have run, books that you have, logs that are accessible to everyone.
Basically anything that could be needed to actually debug
what's going wrong. You want to make sure your team is trained to do.
And this is actually something that Cortex helps with.
We have a service catalog feature where you can see all this information
about your services and basically have one spot to go
as you areas dealing with an incident and looking for this information
so short meantime to recovery. The big difference
here is having a tight incident management process.
Again, knowing who to call when, having the ability to
roll back quickly, having the tools needed to diagnose what's
wrong and having those clear runbooks easily accessible.
And again, a thing that personally I learned
from hearing about these two outages I went through is that
you probably don't want the DNS for your status page
to be the same as these DNS for your website. If your website's down,
so is your status page and you want to make sure that you're thinking about
those things and keeping them separate. And so
ways to actually measure meantime to recovery is using whatever on
call provider. So for example, pagerduty Victor Ops,
Ops Genie, you want to measure how long that outage was,
how much time between the fix was discovered and how much time between it
being released. Again, if you have insufficient CI CD pipelines,
it might take longer to get that out, even if you know what the fix
is. And then also you want to look at how long it took
you to discover the outage. Like do you have the proper alerting so that
when an outage happens, you know immediately, or is it taking
a few hours and taking a customer calling it out for
you to see the outage? And so
you can use whatever tools you're using to measure
this and see where those gaps are in order to
improve your incident machine process and see how you can improve this going
forward. And that brings
me to my fourth metric, which is change failure
rate. So this is these percentage of failures.
This can include bugs that affect customers or
releases that result in downtime or degraded
service or rollbacks. This is again up to your
team to define what you want to include in that
change failure rate and basically figure
out which parts of it you want to measure. A common
mistake that teams make when measuring this is to just look at
the total number of failures rather than the rate. But the
rate is actually pretty important because the goal is
to ship quickly. So if you look at the number of failures
and you're shipping super often, that number might be higher,
right? But actually you want to make sure that
you have more deployments so that it's easier to
again have that meantime to resolve be lower.
And so you want to look at the rate,
not just the number of failures. This can also be a good indicator
for how much time your team is spending fixing
processes rather than working on new features.
And so again, looking at this report,
the state of DevOps 2021 report found that elite
performers have anywhere from zero to 15% change failure rate.
Anywhere 15 or higher isn't great.
And so we've all seen these memes, we've all kind of laughed
at these. But you know, that moment when you're looking at your code and
you're just patching up bug after bug after bug, that's something that
you want to evaluate because it is increasing the number
of bugs that your customers see and creating a poor customer
experience. So a high change failure rate can
be the cause of sloppy code reviews, maybe people
areas just looking at the code, but not thinking about all the use cases,
actually testing it out. Again, insufficient testing,
whether it's unit teams integration
tests, n ten tests, and then having staging environments
with insufficient test data. So if your staging environment
doesn't reflect the data that customers are using, at the end of
the day, it may not be a good representation of testing
your changes before actually rolling them out.
And so a low change failure rate can actually,
the way you get to that is by promoting an those
that is focused on DevOps. And so basically creating that culture
of quality, making sure that you have representative deployment,
development and staging environments so that you can test this before it
gets to production, and having a strong partnership
between product and engineering so that you deeply understand
the use cases and actually know how to test before
going forward. And if these handled all the potential edge cases
that you write tests for, those edge cases,
anything basically to make you more confident that the features you're
releasing work in the way that they're meant to.
And so ways that you can measure this change failure
rate is you can look at how many releases have caused downtime,
you can look at how many tickets have actually resulted
in incidents, you can look at how many tickets
have follow up bug tickets to them. Because again,
ideally you would catch those bugs before they go out.
Even if they don't necessarily cause an outage, it still causes bad
customer experience. And then you can
honestly dig a step further and you can see
how many of these issues areas a result of not having unit
tests in place. Like would a unit test have caught
the issue? Would an end
to end test have caught the issue? Or was it having bad data and
then making sure that you actually update that data in your staging environment
so that going forward you can catch issues similar to
whatever it is that caused these problem.
And so that was a broad overview of all
these four metrics, but now if we put them together, so again,
they're lead time for changes deployment frequency,
meantime to recovery and change failure rate. What they're really
looking at is speed versus stability. So lead
time for changes and deployment frequency are really looking at speed.
How fast are you getting these changes out to your users?
And stability is meantime to recovery and change failure
rate. So how often is your app unstable due to changes that
you have gotten out? And so the key is actually to empower
your developers and give them the tools that they need to succeed.
At the end of the day, it's not literally about these metrics,
it's just about your team and figuring out using these metrics
to improve their performance. And so your
developers are the ones who are able to make the changes to help
your team reach its goals. And you want to make sure that they
understand these metrics, why they're important, and are using them to
improve their processes day to day.
And to give a more concrete example on literally measuring
this, as I mentioned earlier, I'm an engineer
at Cortex, and we help teams define standards
and measure how they're doing. And so what you
see on the screen right now is one of our features, which is called scorecards.
It allows you to create rules for your team and
actually will measure all your services and how they're doing and
give you the scores based on the rules that you
created, based on your integrations for your services.
And then from this, you can create initiatives to help improve
those things going forward. So you can say by Q three,
I really want to improve my deployment frequency.
I want to make sure that the CI CD pipelines are sufficient that
we have better testing. So you can measure things like test coverage,
and you can use scorecards to actually make best.
You can use scorecards to make this a moving target
across your teams. And so this is exactly what
we do. Thank you very much. I hope
you enjoyed learning about dorometrics, and feel free to
put any questions in the chat.