Transcript
This transcript was autogenerated. To make changes, submit a PR.
You. The 2022 accelerate
state of DevOps report. Executive summary.
For the last eight years, we have produced the accelerate state
of DevOps. Hey, Nathan, you know, the report,
it's interesting, but I
had, like, something else in mind that we could do today.
Amanda, what. What is this?
Well, you see, Nathan, one of the key findings from the report
is that context matters. And so I know we
were planning to share the highlights from the 2022
state of DevOps report, but instead, let's bring the report
to life through story time.
Okay. As long as it's clear to you and to me and to
all of you that are watching, this is a completely fictional
story. Oh, absolutely. Right. And we probably
should add some disclaimers for the G lawyers.
Oh, yes, of course. The story. All names,
characters, and incidents portrayed in this production are fictitious.
No identification with actual persons, living or deceased places,
buildings, and products is intended or should be inferred.
And also, no animals were harmed in the telling of this story.
So I was thinking, good stories have a protagonist
can antagonist it, incite action. Then there's conflict,
challenges, and we get to a resolution.
So when you think about that, the protagonist must face obstacles
and setbacks throughout the story before they can reach
their goal. So, for today's story time,
let's talk about log Rochelle,
because, really, isn't it the gift we all received in December of
2021? So let's go back
there, Nathan. Let's go back to December 10,
2021. Where were you when you heard about log for
show? Let's see.
Friday, obviously, I was planning a pretty
lightweight day. You know,
hashtag no deploy Fridays. And it was
December, so I'm sure I had some holiday shopping to
do. So your Friday was
looking like this, but then it changed.
Yeah. In fact, everything did change.
Would you hold my coffee? Sure.
All right, so walk me through it.
You started here. The CVE?
Yeah, kind of. It wasn't really like that,
though. It was more like a roller.
Know, I went through the five stages of grief,
denial. I mean, look, Twitter was the first place I heard
about this issue. Was it really a thing?
And then I had anger shoot it sure is. It's a
real issue. Then began bargaining.
But, I mean, a bug in the logging software?
How bad could it be? Isn't this something that can wait until
Monday? Or better yet, that really quiet week
that's coming up? Can't I just put it on the backlog until, I don't
know, December 27?
But as I dug deeper, depression really started to set
in the issue, and I was talking about it with my colleagues
and part of the team. And I realized that my weekend was about
to take a turn for the worst. Finally, fifth stage
of grief is acceptance. I declared
an incident and myself as the incident commander.
I started our incident response procedures, which include firing up
a slack channel, gathering representatives from each team, working on
all of our applications, and started up some tracking
documents.
So it didn't really look like this single point in time.
It was really kind of more of a
flow. A roller coaster maybe.
It was definitely a roller coaster like that. So the next thing
I did, well, I picked up my phone to call my family
and let them know that another Cve was
going to change virtually everything
about my plans for the weekend. We needed a plan,
and so I went with one of my favorite tools,
the Uda loop. Do you know the Uda loop?
Can you remind me? I always forget what the second o stands,
right? Uda loop. Observe,
orient, decide and act.
We observed that there was a vulnerability. Next, we had to orient
which of our production systems were going to be impacted
or were currently impacted. Then we had to decide what we
were going to do. Well, actually, that was the easiest part. We were
going to upgrade log per j to remediate this vulnerability and
then act. That's the last step we get the team to work.
Of course, it is a loop. So we act and then we go back through
the loop. All right, so how many
production systems were impacted? Oh, yeah, our production system.
So let's see, there was one, two,
about 400 different production applications
that were impacted, and most of them were going
to be impacted by this vulnerability. We thought, wow,
so this was a gift. It was like 400 gifts,
right? So how long did it take you to assess 400
production systems? Oh, yeah, it took about two
minutes. We just did some querying through our s bombs,
our software bill of materials, to find out which would be impacted.
It was pretty easy, really. I mean, that is
amazing. Then what did you do next?
Yeah, sorry, I wish it was that amazing. S bombs
are pretty awesome, but honestly, we haven't
deployed them everywhere. There's maybe one application that's
not yet in production, but we have a good s bomb for that, don't worry.
So what really happened is we had to manually
inspect all 400 of those applications, which meant
calling in subject matter experts from each of those applications
and asking them to do some work over the weekend.
But by Monday morning,
we'd identified two applications that were the most critical.
We knew we needed to fix those.
Know, Nathan, I've got to tell you, I love the jokes. The way
you're laying it in there. And I'm curious if you were telling
jokes like this when you were going through this.
So which two applications were the most critical?
Well, first, it was no laughing matter for sure.
But the two most critical applications that we knew we needed to
fix were our order management system. These system has
been around forever. It's truly the heart of our business.
If it's offline, customers can't buy anything and we can't ship anything.
And the other system that was top of mind was
our ecommerce site. This is the face of the business.
It's where our customers come to purchase things. So if it's downed
or not working, we can't serve any of our customers.
So our two applications, the order management system and our ecommerce
front end. All right,
so I'm going to say let's start with
discussing the ecommerce website. I expect it was easier to
tackle that than the order management system since
it was older and the website is newer. And also
since an order starts, there seems like a great place to go
next. So can you tell me about the ecommerce website?
Yeah, you're totally right. It is a good place for us to start.
As you mentioned, it's the front end and the application itself
was built using microservices.
So maybe it's the right place to go.
But as it turns out, we actually did not have an easy resolution
for the website. You see, decisions were made years ago.
These decisions came back to haunt us. When we built
these site several years ago, our team didn't actually have
any expertise with microservices, but we knew we
wanted a modern architecture. And a modern architecture
requires microservices. So what did we
do? Easy. We hired in some
consultants and a vendor to help build and ship
the site. Ultimately, we paid
for functionality, not knowledge or
documentation. Well, I mean, I imagine
at the time this trade off made sense. Bringing in a partner is
an awesome solution when it's done in collaboration with the
organization's team, and then they're upskilled after that
engagement. So do you have access to the code or
do you need to work with that vendor to make these updates?
Well, the website is on our infrastructure and it's in
source code, the microservices.
There are about 27 different microservices that make up this application.
So the code is spread across about
27 different repositories. But since the marketing team has
a UI for adding and modifying, removing,
basically managing the content and the offerings that we have on
the site. We don't really have to touch the code base very frequently.
In fact, we only are putting out changes once or twice a
year. And those updates are each strategically planned.
They take at least two months to get through all of our
manual testing. But we were able to
quickly identify across these 27 where
we had the vulnerability. Unfortunately, that was the only quick
and easy part of remediating this
microservices application. You see, there was no automated build process.
So when we found a log for j that needed putting,
we had to update it and then manually execute those builds.
And these was no testing in place. No automated testing in place.
Anyhow.
Wow, there's a lot to unpack here.
I guess I'm a little bit surprised.
Okay, so you've really only been making
updates to the application a couple of times a year without
any automated build process or testing. I mean,
I can only imagine that the likelihood of failure is going
to be very high. Exactly. We found one
microservice first that had log for J and we tried to upgrade it.
I mean, we upgraded log for J on that microservice and we deployed everything to
a staging environment and everything broke.
It turned out that all of our microservices are very tightly coupled together.
Interesting. So how did you know everything was broken? And I don't know
if I even wanted to ask this question, but how long did it
take you to fix it? Well, we knew everything was broken because
we would deploy it and refresh the site to check
to see if anything was broken. And what we saw was
500. 500 was not the number of orders we received.
Instead, it was the server error code that we got.
So we entered this process of build,
deploy, see it fail. Let's try the next microservice.
It was pretty painful. This does not sound
like a very fun. No,
no. Remember, Amanda, we didn't start even working on these changes until Monday.
It took us the entire weekend to identify which
systems we should prioritize for remediation.
In the end, the teams spent all week
updating and testing those 27 services that
made up our front end website. By Friday afternoon,
though, they were ready to deploy the changes.
All right, so then it's about a week till you had everything fixed.
No. Remember hashtag no deploy
Fridays. So these team was ready on Friday,
but we couldn't deploy. And the changes for
this site, they have to go through a change approval board who
only meets on Tuesdays and Thursdays. Luckily, though, we can call an
emergency cab, especially for high risk
security incidents like this particular one. They agreed
to meet on Monday. So after looking over
all the changes, the change approval board, the cab was
a bit uncomfortable with this deployment. They asked that the development team do
some additional manual testing. And I'll
tell you, it was a good thing too, because one good vulnerability
deserves another. It turns out there were a few releases of
log for J in rapid succession. The cab
suggested building off on any updates to
production until the log for J releases stabilized.
This way we could batch up all of the changes into one
single release. Turns out there were four
updates. The last one wasn't even released until the 20 eigth of
December. Wow.
So 20 plus days. I can totally understand
why the cab made the decision they did,
but I have to tell you,
I'm listening to this and I'm thinking about it and putting myself
in the story and I
just feel a little burnt out. I mean, I imagine your
team was burnt out at the end of this.
Absolutely. Everyone on the ecommerce team was definitely feeling
pretty crispy by that Wednesday morning after
the final deployment. Long hours obviously
contributed to that, as well as the
stress of the vulnerability itself and the
uncertainty of whether the changes would work. But probably one
of the biggest stressors was the heavyweight change approval process.
In the end, it was difficult for the team to understand and assess what
it would take to go from commit to
approved, much less deployed and working.
That sounds rough. So I guess how
has the team fared? Know? I know a
couple months ago we had the OpenSSL Cve.
Was that an easier experience? Have you learned from
this? I mean, the team has definitely learned a lot over
the last year and a half or so. And we
were spared from the OpenSSL CVE because,
well, that team is still on the 1.1 branch of OpenSSL.
So having advanced warning of the pending vulnerability
though did help some. But it also reminded the team
of the progress that we still need to make. This is
a journey. It's true. We all sometimes
need that reminder. Right? So that was a great recap
of these ecommerce team. What about the order management
system? That system you said is the heart of the business.
And I imagine since it's been around forever
that it was even slower to update than the microservices based
front end. Oh yeah, I would see how you would think that. Of course the
OMS is older, it's larger, it follows more of a
macro service than a microservice architectural pattern.
But unlike the e commerce system, the OMS is
something that our internal teams have been actively developing over
the years. In fact, over the previous two years, the OMS
team was able to go from quarterly releases to
deploying updates to the system on a weekly basis.
So in many respects, they were better prepared for log for Shell
than the ecommerce team was.
Wow, that's fantastic. And I love to hear that they've
been iterating and improving. So how did it go?
Well, on Monday morning, the team identified the three components that were
impacted. These upgraded log for J library in one
of the components, and then their continuous integration process
automatically kicked in. A jar fire was built,
automated tests were run. That jar file was automatically
deployed to a test environment where some additional tests were run. These team
took their passing best, their better pipeline.
They took both of those things to the cab for approval and
the deploy was rubber stamp ship it,
they said, and the team did. So wait,
okay, so that is incredible.
But didn't you say that there were these components and they
only updated and shipped one of those components?
Yes, that is true. But the components are build in a way that they
can be independently tested and deployed. And everyone is
comfortable with this because, well, frankly, that's how we've been working
in practice for well over a year now.
Okay, so one down, two to go.
These must have been pretty easy. So this team
had it all fixed by Wednesday? Almost.
I mean, the test failed on the second component.
When the second component was updated and the test ran, these failed.
So it took a while to track down and fix that
bug. So I
feel like you told me once about this team, and they
were the ones that had that habit of prioritizing their
broken builds. Wasn't this the team? Oh, yeah, that's exactly
right. So after the first one was fixed, we split up the team
and said, work on components two and three. When the tests failed on
component two, the entire team swarmed. Let's figure
out what broke these best. And it was a good thing too,
because it took the teams most of the day to actually track
it down. It was kind of a hidden bug. It was elusive,
if you will, but they were ready to deploy by Wednesday
morning. Thursday came around, the third component was updated and
released. My goodness. I have to
say, do they have any open positions on this team? Because it almost
sounds fun, like, I would have enjoyed being a part of this process instead of
something scary. It sounds exciting and thrilling. This has
just been incredible. So I appreciate you sharing all of this
with me, and I think I may be able to help.
Oh, really? So, Amanda, how do we
help the website team have more of an experience
like the order management team in the future.
Well, Dora the Explorer.
No, not that Dora. Oh, the Digital
Operational Resilience act. Not that one.
Oh, I know. The designated outdoor
refreshment area. Nathan, it's not
even that. You know, it does look like they have a lot of
fun in Ohio. Cheers to that. Right. So the
Dora, for our purposes today, is we're talking about the DevOps
research and assessment. Dora is an ongoing
research program that's been around for about eight years. The research program
has primarily been funded by a number of different organizations.
Over those teams. For a few years, the research program was funded
by the organization of that same name, Dora.
Dora was founded by Dr. Nicole Forsgren, Jez Humble,
and Gene Kemp. Then in 2018,
Dora the company, was acquired by Google Cloud.
The Dora team at Google Cloud has continued the research into the
capabilities and practices that predict the outcomes
we consider central to DevOps. The research
has remained platform and tool agnostic.
And personally, it has been an incredible experience to
work with the research team, not only because of the learnings,
but better understanding of the research practice,
the oath, the ethics, and the passion they bring to this body
of work. Yeah, I think it's super cool.
And one of the things that's really important is that focus on capabilities.
In fact, through the research, we're able to investigate those capabilities that
span across technical, process, and cultural
capabilities. And through our predictive analysis, we're able
to show that these capabilities are predictive
of or drive software delivery and operations.
Oh. Which, by the way, predicts better organizational.
Oh. So, Nathan, it's like a maturity model with
a build in.
No, no, Amanda, context matters.
And in fact, there is no one size fits all roadmap
or maturity model for you to follow. You have to understand your team's
context and focus on these right capabilities.
That's right. In previous years, we had
learned that delivery performance drives organizational performance.
But like you said, context matters. Additional context
this year from the findings was that delivery performance
drives.org performance, but only when operational
performance is also high. That's right.
And operational performance, we oftentimes talk about that as reliability.
But reliability itself is a very context
specific thing that's hard to measure. In fact,
reliability itself is a multifaceted measure
of how well a team upholds their commitments to
their customers. And this year, we continued our explorations
into reliability as a factor in that software delivery and operations
performance. We looked at some of those things, like, how does
a team reduce toil? How do they use their reliability
to prioritize or reprioritize the work that they're doing.
And one of the most interesting things that we found there is that
reliability is required. As you said, software delivery
doesn't really predict organizational success without that
operational performance as well. But we also saw that
SRE investment takes time. Teams that are
newly adopting some of these practices or capabilities,
or have only adopted one or two of them,
may see some initial setbacks in their
reliability, but as a team sticks with it, they can
see this curve really start to take effect,
where they will start ramping up their overall reliability.
Investment takes time and practice. This is a journey.
So while it's not a roadmap, these technical
capabilities they're building on one another. Right. What I'm hearing
you say is that teams improve as these get better at additional capabilities.
That's right. And when you look at a number of capabilities together,
this is where you really start to see that multiplicative
effect. So, for example, teams that are embracing
and improving their capability with these technical practices,
like version control and loosely coupled architecture,
these teams are 3.8 times or show 3.8
times higher organizational performance.
And then security is a big part of this as
well. And of course, security fits very
well into our story about log for j. And the truth is we're
all facing similar measures and what were
similar constraints and capabilities? So one of the
things that we looked into this year was supply chain security
and specifically software supply chain security. And we used a number
of different practices to measure that. But what we've seen is
that adoption has already begun. So that's
really good to see. Of course, there's room for lets more.
Another thing that we see is that healthier cultures have a head
start. Culture was one of the top predictors of
whether or not a team was embracing these security
practices.
So when you say healthier cultures,
you're really talking about generative cultures,
right? Characterized by that high trust and free flow of
information. These kind of performance
oriented cultures are more likely to establish those
security practices than those lower trust organizational
cultures. That's right, Amanda. And it turns out that
security also provides some unexpected benefits.
And thinking about the security of your supply chain,
so sure, you're going to have a reduction in security risks,
that's not an unexpected benefit, that's the hoped for
benefit. But better security practices can
also carry additional advantages, such as reducing burnout
on the team. Oh, and there's also a key integration point.
Adoption of the technical aspects of software supply chain security
appears to hinge on the use of good,
continuous integration practices, which provides the
integration platform for many supply chain security practices.
So I guess here again is another example of how capabilities really
interact with each other and build upon each other.
Because when we compared the two continuous
integration and security, we found that the teams
that were above average on both, they had the best overall
organization performance. So having good continuous
integration and good security is a real driver
for your organization. And I
think we saw this in practice as well. Think back to that order
management system team. They had a really good continuous integration practice
on that team. And as a result, these were able to
assess really how is this updated
library going to impact these application.
The continuous integration was building and running tests and building
their confidence, whereas the website team, without any
continuous integration to speak of,
they had to do everything manually.
Right. It's interesting because in both of these cases, they had change approval
boards, but on one side, you have this kind of mysterious,
spooky can that just is blocking all of your changes.
Right. They don't appear out
of nowhere, but maybe there's more for us to think about their
role and how they show up in our organization,
who's on it, how many people are on it,
who gets the final say, and what happens if
that person goes on vacation? So I think when
we formed, we also look at maybe like, when was it formed?
Why was it formed? How have things changed since these?
And does our oversight need to change as well? I think we see
the OMS team clearly had a very different experience
with their can. And I'm going to guess that it has changed over
time where the website team,
perhaps it wasn't the case.
I've heard of a story where after process changes,
the can is no longer on the critical path.
They only deal with those outlying challenges,
and as a result, deployment frequency increased 800
x. Yeah, it is really startling to see
that type of improvement. I have worked with a team that saw exactly
those results. But you're right. In each of these
takes, both teams had to go through the cab, the change approval
board. It is really that demonstration of why context
matters so much. Amanda,
remember, we are just talking about two
of the 400 applications that needed updating.
There were a lot of meetings, negotiations, blood,
sweat teams. All that went into getting
the rest of the fleet updated. Oh, there were also
spreadsheets. Lots and lots of spreadsheets.
But in short, it was a very long tail to get everything
fully up to date. I might not have wanted that round
for the whole journey, but you know my love for spreadsheets. So thank you
for letting me know about that. All right, so tell
me about. Whoa, Amanda, this is
too small. I can't read anything. That's.
Hmm. Maybe you need new glasses or. Let me
zoom in a little bit for. Oh, thank you.
So that previous chart was a bunch of the capabilities
that we've investigated as part of the research. And here we are zoomed in
on a couple of those capabilities. For example, continuous integration
and loosely coupled architecture. We can see that these capabilities
drive better security practices. Our culture
also drives better security practices. And those security practices
and culture together can help reduce burnout.
They can help reduce the errors that we see in our system
and lead to a bunch of other really interesting
outcomes. So when you think about how to apply the research
to your own team and your own organization, the idea
is that you start with the outcomes that you want to improve and
then work backwards to find the capabilities where you need to
get better. And the idea then is to understand
which capability is holding us back, and let's make can investment
on improving that capability.
All right, so we had zoomed in, and thank you
for kind of explaining how we can look at this and how to move
through it. So now I kind of zoomed back out so we could view
all of the capabilities,
but I would say we've got all this potential
of things that we could change. What's important is
that we remember to not boil the ocean.
Right. We can't go do all of these things tomorrow. You've inspired
me, Nathan. I want to go do that. I want to be on that team.
But the truth of the matter is that to really
affect change in our team, we cannot change it overnight.
We have to remember that it's an investment and we
should start out slow and that really we're going to reach
an inflection point where we start to see that improvement. But there
might be some pain along the way, and we really need to support one
another through that j curve that you showed us earlier.
Absolutely. And it is team specific that
order management system team, they still have
areas to improve, but they're different areas than what these ecommerce
team has to improve. So you cannot use this as a roadmap,
but you can use it to help identify which capabilities
are holding our team back and then commit
to addressing and improving those capabilities and watching
as your outcomes know.
Nathan, I just realized there's
one thing that we didn't do today. Oh, what's that?
Well, we forgot to introduce. So I'm Amanda
Lewis. I'm a developer advocate with Google Cloud, focused on
the Dora research program. Hi. And I'm
Nathan Harvey. I'm also a developer advocate focused on the Dora
research program and helping teams improve using the insights
and findings from the research itself.
One of my favorite parts about my role as a Dora advocate is
working with the community. And so back in September, when we
launched the 2022 report, we also launched a community of practice
around Dora. So I will hope that all of you out there
will come and join us. If you go to Dora community,
you can join the Google Group, and that will give you the ability to join
in on some asynchronous conversations that are going on and also
invitations to our open discussions that we're having periodically.
And Nathan, do you want to share about maybe some experiences
you've had in some of our lean coffee discussions or topics
and things that we've been having with the. You know, my favorite
part about these discussions is that we really cater
them to the people that show up each time for the discussions. That's one of
the benefits of using the lean coffee format. But the
other thing that is really beneficial is that we don't always know
exactly where the conversation will go. I like to say that we need
to be prepared to be surprised. And so we've
had really interesting conversations and perspectives from
practitioners that are putting these capabilities to work. But we're
also hearing from leaders and importantly,
researchers, both the researchers on the Dora project,
but also other researchers across the software delivery
field, the developer productivity field, and so
forth. So it truly is a community where we can bring
together practitioners, leaders, and researchers
to help us all improve. Absolutely. And I think as we've
seen, Nathan, as we're working with teams and helping them apply and use the research,
we realized that we really needed to connect people
together because you are the experts in your business and you can bring that
experience and how you've applied it together.
And I have learned so much since September. It's been
absolutely incredible. Absolutely. So thank you all so much
for tuning in to our presentation today. We hope that we will
see you on the Dora community. And before
you go, make sure you grab these URL or QR
code so that you can download your very own copy
of the 2022 accelerate state of DevOps report.
Now, Amanda, can you give me that report back so I can continue on
with my reading aloud of the report?
Okay. I like reading it, but. All right, you can have it.
All right, well, maybe we'll save that for another
time. Thank you so much, everyone. Thanks,
Amanda.