Transcript
This transcript was autogenerated. To make changes, submit a PR.
Welcome to there's no place like production. I'm Paige
Cruz from Chronosphere. This talk was
inspired by all of the discourse that swirls around a little phrase
I test in production. It's one of those phrases
that people seem to want to take the wrong way and
get into Internet fights about. So I want to tell
you the story of how I learned that there's no place like
production and that a I do test
in prod, but there's a few things you should know first.
I started my tech career at a monitoring company.
And when you work for a monitoring company, no matter where
else you go after that, you are considered the monitoring expert,
and that is a big mantle to carry.
The second thing you should know is that I was hired into this organization
as a senior site reliability engineer,
and something that had been low key stressing me out since
being an SRE and holding the pager for like five or
six years, was that I hadn't yet caused
an incident. All of my friends who were seasoned had tales
of taking down production, accidentally deleting clusters or databases.
And I was just sitting there waiting for my moment. And the last thing
is that during this time, I had a lot of stressors going on.
I was stressed about the pandemic. I was stressed about a Kubernetes
migration that I was having to kind of close out,
which would be fast, followed by a CI CD migration,
which are just two really big projects back to back.
I was stressed about losing like 80% of my department to
attrition, and finally just normal personal life
stressors. So all of this is going on as the backdrop
to this story. Where did this all happen? A sprawling
cannabis platform that was 13 years old by the time I joined.
We connected consumers who were looking to buy cannabis with
dispensaries who were looking to sell cannabis. And as a part of this platform,
we offered many different features, such as an online
marketplace with live inventory, updating a
mobile app for delivery and messaging,
lots of ad purchasing and promotion, as well as
point of sale register systems, really bread and butter
technologies that these dispensaries and consumers depended
on. Our story starts with a little component
called traffic. One day, I was investigating some
issue into production, and I noticed something shocking about our
traffic logs. All of the logs were unstructured,
therefore unindexed, and not queryable or visualizable.
Basically, to me, useless. I couldn't do
anything with these blobs of text. What I wanted were
beautiful, structured, indexed logs that
would let me group and filter by request, path,
remote ip referer, router, name, you name
it. I wanted all of the flexibility that came with structured
login. So what all was involved in making this
change? To sum up, traffic is essentially
an error traffic controller sort of routing requests from the outside
to the inside, or from one part of your system to another.
And it was one of two proxies doing this
type of work. So here is an approximation of my
mental model pre incident of how our system worked.
A client would make requests like hey, is there any turpey
slurpee flower near me? That would hit our CDN,
which would bop over to our load balancer, pass that
information to traffic, which would then give it to a Kubernetes service
and ultimately land in a container in a
pod. All along the way, at every hop we were emitting
telemetry like metrics, logs and traces that were shipped
to one observability vendor and one on
prem monitoring stack. So this
is the steps that I followed to make my change happen.
I needed to update configuration that was stored
in a helm chart. We did not use helm, but we did use
Helm's templating. We would take those helm charts,
render them into raw Kubernetes Yaml
manifest, pass that to an argo CD
app of apps. So first layer of Argo
CD applications, which would then itself point to
an individual argo application,
which would then get synced by Argo and rolled out to our clusters.
But hey, this was a one line configuration
change. I'd been making these types of changes for months.
What could go wrong? Let's find out and embark
on deploying this change, shall we? I started
with a foolproof plan. Whenever I'm a little bit nervous about
a change, I really like to break down what my plan is
for getting from where I am to production. And in this case,
I started with extensive local testing. I wanted
to make sure that our deployment process was up to date, that these changes would
get picked up. I then wanted to announce these changes to the
developers and my other teammates when they would hit
each acceptance and staging. I planned,
of course, to just let it bake, give it time,
and finally, I would schedule and find a quiet time to
try it out tomorrow. In production, I made sure
to request a PR review from the most tenured person on my team with
the most exposure to the system and how it had gotten built up over time.
This helped me feel a lot more confident that I wasn't going to make some
radical change by accident. So after it
passes review and all of the PR checks, it was off to our first
environment, this change landed first in acceptance,
which was sort of a free for all. It was the first place all these
changes would land to see if they would play nicely together.
Unfortunately, this environmentsthere was a little bit undermonitored
and relied on what we call the scream test,
where unless somebody complains or actively
screams that your change has caused an issue,
you consider things good to go and operationally fine.
So after letting it bake in acceptance for a little bit,
I decided I was brave enough to push this to staging.
So I deployed a staging, decided to let it bake for
a little bit longer overnight. One day later,
and it was time to take this change to production.
And at this point, I had a little bit of what we now know is
false confidence that it had caused through two
environments, it had passed through a human PR review and
automated PR checks, and again, was a one
line config change. So at this point, I was
feeling so confident that as soon as the deploy status turned green and said
it was successful, went back to the circus act that
was juggling all of those migrations without a manager and
just trying to get my day job done. Which brings us
to the incident, because several
minutes later, I noticed the incident channel in slack,
lighting up. And from a quick skim, it didn't seem like
this was just a small, isolated issue.
In fact, we had teammates from across the organization,
different components and layers of the stack reporting in
that they'd been alerted, and things were wrong.
So impact was broad and swift.
I kind of sat in the back, panicked, thinking to myself,
was it my change?
No. How? There's no way it was my change.
We had all those environments, it would have come up before then. We've tested
this. It's not my change. And having
convinced myself of that, I muted the incident channel because
I was not primary on call. It was not my responsibility to go
in and investigate this. And I had a bit of a habit of
trying to get involved in all the incidents that I could because I just
find the way that systems break fascinating. And with my
knowledge of monitoring and observability, can sometimes help surface telemetry
or queries that speed along the investigation. But I
was on my best behavior, and I muted the channel and went back to work
until I got a slack dm from my primary
on call, who, incidentally, had reviewed the pr that said,
hey, I think it was your traffic change, and I'm going
to need you to roll that back. And again, my brain just exploded
with questions. How? Why?
What is going wrong? What is different about
production than every other environment where this change went
out. Fine, but this was an
incident. I didn't need to necessarily know or believe
100% that my change was causing the issues.
What needed to happen was very quick mitigation of
the impact that was causing our customers and their customers
to not be able to use our products. And so even
though I was unsure about the
fact that it was my change, it immediately went to that
revert button and I rolled back my changes after
that because now I was a part of the incident response.
I hopped on the video call and just said, hey, I think it's
possible that it was my traffic change. I have no idea why,
but I've gone ahead and rolled back. Let's continue to monitor
and see if there's a change. And very quickly,
all of the graphs were trending in the right direction. Requests were flowing
through our system just like normal, and peace had been
restored. Interestingly, during and after
this incident, I received multiple dms from engineers
commending me on being brave enough to own
being a part of the problem and kind of broadcast
that I was rolling things back and it was probably my
change and really just kind of owning my
part of the problem. And that got me thinking that
we perhaps had some cultural issues with on call
and production operations. So I filed that one away
for the future. And even later that day when
I was telling a friend about I finally had the day that
I took prod down, they replied, hashtag hug ups.
Oh my God. That must have been so stressful that you were the reason
that things broke. But I
didn't really see it the same way. I actually didn't
blame myself at all. I think I took all of the
precautions that I could have. I was very intentional and
did everything in my power to make sure that this change would be safe
before rolling it out to production. And I didn't see what
would be difference between me making this change or someone
else trying to make that change. I think we would have ended up with the
same result. So I didn't blame myself at
all. And I credit this to the learning from incidents
community and just general seasoned SRE fo folks who've seen it
all. So, thank you. And I
go to bed. I wake up the next day, I could
sense that all eyes were on me and
this incident. Engineers from all up and down the stack
were also asking what happened, how,
why? And I realized this was
no incident. No. This was actually a
gift that I was giving to the organization. This was
an opportunity to learn across the more
about how our system works and we were going to cherish this
gift. I became the Regina George of incidents and said, get in
betch, we are learning. And I wanted to bring everyone.
I was really determined to capitalize on this organization
wide attention onto how our infrastructure works.
I had a little bit of work to do because I myself was still
mystified as to how and why this could have happened,
and it was time to start gathering information for the incident review.
It boiled down essentially into a hard mode
version of spot the differences between these two pictures after
bopping around a few of our observability tools, I realized
the quickest way to figure out what went wrong
was to render the helm charts into the raw Kubernetes
manifest yamls and diff those facepalmingly
simple. But in the heat of the moment was not something that
immediately sprung to mind, not when everyone's saying that
everything is down. And let's remember my mental model of the system.
CDN load balancer traffic service pod I
really didn't understand why changing from unstructured logs to structured
logs would affect the path that a request takes.
And it turns out when I played spot the difference,
there was a key component missing from only the
production environmentsthere. How was it that I missed an entire
component getting deleted from my pr change?
Well, we were all in on Gitops, using repositories
as a source of truth and leveraging Argocd's reconciliation
loop to apply changes from the repository into our clusters.
We used an Argo CD app of apps
to bootstrap clusters. How did Argo get Kubernetes manifest
to even deploy helm? In our case,
we would take service secret deployment with whatever
values YAML was the base, plus whatever values YAML was
for that environment. Interpolate and bungee those all together and
spit out the raw Kubernetes Yaml, which we passed
over to Argo.
So what was missing? What I discovered, was that
there was a second container in the pod
where the traffic container was. It was a security
sidecar that acted essentially as a gatekeeper,
vetting and letting requests into our systems.
So difffing the Kubernetes manifest got me
to what happened, but not really
to how. And lo and behold,
I noticed that where I entered my one line to do JSON formatted
logs in a specific values yaml in
the layers of hierarchy, I had unknowingly
overwritten the block for that second container
definition in the pod, aka that security sidecar.
So I accidentally deleted it
and it caused a lot of havoc, but I had no warnings along the
way that this very critical component had disappeared.
Let's talk about learnings, because during the course of this investigation,
I personally learned a lot about how my mental model
of the system didn't reflect reality, what was
actually happening when we merged prs,
and what to look out for next time I made a config change.
But I also had a lot of curious engineers,
managers and leaders wondering what happened,
because anytime there's a sitewide outage, you get a lot of
attention. So here's how I shared my learnings. The first place for
sharing learnings, of course, is the incident retro, and I
made sure that my document had a clear timeline of events,
talked about in detail each step of
translating from helm template to Kubernetes,
manifest to argo application to app
of apps, the whole process from start to end.
Because I wanted anybody, not just the people on
the SRE or the infrastructure team, I wanted anybody in the organization
to be able to understand how this happened. After that,
I took this little incident retro document
along with a life of a request diagram, first to chapter backend,
which was a community group for sharing learnings and announcements
across all the backend engineers. And then I also took
it to chapter front end to share all of this knowledge with our
front end engineers. There was really no shame in my learning
game. I was taking this presentation anywhere people would have
me. And finally I took it to SRE study time,
which was the dedicated space for learning, for my team to
really dig into the details. The thing that
came in the most handy, though, was a metaphor, because I was
talking to leaders. I was talking to maybe engineers
who hadn't been exposed to Kubernetes. So I needed a
really handy metaphor to explain the impact. I said
my pr essentially resulted in me taking out the front door of
our house and replacing it with a brick wall.
Nobody could go inside, but the people who were already inside could talk
to each other. This was a really simple analogy that answered
the immediate question of what happened and what was the impact,
which allowed the attendees to focus on the deeper details
of how and why. Something I kept in
mind when explaining this to teams outside of my own was
our information and knowledge silos. Because we were
a ruby node and elixir shop, we weren't a ghost shop.
We didn't have everybody trained up on kubernetes. If you
saw from the first few slides, we actually had just migrated
to Kubernetes. So a lot of the infrastructure was mysterious
to our developers. So I made sure,
specifically for chapter front end and back end to call out
the idiosyncrasies of go templating and its errors
what the order of precedence is for values.
Yaml file in helm and reviewed the app of apps
pattern with Argo and explained what actually happened when one of them
merged a pr this went a really long way
for building a shared foundation of understanding.
The most popular part of my presentations, by and far, was the
life of a request diagram. It broke
down the end to end that a request would take from
client all the way down to application running in a
and this was the first time that some of these engineers had even
seen this fuller picture of what was going on in
our system. So it felt really good to be able to share this
knowledge. Ultimately, I kind of reflected on the central question
I had been asking myself. What was different
between the production environment, where my change blew everything
up, and staging or acceptance or local,
where my changes seemed to test just fine? It felt like I
was playing spot the difference in extreme mode, which I've recreated for
you here. Tell me which one of these is a
crow and which one is a raven unless you're a
birder, it's pretty hard to say. Sometimes working
in these complex systems can feel like playing a
very risky game of Django. I kept thinking
that on their own, Yaml go templates,
kubernetes, application definitions, and even the argo
app of apps pattern aren't terribly complex
or confusing concepts to understand,
but it's the way that they're all puzzle pieced together into a
system, or stacked and pointed to each other and layered
that made this change in this complex system really difficult
to diagnose. To sum it up,
I learned the hard way that there literally is
no place like production. And it's not that I
test in prod, but we all test in prod.
Thanks so much for listening. I'm Paige Cruz with chronosphere.
Catch up with me at Pagerduty on Mastodon LinkedIn.
Email me would love to chat about the time you took production
down.