Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello and welcome. Today's talk is titled Pushing
Code. Don't forget to flag your canaries. Thanks for joining in.
In this talk we'll cover a variety of topics.
We'll go into reliability, how it's
most importantly about the user. Why do we do canary releases
in the first place? When is it safe to expand those canary
releases and also dive into feature flagging
various features and then also
understand how our users rely on our services and then identify
the ideal groups in youll users base
to target. And at these end we'll have
a quick question and answer session. Let's get started
with reliability. Reliability is feature number one.
Without reliability, nothing else matters.
It doesn't matter what shiny new feature you put into your
application. If your application is not reliable, the customer won't
care. Now, reliability might mean
a few different things to your team. Phased on these,
you are in the maturity of your application or service and also
scale and growth of users adoption let's say you're a startup
you're just worried about let's get the features out. Let's not focus too much
on reliability or the other bells and whistles because
we want to get the feature out, because if we don't get it out,
we'll get beaten by competitors. So you're really focused on delivering features
as fast as possible and delivering value to your customers as fast as possible.
And then at some point reliability starts. As you get
more customers, reliability starts becoming more and more important because
now you've got customers paying for your product and
maybe they're complaining something's not working and that
is these reliability of youll application. Your reliability is suffering because you move fast,
which you had to do, but now you're at a point where you need
to address those concerns so your customers don't leave. And as you
grow into a bigger company, that reliability becomes more and more important. You see
a lot of companies become very stagnant
sometimes as far as releasing products because they've built such a
massive platform. These their customers are relying on that. It's really
hard for the company to release new versions or new features
without going through a whole suite of testing. That takes a long
time. And that's the other extreme of where
reliability takes over and innovation drops
off. The state of DevOps report sponsored by Google states
that reliability is a combination of measures which
include availability, which is one of the most important.
Availability can be tracked with four golden signals,
latency, traffic saturation and errors.
If your users can't easily access the service or
if it's unbearably slow, or if there's errors popping
up everywhere, then it's not reliable. We've all been
there. We've all used a SaaS product or any other product
where you're sitting there waiting for the loading spinner to end and just
keeps going in an infinite loop. Or you click tab and you see a
loading skeleton which initial
glance is like, oh, it looks a little better. That's why the loading skeletons
were invented or implemented,
because it gives the user perception of speed. But in fact you're
still waiting. So we've all been there, we've all seen it,
we know exactly how annoyed we get. And at that point all
the bells and the whistles, all the nice shiny new features don't really better because
we're so annoyed at the one feature we wanted to use at that time and
it's not working. Our internal SRE expert, Kurt Anderson,
who did 70 years as part of LinkedIn's SRE team, says reliability
is a team sport. It's a constant balancing act between pushing
new code to production and monitoring your services to make
sure these performance as expected within the threshold
that you have set for resources and just overall healthy.
The practice of SRE is about bridging the gap between
the dev side of things and the operational side of things,
so that innovation can keep going
at a fast pace while also maintaining the reliability of
your application. Oftentimes it feels like innovation
goes super fast and then reliability falls off. And these other
times reliability takes over and is the
number one priority and then innovation drops off.
And that's where I talked about. Different states of companies tend
to have different extremes sometimes, but really the perfect
experience is having a balance between them.
Where innovation is fast but also reliability is high. That's the perfect
balance. That's the perfect experience for youll customer and for
your development teams and your operational teams.
Sre lives between the dev and ops
and our job is to smooth it out as much as possible. We want to
connect dev and ops so that they can keep going at
equal pace without causing too much friction.
And we have to think about, as an
engineering team, we have to think about what is planned work, what's unplanned work.
When something goes wrong in production,
yes, you release something really fast. Something went wrong in production
that causes unplanned work for your teams as well. So it's not just impacting
the customers, also impacting internal engineering teams as
well. In our topic today, we'll go over how you want to continue
release and innovate while meeting your customers expectations of reliability
and their needs for new features at the same time.
To do that, you have to adopt an iterative release
process using canaries and feature plays.
Before we do that though, let's take a simple example.
Now, I've got an ecommerce application that a
customer can log in, search and buy things.
And I've got a simple user journey here. The user is logging in.
How likely the login is to succeed is extremely important to the success
of your application. It does pertain to reliability. If the user
can't even get through the front door, what users is your product? What uses are
the shiny new features? They're of no use at all.
Now, let's say the customer was able to log in and now they're searching for
a product. How fast those search results come back is extremely
important. Obviously not as important as getting into the
front door in the first place, but fairly important
because it annoys the user when something is slow. We talked
about this just a couple of slides ago. We've been there.
You're using a product even if you're not buying something. You're trying to use a
product and it's slow. It annoys you, right? It gives you a bad experience
with that product and you'll hold that negative experience until it
gets resolved somehow. Now, customer found the
product they're looking for. They want to add it to these parts.
If it gets successfully added to the cart, awesome.
If it doesn't or it takes a long time to do so,
that's a problem because the customer
wasn't able to do what they came to your application to
do. And if they're paying you money, maybe a membership fee
or monthly fee, whatever they're paying
you, they're paying you to do what they came here to do.
And if they can't do it, they're not going to be happy.
All these three things put together,
obviously this is a simplified example, but these
three things put together encompass how happy a
customer is going to be with your product.
Now in this example, we talked about
latency as one of the big metrics,
and that would be the SLI, the service level indicator that
we want to use. If we were measuring this and these service level objective could
be, I want this latency to be below
a certain threshold over 30 seconds. That would
be how you measure, and slo
would be what you want to measure and optimize because you're
probably always going to get one or two responses and occasionally
that are slow. But what really matters to the overall experience
of your customer is how slow is
it over a period of time? If it's just slow for one
call and then it gets super, super fast, customer probably won't
be affected too much and they won't mind too much. But if it's consistently slow,
then we have a problem. And youll can measure that using slos and slots.
Blameless. Blameless plug.
Now, let's talk about canarying. What is canarying? Well,
the term canary comes from an old mining tactic
where miners would release canaries into tunnels to measure toxic gases.
If the canary survived, that means the tunnel was safe. If they
didn't, well, not so safe. How do we relate that
to software development? In software development, canary means releasing to
a set of users over time until you've rolled out a feature
to all customers. Why do we do this? Well, one,
there's less impacted liability. As I said, you're releases to a subset
of users iteratively so you could roll it out to one
set of customers, monitor, make sure
things are looking okay in your metrics, tools, et cetera.
And if they don't, you could stop the rollout or even roll
back the rollout without impacting all of your customers, right?
So therefore, you're more reliable to all the other
customers. And two, continuous feedback from
various users that you're rolling out to. If you roll out the first set of
customers and they tell you, this sucks, these can't even use it.
It's not even the workflow that they want. Maybe you pause,
go back, revisit the design, and don't roll out the feature to
everybody else because maybe you never know.
And then three smooth operations, right, you're releasing
to less users. Therefore, the surface area of incidents has decreased dramatically.
And if you don't encounter incidents with these set
of users you release to, then you can roll out to another set of users,
keep doing the same thing, monitor, and then roll out to the other set of
customers, hopefully reducing the incidents as much as possible and causing less strain
on internal engineering teams.
Let's talk about software development lifecycle for a bit. You might
think, okay, well, you've got these iterative releases,
canary releases. How does that affect my lifecycle?
Well, the short answer is it doesn't. You're still going to be building
your application, testing it,
deploying it, monitoring, and then
if you find any issues, you'll be fixing them, reviewing the fix,
learning from it, and then repeating the process over
and over again. The difference here, and I've got sort of two different
use cases being demonstrated here the difference in canary releases
is that the deploy step is really
broken up into multiple steps, right. You're releasing
it to first set of customers doing the monitor, fix, review, learn,
build, cycle again after you release the first set of customers and
then the same feature rolling out to the next set of customers and
then repeating the process. So really you take every feature and repeat
this process multiple times for every set of customers. So it's the
same process, you're just doing it in a more iterative way over and
over again. And the fact that you're breaking it out and
it's the fact that you're releasing it to small system users, meaning that your monitoring
and fixes are less urgent because you, youll turn
off the feature too. You've got feature flags, so you could turn off the feature
as well if it was a major issue that the feature introduced. And these,
the other part I want to showcase these is feature A and feature B.
So not only are you releasing two subsets of users over time,
for feature A, you're going to offset feature B, release it,
and these do the same process as feature B independently. How does
that work in a timeline or chronological fashion? Well,
here is a good example. This is the
same chart, essentially this is the same thing as
this, but more chronological and
more visually, I guess, easy to understand. So these
we've got feature A and feature B just like the last diagram. And then
let's just say 123456. Those numbers
represent sprints and then the y axis here represents
customers and the x axis is chronological and number of features.
So let's say feature a is done and we're ready to roll it out to
the first set of customers, we roll it out to 25% of customers. In sprint
one, we do whats process monitor,
fix any issues, get feedback. And once we're ready and we
feel comfortable in sprint two, maybe we'll
roll it out to 25% more customers. And then in sprint three,
we'll roll it out to 25% more customers. And maybe in
sprint three, this is a time where we're about to reach a general availability.
Maybe it's time to button up our documentation, make sure it's all up to date.
And these in sprint four, we'll roll it out to all the
customers. Now going back to start a sprint three,
you can see feature B was done and feature B was also ready to release.
So in feature B we repeated the same process, but now
feature B in Sprint three is only released 25% of customers, whereas feature
A is released to 75% of customers. And hopefully feature
A and feature B are decoupled enough so that if something happened and only
went wrong with feature A in sprint one, two or three, you could
pause feature A, continue releasing feature B because
they're feature toggled right and continue the canary approach with feature
B. While feature a is sort of paused and you're working
on fixing the issues, this allows you to unblock your pipeline.
Don't have to do major git wrangling, which,
let's be honest, in some cases is unavoidable.
But hopefully you've decoupled the features enough that you don't
have to do that git wrangling. And you can make the
process easier for your engineers, your developers don't have to rush. Make sure they're
reverting, make sure they're not losing code, make sure they're not having merge
conflicts. All that good stuff that comes with
messing around with git histories and making git merge
commits and reverts. Now feature flags we talked about
a little bit with feature a, feature B in the last couple of slides,
but what do they do for us? They allow us
to inform timing, stability and the go forward
strategy. So you can proactively monitor
what got pushed because you can see when the feature
flag was enabled and then who it got pushed to because you can obviously specify
which customers you enabled the feature flag for.
And that's how we would do canary, you would say feature flag a enabled
for customer A and B, and then feature flag a in
two weeks gets enabled for customer C, customer D and so
on and so forth. You know exactly when you did those changes and then you
can watch for incidents before you continue to push. For example, in the diagram
I've got here on the left from sumo logic, you can see the icon
for the feature flag. You can see it around April 1 the feature
flag was turned on and you can see there's a spike in
Prometheus memory usage. The pink line,
the blue line both spiked and the orange line sort of spiked
there. What happened? Well, if we didn't have
feature flags, we may not necessarily know when
something was rolled out. You could obviously go to your releases
and correlate that. But it's so much easier when you have something whats you can
explicitly turn on and more importantly explicitly
turn off if you saw issues. So if we saw this spike, we could
potentially turn it off. So we turned off a feature flag maybe,
and these it goes back down. And then once everything is stable and
you fix the any issues you've identified, you roll out to other users.
And in all this slot the test for a rollback strategy,
because in worst case scenario you have to roll back the code to make sure
you have a strategy in place. If you do need to do that,
let's take a simple case study regarding a
project that we internally canary called Comsflow.
Comsflow is a communications flow tool
whats allows the users to customize
message templates and messages and who
they want to send those messages to based on certain events. In the blameless platform,
for example, you could have incident status changes trigger an event.
Incident severity changes or retrospective state changes trigger
a blameless event, which would trigger a message that has variables
in it to SMS,
email, Slack, Microsoft Teams, or status page recipients.
And those recipients could be internal or external. You could directly
post to a customer status parts you've got. You could directly post
to a public page if you wanted to.
Now, in this product, you can see
there's different areas that could be feature flagged individually. For example, we could take
incident status, incident severity, post mortem state.
Those could all be separate features. Maybe we only release incident status
and that was the only thing that was complete at the time. Maybe we releases
that first as a feature flag, and then maybe blameless
event and reminders are separated as well. Maybe we
only want to do events at the moment. We don't want to enable reminders or
they're not done yet, the development isn't complete, we'll just keep that features flag
off. And then in the message template we've got these variables that I
mentioned. We could turn off variables because like I said, it could
not be done, or you want to minimize the release scope. And then even
the recipients could be controlled based feature flags. So anything
could be feature flagged here pretty much, but it really depends on your organization,
your team, and how granular you want to be. As far
as feature flagging, here's a rough overview
and plan that we had about how we rolled out comms flow. For example,
we had the first iteration in early December. We had a certain set of
features built for comprise flow, and at
the bottom we had some customer accounts that we decided
were asking for a feature and wanted
to use it, et cetera. We just determined who those were and rolled out
iteratively using a canary approach to those customers,
and then we let them play with it for all of
December. And then in Canary we came out with the next set of features
which were individually feature toggled as well.
And then we decided, okay, all the customers can have this feature because
we felt comfortable and confident based on our previous releases.
What do we learn in this process? Well, a couple of things.
Production readiness. As we rolled out two customers
iteratively in a subset manner using a canary
approach, we were able to get feedback from these customers. We were able to
monitor our metrics and make sure nothing were
breaking. So that gives us more confidence as far as how ready this application is
for production. Obviously we do production readiness testing before
we release to any customer as well. But just getting that feedback
or getting that visibility into customers using
it and our logs looking okay and our services are healthy
gives us even more confidence that yes, when we roll this out,
this will work and it won't break and cause headaches. And these,
these other thing is much more agile roadmap. We were able to
break up all the features and different features so that people could still develop and
keep developing without worrying about overriding somebody's change or
merging something. Whats couldn't go out to production because we had feature plays in place
and a couple of other things. Ownership and code cleanup.
Well, ownership, what does that mean? Well, we had a lot of features,
created a bunch of feature flags. One thing, in some
cases we dropped the ball as far as who whats the owner,
who whats responsible for enabling the feature flag? Is the product manager,
is the engineering manager, is the engineer who's responsible
for enabling the feature flag. That wasn't super clear for some of
the features that we rolled out. And so we had features that we rolled out
at one point that weren't enabled for customers and we thought they were released
and then we had to go back and enable them. So lesson learned.
And we obviously learned from that and working towards making
it better and have a process for it now. But keep that in mind as
you're adding feature plays throughout your code base. And then since
you're adding feature flags throughout your code base, code cleanup becomes
an issue as well. Let's say you've gone through the canary process. You've enabled the
features for x number of customers over time.
Now you've got to go back and clean up that code
because otherwise you're going to have really bad code where you
have a bunch of if statements checking for certain features, et cetera, and those
aren't really useful anymore because you've already enabled it for everybody. So you want to
go back and take those out. That's an important aspect from a development experience
perspective. Now let's talk about how we went about
building the right canary user groups, right the
first users group that is the most important, in my opinion,
as far as releasing feature to is the ones that are already using a specific
product area where you're adding a features, you don't want to impact their
workflow. If you're releasing a feature and you're impacting their workflow,
you want them to test it out. You want them to like, hey, first of
all, let them know, hey, we're releasing
this new feature. Kind of want you to take a look at it and let
us know what you think. And these, maybe they've asked for these specific feature in
the first place. So you obviously want their feedback right off the bat without releases
to everybody else. And the great thing is you can tell
customers, look, we're releasing this feature and they'll
trust you because they'll see that you have
a plan. You're not just rolling things out willy nilly and that
you're actually innovating. These. There's another set of
users that are less active, right? They're not currently using the product area. They didn't
really ask for it. Do they even care about the feature? Are these the right
group for canary testing? Probably not.
And there's a vocal set of users who will give you specific feedback even
if you don't ask for it. Sometimes it's valuable to get that opinion,
but you also don't want to have a bunch of noise as far as feedback
is concerned. You want to make sure you target these users that are the super
users of your product or of that features specifically,
because you want to know when customers actually using it,
how do they use it, how does it impact these workflow? Does it improve something
for them or did it make things worse? You want to know from the power
users, you don't want to know necessarily from someone who likes giving
feedback, wants to give feedback and doesn't really use
the feature heavily, though we all, we may even
be some of those vocal users ourselves. But as
far as canary testing, the best group to target is the ones that
are already using youll product area or have specifically
asked for a feature set. The harness blog had
a quote that says a Canary deployment is the lowest risk
prone compared to all these deployment strategies because of the level
of control. And I would agree you're releasing to a small
set of users, not everybody. And so you control who's
getting the experience that you've decided. And you also control when
you can roll it back as well, right? If something goes wrong, you can
roll it back. If things look good, then you roll out the next set of
customers, so you have that control within your hands. And features plays just
make that even easier by being able to turn it off really easily.
And that's all for today. Please let me know if youll have any questions.
Please feel free to send feedback to my email hamad@blameless.com and
I'll be available for any questions. Thank you.