Transcript
This transcript was autogenerated. To make changes, submit a PR.
Jamaica make up real
time feedback into the behavior of your distributed systems and
observing changes exceptions. Errors in real time
allows you to not only experiment with confidence, but respond
instantly to get things working again.
Slos sake
you for tuning in to my talk. I am going to be
talking about continuous reliability and how is it that
we can get there. Let's go ahead and just jump right in.
As we know, software is going to break.
The world that we're building continues relying on the stability
of this naturally brittle technology.
The challenge that we continue facing is very
much about making sure that our customers stay first.
How do we continue to innovate and deliver products and
services so that our customers are
happy that we're minimizing that risk of failure as
much as possible? But we actually have come a long
way. Maybe we used to think that our technology stacks
was very complex, but boy were we wrong.
Our legacy systems were way more simpler than
the systems that we have now. These complexity continues
increasing. We started out with just a few services
on premises. Maybe we had one service that was
being hosted, and maybe we had one or two annual
releases. We've gone ahead and shifted left and
rearchitected our monoliths to be microservices.
Now we have things hosted in the cloud that
we don't even know that location. If it's data centers,
we now have hundreds if not thousands of microservices
that we have to look after. And thanks to DevOps,
we've collaborated a lot more when we have more frequent releases
and sometimes we even deploy on Fridays.
We have been thinking about these complexity.
We have been preparing for the more unexpected
that can happen to our systems. We are at a chaos engineering
conference, so I am not going to go ahead and cover the history
of chaos engineering. It's been great to see this space
continue evolving, the community getting larger and stronger,
and for more tools to be out there to make these possible.
It's great to find tools that allow for us to run
simple and safe experiments without needing to do
one of a kind configurations. But what is
it that we're missing? We have to continue reminding
ourselves that as our systems get more complex,
that means that our failures are also more complex.
Operating them at scale is even more of
a headache. We still find ourselves doing a lot of these
manual work. That is a lot of toil of us to find
the site reliability engineers. We end up having to
do a lot of remediation, a lot of looking for the
proper dashboards and observability links,
and we also spend a lot of time executing those runbooks or
shell scripts to make sure that our systems come back up.
We constantly find ourselves feeling like this little stick
figure, this is fine. My systems are broken, but they'll come
back up. I'll have more coffee. I won't sleep. It's going to
be okay. Maybe we'll actually escalate and get more help.
But we are the ones that are on call. We're the ones getting pitched
and woken up in these middle of these night. Until when is this too much?
Until when do we question and ask ourselves what
can we do to make things better? How do we make things
more automated? How do we make things more reliable?
I say that I am going to be talking about continuous
reliability, but how is it that we're going to get there?
Well, I believe that with these three words, we can get there.
When I look back at my time as an SRE and working within SRE
communities, three things that always come to mind
is automation, standardization, and experiments.
We learned that automations and standardization are
core principles to site reliability engineering. And of course
we cannot forget experimentation. From chaos engineering to
feature flags to canary deployments, they've all helped
us move the needle through all these years. We know that
automation helps our organization and our teams not burn out
and our systems to be more reliable. And of course we
know that defining these reliability goals, it helps
keep us online. Well, captain allows for these things to
come together under one roof. So let's
go ahead and dive right in. And first I'm going to introduce myself.
My name is Ana Margarita Medina. I am a senior chaos engineer
at Grumlin. I've been working here for almost four years
with the focus of empowering others to learn more about chaos engineering
and move their journey forward. Prior to that,
I used to be at Uber working as an SRE
where I focused in chaos engineering and cloud infrastructure.
Prior to that I was also a front end developer, a back
end developer and I even did some mobile applications.
Gotten a chance to take all my knowledge from those
industries and try to talk about making things more reliable.
I also sit on the advisory board for Captain project,
which has been really cool to see this space continue growing.
Representation is something that really matters to me.
So shout out to all of you that are joining in from one of those
groups. I was born and raised in Costa Rica and my parents
are from Nicaragua and I now reside in the San Francisco Bay area.
So shout out to all of you. Let's go ahead and
jump right back into this captain project.
Captain is these control plane for DevOps automation
of cloud native applications. It uses a declarative
approach to build scalable automation for delivery
and the operations of these services. It can also be
scaled to a large number of services as well.
The cool thing about captain is that it works for cloud native applications
and is not just exclusive to Kubernetes.
Captain is also part of the CNCF foundation
and it sits in as a sandsbox project.
It's been great to watch this project grow,
improve and for it to gather more adoption.
One of the awesome things within captain is that it allows for
you to have a lot of things out of the box. You're going to
be getting observability, some dashboards and alerting
that allow for you to have best practices.
You also get to configure that monitoring and observability
whether you want it as default settings or you want customizable
dashboards, along with getting some extra
alertings that are going to be set up based on service level objectives
for each managed service that you have within captain.
And it allows for you to bring under the same platform some
of that delivery, along with operations and
remediations for your services. So I talk but service
level objectives and service level indicators,
I want to make sure to cover that terminology before we talk a
little bit more about them. That service level agreement is
going to be that contract with your users that is going to
include the consequences for that contract to be not met.
And that comes down to that service level objective,
that is that target value or the range of failures that you
have for that service that is going to be up or not,
and that is going to be then measured by that service level
indicator. That is a carefully defined quantitative measure
of some aspect of the level of the service that you are
providing. So a perfect example is our service
level indicator being a web request that
is going to have latency and for it to be less
than 500 milliseconds. So that indicator is just these latency of
every single request for that service. When we look
at it, we look at that service level objective that it's 95%
of those web requests have a latency less than 500 milliseconds
over a rolling month in that service level agreement.
A web request that have a latency less than 500 milliseconds
for the month. If not that customer gets that money back. So there's actually
a consequence for things to not be done
with reliability in mind. And of course we
have that big idea that we care so much about reliability.
So we actually just don't want nines of reliability. We want to go ahead and
think that we're trying to reach 100% of web requests,
but that perfect ideal world doesn't really work when we actually
put it out to technologies. The amount of dependencies that we
have really make it really hard to have five nines,
four nines, three nines of reliability. You have to do the work.
Captain allows for you to take these concepts known as SLOs
and SLOs, and it allows for you to have standardization.
We have many tools that allow for us to be declaring service level objectives,
but we don't have them under one platform. We need to find
a way to standardize them across the tools and
across different stages that you have within your pipeline,
and sometimes even just these organization on its own.
Captain allows for you to do just that.
If you're interested in learning more specifically about the ways that slos
can get created and be done within captain.
One of the contributors, Andreas Grabner, a great friend,
has a lot of talks around this. I personally love the one he
gave last year, the Slos conf.
So as we keep in mind that we have a declarative environment
that allows for us to set up service level objectives,
that is a way that we can think about building reliability.
With bringing slos into all of this,
we now have service level objectives that are going
to work within the pipelines. That means that
developers are going to see how their code, their improvements,
their features that they're working on are actually impacting this reliability
metric. And this will allow for a service level objective
to actually work as a gatekeeper. So they're
getting a chance to see things gradually roll out to your
dev environment, to your staging environment. They get to say,
oh, actually, this is making the request be even
slower. We actually don't allow for that to
be hitting our customers. Based on the service level agreement and
that SLO that we just recently covered,
we now have slos, part of a platform,
part of the CICD, a great way that I love
calling it is this being test driven operations. A lot
of that SRE operation work now gets to be
defined and done in a way that we actually have
pass and failures based on these metrics that we define.
We have these slos built into pipelines.
This is a way for you to then think
about what can you do to make things better afterwards.
And captain allows for you to define these remediation actions,
what to execute, to reevaluate that
service level objective,
since those objectives must be met in every single stage for
every deployment within captain, captain is going to be
running tests to make sure the service level objective is not
breached before that promotion. You're getting a chance to
automate that delivery. You get a chance to automate that
extra step that SRe also gets to do.
When Dynatrace looked around these space using the
2020 DevOps report, they saw that 63% of
folks are building internal delivery platforms and they wanted to
find a way that they can give back to the community. They then ran
their own surveys and got a chance to find out that a
lot of time was being wasted maintaining pipelines,
doing a lot of manual tasks and doing a lot of manual remediation.
This can totally happen. I've seen this in multiple orgs where a
lot of these things are completely just shell scripts that folks have
to run, or you have to send a message on slack to
one of your friends across these.org and ask, how do you bring a database
back?
We first start with the pipelines. We then bring in service
level objectives to be set up as
quality gates that captain allows for you to define.
So as these service level objectives are defined within
the development stage, as they are met and
there's not a breach in it, and breach and reliability that then gets promoted
over to your pre production environment, your staging environment.
And as we see that those things are reliable and it's not a
harm to our slO, we then get to promote that
to our production environment.
The cool thing too is that we also get to automate the operation of
bringing our systems back when there is an issue that breaches that
service level objective. So it gets to execute
one of these remediation actions that you've set up, such as toggling
that feature flag and these. That quality gate reevaluates
that service level objective.
If it passes. You now have remediated
what was going on and it closes your reported
issue. This takes in alerts and problems within
your observability, within your dashboards.
This is something that you can set up with tools such as Dynatrace
and Prometheus, of course. And I wanted to show you
a little bit of what that looks like in action.
One of my favorite things about captain is that it has multiple
learning resources. The tutorials. Captain Sh has
really cool tutorials that you can run through. I'm going to go ahead and
follow the captain. Full tour of Dynatrace. But if
you don't want to use Dynatrace, you can use Prometheus and
you'll get a chance to see how you bring your own kubernetes cluster.
You install captain and such. So let's see what this
does. We get a chance to just install captain
by running this command. We get to give it these use case. Today we're going
to focus on continuous delivery since that come with those
quality gates. And it allows for you to see how this gradually
rolls out. We're going to onboard our own project.
We're going to call sock shop and we're going to pass a yaml file that
helps define it. As we start our project,
we now see that we have defined our three stages.
Our stages are going to be very simple. Development, staging and
production. We have a project now we actually need
to onboard some services. So we're going to start out by onboarding a
cart service. We're going to pass that sharp for it.
We're going to pass our database of carts as well.
And we're going to trigger
that first delivery of our application. We're going to start out
by triggering that database and
then triggering the delivery of these cart's application to give
the tag for the images that you are actually deploying.
And things are going well. We see that we did the
delivery over to deployment, to staging
and to production. And things are all green. Things are going well.
The project is succeeding.
We then go ahead and do another release.
We're now releasing version two of our carts.
We see that production is still running on version,
still running on version one of our application. But as we're
actually rolling out version two of our application, we see that it's starting to fail
in our preprod environment and in our development environment.
When we look over at captain, we can see that the delivery got started,
but when it came down to that evaluation stage, it's actually
having issues. Dynatrace is reporting maybe a breach within
slos and it's not allowing for this to get promoted.
This allows for us to take a deep dive into
that evaluation. When we go see what it's looking like in
staging, we see that the response time of
the 95th percentile is actually being
1052 milliseconds. This does not meet
these criteria that we have for things to be passing. So the
result is being failed. And this is not getting promoted
from staging onto production. We go
ahead and we get to now release version three
of this application where we're thinking more, but that
response time. So as we release that, we see that
our dev environment has moved over to version three. We also
see that our staging environment eventually got rolled out
to version three. And our production has also adopted into version three.
So this is the ideal delivery that we're doing,
our applications, it has passed some service level objectives
and it's able to get promoted to the next one.
If you're wondering how some of these service level objectives are defined
and measured within captain and how all the magic that it
does, you get to see some examples of it. We have some service
level in the create indicators, which is that response time in the
95th percentile, in that 15 percentile. And of
course we have our error rates and we have our through output.
We get a chance to see how those now have
service level objectives. You get to define what the criteria
needs to be for them to pass the criteria that it takes for them
to be on warning. And then you now
get to get an overall score for that service based on
these different quality gates that the application has.
You can have some that are warning and the
quality gate is going to give you that warning. Some of them
are just going to fail or continue passing. The awesome part
is that captain allows for you to define these as their own
Yaml files. You have your Slos Yaml. You have your Sli
Yaml that you just get to say what it is that you
want for that indicator and objective to be.
And now captain allows for those
YAml files to define the platform and the way that it's going
to work. So when we talk about continuous reliability,
to me that gets created when we have service level objectives,
when we have things like captain that come with quality gates
and we bring in that experiments piece, we bring in that chaos
engineering. Those slos are going to require for
us to do the work of setting up indicators. And now
we get to inject chaos within the pipelines and
see what that experiments does. So there's multiple
ways for one to do chaos engineering. You can go ahead
and run this at every single stage, see what chaos engineering
experiment does in the dev and the staging prior to leaching to
production. Or you can also just have a chaos engineers stage.
So you can have your development, your staging, then have a chaos engineering stage.
And that is that last quality gate before you promote to production.
You can also think about doing chaos engineering alongside
performance testing. This is one of the great ways that you have
a lot of learning that comes with chaos engineering. This is one
of the things that we did at Uber. We did load testing and we went
ahead and also did chaos engineering. That's a way that we
were preparing for our Black Fridays. That was our Halloween and New Year's.
How do we make sure that we have enough bare metal racks
that allow for us to handle the large load of capacity that
we have on our peak traffic days, and how do we make sure
that all these 50 microservices that it takes to run a trip
are actually reliable on that day that we really
matter. So that practice, we're not just building this overnight.
There was a lot of testing that got done.
And when we do these type of chaos engineering within
the pipelines, we're asking ourselves, is this service
level objective met? Yes. Cool. We're promoting
that over to production. That service level objective
is not met. Did we actually identify a weakness?
We get a chance to do multiple things with something like,
captain, you get to have autoremediation in case you want to
set that up, or you just have to now do
a new release, and that fix is actually going through.
How do we think about these experiments? We're going to
always keep in mind quality gates. We don't even have to do
the math or ask our team if we think that these chaos
engineer experiment results are okay for the customer or not,
because we went ahead and we defined slos and slis.
When the slos are met, we're good to go. When that SLO
is not met, we're not okay to ship over to our customers.
The way that this all comes together, when we do that example of
having a chaos engineering stage is going to be
about that application rolling out to that chaos engineering
stage. That stage now triggers chaos engineering experiment.
It takes all that data from the application and your
tools that you have connected to it. It then says,
is this SLO a pass or a fail? Do we
promote to production or not? And we
get to see how a lot of that continuous learning
comes about. We're going to learn by doing, we're going to
learn by injecting failure into our systems. And of course, there's that
continuous aspect of it where we're going to be improving and repeating
as we do more releases of our application. The ecosystem
of captain continues growing. There's a lot of tools that you can
use, starting with a delivery subset of applications.
On the test side, there's a lot of it when it comes to observability,
you can tie in multiple tools, and of course,
in collaboration, you can send messages over to Slack to Microsoft
Teams. Captain also recently launched SAP.
Your integration for you to have a little bit more freedom
if you want to do things in a no code way. You can also
access these webhooks that allow for things to just be plug
and play within your own internal systems or any other service that
you don't see in the integrations. It's a great time to be playing
around with captain and make sure to join the captain community.
There's a lot of learning resources within chaos
Engineering, SRE and DevOps. You can head on over to
Captain Sh to learn more about this project. You can follow them on
Twitter, YouTube and LinkedIn. Go ahead and give them
a star over GitHub and make sure to get your hands dirty.
Head on over to tutorials Captain Sh and make
sure to take a look. They even have one that you can do locally using
k three s in just a few minutes and set up a local Kubernetes cluster
and get a chance to play with this delivery pipelines. As we're
closing out, I want to make sure that I leave with some final thoughts
as we go through the stuff that we're building within our systems.
We have to remember that we can't just build reliability overnight.
That can't just be an OKR. We also have to
remember we can't. But reliability overnight.
You can't bring in just a new tool that's going to promise you more nines
of reliability. Those don't exist. You actually have to do the
work. You have to learn. You have to inject failure and learn.
You have to be able to make sure that
you think of ways to make your team, your systems, more robust,
more reliable. And the way to do that is by establishing
processes, automating them, and continuous reliability
that those processes are being ran and that the proper
results are being gotten. That includes
things like experiments, those service level objectives,
making sure that you're doing game days, that you're doing failovers,
that you're executing those runboats so that they don't become scale for
those days that really matter. And reliability is not can
accident at all. You have to do the work. You have
to make sure that you're thinking ahead and thinking about
unexpected things that can happen to your system and do chaos
engineering around it. You also have to continue learning.
If you're interested in taking the next step in your learning journey,
feel free to check out Gremlin certification. You can head on
over to gremlin.com slash certification to learn all
about it. There's currently two certification modules that Gremlin
is providing. We have the practitioner level, which ends up
being chaos engineering fundamentals, and you have
the next level. What is that professional level where you can get
to test your skills on advanced chaos engineering along
with some of that Gremlin terminology. I hope you all get a
chance to check it out. And with that, I would love to say thank
you for tuning into my chat. If you have any questions
about Captain Chaos, engineering Gremlin Sre DevOps,
don't be afraid to reach out. You can reach me on my email at anna@gremlin.com
or feel free to say hi on Twitter Anna Underscore m underscore
Medina gracias.