Transcript
This transcript was autogenerated. To make changes, submit a PR.
Are you an SRE,
a developer,
a quality engineer who wants to tackle the challenge of improving
reliability in your DevOps? You can enable your DevOps
for reliability with chaos native.
Create your free account at Chaos native.
Litmus cloud hi
there. I am Joshua Arvin Lat and
I'm going to talk about pragmatic site reliability engineering for
kubernetes in the cloud. So before we
start, I'm going to introduce myself. I am the chief
technology officer of Nuworks Interactive Labs and
I'm also an AWS machine learning hero. I'm also one
of the subject matter experts who help update the
certification exams globally.
And finally, I am the author of the book Machine
Learning with Amazon Sagemaker Cookbook.
So in this book we have 80 proven recipes for data
scientists and developers to perform machine learning experiments
and deployments. So you'll see a lot of customizations there
using docker and using different techniques
in order to port your machine learning model and get that to work in
the cloud with Sagemaker.
So let's start with the first practical
tip. When dealing with pragmatic site reliability
engineering Kubernetes, the first thing that comes into
mind would be if there's going to be an issue,
how fast would we be able to detect and
manage that issue? So we need to
find ways to visualize and know what's happening inside.
So if you have a lot of nodes in your Kubernetes
cluster, and there are different types of
logs, logs from your application, maybe logs from
your network, and so on, things get a bit
trickier when you have to deal with a lot of logs and finding
ways to visualize what's really happening, especially knowing
the metrics, knowing what's wrong, knowing what
part of the system is not working and what parts SRE working,
it's not that easy as we SRE
talking about it right now. So what we can do about it is
maybe we can start using additional tools on
top of the Kubernetes setup and allowing different end
users to access those monitoring tools.
Of course, we may be tempted to just allow
the pragmatic site reliability engineering access those
tools. However, the stability of
the entire system depends on both the software
engineers and the site reliability engineers. Because for one thing,
infrastructure is just one part and then the other
part would involve the application side, which the software
engineers are continuously building on top of.
So if there's an issue, let's say that you
have an ecommerce website and then there sre three or four
different parts and there's something wrong with the checkout flow
that users can check out, or for some reason, the third
page before the checkout page isn't working.
So being able to detect what's wrong is
a responsibility shared by both groups, both the software engineers
and the site reliability engineers. And finally, it's about just
making sure that people have the right access levels and we
just give them the appropriate charts and
tools to debug the problems. The next thing that we
need to think about is what if there are so many servers
and nodes and containers and services we need to worry about here?
So it's critical that we're able to manage the
data, the logs data, and being able to connect it
properly with the monitoring tools. Because if we're
not able to clean things ahead of time and doing it in an automated fashion,
then the software engineers and site reliability engineers
would have a hard time solving the problem.
So when dealing with production level stuff,
the number one thing that comes into mind would be time.
We can talk about concepts all day long,
but when we're dealing with reality, we have to worry
about time. So when you have an issue,
if the issue gets resolved right away, let's say automatically,
then that saves us time, because an automated process
is definitely better than a human who would
still have to process and look for the issues.
So if, let's say that there's a self healing setup involved,
then that's much better, at least as the first layer.
Next, if some analysis is needed and the
problem is custom, then it would involve both automated
and manual solutions for it. So let's say that
you deployed something and that caused
the system to become unstable. Then you would need
a human to solve that problem. Let's say by rolling back
or by doing some tweaks in the current setup or configuration.
Speaking of time, let's play a
quick game. So what's the game all about?
So in this game, I'm going to let everyone
count the number of apples in this slide and
I'm going to share a reward.
So for those who's going to win this game,
in this game, we're going to give,
I think, 25% discount. For those who will try CTO,
purchase the book, the book that I've written this
past couple of months. So again, if you win the game
and you get the number of apples correctly in the next slide,
then we'll share that promo code to you so that you can use it to
buy the ebook. All right, so are you guys ready?
So I'm going to give everyone 20 seconds to
count the number of apples here. And as mentioned. For the
winners, we'll give a 25% discount and we'll provide a
promo code. Promo code separately.
So, 20, 19,
18, 17,
16, 15, 14,
13, 12, 1110,
9876-5432,
and one. So stop counting.
So, if you have guessed, let's say, 20 apples,
that's incorrect. All right, so let's
try something lower. Let's say 17. If you've guessed
17, unfortunately, that's also incorrect.
How about 24? So, for those who have tried guessing
that it's 24, you are also incorrect. So what's the correct
answer? There's no correct answer for this,
because, for one thing, it's impossible to count the number of
apples in this slide because there
may be apples underneath the initial layer
of apples. And the lesson that I want to share with
you guys here is that when things are not organized and
when it's hard to look for the needle in the haystack,
then it would be hard for us to identify what the problem is. So let's
say that we were given a different set of problem similar to
this one. Let's say that the apples are organized
and there's a ten by ten block of apples,
and then there are ten layers. So we're sure that ten by ten times
ten, that would be 1000 apples much easier to count,
because the data is organized. And the same way,
it would take us a lot of time to count the number of apples if
it's disorganized like this. And then if we were to
clean things up and organize things ahead of time, then it would
be easy to find and do aggregates and
counts, and even look for the specific apple that we're looking for.
So, that said, for those who sre
a bit disappointed that they were not able CTo win that
prize, I'll just say that I'll
be nice today and I'll just give out 25%
discount by using the code 25.
Joshua. And yeah, I think it's valid between
October 22 and December 31 of this year.
So just take a screenshot and then use that
when the book is launched. I think next month.
Yeah, a couple of weeks from now. So,
that said, why are these things
important? These things are important because the
faster we're able to solve problems, the better the
system is configured, the better is our capability to
manage the SLI, SLO and SLA.
So we won't talk about these things in detail here. But if you're
already aware of the fundamentals of SRE,
it's critical that we're able to manage this and have
a plan for it, because we cannot just assume that we'll
be able to reach these targets. And yeah,
we need to have a plan. And later, the other nine tips
that I'm going to share would help us do SRE properly.
The next tip would be knowing the
options and alternatives. So in the screen,
we will see some of the possible tools
that we can use with Kubernetes.
Let's say we have rancher, we have Prometheus, we have Kali, we have istio,
we have appmesh, and we have helm.
So you may or may not use some of these tools, but what these
tools help us with is that instead of us trying to build our own
solution, let's just learn what majority
of the users of Kubernetes are using,
so that if we have a requirement,
then we're pretty sure that maybe some other professional
out there has encountered the same problem.
Meaning these tools may have the features already.
All we need to do is learn how to install it and get that to
work. However, in the next tip,
using tools do not automatically solve the problem.
The tools are enablers. Like I said,
the tools are enablers. What do I mean by that?
At the end of the day, the tools will be used by people.
So, combining the concepts we learned from the previous slides,
let's say that the software engineering team and then the
site reliability engineering team are going to
use these tools. What if the tools are
properly set up, the tools are in place, but 80%
of the team have no idea how
to use these tools, have no idea, and have very
little idea about the concepts being used when using these
tools. And then finally, what if they
say that they're familiar with these tools, but in reality they're
not able to use these tools properly to troubleshoot the problems.
So what will happen here is that at the start, during the planning stage,
we ask the people who's familiar with this tool, and then
everyone will say, yeah, I have used that before. And then something goes
wrong one month from now, and then only one or two people
are able to use the tools properly. So there, that's something that
you need to solve, because a big part of this
would be a human problem. So one of the possible solutions
you need to take note of would be,
let's say, giving them a training program to help
them understand the tools better and how to do troubleshooting better.
Another thing that you have to take note of is that sometimes
the best words, the most important things we need to hear about are
the words never said. So people will not really tell you what's really
happening. So it's important for the management team to
listen to the words never said by trying to understand what's really happening, by,
let's say, auditing and by checking things
even without people telling them or giving them feedback.
Sometimes management teams are relying too much on feedback,
that they assume that this
feedback notes sre the sources of truth,
when in fact, when you do assimilation,
maybe that's a much better way to know what's going
to happen or not. The next thing in our list
would be this one. So there's a bunch
of logos here, a bunch of tools.
So first we have here
Sagemaker, we have app mesh, we have eks,
we have x ray, and we have Cloudwatch.
The lesson I want to share here is that not everything
needs to be stored inside the Kubernetes cluster. There may
be tools, there may be open source tools that
can easily be installed inside the Kubernetes cluster.
You may be using some machine learning tools
and workflow tools to run machine learning workloads inside the Kubernetes
cluster. Right. But if there are
other options available, let's say Sagemaker,
and it has proper integration with Kubernetes.
So in this example, you have sagemaker operators
for Kubernetes. Then you'll be able to delegate
and transfer some of the workload to Sagemaker so
that you're pretty sure that it's managed there. And then there's
less management burden inside your Kubernetes cluster, especially if
you're dealing with a lot of services and workload there.
And you also have app mesh, you have x ray, you have Cloudwatch. So if
you have logs, yeah, you can use Cloudwatch to store
those logs and help us analyze the logs there.
The same way, instead of storing the data
inside kubernetes, inside databases there, maybe we can use
an RDS instance outside of the cluster so
that at least you can deal with the manage advantages,
the pros of using RDS,
which is outside of your Kubernetes cluster.
So if you need to vertically scale your
database, then instead of you trying to
use Kubernetes to manage the database resources there,
you can do that using RDS separately.
The next one involves the users,
the target users. When creating infrastructure,
we're not trying to build and manage the infrastructure
just for the sake of it. The goal of SRE,
the goal of what we're doing right now is to make sure
that our customers are happy and the system is up almost
100% of the time. Almost 100%. So it's
critical that we know who the target users are and we need to understand their
behavior. Because when dealing with systems,
it's always going to be a combination of planned
and unplanned work, and at the same time it will be a combination of
automated task and manual task.
There will be times where we will have to take more risk
than usual. And if we know the usage rate of
the system and we understand the behavior of the users, then maybe we
can schedule some of the more critical work,
maybe critical migration work, which can be done during
off peak hours. So in this example here,
maybe your customers for some reason
uses the site between twelve midnight and 03:00
a.m. And then maybe
they go back to your site again 03:00 p.m. To 07:00 p.m.
So it really depends on your application.
This really depends on the users,
the target market and the behavior changes.
Also, day to day, maybe during weekdays they
behave like this, during weekends they behave in some
fashion. And then there SRE specific days in the year where
you really have to plan ahead and prepare the infrastructure and configuration
to carry the workload, especially in cases where there's a spike.
So let's say there's a holiday or there's an event
you need to prepare ahead of time and you cannot just rely on
auto scaling to help solve the requirements
of the business. So again, this is very critical and
we need to know the behavior of the target users.
The next one in our list involves managing the overall cost
of storing, managing and searching the logs.
Whenever we prepare infrastructure, one of
the primary requirements we have to deal with would be cost
estimation. It's one of the trickier things we need to worry
about, because in
addition to trying to predict the cost when dealing with auto scaling infrastructure,
we also have to worry about the things which may not be
noticeable or visible at that point in time.
So let's say that we want to prepare some cost
estimates. The first thing that comes into mind would be how
much would be the database instance cost and
then how much would the aks cluster cost?
Right. But after a couple of months, when you check the bill,
you'll see, hey, where did this additional cost come from?
Oh, the logs start to build up because we're
trying to collect a lot of logs because,
yeah, the more logs we have, the better, because it allows us
to debug the site better, right?
Yeah. So you just have to worry about the additional cost of storing
logs and also making SRE that there's
also backup because you cannot just remove the logs and reduce the number
of logs being collected by the system just to reduce the cost.
So knowing the different options on where to store the logs,
that's also the second step.
And the third step here would be understanding the total cost
of managing the centralized log storage.
Because what you don't want to do is that some
of your resources will be building a custom solution for your system,
and instead of them doing other things, they will be forced to have
a sophisticated centralized log storage.
And then finally, it's critical that we manage
the time spent searching the logs. So managing is
one of the things we need to worry about. Searching the logs
is critical also. So given that your system
will be collecting a lot of logs from different sources,
it's critical that searching the logs, especially when done manually by
a user, it's going to be super fast and we're able to connect the logs
collected by the system of the distributed components and
trying to connect the dots, especially if,
let's say component a calls component b and then component b calls
component c and they are in different pods and different services,
right? So being able to identify
the problem and connecting the logs
from each of the components is something that you need to
set up ahead of time. So there are a lot of tools and options
out there. But having good collaboration between
the application development team and the SRE team
is crucial to get that to work.
Next in our list would be understanding the weakest links in your system.
Sometimes when we do not understand the system enough, there's always
that tendency. When there's a slowdown, we always
try to say, okay, if the system is slowing down, we just need
to scale it, we just need to do auto scaling.
However, it doesn't really work like that in real life.
Not all issues are due to lack
of, let's say, compute power when it comes to processing
the logic work. Logic code. Right.
In some cases, maybe the bottleneck would be
the database. So instead of us trying to always blame
the Kubernetes cluster and the resources inside it, the resources
being managed by that cluster, the first step
is to actually know where the problem is. Maybe the database
queries sre not optimized. So maybe one
solution would be to use queries.
Then the other solution would be maybe we need to have read replicas to solve
the problem. And there's really no need to do auto scaling right away.
Because even if you auto scale and the database is the bottleneck,
auto scaling will not help solve the latency issues.
So that's one of the things that you need to be aware of,
because there's always the tendency to have the
mapped solution, the canned solutions, and we feel that
okay, if the site slow auto scaling, no, that's not the case.
We need to spend some time, analyze what's wrong,
and let's try to remove any sort of assumptions and
biases to help us really solve the
problem from the very real root
cause of it.
The next one in our list would involve failure.
So the lesson here would be to prepare for
failure so nothing fails. So understanding
the common failure modes when
using Kubernetes is critical.
So some examples are shared here that would involve high cpu,
that would involve packet loss, maybe sudden
increase in latency would happen. And then sometimes there
will be node failure in rare cases,
maybe unavailable regions, maybe there's some service
failure from time to time, and maybe some cpu throttling as
well. So what do we do with this knowledge
or information? What's important here is that we
try to simulate these common failure modes
before launching things into production
again. The best assumption would be that things will fail and
the goal is to prepare a layered plan and
to prepare the automation work required so that the
system is self healing. So you have to plan
things ahead and include this in your timeline. Building the
site is the first step, and then the second part
there would be to harden the system to make it prepared
for anything that may come up. So the previous lesson, which is
understanding the users and their behavior is critical because
for one thing, if let's say majority of your users are
inside the country, are from a country, let's say Philippines,
then maybe you don't need to have a
system which is available globally,
right? So that makes your system much cheaper compared to,
let's say, requirements that involve the entire
world. So let's say you have a website and then different
parts in the world and different users across the globe will
use it, then that may definitely need
a more expensive setup. Maybe you have the
high availability set up across different regions.
So being prepared for those things and planning
things ahead is critical and also being practical
about it. Because what we want to do is to
also save on costs. We cannot just build, build, automate this,
make this available everywhere. It's about managing that balance
between availability, resilience and
cost. Next would be managing the security
risk. Kubernetes is like any other tool.
If it's not configured properly, then it
can be attacked and compromised. There are tools out there which
helps us run these tools and
identify the risk and vulnerabilities, especially when it comes to misconfigured
kubernetes clusters. In some cases,
some teams would just say, oh, it's secure. Of course,
if you misconfigured something and if you use the wrong sequence
of things, and also you have an application which is
easily compromised, then someone might be
able to attack the entire system. So knowing the weakest
links of your system when it becomes CTO security is critical.
So in this example, let's say that we have a cloud nine control
instance that has Kubectl
installed there, which helps manage the Kubernetes cluster.
What if that's the highest risk entity in our
environment? So even if we were able to secure
the Kubernetes cluster, the Kubernetes configuration, if that
gets attacked and the Kubernetes cluster
is deleted by the malicious actor, then boom,
your entire system is down. So feel
free to check the possible risk and issues when dealing with
these types of setup. Let's say IaM pass rule.
Think about how an attacker would be able to attack
your system using different techniques and
exploits. So that's the first step. The second
step would be trying to look at all the other resources
in that AWS account. So there are different techniques to
protect this system. For example. So for
example, you use multiple accounts to make sure
that even if one account gets compromised, the other systems are not affected.
Because for one thing, let's say that one of the IAM users,
or maybe one of the easy CTO instances launched by an IAM user,
has super admin powers, and then for some reason that
easy to instance gets compromised.
Then if that system gets compromised
and that has the ability to delete EC
two resources, delete all the resources in your account, then boom,
even if this entire setup is secure, your entire
AWS account would be finished.
Would be done. So, yeah, so you might lose the data, you might lose the
resources and make sure that you are prepared for those edge
cases. This 1 may
not be so obvious, but it's critical to
have a plan. But the timing is critical.
It's important that we plan when there is minimal pressure. Some of
us will tell everyone and just assume this is
the truth, that sometimes we do a better
job when there's pressure in reality.
Generally when there's an emergency,
there's the time constraint. And then when we are planning,
the more time that we have, the better.
And when there's minimal pressure and when there's no pressure at all,
it's better to plan there because we have the opportunity. CTO,
prepare an exhaustive plan to prepare an exhaustive list of
what we need to do to be able to accomplish everything that we need.
But when there's an emergency, a lot of us will
panic, a lot of us will be under time constraints, and we will
not be able to prepare the best solution. So planning
things way ahead of time is critical, and it's better
to have a solid plan than rushing it when there's a problem
already. So that's pretty much it. We learned
a lot of things in this talk,
and yeah, if you have any questions,
feel free to reach out to me. But, yeah, thank you for listening and hope
you learned something today in my talk and pragmatic site reliability
engineering for kubernetes in the cloud. So again,
I am Joshua Arvin Lat, and I'm the chief technology officer of Nuworks
Interactive Labs. Thank you, you, and bye.