Transcript
This transcript was autogenerated. To make changes, submit a PR.
Are you an SRE, a developer,
a quality engineer who wants to tackle the challenge of improving
reliability in your DevOps? You can enable your DevOps
for reliability with chaos native. Create your
free account at Chaos native Litmus cloud
hello everyone. Happy to be here. I'm here to talk about how to empower your
developers to troubleshoot Kubernetes independently. My name
is Itiel Shwartz and I'm the CTO and co founder of Commodore,
the first Kubernetes native troubleshooting platform. I really believe in
the shiftlab movement in dev empowerment.
Other than that, in the past I worked in eBay, Forter Rookout and
I have a lot of backend and opsit experience
and also I'm a Kubernetes fan. So the agenda
for today is first of all we are going to talk about the benefits and
challenges of Kubernetes. I know everyone knows Kubernetes,
but I'm going to do a little bit more of like a deep dive on
what's so good and what's not that good in kubernetes.
We are going to talk about complexity of troubleshooting Kubernetes
and how to overcome them. So first of all, I think everyone
knows that move fast and break things is kind of
the new motto for the past couple of years. We are
shipping code a lot faster and elite teams are moving even
faster than normal teams, meaning 100, even thousands
of deployments per day in this new
world, in this new space, what happened is in order to move
fast, new infrastructure has emerged and docker
and Kubernetes are de facto the standard in this new world.
And like 88% of organizations already adopted
Kubernetes. So I know everything sounds
amazing and Kubernetes has a lot of benefits. It help you save
money, it's easy to port multicloud capabilities,
a lot of things and it sounds like it's
doing everything. But in the end of the day it
does come with its price. And I think every organization
that is moving to Kubernetes because it looks sexy,
because it will help make their life a lot easier at
the beginning of the migration is bombarded with issues
that arise because they move to kubernetes. It looks very
easy, very naive to get started, but once you are in
there, you feel the pain, you feel how hard it is, how complex
Kubernetes really is. And also the
trick about kubernetes is it makes it very easy to build very complex
systems and basically it allows you to
shoot yourself in the leg very easily.
So the complexity of troubleshooting and troubleshooting Kubernetes
is basically because a lot of things happen in the Kubernetes cluster
that you don't really know. The master can change,
the node can change, and not all of those changes, all of those issues
are propagated, are propagated. More than
that, one issue, one issue with microservice can
impact the rest of the system. And because with the move
to Kubernetes, we are moving from a monolith application to a lot
of microservices, a very small and innocent change in
one service might crash the whole application.
Other than that, there are permission issues and some organizations
don't really trust everyone to see all of the data.
What causes basically the lack of ability?
CTO act independently for some team members and particularly
for developers. And because Kubernetes is still new and not
a lot of people are really experts in Kubernetes,
we see how the lack of knowledge is frustrating
for developers and for teams, and they
don't really understand how they should operate this
very complex system and how are they expected
to own the services even they don't really know what is happening
under the hood. More than that, there is the
very big question of who is responsible for troubleshooting issues
in Kubernetes. I think if we
ask the same question a couple of years ago it was clear
that the knock and opsit responsibility is
to manage things in production. Now with the shift level
movement in a world where for each dev there
is one DevOps, we see more and more organization trying to
shift this responsibility from the DevOps to
the developer. Mainly because there are a lot of developers, they are already
responsible for deploying the code into production.
And if they are already deploying the code into production, then it
only makes sense that they will be responsible for
the troubleshooting, for the ability not only CTO
break things, but also to fix them and to understand
what did it break and why did it break.
So the question is, if kubernetes is indeed complex,
and if we want our developers to help us troubleshoot
Kubernetes, we want them to get a full ownership
over their application, like full end to end, from deployment
to troubleshooting, what do we need to do as operations as
SRE in order to allow them, in order to
make them part of the troubleshooting cycle, and not only
bystanders. So the step one,
and like a lot of organizations,
when we speak with organization, we see how they try to throw everything
to the developers. Basically. I don't know you are now
responsible for it. We brought Kubernetes into the organization so
the dev can have more empowerment. It's your problem now.
But I think one of the key things that you must start with
is to get the buy in from the dev. You need the dev and you
don't necessarily have to get all of the dev organization. Even a few
champions at first are good enough. You need people that
will want to troubleshoot and they want to troubleshoot not
because they're like Mazoheks, but because
it will give them more ownership, more responsibilities and
more independence. So you need to understand how
to harness the developers through the journey of troubleshooting,
how you can make them part of the team and not
something that they are reluctant to
do. If you expect them to wake up in the middle of the night and
to understand what is happening in the Kubernetes cluster, you have
to get the bind, you have to get them engaged,
and if not, it's not really going to work after you
get them engaged. And now everyone really wants to troubleshoot.
You need to spend the time on the training part. A lot
of developers, developers are measured,
or used to be measured on the quality of the code they
write and the number of features they write, not necessarily on
how good they are at troubleshooting. They have a very micro
and aeroscope. I don't like to say every developer
is like this, but as a whole, while the operation people that
troubleshoot has a more of a macro level view. And what you need to do
is help them to understand the world as you see
to write the playbook, to go in and train
them so they will have all of the necessary knowledge
in order to troubleshoot independently. You can't expect to
say, okay, you're a developer, I see that you want to troubleshoot, knock yourself
up. You should help them CTO, make the
journey a lot easier by giving them some of the knowledge
that you already have about both the technical parts of kubernetes
and also about the company specific parts of how is Kubernetes
managed? In my organization,
the step three is to give them the tools they need to succeed.
Again, I will say that most monitoring
tools, most troubleshooting tools like today are more
focused on the operation. People like Commodore is can exception,
but I will say most of them are macro level
operation focused. You need to help them have
the right tools in order to troubleshoot efficiently. And that
means creating them the relevant dashboard in Datadog training to
them, how they can add it or change this in Datadog or Prometheus,
it doesn't really matter. You have to make sure that once an issue
occurring in the system, they have the relevant places
and tools to look at. They should have the
required permissions to go, but also it
should be part of the training on understanding CTO them how
to use those certain tools. And once
we have all of these tools, we must give them the permission. You can
really experts someone CTO be responsible
for the production environment. When he has zero production
accessibility. He can't see the logs or he can't do actions such
as rollback or increase replicas and so on. So if you
experts them to wake up in the middle of the night troubleshooting the issue,
you must give them not only the necessary tools, but more than the permission
and the abilities to do actions
in order to remediate the issue. One of the most frustrating things that
I see in organization is that the developers can
deploy the code to production, but for some reason they can't really roll back
to the previous version. And that in turn just
creates frustration for the developers and for the DevOps.
Like everyone is not happy because the DevOps
is required CTo be in the loop every time we need CTO
roll back. And on the other hand, the developers need to call up
the DevOps just so you can click the button. So free yourself,
open the bottlenecks and allow the developers CTO take
full responsibility over the lifecycle of
the system. And I will say
that we are talking about troubleshooting, about the
procedures, about the training, about the tool. It is important
to make sure you are a single team,
that the core mission is to make the system better and
to improve the troubleshooting process and time. So you needed
to find the relevant partners. It can be even one or two tech
lead in the dev organization to get started. You need
to remember this is a marathon. It's not a sprint. It's not
like one silver of a bullet. And all of your developers
will be just as good as you are. You need to remember, it will take
time. You will need to change some of the playbooks, you will need
to change some of the tools, you will need to adapt them. You will maybe
need to create new actions, a new access mechanism to
accommodate the fact that the developers are now owning the troubleshooting
process. But I don't think you can really
take it as a bummer. Like now I need to write tools
for the developers to use or something like that. The opposite.
You are now writing tools that will empower your developer and
in this way you are basically freeing yourself out of
troubleshooting and out of waking up in the middle of the night.
I think one thing that good
operation and developers love to do is to automate themselves
out of the process. And this is the state of mind that you
need to think, how can I be less involved in troubleshooting?
How can I give more tools, more capabilities to
the developers? So to conclude
the talk today, I will say that it's important when starting to
adopt Kubernetes to think about all of these things.
You can really expect to move everything from the
legacy system into Kubernetes and expect things to work normally.
You should try and build the right foundation from the ground
up, meaning have the dev involved in picking the technologies,
in troubleshooting from day one and getting the bind
from them. More than that, I will say that
in the end of the day this will increase the velocity of
everyone in the team, both the developers, because they
will be able to write code fetcher and once they have issue they will
be able to solve it by themselves. And it will also help
to free the DevOps and the SRE because they
won't be bombarded with developers that are like why is
it not working? Can you fix my application? My application is crashing
and so on. So both the developers and operations
should be happy about this process as it will make
both of them much more efficient and it will save time for
everyone. So I think having the developer as a
critical part of the troubleshooting process is a must for
every organization that want CTO move fast. And I think
that it's not that hard, but it is a process,
it will take time and you need to understand that even
it might be like a bit hard at first. Over the
long term it's worth your time.
And that was me talking about troubleshooting
Kubernetes. It was really fun being here. Thank you a lot
everyone.