Conf42 Site Reliability Engineering 2021 - Online

How to Empower Developers to Troubleshoot K8s Independently

Video size:

Abstract

In the “good old days”, Ops/IT teams were responsible for handling issues when applications crash. In the world of microservices, however, developers are required to take on a bigger role in identifying and fixing issues in production, sometimes without having proper tools, privileges or even training. So how can we empower developers to troubleshoot efficiently and independently?

Join us as Itiel Shwartz, co-founding CTO at Komodor, discusses:

  • How to turn developers into k8s troubleshooting experts
  • What culture and mindset changes are needed to succeed
  • The link between chainge management and troubleshooting

Summary

  • You can enable your DevOps for reliability with chaos native. Create your free account at Chaos native Litmus cloud. How to empower your developers to troubleshoot Kubernetes independently.
  • Itiel Shwartz is the CTO and co founder of Commodore, the first Kubernetes native troubleshooting platform. He believes in the shiftlab movement in dev empowerment. He talks about the challenges of troubleshooting and how to overcome them.
  • I think one thing that good operation and developers love to do is to automate themselves out of the process. Having the developer as a critical part of the troubleshooting process is a must for every organization that want CTO move fast. Over the long term it's worth your time.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Are you an SRE, a developer, a quality engineer who wants to tackle the challenge of improving reliability in your DevOps? You can enable your DevOps for reliability with chaos native. Create your free account at Chaos native Litmus cloud hello everyone. Happy to be here. I'm here to talk about how to empower your developers to troubleshoot Kubernetes independently. My name is Itiel Shwartz and I'm the CTO and co founder of Commodore, the first Kubernetes native troubleshooting platform. I really believe in the shiftlab movement in dev empowerment. Other than that, in the past I worked in eBay, Forter Rookout and I have a lot of backend and opsit experience and also I'm a Kubernetes fan. So the agenda for today is first of all we are going to talk about the benefits and challenges of Kubernetes. I know everyone knows Kubernetes, but I'm going to do a little bit more of like a deep dive on what's so good and what's not that good in kubernetes. We are going to talk about complexity of troubleshooting Kubernetes and how to overcome them. So first of all, I think everyone knows that move fast and break things is kind of the new motto for the past couple of years. We are shipping code a lot faster and elite teams are moving even faster than normal teams, meaning 100, even thousands of deployments per day in this new world, in this new space, what happened is in order to move fast, new infrastructure has emerged and docker and Kubernetes are de facto the standard in this new world. And like 88% of organizations already adopted Kubernetes. So I know everything sounds amazing and Kubernetes has a lot of benefits. It help you save money, it's easy to port multicloud capabilities, a lot of things and it sounds like it's doing everything. But in the end of the day it does come with its price. And I think every organization that is moving to Kubernetes because it looks sexy, because it will help make their life a lot easier at the beginning of the migration is bombarded with issues that arise because they move to kubernetes. It looks very easy, very naive to get started, but once you are in there, you feel the pain, you feel how hard it is, how complex Kubernetes really is. And also the trick about kubernetes is it makes it very easy to build very complex systems and basically it allows you to shoot yourself in the leg very easily. So the complexity of troubleshooting and troubleshooting Kubernetes is basically because a lot of things happen in the Kubernetes cluster that you don't really know. The master can change, the node can change, and not all of those changes, all of those issues are propagated, are propagated. More than that, one issue, one issue with microservice can impact the rest of the system. And because with the move to Kubernetes, we are moving from a monolith application to a lot of microservices, a very small and innocent change in one service might crash the whole application. Other than that, there are permission issues and some organizations don't really trust everyone to see all of the data. What causes basically the lack of ability? CTO act independently for some team members and particularly for developers. And because Kubernetes is still new and not a lot of people are really experts in Kubernetes, we see how the lack of knowledge is frustrating for developers and for teams, and they don't really understand how they should operate this very complex system and how are they expected to own the services even they don't really know what is happening under the hood. More than that, there is the very big question of who is responsible for troubleshooting issues in Kubernetes. I think if we ask the same question a couple of years ago it was clear that the knock and opsit responsibility is to manage things in production. Now with the shift level movement in a world where for each dev there is one DevOps, we see more and more organization trying to shift this responsibility from the DevOps to the developer. Mainly because there are a lot of developers, they are already responsible for deploying the code into production. And if they are already deploying the code into production, then it only makes sense that they will be responsible for the troubleshooting, for the ability not only CTO break things, but also to fix them and to understand what did it break and why did it break. So the question is, if kubernetes is indeed complex, and if we want our developers to help us troubleshoot Kubernetes, we want them to get a full ownership over their application, like full end to end, from deployment to troubleshooting, what do we need to do as operations as SRE in order to allow them, in order to make them part of the troubleshooting cycle, and not only bystanders. So the step one, and like a lot of organizations, when we speak with organization, we see how they try to throw everything to the developers. Basically. I don't know you are now responsible for it. We brought Kubernetes into the organization so the dev can have more empowerment. It's your problem now. But I think one of the key things that you must start with is to get the buy in from the dev. You need the dev and you don't necessarily have to get all of the dev organization. Even a few champions at first are good enough. You need people that will want to troubleshoot and they want to troubleshoot not because they're like Mazoheks, but because it will give them more ownership, more responsibilities and more independence. So you need to understand how to harness the developers through the journey of troubleshooting, how you can make them part of the team and not something that they are reluctant to do. If you expect them to wake up in the middle of the night and to understand what is happening in the Kubernetes cluster, you have to get the bind, you have to get them engaged, and if not, it's not really going to work after you get them engaged. And now everyone really wants to troubleshoot. You need to spend the time on the training part. A lot of developers, developers are measured, or used to be measured on the quality of the code they write and the number of features they write, not necessarily on how good they are at troubleshooting. They have a very micro and aeroscope. I don't like to say every developer is like this, but as a whole, while the operation people that troubleshoot has a more of a macro level view. And what you need to do is help them to understand the world as you see to write the playbook, to go in and train them so they will have all of the necessary knowledge in order to troubleshoot independently. You can't expect to say, okay, you're a developer, I see that you want to troubleshoot, knock yourself up. You should help them CTO, make the journey a lot easier by giving them some of the knowledge that you already have about both the technical parts of kubernetes and also about the company specific parts of how is Kubernetes managed? In my organization, the step three is to give them the tools they need to succeed. Again, I will say that most monitoring tools, most troubleshooting tools like today are more focused on the operation. People like Commodore is can exception, but I will say most of them are macro level operation focused. You need to help them have the right tools in order to troubleshoot efficiently. And that means creating them the relevant dashboard in Datadog training to them, how they can add it or change this in Datadog or Prometheus, it doesn't really matter. You have to make sure that once an issue occurring in the system, they have the relevant places and tools to look at. They should have the required permissions to go, but also it should be part of the training on understanding CTO them how to use those certain tools. And once we have all of these tools, we must give them the permission. You can really experts someone CTO be responsible for the production environment. When he has zero production accessibility. He can't see the logs or he can't do actions such as rollback or increase replicas and so on. So if you experts them to wake up in the middle of the night troubleshooting the issue, you must give them not only the necessary tools, but more than the permission and the abilities to do actions in order to remediate the issue. One of the most frustrating things that I see in organization is that the developers can deploy the code to production, but for some reason they can't really roll back to the previous version. And that in turn just creates frustration for the developers and for the DevOps. Like everyone is not happy because the DevOps is required CTo be in the loop every time we need CTO roll back. And on the other hand, the developers need to call up the DevOps just so you can click the button. So free yourself, open the bottlenecks and allow the developers CTO take full responsibility over the lifecycle of the system. And I will say that we are talking about troubleshooting, about the procedures, about the training, about the tool. It is important to make sure you are a single team, that the core mission is to make the system better and to improve the troubleshooting process and time. So you needed to find the relevant partners. It can be even one or two tech lead in the dev organization to get started. You need to remember this is a marathon. It's not a sprint. It's not like one silver of a bullet. And all of your developers will be just as good as you are. You need to remember, it will take time. You will need to change some of the playbooks, you will need to change some of the tools, you will need to adapt them. You will maybe need to create new actions, a new access mechanism to accommodate the fact that the developers are now owning the troubleshooting process. But I don't think you can really take it as a bummer. Like now I need to write tools for the developers to use or something like that. The opposite. You are now writing tools that will empower your developer and in this way you are basically freeing yourself out of troubleshooting and out of waking up in the middle of the night. I think one thing that good operation and developers love to do is to automate themselves out of the process. And this is the state of mind that you need to think, how can I be less involved in troubleshooting? How can I give more tools, more capabilities to the developers? So to conclude the talk today, I will say that it's important when starting to adopt Kubernetes to think about all of these things. You can really expect to move everything from the legacy system into Kubernetes and expect things to work normally. You should try and build the right foundation from the ground up, meaning have the dev involved in picking the technologies, in troubleshooting from day one and getting the bind from them. More than that, I will say that in the end of the day this will increase the velocity of everyone in the team, both the developers, because they will be able to write code fetcher and once they have issue they will be able to solve it by themselves. And it will also help to free the DevOps and the SRE because they won't be bombarded with developers that are like why is it not working? Can you fix my application? My application is crashing and so on. So both the developers and operations should be happy about this process as it will make both of them much more efficient and it will save time for everyone. So I think having the developer as a critical part of the troubleshooting process is a must for every organization that want CTO move fast. And I think that it's not that hard, but it is a process, it will take time and you need to understand that even it might be like a bit hard at first. Over the long term it's worth your time. And that was me talking about troubleshooting Kubernetes. It was really fun being here. Thank you a lot everyone.
...

Itiel Shwartz

CTO @ Komodor

Itiel Shwartz's LinkedIn account Itiel Shwartz's twitter account



Join the community!

Learn for free, join the best tech learning community for a price of a pumpkin latte.

Annual
Monthly
Newsletter
$ 0 /mo

Event notifications, weekly newsletter

Delayed access to all content

Immediate access to Keynotes & Panels

Community
$ 8.34 /mo

Immediate access to all content

Courses, quizes & certificates

Community chats

Join the community (7 day free trial)