Conf42 Site Reliability Engineering 2022 - Online

On Call Like a King - How do we utilize Chaos Engineering to become better cloud native engineers?

Video size:

Abstract

The evolution of cloud native technologies and the need to scale engineering, leading organizations to restructure their teams and embrace new architectural approaches. Being a cloud native engineer is fun! But also challenging. These days engineers aren’t just writing code and building packages but are expected to know how to write the relevant Kubernetes resource YAMLs, use HELM, containerize their app and ship it to a variety of environments. It isn’t enough to know it at a high level. Being a cloud native engineer means that it’s not enough to just know the programming language you are working on well, but you should also keep adapting your knowledge and understanding of the cloud native technologies you are depending on. Engineers are now required to write services that are just one of many other services that usually solve a certain customer problem.

In order to enhance engineers’ cloud native knowledge and best practices to deal with production incidents, we started a series of workshops called: “On-Call like a king” which aims to enhance engineers knowledge while responding to production incidents. Every workshop is a set of chaos engineering experiments that simulate real production incidents and the engineers practice on investigating, resolving and finding the root cause.

In this talk I will share how we got there, what we are doing and how it improves our engineering teams expertise.

Summary

  • How do we utilize chaos engineering to become better cloud native engineers? It's a series of workshops that are composed in Siren. Originally meant to bring more confidence to engineers during their own call shifts. But later on it became a great playground to train the cloud native engineering practices.
  • The evolution of cloud native technologies and the need to scale engineering is leading organizations to restructure the teams and embrace new architectural approaches. Engineers these days are closer to the product and the customer needs. But there is still a long way to go and companies are still struggling.
  • As part of transitioning to being more cloud native distributed, engineers face more challenges. Deep systems are complex and you should know how to deal with them. Cyren is holding workshops to train engineers on cloud native practices.
  • A bit on our call procedure before we proceed. We have weekly engineering shifts and ANOC team that monitors our system 24/7. Alert playbooks that assist these on call engineer while responding to an event. First priority is to get the system back to a normal state.
  • The workshop sessions are composed into three parts. The session starts with a quick introduction and motivation. Make sure that the audience are aligned on the flow and the agenda. The overall session shouldn't be longer than 60 minutes.
  • Our oncology being workshop sessions are usually trying to be close to real life production scenarios as possible. By simulating real production scenarios, such real life scenarios enables the engineers to build confidence while taking care of real production incident. The discussions around the incidents is a great place for knowledge sharing.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hey everyone, thanks for joining my talk. I'm happy to be with great other folks that share the knowledge with others. Our talk today is how do we utilize chaos engineering to become better cloud native engineers? Let me first introduce myself. My name is Iran and I'm leading the sirenbox security engineering at Siren. I'm an engineer, I'm a problem solver and love sharing my knowledge with others obviously. So before we start, I would like to start with this one. What are you going to gain out of this talk? I would like to share with you how we leverage chaos engineering principles to achieve other things besides its main goal. We wanted at the beginning to bring more confidence to our engineers while responding to production incidents and in addition to that, train them to become better cloud native engineers as it requires additional expertise which isn't just the actual programming job. And ship your code to somewhere. I would like to share with you how we got there, what we are doing, and how it improves our engineering team expertise. And you might ask yourself, why is this title on color being so? It's a series of workshops that are composed in Siren, which are the beginning, I must admit, meant to bring more confidence to the engineers during their own call shifts. But later on it became a great playground to train the cloud native engineering practices and share knowledge around that. So during this session, I'm going to share what we are doing in such workshops. So stay with me to learn more. So let's start with the buzzword cloud native. Here's the definition I copied from the CNCF documentation. We call it the cloud native definition. I highlighted some of the words there. While you read these definition, you see the words scalable, dynamic, loosely coupled, resilient, manageable, observable. At the end you see something that I truly believe in and I'm trying to make it part of the culture of any engineering team I join. As engineers. We deliver production. From the definition, you see that this is what actually cloud native brings. As a result, engineers can make high impact changes. These is my opinion, what every engineers culture should believe in, make an impact as an as a result, you will have happy customers. The evolution of the cloud native technologies and the need to scale engineering, leading organizations to restructure the teams and embrace new architectural approaches such as services. We are using cloud environments that are pretty dynamic and we might choose building microservices to achieve better engineers scale. You should remember that as a side note, microservices are not gold, but we use that, the cloud and other stuff as the tools to scale our engineering product. As your system call, your system probably becomes more and more distributed. Distributed systems are by nature challenging. They are not easy to debug, they are not easy to maintain. And why it's not easy just because we see pieces of a larger puzzle. Two years ago I wrote a blog that's trying to describe the engineering evolution. At a glance, I think that the role of engineers grown to be much bigger. We are not shipping code anymore. We design it, we develop it, we release it, we support it in production. The days that we had to throw the artifacts on operations are over. As engineers we arent accountable for the full relief cycle. If you think about it, it's mind blowing. We brought so much power to us as engineers and with regard power we should be much more responsible. You might be interested in reading my post that I wrote two years ago. I just talked about the changes and the complexity, but we should embrace these changes. These changes enable teams to take an endtoend ownership of their deliveries and enhance their velocity. As a result of these evolutions, engineers these days are closer to the product and the customer needs. In my opinion, there is still a long way to go and companies are still struggling. How to get engineers closer to their customers to understand in depth what their business impact. We called about impact, but what is this engineering impact that we talk about? Engineers need to know what do they solve? What's these influence on the customer and know these impact on the product. If you think about it, there is a transition in the engineering mindset. We ship products and not just code. We embrace this transition which brings with it so many benefits to the companies that are adapting them, and we are among them. And on these other end, as a team, as a system scale, it becomes challenging to write new features that solve a certain problem. And even understanding service behavior is much more complex. And let's see why it becomes more complex. The advanced approaches that I just mentioned bring great value. But as engineers, we are now writing app that are part of a wider collection of other services that are built on top of a certain platform in the cloud. I really like what Ben is sharing in these slides and I would like to share it with you. He's calling them deep systems. Images are better than words and the pyramid in the slide explains it all. You can see that as your service scales, you need to be responsible to a deep chain of other services that these services actually depend on. This is what it means. We are maintaining deep systems. Obviously, microservices and distributed systems are deep by nature. Let's try to imagine just a certain service you have, let's say you have some, I don't know, order service. What do you do in this order service? You fetch the data in one service and then you need to fetch another data from another service and you maybe produce an arent to a third service. The story can go on, but you understand the concept. It's just complex. Deep systems are complex and you should know how to deal with them. As part of transitioning to being more cloud native distributed and resolving on orchestrators such as kubernetes at your foundations, engineers face more and more challenges that they didn't have to deal with before. Just imagine this scenario. You are on call, there is some back pressure that burning your SLO targets. There is some issue with one of your AZ and third of your deployment couldn't reschedule due to some node availability issues. What do you do? You need to found it. You need to find it, but and you might need to point that to your on call DevOps colleague. By the way, DevOps may be working on that already as it might trigger these slos as well. These kind of incidents happen. And as a cloud native engineer you should be aware of the platform you're running in. What does it mean? You should know that these are AZ in every region your pod affinities are defined in some way and the pods that are scheduled have some status. These are cluster events. And how do you read the cluster events in such a failure? This was just one particular scenario that happened to me and might happen to many of you before. As you see, it's not your just service anymore, it's more than that. And this is what it means to be a cloud native engineer. As I said already, being a cloud native engineer is fun but also challenging. So these days engineers arent just writing code and building packages, but arent expected to know how to write the relevant Kubernetes resource yamls or use helm or containerize these app and ship it to some variety of environments. So as you see, it's not enough to know it at a high level, being a cloud native engineers means that it's not enough to just know the programming language you are working on well. But you should also keep adapting your knowledge and understanding of the cloud native technologies that you are depending on. Besides these tools you are using, building cloud native applications involves taking into account many moving parts such as these platform you are building on, the database you are using, and much more. Obviously there are great tools and frameworks out there that abstract some of this complexity out from you. As an engineer, but being blind to them might teams you day or even night. If you haven't heard of the fallacies of distributed computing, I really suggest you to read farther on them. They are here to stay and you should be aware of them and be prepared. In cloud things will happen, things will fail. Don't think that you know what to expect, just make sure you understand them. You handle them gracefully and breath them. As I said, they will just happen. We talked a lot about the arent benefits and also the challenging. So these is what we had to do and we had to deal with them with these challenges here at Cyren. So let me explain what did we do to code with these challenges here? We utilize Kerson engineering for that propose and we have found this method pretty useful and I think that it can be nice to share with you the practices how we dealt with them here. So stay here with me. Let's first give a quick brief of what chaos engineering is. The main goal of Chaos engineers is as explained in the slide that I just copied from the Chaos principles website. The idea of chaos engineering is to identify the weaknesses and reduce certain when building distributed systems. As I already mentioned in previous slides, building distributed systems at scale is challenging and since such systems tend to be composed of many moving parts, leveraging chaos engineering practices to reduce these blast radius of such failures improves itself as a great method for that propose. So I've created a services of workshops called Oncoloke King. These workshops intend to achieve two main objectives. The first train engineers on production failures that we had recently and second is to train engineers on cloud native practices, tooling and how to become better cloud native engineers. A bit on our call procedure before we proceed so it will help you to understand better what I'm going to talk about in the next slides. We have weekly engineering shifts and ANOC team that monitors our system 24/7 there are these alert severities defined severity one, two and three which actually define the business impact alerts to the actual service owner, alerts that we usually monitor. We have alert playbooks that assist these on call engineer while responding to an event. I will elaborate on them a bit later. And in case of a severity one, the first priority is to get the system back to a normal state. These on call engineers that is leading the incidents should understand the high level business impact to communicate back to the customers and in any case that there needs to be some specific expertise to bring back into a functional state. The engineers is making sure that the relevant team or the service owner are on the keyboard to lead it. These are the tools that the engineer got in his box to utilize in case of an incident. It's a pretty nice tool set, I must admit. We have Jager, Kibana, Grafana and all the rest. Now that we understand the big picture, let's read down to the workshop itself. The workshop sessions are composed into three parts. We have the introduction and a goal setting. Then we might share some important stuff that we would like to share with everyone. And then we start the challenge as the most important thing of this workshop. Let's dive into each one of the parts that I just mentioned above. The session starts with a quick introduction and motivation. Why do we have these session? What are we going to do in the upcoming session? And make sure that the audience are aligned on the flow and the agenda. It's very important to show that every time as it makes people more connected to the motivation and understand what is going to happen. This is part of your main goal. You should try to keep people focused and concentrated, so make sure that the things are clear and concise at the beginning of every workshop session. Sometimes we utilize the session as a great opportunity to communicate some architectural aspects, platform improves or process changes that we had recently. For example, we provide some updates on the on call process, maybe code service flow that we made some adaptations on and more. And the last part, and the most important part is we work on maximum two production incidence simulation and the overall session shouldn't be longer than 60 minutes. We have found out that we lose engineers concentration for longer sessions. So if you work hybrid, it is better that these sessions will happen when you are in the same workspace as we have found that much more productive. The communication is a key and it's making a great difference. Let me share with you what we are doing specifically in this part, which is the core of this workshop. I think this is one of the most important things. Our oncology being workshop sessions are usually trying to be close to real life production scenarios as possible. By simulating real production scenarios in one of our environments, such real life scenarios enables the engineers to build confidence while taking care of real production incident. Try to have an environment that you can simulate that incidents on and let people play in real time. As we always say, there is no identical environment to production and since we are doing specific experiments, it's not really necessary to have production environment in place. Obviously, as more as you advance, it might be better to walk on production, but it's much more complex by nature, as you already understand. And we have never done this before since we utilize chaos engineering here. I suggest having a real experiments that you can execute within a few clicks. We are using one of our lotus environments for that proposal. I must say that we started manually, so if you don't have a tool I really suggest you not to spend dime on that. Don't rush to cause a specific chaos engineering tool. Just recently we started using litmos chaos to run these chaos experiments. But you can cause anything else you would like to or you can just simulate this instance manually as we have done in the beginning. I think that the most important thing is, as I said before, we need to have a playground for the genes to actually exercise and got just hearing someone talking about presentation slide in some demo you will be convinced that when they arent practicing and not just listening to someone explaining it makes the session very very productive. Right after the introduction slides we drill down into first challenge. The session started with some slide that is explaining a certain incident that we arent going to simulate. We usually give some background of what is going to happen. For example there is some back pressure in one of our services that we couldn't handle in some specific UTC time. We present some metrics of the current behavior. For example we present the alerts and the corresponding grafana dashboards. Usually I think that you should present something that is very minimal because these is how it usually happens during the real production incident. These we give engineers some time to review the incident by themselves. Giving them time to think about it is really crucial. These are exercising alone thinking. If they haven't covered something similar before this very important step, it will encourage them to try and find out more information and utilize their know how to get more information such as gather some cluster metrics or view the relevant dashboards, read the logs or service status. And I think that it's understanding the customer impact is a very important aspect. You should understand the customer impact. And even more importantly, when you are in an on call shift, in case of a severity one, you should communicate the impact on the customer and see if there is any walk around that did the incident resolved completely. Engineers is not always aware of the actual customer impact. It's very good time to discuss it. I think that the walks of chess and I really think that it's a good time to speak about such things. I think that you should also pose their analysis from time to time and encourage them to ask questions. We have found out that the discussions around the incidents is a great place for knowledge sharing. Knowledge earning can be anything from design diagram to some specific Kubernetes command line and if you are sitting together in the same space, it can be pretty nice because you can see who is doing what and then you can ask them to show the tools that they are using and other people can learn a lot from that. What I really like in this session is that it triggers conversations and engineers tell to each other, for example, to send them Cli or tools that they recommend to use. It's pretty nice and it really makes the engineers life while debugging an incident much, much easier. The workshop sessions will teach you a lot on the know how that people have and I encourage you to update the playbooks based on that. If you don't have such a playbook, I really recommend you to have such we have a variety of incidents playbooks. Most of them are composed for major severity one alerts as a provider on colongineer with some gotchas and high level flows that is important to know and to look at when dealing with different production incidents or scenarios. And this is how our playbook template looks like. You can see that these is some description how to detect the issue, how to assess the customer impact, and how to communicate in case of a failure. Drive the conversations by asking questions that will enable you to share some of the topics that you would like to train on. A few examples that I have improves to be efficient arent you can for example, ask an engineer to present the Grafana dashboard to look at or ask someone else to share the kibana logging queries or ask another one to present their jega tracing and how do they fund trace. It's pretty nice. I must admit. You sometimes need to moderate these conversation as the time really flies fast and you need to bring back the focus of it because the conversation getting very heavy sometimes during the discussion portal, figure on interesting architectural aspects that you would like the engineers to know about. Maybe you can talk on specific async channel that you might want to share your thoughts about or anything else. Encourage the audience to speak by asking questions around these areas of interest that will enable them to even suggest new design approaches or highlight different challenges that they were thinking about lately. You might be surprised, I tell you. You might be surprised what people say and sometimes you might even add them some of the things that they suggest to your technical depth in order to take care of them later on. At the end of every challenge, ask somebody to present their end to end analysis. It makes things clearer for people that might not feel comfortable enough to ask questions in. For example in large forums or engineers that have been just joined the team or junior engineers that might want to learn more afterwards. It's a great source for people to get back into what has been done and also a fantastic part of your knowledge base. These you can share the onboarding training process to new engineers that just join the team. I found out that people sometimes just watch the recording afterwards and becomes handy even just for engineers to get some overview of the tools that they have. So just make sure to record it and share the meeting notes right after the session. As you have seen, chaos engineering for training is pretty cool, so leverage that to invest in your engineers team's knowledge and expertise and seems like it was successful for us and maybe successful for you as well. So to summarize some of the key takeaways, we found out that these sessions are an awesome playground for engineers. I must admit that I didn't think about using chaos engineering for this simulation at the first place. We started with just manual simulations of our incidents or just presenting some of the evidence we gathered during some time of failure to drive conversation around them. As we moved forward, we leveraged the usage of Kios tools for that proposal. Besides training to become better cloud native engineer, non code engineers are feeling more comfortable now in their shifts and understand the tools that are available to them to respond quickly. I thought it can be good to share as we always talk about chaos engineers experiments to make better, reliable systems, but you can also leverage them also to invest in your engineering team's training. So thanks for your time and hope it was a fruitful session. Feel free to ask questions anytime and I will be very happy to share with you much more if you want to.
...

Eran Levy

Director of Engineering @ Cyren

Eran Levy's LinkedIn account Eran Levy's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways