Chaos Engineering 2024 Panel Discussion

Video size:

Abstract

How are chaos engineering tools performing in 2024? Join the discussion with Anurag Palwal (Chaos Mesh), Karthik Satchinanand (Litmus Chaos), Sylvain Hellegouarch (Chaos Toolkit) and Miko Pawlikowski (Powerful Seal). The Panel is moderated by Prithvi Raj (Harness).

Summary

This is a Chaos Engineering panel at Con 42. Prithviraj is a CNCF ambassador and technical community manager at Harness. With him are four amazing folks from the Chaos engineering community.
Anurag, Sylvan, Miko and Karthik will talk about chaos engineering. The panel looks stacked. We have a lot of chaos engineering insights for folks tuning in.
Miko: How did you get involved in the Chaos engineering community? He says he joined the community like four or five years ago, and has been seeing the exponential growth. He says chaos engineering has grown exponentially in the last two, three, four years.
Karthik: You were one of the founding members of the Litmus Chaos community. Tell us more about how litmus chaos came into play. The motivations to carry out chaos engineering is similar to what Miko just explained.
Sylvan has been sharing a lot of experiences on culture practices using chaos. The chaos toolkit format is designed to delegate to better tools for the fault itself. It was a great outcome after years of working with Netflix, doing this.
Mandi: The only thing that I do is I always treat people nicely. It's my only culture. As a community, as a global community, we should be more aligned and more talking to each other. Sylvan Walls said talking about it more would have helped the community.
Sylvan: As a community, we've not been good at joining forces into a bigger thing. There is still a lot of advocacy with the audience that we have in mind. But as open source communities, we don't have the bandwidth to be there for every single person.
Open source projects are not just open source projects. The idea of chaos suffers is people start using it. But then there's some hard call that's taken that there's no budget for this, or this is not our priority right now. With how things are moving, chaos engineering is being cornered.
Karthik: What are the challenges that you have seen moving from open source to enterprise, or just the enterprise adoption side of things in today's world? How does chaos scale? There are a lot of operational constraints.
Reliability is understood differently by different people. How do you quantify the impact of chaos? Many organizations who are beginning out with chaos expect some kind of solutioning. A lot of these conversations gives us a lot of insights.
Miko: How do you see modern chaos engineering today? I mean, there's generative AI as well, which is being induced right now to move chaos forward. What's your thought process on how chaos has changed from back then to today?
With AI, we're at a very interesting time in the evolution, I think, of humans in general. For us, like in the SRE space and incident management and chaos engineering, that creates a massive opportunity. I think with things changing, and as you mentioned over the last couple of years, we'll see a lot of change.
Sylvan: I want to spend a lot of time trying to care for Chaos Toolkit. I would like to be a bit more developer centric. I sense that there is a fatigue of new things coming all the time. So I'm interested in going back to perhaps talking to developers or engineers.
Sylvan: How do you see the future of chaos engineering overall? He says as long as chaos engineering streamlines itself into the delivery pipeline, I think it's going to win. But if it sticks to be that thing that lives in its ivory tower somewhere, it will struggle.
Anurag: Chaos engineering is going to treat faults as just going to commoditize faults. It's going to be about what insights it is giving about the system. How do you measure and what all can you measure? Karthik: Would you like to share some insights on the future direction of chaos engineering.
Miko: I would love to see us come together a bit more often. If we manage to figure out the human aspect a bit better, we'll be on the right track. And again, I hope to see more conversations happening around chaos and similar pan is going through.

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Good morning, good evening. Good afternoon, everyone. Who's tuning into this chaos engineering panel at Con 42, Chaos Engineering. I am Prithviraj, the host for this panel. I'm a CNCF ambassador and working as a technical community manager at Harness, and I've been leading the community for the Litmus Chaos CNCF project for the last four years. Now, for the esteemed panel, I have with me four amazing folks from the Chaos engineering community. Some of them have been part of the community since its inception and then have been leading the community, the open source side of things for long now. And I'm pleased to welcome with us Sylvan from the Chaos toolkit community, Miko from the powerful seal community, Karthik from the Redmiss Chaos community, and Anurag from the Chaos mesh community. And without any further ado, I would allow each of them to introduce themselves one by one. Let's start with Anurag. Sure. Hello, everyone. Good morning, good afternoon. Good evening, depending upon where you are. First of all, I want to thank Prithvi for inviting me for this session. I'm Anrak Pariva from Adobe and from past five years I'm working in a platform engineering team internally known as Ethos and where we are basically helping our service team in Adobe to deploy their services on Kubernetes and also trying our best to make best use of the CNCF projects walls. Right, next up, Sylvan. Hey, everyone. Thank you indeed for having me as well. For the past six years now, I've been leading the Chaos toolkit project. It's been quite a fun ride because it's been good to see the market and the project and the community come together, having seen so many products, great products like we have today coming up. And yeah, I'm really happy to have this crowd today because it's quite a lot I want to hear. So quite happy about it. So thank you very much again for having me. Thank you so much, Sylvan. Next up, Miko. Hi, everybody. Very nice to meet you. My name is Miko. I've been in somewhere in between platform engineering, SRE, and Chaos engineering for the last decade, primarily working in finance. I initially started powerful seal project back at Bloomberg, I'm forgetting now, but a good few years ago, and ended up writing a book, Chaos Engineering from Manning, published 2021. Very nice to meet you all and looking forward to the panel. Thank you so much, Miko. Next up, Karthik. Hey, everyone, I'm Karthik. I'm one of the co creators and maintainers of Litmus Chaos, which is a CNCF incubating project. I've been around around the same time as Sylvain. I think we sort of started the Chaos community together. I'm currently a principal engineer at Harness, focused on the Chaos module within the harness platform. Prior to this, I was mostly on the storage side of things, focused on testing the resiliency of storage systems, and that's how the Chaos journey began. So, yeah, very happy to be here and talk chaos engineering with you. All right, thank you so much, Karthik. I think the panel looks stacked. Mandi, we have a lot of chaos engineering insights for folks tuning in. But first up, I think Miko shared a lot about his journey and how he's involved in not just chaos engineering, but SRE practices platform engineering, and how he chaos written a book on chaos engineering as well. So, Miko, we'll start with you. Just tell us about your early days in the Chaos community. How did you get involved, and what's your experience been like as part of the Chaos engineering community? Sure thing. So for me, it was kind of very natural. So here is 2015 sometime there, and I landed on a new project back at Bloomberg, where we're basically trying to build a platform for quants on this brand new thing that just landed called Kubernetes. And at the time, it was just a bunch of binaries that you hamed to kind of put together yourself following best practices that were readme on GitHub. So as we were building that, I quickly realized that first, we bumped into a few things that kept breaking. So we wanted kind of like a new way of testing against these behaviors. It's not unit test, it's not integration tests. It's kind of like this dynamic behaviors that were appearing in the systems. And second of all, we found that actually trying to proactively find issues with this distributed system was making us sleep better at night and being able to find things before they break on us became like the motto of the team that we worked on. And that slowly, over the next couple of years, evolved into what I later learned was I was supposed to be calling cache engineering. The result of that, I actually just had to look it up. But it looks like powerful seal got open source about seven years ago. So somewhere around that time, we published this tool that was basically taking a Yaml file and saying, okay, do this, Mandi, this and that to my system, and we're verifying that it was still working. And then I discovered the rest of the community kind of building things around that. And different people, the KS toolkit, who were basically building similar things in this domain. And we started talking Mandi, that eventually actually was the reason for the Ferris Conf 42, chaos engineering back in 2020. So kind of like a natural way of, I keep joking that this is basically a sleep aid, this entire chaos engineering to try to go a level above the standard weights of testing and the community around that has been great. I met a lot of amazing people and had a lot of so very, very grateful for all of. Absolutely. Absolutely. Miko, I think you talk about seven years ago, and I joined the community like four or five years ago, and even till now, I've been seeing the exponential growth and the amazing people who have been coming up, contributing, making this community vital. And I think chaos engineering has grown exponentially, to be honest, in the last two, three, four years, if I'm not wrong. Certainly has. And to not a small extent, thank you to chaos native and Uma and company. And now part of, you know, you guys contributed quite a lot to that. So kudos. Appreciate it, Miko. So next up, Karthik, what are your thoughts? I mean, you've been part of this community for five years. You were one of the founding members of the Litmus Chaos community. Tell us more about how litmus chaos came into play. And with all these projects being there, what was the thought behind it? Yeah, the motivations to carry out chaos engineering is, I think it's very similar to what Miko just explained. I think we were at a similar point in our journey as an organization. We were trying to stand up a SaaS platform on kubernetes, and kubernetes was new to us as well. And we figured the best way to learn would be to start experimenting and causing failures and see how systems react, et cetera. So that way it's very similar. And when we did find out, try finding out what tools were already available to us at that point. And like you said, a lot of these tools were, that was the time around when all these tools were getting built, powerful seal, chaos toolkit, et cetera. And we found that we did not have a consistent way to meet all our use cases. Of course, we learned more about powerful seal and chaos toolkit after we started with litmus, I should admit that. But one of the things we found at that point of time, things that we lagged, was a homogeneous way to do chaos across different kinds of entities. So we were part non kubernetes, part kubernetes, moving on to kubernetes. And we saw that there were a lot of different kinds of tools. There was the Chaos monkey, which was mostly on cloud instances. We had a very nifty tool, still look up to that tool today called Pumbaa, which was doing a lot of container level faults. And then there were a lot of scripts that we had written ourselves. There was an assortment, to be honest, and that was making management of our resiliency test suites very difficult. So we wanted a standard contract in the way we define a particular failure test and how I can actually implement it, how I can view the results, how I can sort of make it work in a pipeline. And all this sort of led us to go build what we called as litmus chaos. It was actually part of the open source project called OpenBs. So litmus was just another repository inside the open EBS organization when it started. But over time, we learned more about the actual principles of chaos. So, like Miko said, we don't really know the term called chaos engineering. It walls started as a way to do fault injection, but we learned more again, thanks to folks like Sylvain and the chaos, the initial version of the Chaos work group. We learned more about what chaos engineering actually entails, how it is assigned, and then we decided to sort of put a cloud native spin on it with custom resources and things like that. And that's how the journey began. And I think some of our choices were vindicated in the long run with all the feedback we got from the community. And of course, there were things to course correct along the way. But that was really how we got started. That was the inception of the project. So thank you so much, Karthik. I think there was a lot of agreement I saw from the other panelists, and a lot of credit to Sylvan for starting this journey, Miko for also being one of the initial members of this community. So next up, I just ask Sylvan about, he's been sharing a lot of experiences on culture practices using chaos. So, Sylvan, just let us know about your journey, the inception of Chaos toolkit, and how did it all start? If you have like, two or 3 hours ahead of you trying to squeeze that in a few minutes, it's really a shame. And for disclosure, that panel should be way longer, because already it's quite interesting to me to go through back history with you. I started with Russ Miles, actually. Russ back, we were working in a startup company called Atomist at the time, and we left, and Russ said, look, that is that principles of Chaos book that was released a few months or something earlier than that so from Casey Rosental and people from Netflix. And it captured what at least the outcome of years and years of leading chaos engineering at Netflix. So we were really hooked by that. There was something that transpired, but to us, we were less interested in the fault injection itself than the reading of what it means to do chaos engineering. The old experiment, learning curve learning thing. And this is where chaos toolkit was trying to be at, say, can we glue under a certain language? And we went declarative. We were quite terraform in spirit, I guess. So can we go and have a description file that says, like Miko said, can we do this yaml or JSon? Because to us, what was interesting was, can we learn from this? Right? Not how are we going to do the fault injection. There were already quite a few tools, like Pumbaa, indeed, and then the native Linux tools like stress ng or very low level things. But to us it was we never wanted, Mandi never wanted anywhere to stop at low level fault injection. To me, when you look at incidents, a lot of them come from someone published a wrong configuration file somewhere. So it's nothing to do with your cpu or memory. So we thought, okay, can we create that format where any sort of experiment is a chaos engineering experiment and delegate to better tools for the fault itself, right. And let people drive those tools with a common interface. And this is where we came from. Interestingly, the chaos toolkit format, and how could I say, its model changed only once, like three months after we started, we added something I'd missed, and since then it's never changed. I've added things, but never changed. So it's a testimony that at least what we took from the principles of chaos was something of meaningful. It was a great outcome after years of working with Netflix, doing this. So it was quite interesting for us. What we didn't do, I think, as good as others after that was especially as Litmus was to promote that. Mandi, think we touched on that later on about how to grow from there. So the culture, just to finish on your question around culture, I don't have a particular culture at AnT. The only thing that I do is I always treat people nicely. It's my only culture. I try to be kind. Mandi, when someone comes to me on slack and say, I don't understand that, to me it's always about, I've got bad documentation, not people are being stupid or anything. Mandi, that's the only culture I know of being civil. Mandi kind. Because we're walls learning, right? And that's the only thing I can do, honestly. So it's the way I see communities, but I'm not very good at driving the community itself, just me. So this is where I'm at. But what I find fantastic. Mandi, promise I'll finish on that. What I find fantastic is I realize today how much we haven't talked together enough over the last few years, and perhaps that the biggest foul here is we all do the same sort of thing and with the same spirit. The proof is we're here and for the next phase, we should really spend more time together. Mandi, talk about as a community, as a global community, we should be more aligned and more talking to each other. I think it's not even like we avoid. I don't want to sound like we are at war or anything, not at all. It's just I think we are on our lane and we have things to do and we just don't get the time. But it's a shame. We should talk more. Anyway. It's rambling as usual. You know me. Absolutely. Sylvan Walls said talking about it more and just exchange of ideas, sharing thoughts. I think this is something that would have helped the community. Mean I'll talk about the current state and my thoughts on the community as well. But Sylvan, as you said, I think we all are learning how to build the community and take it forward. But yeah, I think this is something, these panel discussions meeting together, exchange of ideas are something that become vital. Moving on. Just talking about advocacy, I would ask Sylvan, you've been part of this community for seven, eight years, have driven a lot of advocacy behind it, given a lot of talks here and there. So what is your thought on how advocacy has helped chaos engineering reach that level? What's missing in that culture that people are still not adopting it? What are your thoughts or take on that? Sylvan it's a large subject, and if I had the answers, the definitive answer, I would be in a better position. In general, I think, like we were saying, probably as a community, as leaders of tools that we've created, we've not been good at joining forces into a bigger thing. So make it look bigger than it might be, right? I remember I'm going to use one of the commercial tool that exists there. When they started, we discussed with them and we were telling them we need to have more players in the field. It's good to have competition at a commercial level and probably open source level as well, because the more you have to a certain degree, the more it proves in a way that there is a valid need for this right market, however you want to call this. So you need that. You need to have the players that sort of work together in a way to prove the existence, the validity of the effort, basically. But then you rexed to fight like Anurag was saying. Other fights like a lot of DevOps. I'm going to use that word extremely loosely because I don't want to have any sort of, I don't have any definition, but I was actually recently looking at a Reddit thread where the person asked, is chaos engineering still a thing? And there is a lot of answers. And what struck me was how a lot of people didn't basically throw that away, dismiss the whole idea like it's buzz. Right? And I was surprised to see that in 2024 still, right. I would have been quite not surprised to see people saying, we'd love to do that. We just don't have the budgets, Mandi, everything. And some people do say that, but there is still a lot of advocacy with the audience that we have in mind, whether it's sre, DevOps, whoever, and the developers themselves. Right. So there's still a long way to go to not sound like we're trying to sell something that we find fancy. Right. And it does take a long time either because a lot of people are more confident in themselves and what they put out. I remember talking someday to some engineers or actually higher up saying, we are quite reliable. Right. What do you do to know that? Well, we don't do anything. We have all the things that we need to tell us that we are reliable. Okay, let's see each other. When you hamed an outage. Right. So you do have that mindset in the audience we're targeting as a community. If they don't believe that it's useful, it's going to be difficult because they are your advocate, they are your champions. And from what we've seen in care circuit, quite a lot of time we've seen a single person becoming the champion, either for personal reasons or because we're advocated. And you need to support that person. You need to give them everything they can to succeed within their own organization. But it has to be repeated with every single organization. So it's not really scalable. It's not easy to scale that because as open source communities, we don't have the bandwidth to be there for every single person. So while I was saying that I'm kind and helpful whenever I can, it doesn't mean I have the bandwidth to be the champion of the champion. For every single person out there, it does take time and you end up not being responsive. So the committee may think that you're not there for me, which is not the case. I would like to be there, but it does take time. So you do have that. And finally, we were saying a little bit before, it's not prioritized. Until you don't see that prioritized in delivery teams, it's going to be so easily dismissed. Right? A lot of delivery product deliveries just don't put resilience. It's not chaos for chaos, but reliability. However, you name that as an indicator of success for this, so you end up with quite a few times I've seen companies where you have a team trying to champion saying, yes, we believe in that, we brought it. Sometimes we've even built our own platform, whatever we do. But we struggle. We struggle to make it a thing. Everywhere else, even them internally struggle to advocate for it. So I think we still have a long way to go, but I think we are stuck in that. I'm not rephrasing. We're not stuck. We've been cornered a little bit because of ourselves, but also because of, generally speaking, how the media, the market. However, you want to see chaos engineering into something that is, do that when you're mature enough. So you need to look like Netflix to do chaos engineering, which I always fight against. I've talked to people who say we are restarting our project. We can't do for sure chaos engineering. Yes, you can. You're just going to adjust to what you have and grow with it, because you're not talking about chaos engineering, you're talking about your operations. You want to operate right now the best you can. So it's something you have to be relentless continuously. I've got the same discussion now that I had six or seven years ago, and it's okay, I guess it's where we at, right? It takes years to get there. But then again, a lot of organizations still haven't even started on bringing DevOps as a culture. So it's the same for everything, right? And finally, for the last two years, anyway, at least year, project fundings hamed, decreased dramatically within organization. So even when using open source tools, like, like the ones we have, obviously that is commercial proposition for sure. But even open source tools suffer from that, because suddenly you have people turning their back and saying, walls, I can't use this anymore because it's not prioritized anymore. And therefore you're left with, well, where do I go? From this. Right. So it's a lot of things coming together. And while as a community, we can push, there is a lot of the world that we can control. So it's going to take time. It's going to take time. Absolutely sylvan, I think the last point that you mentioned how open source projects are not just open source projects, the idea of chaos suffers is people start using it. But then there's some hard call that's taken that there's no budget for this, or this is not our priority right now. I think it's a two way suffering, where one way is the project suffers in terms of adoption or users, but the other way suffers is also in terms of contribution. It just becomes stale. Or the company that's sponsoring the project, it believes that there's no interest in this anymore. Why do I. To sponsor this anymore? Why to take it forward. So I think that's why you mentioned that chaos engineering is being cornered. And that is where I can partially agree with you that. Yeah. With how things are moving and with this kind of a market, I think that's something that's evident for all open source projects out there. I mean, Mandi projects out there, but, yeah, per se, chaos engineering as well. There's some impact that's there for sure. So with this, I would just move on to Karthik. Karthik has been one of the members of the open source community, chaos community, for some time now. And Karthik, I would just want to know from you, what exactly are the challenges that you have seen? I mean, you've seen the open source side of things. You hamed seen the enterprise side of things. And what would you like to list as the challenges that you have seen moving from open source to enterprise, or just the enterprise adoption side of things in today's world? Yeah. Before getting into the enterprise adoption, et cetera, just dwell on the open source side of things for a few minutes. I think I resonate with the point that Sylvain made. You want to be there for everybody who is trying chaos, make them successful, but it's not a humanly possible task. Right. You will find someone on the other side who is interested in doing chaos, adopting your tool. You help them in all possible ways. They're able to get out a very technically appealing POC, which basically helps them sell the solution into, let's say, a couple of teams. And then there's no follow up on top of that purely because of this reason. Right. No prioritization, no budget, et cetera. So you've spent a lot of energy trying to make someone successful. And of course, in that process you learn a lot from them about what's lacking in the tool. Mandi, you go back and build. That's how communities, those projects grow. But then it stops at some point. That's probably the biggest barrier in terms of running with chaos. You will find some of that spillover onto the enterprise scene as well. If you can imagine, this is the case with open source where you're not really putting a cost on it. In case of enterprise, it's sort of going to be exponential, right. Because there's a cost associated with it. So you're going to scrutinize it so much. Is this really needed for me? So that cycle is very long, but purely from, let's say, a technical perspective, what people ask for when they're trying to, let's say they're all on board with respect to the chaos engineering. They've got the buy in from the management, they've managed to prioritize it, they've identified the target team or the domain or the services. They've gone ahead and allocated the infrastructure and budget for it all. That's good. Now, from that point onwards, how does chaos scale? Right. How successful is a chaos product there? There are a lot of operational constraints. I would say it's all good to start some low intensity, low blast radius, simple kind of chaos scenarios. You might be, let's say, doing the hello world of the chaos engineering kubernetes. Chaos engineering space. Right. That's the pod delete. You do that, you're very good. You actually can convince people to drive a lot of value out of a pod. Elite experiment. It sounds very innocuous, but it's actually very useful. That's what we found in our journey. So you get a lot of value. But from then on, for the person who is very enthusiastic about wanting to do more and more chaos, he hits a roadblock with some of the other faults that they would want to do something that's, let's say, more involved. They want to do a network packet drop or they want to do something that's more, let's say a storage level I o latency or something like that. You're entering into territory which wants you to run privileged containers. You should be basically adding a lot of container capabilities to be running it. You are now facing friction with the security teams who come and tell you that we are not going to allow this unless this is really important. And then you sort of hamed to convince, except for a lot of verbiage and a lot of documentation and explaining. You don't really have any other sort of armor in terms of how you push those kind of faults into their environment. Right? So that is some of the things that we see. And then, of course, large organizations, when they try to operationalize chaos at, let's say, a larger scale, now we are talking about more mature organizations who really recognize the need for chaos and they're okay to run with it. They're ready to give all these kind of permissions and everything. They hamed requirements that will be something like, you will have a certain group of people who should be allowed to do chaos on, let's say, only certain services, only a certain point of time, using only a certain set of roles associated with them. You might want to introduce some kind of freeze windows. You might want to go ahead and simplify the visualization around how much of chaos is happening. Of course, all the dashboarding integration with different APM tools, these are par for the course. You sort of hear them in most conversations. But there are a lot of other ways in which they want to aid you with respect to how they carry out chaos operations. Right. They are very concerned about the security and impact of the chaos. They want to gain the benefit of the chaos engineering Mandi chaos experiments, but they want to be very careful in how they implement it. So there are a lot of constraints that come that way, and they look for tools to alleviate those kind of problems. The other thing is a lot about, I would say when you have more folks coming in to do chaos engineering, there's a lot of perspective differences in terms of what they want to get from it. Reliability is understood differently by different people. Right? So how do you quantify the impact of chaos? So when you say calculating KPIs for chaos, that's a very broad topic you might have seen in the chaos carnival. Different people came with their own notion of how to measure the impact of chaos. So that changes from organization to organization. And how are you catering to that group? Who's looking for that insight from their application or from their teams? How do you go and put a number that says, this is how impactful chaos has been for you, this is how effective chaos chaos been for you, and this is how you go from here. Right. And many organizations who are beginning out with chaos expect some kind of solutioning as well. So you should not just be ready to say, this is how my tool works, but you should also try and understand how the tool can work for them. Right. And what are their use cases and they also look for guidance on at what point where you can actually help with chaos engineering. With all the services that they have right now, how can you sort of ease it into the system and how best to sort of sell it to their own management if they want to scale it, they look for certain solutioning guidelines as well. So this is what we sort of deal with on a regular basis when you are trying to take the enterprise route. But I think a lot of this is worth it. A lot of these conversations gives us a lot of insights, and of course, a lot of those insights translate into features or capabilities and move into the open source project again, because that's the karma giving back thing. So I think that's where we are now in our journey and we are learning, I would say absolutely Karthik, I think one point you mentioned about reliability and how chaos is seen as part of reliability. I think there has been a vast definition to reliability, and a lot of people are trying to mix chaos with a lot of things. You see, I mean, the slos, and then there's incident management and then there's so much more that's coming under the paradigm of reliability. And then that is how modern chaos engineering has changed. And that's where I go to my next question to Miko. I mean, Miko, you've been seeing chaos engineering for so much. I mean, so long now, and you've been part of other practices as well, sre platform engineering. How do you see modern chaos engineering today? I mean, there's generative AI as well, which is being induced right now to move chaos forward. So what's your thought process on modern chaos engineering and how chaos has changed from back then to today? I love your question. And predictions have that nasty thing about them that they almost always false. But I'm going to venture something because it's such an interesting question. A caveat to that is that I never actually had a commercial chaos engineering thing, so I can't really speak to what Sylvan was talking about when you're trying to convince people to use that product. I was just sharing the things that I walls doing so it was somehow easier. But the response was very often exactly what he described, which was, oh yeah, this is such a cool idea. We're not going to do that. We're not mature enough. But I think there are basically two elements that are kind of now on a collision course together. I think one of the elephants in the room that I noticed a few years ago was that managing tooling to implement these things was the easy part of the equation. The harder part of the equation was actually making sense of the data that you found and doing all the things that you need to do to implement this lessons in practice, because a lot of that wasn't even technical. Like you said, sort of an initially messed up config. There's no tool that's going to prevent you from doing that. So you had that one elephant in the room, and then on the other kind of side of the collision course is what you mentioned. With AI, we're at a very interesting time in the evolution, I think, of humans in general. Like last year, effectively, our species figured out that if you take a big enough corpus of text, like 1 trillion tokens or something, and then you mush it up for long enough with a stash of gpus whose prices just skyrocket because mining wasn't enough, all of a sudden you can have this new type of software that can summarize text, that can kind of understand intent, that can do a lot of things that are creepily human like. And I think for us, like in the SRE space and incident management and chaos engineering, that creates a massive opportunity to solve at least a part of that problem where it's easy to get the data, it's much less easy to actually make sense of it and summarize it, because typically it's a lot of data and actually leverage this large language models to do some of that harder work for you. Mandi, think what I was picturing the other day is like, why am I still writing bash all those years later? I'm still terrible at writing bash, and every now and then I have to look up how to do a loop, because that's bash for you. Why am I not talking to a small optimized LLM that can run locally and doesn't cost the price of a house in electricity to run? And if we apply a similar logic to understanding all the signals that we already have from all the tooling that we have for chaos and for SRE and in general for observability, I think we could really take it to the next level. Mandi, make a lot of these arguments about, oh, but this is work, and we don't have budget for that, somewhat mute because you can do it automatically. So I'm super hopeful about that. I've not been in the AI space for very long, but last year I kind of realized that this is not like crypto that's going to hype, that's going to die out. This is legitimate new discovery that actually turns out to not be as difficult as the Sci-Fi writers had us believe to achieve that kind of like human like interaction. So I'm both creeped out about this and super excited, and I think that, well, I would love stuff like that to come out of our community and I think it will. So, yeah, here's a prediction for you. Let's see how this ages. Absolutely, Miko. I think with things changing, and as you mentioned over the last couple of years, how we have seen AI come in, I think we'll see a lot of change. And as you mentioned, it's not like crypto. The hype will come up, tie down. If you talk about chaos today as well, we've been still referring to what was done initially, chaos monkey, the principles of chaos. I think that still stays the same. So hopefully that stays alive tomorrow as well. So with this, I think, yes. Miko? Yeah. Don't get me wrong. Right now everybody's slapping AI on everything. Seems like this is the way to raise funds at the moment and all of that. So that bit will go away eventually, but I think there is a legitimate part of it that's definitely shifted the landscape. Yeah, absolutely, absolutely. So with this, we'll just cover 1 minute each of sharing what the future looks like. I think Mika has given a lot of insight. So before we move on to him giving the final future prediction direction, let's, let's get an insight from Sylvan. What do you think about the future direction? Well, yeah, like you said, michael, predictions. All right, future direction for Chaos Toolkit. I think I want to spend a lot of time trying to care for it where I couldn't for the last few years because I was a little too busy with reliably and all other things. So I think I owe to the community a little bit more care for the product. So it's going to be essentially documentation. It's going to be a lot of tidying up. If I could rewrite that in rust, not because I think Rust is better than Python, not at all, but purely because I wish I could just drop a binary and people could just run it, whereas with Python it's a mess, but bits and pieces like that. So basically making the project, I think, more, I would like to be a bit more developer centric. So go back to the developers just because, and that goes back to our discussion earlier. There is one thing we didn't mention, but I have a sense that there is a fatigue of new things coming all the time, and there is a fatigue generally across the board. And I would like to come back and I sense a little bit of what do we do with all of this? So that's why I'm interested in going back to perhaps talking to developers or engineers to be more global and try to be out of their way and not promise too much and just deliver that one tool they need and that does the thing that they want at the time they need. And that's it, right. Not more so being more focused, lean and out of their way and not necessarily a full on platform. I guess that would be where Kaosurki tries to be back, or. Yeah, good insight, Sylvan. But just one more question. How do you see the future of chaos engineering overall? There was one thing about Chaos toolkit, but what's your comment on. Yeah, okay, so I completely obviously missed the question. So I'm sorry for this. To me, chaos engineering, I agree with what has been said that it's still central. It's still central because we are glorified when we. I say we as engineers, globally, what we do is we integrate with a gazillion of third parties. Right? You want to build a platform, you're going to use stripe or something like this for your payment. You want to use other services like this. So you end up being. Connecting things together all the time. So if you don't consider this as a problem at every milestone, at every step, you're missing something as an engineer. So I think, hope that chaos engineering will be realized. It's not a buzzword, it's not a vendor trying to push things, but it's a legitimate action that you do on a daily basis. That's why I'm focused. But generally speaking, I hope chaos engineering, I'm going to use the abused verb of shifting left, which I don't like, but the idea of bringing chaos engineering closer to what engineers need to do to deliver their value, right. Not a thing that comes on top of, but that supports what they need to do. So I think as long as chaos engineering streamlines itself into the delivery pipeline, I think it's going to win. But if it sticks to be that thing that lives in its ivory tower somewhere, it will struggle. It will struggle. So hopefully, to me, chaos injuring is going to be something that the mask, in a way, can just take and do, and that will make it absolutely. Anurag, some thoughts from you. What's next in your chaos engineering journey? How do you see chaos engineering? Thank you so much, Anurag, for those insights. I think a lot of enterprises or folks who are getting started. Trying to see your direction will get some insight from your thoughts. Karthik, would you like to share some insights on the future direction? According to you, I like the point about documentation. That's a sure short candidate for us to improve on in litmus as well. I don't know if documentation is ever going to be not a problem. So it's always going to be something that we need to continuously keep working on. And that's really the top of mind thing for us as well. On predictions, I don't know if this is a prediction. This is probably something that's already out there. It's known domain chaos engineering is going to treat faults as just going to commoditize faults. It's more than fault injection and stuff. It's going to be about what insights it is giving about the system. So how do you measure and what all can you measure and let your developers or sres get value out of that particular chaos experiment run? You don't want to be running an experiment, investing time and tooling to not learn more about the system. So how much can you learn about your system by running that experiment? And what all can you measure not just over the course of one experiment, but across your entire chaos journey? How are you going to quantify? That's probably the questions that we're already being faced with, and there are a lot of different folks attacking it in different ways, but I think that's really critical and probably that is also going to, in a way, I think, motivate people to do more of Chaos engineering if they can sort of taste how much they will learn about the system, how much they can learn about it by, let's say, doing a chaos experiment. These are all the metrics that I've got. These are the places where probably I wouldn't have looked otherwise at all. Right. That's basically the original intent of chaos anyways. Probably also why CNCF put all the chaos tools in the observability category when they started out in the original. So that's probably the most important area in which the chaos tooling or the general discourse about chaos is going to evolve. Probably. All right, thank you so much, Karthik, for some insight. Before we close the panel, some conclusive statements from Miko on his thoughts. He's given a prediction, but his thoughts on the future. What are the changes he would like to see coming in? Thoughts? Yeah, so I ventured one already about the artificial intelligence. I would also like to venture one about the very human intelligence. I think it goes back to what Sylvan was saying, that we kind of all run in different directions and try to do things. And I would love to see us come together a bit more often and maybe that walls be enough to generate a little bit of hype and generate a little bit of an agreement to kind of accept this as a boring practice that's no longer crazy to do. And we're doing some of that with comfort. Two, shout out to mark, my brother. We're doing some of that with Sre day, but I'm not sure what form this should have. But I definitely would love to see more of that cooperation. And I think basically what I'm saying is that the tools are there and we figure out most of the technical aspect and if we manage to figure out the human aspect a bit better, we'll be on the right track. Absolutely. Absolutely. Miko, very well said. I think there's so many projects that are coming up, I mean, supported by the CNCF, open source, on the end price side of things, so many events, talks. I mean, if we are seeing Con 42, chaos engineering for the fourth year, fifth year running, and then there's chaos cons carnival, so many meetups, webinars happening. So I think these things obviously lead chaos engineering until now. And we hope to see that over the next at least four or five years, maybe chaos engineering with all the modern practices, with all the evolution will still exist. But again, thank you so much, everyone, for joining in the panel. And I hope all of them watching just got some insight, getting started with the practice or have more inputs for the community. And again, I hope to see more conversations happening around chaos and similar pan is going through. Thanks everyone.

See all 22 talks at this event!

Conf42 Chaos Engineering 2024 - Online

February 15 2024

Chaos Engineering 2024 Panel Discussion

Video size:

Abstract

Summary

Transcript

Join the community!

Featured event

2025

2024

Info

Conf42 Chaos Engineering 2024 - Online

February 15 2024

Chaos Engineering 2024 Panel Discussion

Video size:

Abstract

Summary

Transcript

Join the community!