Maximizing Error Injection Realism for Chaos Engineering with System Calls

Video size:

Abstract

Some of the perturbation models for chaos engineering are based on a random strategy such as ChaosMonkey. However, realistic perturbations could also come from errors that have naturally happened in production. I would like to share more about how to improve the realism for CE experiments.

During the talk, I would like to share our research work on chaos engineering for system call invocations. I will present a novel fault injection framework for system call invocation errors, called Phoebe. Phoebe is unique as follows: First, Phoebe enables developers to have full observability of system call invocations. Second, Phoebe generates error models that are realistic in the sense that they mimic errors that naturally happen in production. Third, Phoebe is able to automatically conduct experiments to systematically assess the reliability of applications with respect to system call invocation errors in production. We evaluate the effectiveness and runtime overhead of Phoebe on two real-world applications in a production environment. The results show that Phoebe successfully generates realistic error models and is able to detect important reliability weaknesses with respect to system call invocation errors.

The corresponding research paper could be found here: https://arxiv.org/abs/2006.04444

Summary

Long Zhang is a software engineer at Tencent in China. One of his main research topics is chaos engineering. His topic is maximizing error injection realism for chaos engineering with system calls. How to improve the efficiency of chaos experiments.
One of the idea is to synthesize realistic failure models for chaos engineering. This could be quite interesting, and then we can learn more in limited time. We want to focus on system call invocations at this moment. But of course this idea can be used and explored at other levels.
System call is a fundamental interface between an application and the operating system. In Linux kernel there are more than 300 different kinds of system calls. Linux provides more than 100 different types of error codes when system call invocation fails. We need a better strategy to generate some error injection models.
OBS is a software that monitors all kinds of system calls invocations for us. By attaching the system call invocation monitor, we can collect metrics like the failure rate, the total number of system call invocations and also like the latency of the system calls. It can be quite interesting to explore this using chaos engineering experiments.
Fabi is doing research at system come level and also about analyzing the realistic errors and generating the realistic error models for error invocations. The idea is to amplify the errors rate. And of course we can visualize the results using a visualizer.
Using EPPF, it's quite easy to capture the system call invocation information. Fabi of course has its limitations and we need to do lots of future work to explore this field. Still, the response time, the cpu load and the memory usage are quite slightly improved.
Fabi takes the naturally happening errors as input and automatically synthesize realistic error models for chaos engineering experiments in the FAbi. The framework has already been made open source. If you feel interested please you are welcome to contact me at any time.

Transcript

This transcript was autogenerated. To make changes, submit a PR.

You. Yes. Hello everyone. Very happy to be here. Very happy. Two join comfort e two. My name is Long Zhang and today my topic is maximizing error injection realism for chaos engineering with system calls. So of course we can replace maximizing with improving here because I can't see like I have maximizing the array injection realism, but it still interesting to explore how to improve the realism for chaos engineering experiments. So, about myself. Currently I'm doing my phd study at KTH KTH Royal Institute of Technology, Sweden. And before that I was a software engineer at Tencent in China. And I'm very proud to say that one of my main research topics is chaos engineering. Exactly. So it's really nice to join a conference such as conf fourty two and discuss all kinds of novel findings of chaos engineering with everyone here. So today. Yes, so during my research work I find like there are a lot of nice talks, tutorials and also nice tools. And I find that people also talking about the challenges of chaos engineering nowadays. So for example, here are some challenges I find the first one, how to evaluate different aspects of resilience because chaos engineering can be done at different levels, like operating system level, network level or application level. And then we need to have extra observability of an application and collect different kinds of metrics so we can precisely define how our system behaves during AnKLC engineering experiment. And how to evaluate the resilience of an application at different level, in different aspects can be quite challenging. The second one, how to automate the chaos experiments. Though we have lots of awesome tools today, they do help, do helps a lot for chaos engineering experiments, but still based on the applications concrete scenario, how to make this automation really helpful and efficient for developers, this can be quite challenging, like how we automate the flow, how we learn from the chaos engineering experiments automatically, and how we use the learned knowledge for our future chaos engineering. So this deal can be challenging and need more research work on that. And the third challenge, how to improve the efficiency of chaos experiments. So for example, if we are going to explore some really large faulty injection search space like, we have a lot of different combinations to inject different kinds of failures in production, but given the limited experiment time, we can't really explore every possibilities. So how to improve the efficiency of this chaos experiment, how to learn as much as we can during an experiment. This can be challenging and to explore this can be quite helpful for our future chaos engineering experiments. And today I could like to only focus on the third challenge, how to improve the efficiency of the chaos experiments and I believe that there will be lots of different directions, different solutions for this challenge. One of the idea we come up with is to synthesize realistic failure models for chaos engineering, because let us take an example, if we have lots of different failures to explore and the experiment time is quite limited. Also, this will be done in production, so it's sort of risky if we inject so many different kinds of failures, especially some failures, they may not actually happened in production. So how to synthesize realistic failure models that actually is possible to see these errors in production? And then we use chaos in engineering experiments to explore more. Like to learn more about how your system behaves under such an error, such a failure. This could be quite interesting, and then we can learn more in limited time. So based on this idea, we could like to do some more research. And to be precise, and to do deep research, we would like to narrow down the scope. So we want to focus on system call invocations at this moment. But of course this idea can be used and explored at other levels. For example, at application level we can also synthesize exception like failure models, failure models with respect to Java exceptions, and then we try it in chaos engineering experiments that should also work. But today I would like only to present our research work at the system come invocations level about how to synthesize realized failure models for chaos engineering experiments. Yes. And next, before going into InA details, I could like to show you a short demonstration so we can have an overall picture of system call invocations, and also about this naturally happening errors at the system calls invocation level. First of all, about system call. So system call is a fundamental interface between an application and the operating system, or let's say the Linux kernel. And usually the critical resources such as network or disk storage, or maybe like thread scheduling, et cetera. These resources are controlled by the operating system kernel. And then when application needs some resources in these categories, the application needs to invoke some specific system calls. So the application invokes system calls. It transforms the control from the user space to the kernel space, and then the kernel can arrange the resources for the application. So in Linux kernel there are more than 300 different kinds of system calls. And also Linux provides more than 100 different kinds of error codes. So the Linux kernel can precisely report different kinds of errors when system call invocation fails. So imagine this is quite like a big combination. Like we have hundreds of system call types, we also have hundreds of errors codes if we want two, do some experiments at system call level if we want to. Inject some system calls errors on purpose, then it's quite like time consuming. If we try to combine different system calls types with different errors codes, and actually some of the combinations are not realistic and they can never happen in production environment. So we need a better strategy to generate some error injection models and then we can pick up the most interesting errors to inject in our system and learn about the resilience. Okay, so about the demo. Now I'm going to switch to my terminal. Yes. So here we have mount, we have a script to monitor all kinds of system calls invocations for us. Right. Currently I'm using OBS, a software called OBS, to record this video. And when you watch this video or listen to this video, I believe that you won't feel anything abnormal like the pictures, the videos are good and also the sound is quite good. So this means the OBS is functioning correctly. And behind this we can feel like the OBS must interact a lot with the system and it invokes lots of different system calls. So let's take a look at this. There is a script called system call monitor. And then given the process name OBS, I would like to use milliseconds as a unit, and I also want to monitor the latency of different system calls, invocations. And please report, call the metrics every 1 second. Okay, so now we can see like yes, the OBS indeed does lots of system call invocations, and the type is here, few text read, write, et cetera, and different counts of the system call invocations. And also most of the system calls succeed. But still we can see like some of the system calls called and we can see the percentage of these failures. Let's see, so behind this field tax system calls we also had like field tax e again. And also let's see, yes, most of them are field tax, but if we keep watching this for several, maybe one or two minutes, I think we will have more failures to observe. For example, this one execve, it failed four times in 1 second. And all of them return an en o ent errors code. So this is just an idea. Like we can feel the presence of system call invocations behind one application. Every application indeed does lots of system calls, and some of them failed naturally. So this is the idea first, and if we want to make it a bit more complex. So for example, we keep monitoring the application and we keep extracting these metrics and put them into the database. So in a long time we can have a series of monitoring metrics at the system call level and then we can have really diverse information about the natural happening errors at this system call invocation level. So for example, this one is a dashboard. We deployed an email server into production and we just keep it running as euro and we also register different email address, email accounts into different mailing list. So we do collect some natural traffic here. And then by attaching the system calls invocation monitor, we can collect the metrics like the failure rate, the total number of system call invocations and also like the latency of the system call invocations, et cetera. So you see here there are many different kinds of system calls failed with different IR codes, e again econary site, et cetera. And this can be quite helpful and interesting for us to see. Okay, even though the email server is functioning and still we have lots of different errors at the system call level and they return like different error codes. And also the failure rate are quite different. Sometimes they called a lot and sometimes it's really rare to observe one single errors in production. So this is a question to be discussed here, because these are errors. So an error can be handled by the operating system, it can be remediated, but if the operating system cannot handle this error, usually it wraps the error and reports it. Two, the application, for example, if some system call invocations called and the Linux kernel cannot handle it, it can wrap it as a Java exception. And then in our java application we can write a try catch block to handle this concrete error scenario. So interesting question is, we are not sure if these errors happen more frequently because sometimes it's rare to observe one errors. But if this errors happened far more frequently in a severe scenario, how our application behaves, then it can be quite interesting to explore this using chaos engineering experiments. Okay, let's come back to the slides. So that's why we could like to do some research at system come level and also about analyzing the realistic errors and generating the realistic error models for error invocations. We call it Fabi as a research type and this is an overall picture of the design of FAbi. So we can start from here like we have application in production, we just deploy it normally and then the application interacts with the kernel space using different kinds of system calls. And then Phoebei attaches two different kinds of monitors here. The first one is natural arrow monitor, so it uses, I will introduce later like we have some nice features, we use some nice features provided by the Linux kernel. So it observes different kinds of natural errors and then it reports to our IRO model syzor also there is another monitor called application behavior monitor, because we do need some application specific metrics like response come, or maybe some performance related metrics, the cpu usage like heap memory usage, et cetera. Because we could like to compare how the application behaves when IRO is injected. We need some evidence about the steady state. So based on this monitor, the information the IRO model syntax outputs some realistic IRO models based on the naturally happening errors. And then our Phoebes orchestrator can conduct some chaos engineering experiments. It uses a system calls injector. So this injector intercepts some system call invocations, and then it replaces the return code with our predefined error codes. And so finally we get a resilience report about how the application behaves when we actively inject some specific errors at the system call level. And of course we can visualize the results using a visualizer, just like I showed you before, the grandfather dashboard, which provides you some information about system call invocations and error rates. So a question here is how to synthesize realistic error models. Now we have got natural errors and we can also analyze these errors like what's the error rate, what is error code, et cetera. And then we have to use different strategies to generate realistic errors models. And here our idea is to amplify the errors rate. So because given the limited time, we would like to learn as much as we can in a chaos engineering experiment. So we need to somehow amplify the error rate, but we still keep it realistic. This can be done like we keep the type of the system calls as the same, and then we slightly increase the error rate, the happening rate of this specific error, so we can observe enough data in the chaos engineering experiments. Before doing that, I would like to propose some definitions. So, system call invocation error we define it as here as follows. An invocation of a system call is deemed an error if it returns an error code, and then we need to define the monitoring interval because we keep reporting this matrix to our database, and this is based on the monitoring interval. So a monitoring interval is a period of time along which matrix of interest are collected. Based on these two definitions, we can define our error rate. So the error rate means the percentage of the errors with respect to the same type, the total number of invocations in the same type. So for example, in 1 second there are ten open system calls and one of them failed with an error code, for example no content or something like that. Then the error rate can be like 10%, and then this is some information for us to generate error models in the future because we can amplify this errors rate. This error rate was observed in production, so it can be like a basic information for us. Yes, and talking about the application strategy, we have some strategies predefined here. So if we have a set of error rates with respect to error, system calls error code e, and then we can calculate the maximum value of the error rate in this observation phase and also the variance of the error rate. So if the max value of error rate is quite low, this means it's pretty difficult to observe such an error in production. It's rare to see this errors and it does not happen so frequently. Then we categorize such errors into sporadic errors and to observe these errors. And two, learn more about how our application behaves under these errors. We would like to use a fixed error rate to amplify this error rate, so this can be predefined, of course. And usually we just use a number that can be large enough for our chaos engineering experiments, so we can get enough data and we can evaluate how our applications behaves when this error happens. And also, there are also different errors. Like these errors are quite fluctuating because the max value is relatively large. So we can observe these errors in production more frequently. And also the variance of the errors rate is larger than a certain value. So this means, like sometimes this system calls invocation errors happened quite a lot, but sometimes it's difficult to observe them. And by amplifying this kind of errors, we would like to keep using the maximum value of the error rate. So this error rate do exist, does exist in production, and then we keep using it. So we somehow improve the frequency of these fluctuating errors. And the last one, we categorize it as steady errors. So the maximum value of the error rate is quite large, but the variance of the error rates is not that large. So we can always observe these errors in a relatively steady frequency. So that's why we would like to multiply the max value with a factor, because we still, we feel interested to increase the happening rate of these errors, just to see if there exists any threshold for our application. Like in production, this happens in this frequency. But if we do a Kelsey engineering experiments and we increase the errors rate, how our application behaves, is it still resilient to this error, or there exists a threshold, like after this errors rate, the application behaves abnormally, et cetera, then we can use experiments to explore it. Of course, these are sort of like basic application strategies. We can also define more complex strategies like we combine different scenarios or we have maybe injection chaos, or considering the propagation of errors, et cetera. And then it's another story. So as a research prototype, we would like to evaluate how Phoebei functions when we use Phoebe to monitor system calls and also inject system calls, invocation errors. So we have different evaluation targets. For example, this Hedwig is an email server. So I just show you during my demonstration, we deploy Hedwig into production and then we use Hedwig to send and fetch emails in a normal way so we can calculate the steady state, of course, and we can also have some realistic traffic there. And then based on the sending and fetching emails, the number of failures, et cetera, we can know how the system behaves without injecting any errors. But based on our observation, we can still observe some errors that naturally happened in production. So for example, the table on your left, we observed like this almost ten different kinds of system call errors. And some of them happen quite a lot. Some of them, the error rate is quite low. But these are sort of interesting findings for us and we would like to explore more about how Hedwig behaves if we inject more failures with respect to different categories of system calls. So that's the first phase. Like we observe the system's behavior and we collect all kinds of naturally happened errors based on the error and also based on the application strategies I introduced just now, or fabi outputs like realistic hair injection models for our future experiments. And then we can keep injecting different kinds of invocation failures. And then we try to send and fetch emails again to see how Hedwig behaves. And this is the result. So based on the result table, Hedwig does have different capabilities about bearing these injected errors. So some injected invocations errors, it directly crashed the server, or it has state corruption. This means even we turn off the error injection, the email server still behaves incorrectly. But some of the system calls invocation errors. Hedwig is quite resilient to them. Even we keep injecting these errors with a relatively larger error rate than the original one. Hedwig is still resilient and provides the correct functionality. So this is some findings of the experiments on Hedwig. We also did some experiments on tTorrent. Ttorrent is a Java client which is used to downloading files using Bittorrent protocol. So we still, as a normal user, we use tTorrent to download files and then we observe ttorrent like a normal behavior and also collects call the naturally happening errors and then based on this naturally happening errors, we generate error injection models for fabric for our chaos engineering experiments, and this is the result of the evaluation experiments on tTorrent. Similarly, we can also find that tTorrent is resilient to come of the system calls invocation errors and some of the system call invocation errors are quite critical to title's functionalities. So based on this resilience report, we can further explore how we can improve the application's resilience when such system call invocation happens. We can also explore what's the relations between system call invocation failures and the application's behavior or the exceptions in this specific application. So there are lots of interesting topics to see. And another important aspect is the overhead, thanks to the feature we used in Phoebe. So now I'm going to introduce it a little bit. The feature is called EPPF. It's a feature provided by the Linux kernel. By using EBPF, it's quite easy to capture the system call invocation information, and you can also intercept the system calls invocations to rewrite return code, et cetera. Actually, EBPF is quite powerful. You can use this feature to have a really nice observability of Linux kernel. And an interesting story behind this is I learned it when I joined Conf 42 last year. There was a workshop which was about the Linux observability and chaos engineering hosted by 42. So I learned a lot. And I learned like EBPF and other the front end BCC are really helpful to improve the observability and get really lots of nice insights in our running system. So we came back to Sweden and decided to explore more on this. Then we came up with this research idea and finally we write a paper about it. So it's really nice. We joined comforted too to share with each other and we learned a lot, a lot. And finally I can come back again to tell you. Here is my findings and here is some extra work we have done in this year. I think it's quite interesting and I'm very happy to join comfort two again. But the slide here I would like to show you is the overhead of Fabi because using these nice features we actually don't have really high overhead compared to other research prototypes even. We ran the experiments in production environment. Still, the response time, the cpu load and the memory usage are quite slightly improved, but not that much. I would say this makes it promising if we do lots of improvements and then we can still lower down the overhead and try to improve, maybe to really learn something more about the system calls or other useful information at the system calls level or at the operating system level. Yes, but as a research prototype, Fabi of course has its limitations and we need to do lots of future work to explore this field. For example, Fabi is designed for Java applications. This is because when we do some chaos engineering experiment, we would like to have a multiple layer observability. So we want to collect the metrics at operating system level and also the metrics about the GVM and also some application specific metrics such as response code, response time, et cetera. That's why we only evaluate Fibi for Java applications. But I think the concepts are applicable to other software stacks. For example, for call kinds of executables in Linux operating system, we can always use Phoebe to monitor its system calls invocations and also to summarize its naturally happened errors about system call invocations. So this means like some components can be reused with minor applications such as the monitor, the visualizer, and also the system error injection. And the second limitation is the GBM itself, to be honest, may either create natural system call errors or remediate come errors directly. So this is only our first step. But in the future I think it would be interesting to explore the interplay between the GVM and the application error handling mechanisms with respect to system call errors. Okay, so the next page is about the summary of our research work on Fibi. So basically we proposed some original insights about the presence of naturally happening system call errors in software systems. We also contribute to the concept of synthesizing realistic error injection models for system calls based on amplification strategy. So we amplifying naturally happening errors observed in production. And also we implemented Fabi as a research prototype about this concept. So the crisp bounding paper is called realistic average injection for system calls. If you are a fan of reading papers, you are more than welcome to give us feedback about how you think of this paper. Okay, finally, about the Fabi, the framework, we have already made it open source. So if you feel interested in Fabi, and also if you feel interested in how we address the previous two, the first two challenges I mentioned in the beginning, you can take a look at our royal chaos GitHub rep because Kth is Royal Institute of Technology. So we have this royal chaos repo which contains all kinds of different our research work in the field of chaos engineering papers, source code and also experiment. So feel free to take a look and also create issues if you have further questions. Okay, so as a summary during this talk I presented a short demonstration so we had an overall picture about the system called invocations and the naturally happening errors behind it. Then we proposed the design of Fabi which takes the naturally happening errors as input and automatically synthesize realistic error models for chaos engineering experiments in the FAbi the most interesting or important idea is error rate amplification. So we have different strategies and we have categorized the errors into different types and also we did some evaluation experiments on FIbi on different applications so we have some interesting findings and of course based on the findings we have summarized the cool ideas and also the limitations of Phoebe. And the last point is to take a look at our Royalkowski tab rifle if you feel interested. Basically that's all of my talk today and I hope you enjoyed it. If you feel interested please you are welcome to contact me at any time. I look forward to discussing more with you and we keep in touch. Thanks a lot. Have a nice day.

See all 31 talks at this event!

Conf42 Chaos Engineering 2021 - Online

February 25 2021

Maximizing Error Injection Realism for Chaos Engineering with System Calls

Video size:

Abstract

Summary

Transcript

Long Zhang

PhD Student in Computer Science @ KTH Royal Institute of Technology

Join the community!

Featured event

2025

2024

Info

Conf42 Chaos Engineering 2021 - Online

February 25 2021

Maximizing Error Injection Realism for Chaos Engineering with System Calls

Video size:

Abstract

Summary

Transcript

Long Zhang

PhD Student in Computer Science @ KTH Royal Institute of Technology

Join the community!