Transcript
This transcript was autogenerated. To make changes, submit a PR.
You. Yes. Hello everyone.
Very happy to be here. Very happy. Two join comfort e
two. My name is Long Zhang and today my topic is
maximizing error injection realism for chaos engineering
with system calls. So of course we can replace maximizing
with improving here because I can't see like I have maximizing
the array injection realism, but it still interesting to
explore how to improve the realism for chaos engineering
experiments. So, about myself. Currently I'm doing my
phd study at KTH KTH Royal Institute
of Technology, Sweden. And before that I was a software
engineer at Tencent in China. And I'm very proud
to say that one of my main research topics is chaos
engineering. Exactly. So it's really nice to join a conference
such as conf fourty two and discuss all kinds of
novel findings of chaos engineering with everyone here.
So today. Yes, so during my research work I
find like there are a lot of nice talks, tutorials and also
nice tools. And I find that people also talking about
the challenges of chaos engineering nowadays. So for example,
here are some challenges I find the first one, how to evaluate different
aspects of resilience because chaos engineering can be done
at different levels, like operating system level,
network level or application level. And then we need
to have extra observability of an application
and collect different kinds of metrics so we can
precisely define how our system behaves during AnKLC
engineering experiment. And how to evaluate the resilience of
an application at different level, in different aspects can be
quite challenging. The second one, how to automate the chaos experiments.
Though we have lots of awesome tools today, they do
help, do helps a lot for chaos
engineering experiments, but still based on the applications
concrete scenario, how to make this automation really helpful
and efficient for developers, this can be
quite challenging, like how we automate the flow, how we
learn from the chaos engineering experiments automatically,
and how we use the learned knowledge for our future chaos
engineering. So this deal can be challenging and
need more research work on that. And the third challenge, how to improve
the efficiency of chaos experiments. So for example,
if we are going to explore some really large faulty injection
search space like, we have a lot of different combinations to
inject different kinds of failures in production, but given the
limited experiment time, we can't really explore every possibilities.
So how to improve the efficiency of this chaos experiment,
how to learn as much as we can during an experiment.
This can be challenging and to explore this can be quite helpful for
our future chaos engineering experiments. And today I could
like to only focus on the third challenge, how to
improve the efficiency of the chaos experiments and I believe
that there will be lots of different directions,
different solutions for this challenge. One of the idea
we come up with is to synthesize realistic failure
models for chaos engineering, because let us take an example,
if we have lots of different
failures to explore and the experiment time
is quite limited. Also, this will be done in production, so it's sort of
risky if we inject so many different kinds of failures,
especially some failures, they may not actually happened in
production. So how to synthesize realistic failure models
that actually is possible to see these errors
in production? And then we use chaos in engineering experiments
to explore more. Like to learn more about how your system
behaves under such an error, such a failure. This could be
quite interesting, and then we can learn more in limited
time. So based on this idea, we could like to
do some more research. And to be precise, and to do
deep research, we would like to narrow down the scope.
So we want to focus on system call invocations
at this moment. But of course this idea can be used and
explored at other levels. For example,
at application level we can also synthesize
exception like failure models, failure models with respect
to Java exceptions, and then we try it in chaos
engineering experiments that should also work. But today I would like only to
present our research work at the system come invocations level about how
to synthesize realized failure models for chaos engineering experiments.
Yes. And next, before going into InA details,
I could like to show you a short demonstration so we can
have an overall picture of system call invocations,
and also about this naturally happening errors at the system
calls invocation level. First of all, about system
call. So system call is a fundamental interface between
an application and the operating system, or let's say the
Linux kernel. And usually the critical resources
such as network or disk storage, or maybe like thread
scheduling, et cetera. These resources are controlled by the
operating system kernel. And then when application needs
some resources in these categories, the application
needs to invoke some specific system calls. So the
application invokes system calls. It transforms
the control from the user space to the kernel space,
and then the kernel can arrange the resources for
the application. So in Linux kernel there
are more than 300 different kinds of system calls.
And also Linux provides more than 100 different
kinds of error codes. So the Linux
kernel can precisely report different kinds of
errors when system call invocation fails. So imagine
this is quite like a big combination. Like we have hundreds
of system call types, we also have hundreds of errors codes
if we want two, do some experiments at system call level if
we want to. Inject some system calls errors on
purpose, then it's quite like time consuming.
If we try to combine different system calls types with
different errors codes, and actually some of the combinations are not realistic
and they can never happen in production environment.
So we need a better strategy
to generate some error injection models and
then we can pick up the most interesting errors
to inject in our system and learn about the resilience.
Okay, so about the demo. Now I'm going to
switch to my terminal. Yes. So here
we have mount, we have a script
to monitor all kinds of system calls invocations
for us. Right. Currently I'm using OBS, a software
called OBS, to record this video. And when you watch
this video or listen to this video, I believe that you won't feel
anything abnormal like the pictures, the videos
are good and also the sound is quite good. So this means the
OBS is functioning correctly. And behind this
we can feel like the OBS must interact a lot with
the system and it invokes lots of different system calls.
So let's take a look at this. There is a script called system
call monitor. And then given the process name OBS,
I would like to use milliseconds as a unit,
and I also want to monitor the latency of different system calls,
invocations. And please report, call the
metrics every 1 second.
Okay, so now we can see like yes,
the OBS indeed does lots of system call invocations,
and the type is here, few text read,
write, et cetera, and different counts
of the system call invocations. And also most of the system calls
succeed. But still we can see like some of the system
calls called and we can see the percentage of
these failures. Let's see, so behind this
field tax system calls we also had like field tax e
again. And also let's see,
yes, most of them are field tax, but if
we keep watching this for several, maybe one
or two minutes, I think we will have more
failures to observe. For example, this one execve,
it failed four times in 1 second. And all of
them return an en o ent errors code.
So this is just an idea. Like we can feel the
presence of system call invocations behind one application.
Every application indeed does lots of system calls,
and some of them failed naturally.
So this is the idea first, and if we want to
make it a bit more complex. So for example, we keep monitoring
the application and we keep extracting these metrics and
put them into the database. So in a
long time we can have a series of monitoring
metrics at the system call level and then we can have
really diverse information about the natural
happening errors at this system call invocation
level. So for example, this one is a
dashboard. We deployed an email server into
production and we just keep it running as euro
and we also register different email address,
email accounts into different mailing list.
So we do collect some natural traffic
here. And then by attaching the system calls
invocation monitor, we can collect the metrics like the failure rate,
the total number of system call invocations and also like
the latency of the system call invocations, et cetera. So you
see here there are many different kinds of system
calls failed with different IR codes, e again econary
site, et cetera. And this can be quite helpful
and interesting for us to see. Okay, even though
the email server is functioning and still we have lots
of different errors at the system call level and
they return like different error codes. And also the failure rate are
quite different. Sometimes they called a lot and sometimes
it's really rare to observe one single errors in production.
So this is a question to be discussed here,
because these are errors. So an error
can be handled by the operating system,
it can be remediated, but if the operating
system cannot handle this error, usually it wraps
the error and reports it. Two, the application,
for example, if some system call invocations called
and the Linux kernel cannot handle it, it can
wrap it as a Java exception. And then in our
java application we can write a try catch block to handle
this concrete error scenario. So interesting
question is, we are not sure
if these errors happen more frequently because
sometimes it's rare to observe one errors. But if this errors
happened far more frequently in a
severe scenario, how our application behaves,
then it can be quite interesting to explore
this using chaos engineering experiments.
Okay, let's come back to the slides. So that's why
we could like to do some research at system come level and also
about analyzing the realistic errors and generating
the realistic error models for error invocations.
We call it Fabi as a research type and this is an overall
picture of the design of FAbi. So we can start from here
like we have application in production, we just deploy it
normally and then the application interacts with the
kernel space using different kinds of system calls.
And then Phoebei attaches two different kinds of
monitors here. The first one is natural arrow monitor, so it
uses, I will introduce later like
we have some nice features, we use some nice features provided by
the Linux kernel. So it observes different kinds of
natural errors and then it reports to our
IRO model syzor also there is
another monitor called application behavior monitor, because we do
need some application specific metrics like response come,
or maybe some performance related metrics, the cpu
usage like heap memory usage, et cetera.
Because we could like to compare how the application behaves
when IRO is injected. We need some evidence
about the steady state. So based on this monitor,
the information the IRO model syntax outputs
some realistic IRO models based on the naturally happening errors.
And then our Phoebes orchestrator can conduct some chaos
engineering experiments. It uses a system calls injector.
So this injector intercepts some system call invocations,
and then it replaces the return code with
our predefined error codes. And so finally
we get a resilience report about how the application behaves
when we actively inject some specific errors
at the system call level. And of course we can visualize the
results using a visualizer, just like I showed you before,
the grandfather dashboard, which provides you some information about
system call invocations and error rates. So a
question here is how to synthesize realistic error
models. Now we have got natural errors and
we can also analyze these errors like what's the error rate,
what is error code, et cetera. And then we have
to use different strategies to generate realistic
errors models. And here our idea is to amplify the
errors rate. So because given the limited time, we would like to
learn as much as we can in a chaos engineering experiment.
So we need to somehow amplify the error rate, but we still
keep it realistic. This can be done like we keep the
type of the system calls as the same, and then we
slightly increase the error rate, the happening rate of
this specific error, so we can observe enough data
in the chaos engineering experiments. Before doing that, I would
like to propose some definitions.
So, system call invocation error we define it
as here as follows. An invocation
of a system call is deemed an error if it returns an error
code, and then we need to define the monitoring
interval because we keep reporting this matrix to
our database, and this is based on the monitoring interval.
So a monitoring interval is a period of time along which matrix
of interest are collected. Based on these two definitions,
we can define our error rate. So the error rate means the
percentage of the errors with
respect to the same type, the total number of invocations
in the same type. So for example, in 1 second there
are ten open system calls and one of them
failed with an error code, for example no content
or something like that. Then the error rate can be like 10%,
and then this is some information for us to
generate error models in the future because we can amplify this
errors rate. This error rate was observed in production,
so it can be like a basic information
for us. Yes, and talking about the application strategy,
we have some strategies predefined here.
So if we have a set of error rates with respect to
error, system calls error code e, and then we can calculate the maximum
value of the error rate in this observation phase and
also the variance of the error rate. So if the max value
of error rate is quite low, this means it's pretty
difficult to observe such an error in production.
It's rare to see this errors and it does not happen
so frequently. Then we categorize such errors into
sporadic errors and to observe these errors.
And two, learn more about how our application behaves under
these errors. We would like to use a fixed error rate to
amplify this error rate, so this can be predefined,
of course. And usually we just use a number
that can be large enough for our chaos engineering experiments,
so we can get enough data and we can evaluate
how our applications behaves when this error happens. And also,
there are also different errors. Like these errors are
quite fluctuating because the max value is
relatively large. So we can observe these
errors in production more frequently.
And also the variance of the errors rate is larger
than a certain value. So this means, like sometimes this system
calls invocation errors happened quite a lot, but sometimes
it's difficult to observe them. And by amplifying
this kind of errors, we would like to
keep using the maximum value of the error rate. So this error
rate do exist, does exist in production,
and then we keep using it. So we somehow improve
the frequency of these fluctuating errors. And the last
one, we categorize it as steady errors.
So the maximum value of the error rate
is quite large, but the variance of the error rates is
not that large. So we can always observe these errors
in a relatively steady frequency. So that's
why we would like to multiply the max
value with a factor, because we still,
we feel interested to increase the happening
rate of these errors, just to see if there exists any
threshold for our application. Like in production, this happens in
this frequency. But if we do a Kelsey engineering experiments
and we increase the errors rate, how our application behaves,
is it still resilient to this error, or there exists
a threshold, like after this errors rate, the application behaves
abnormally, et cetera, then we can use experiments to
explore it. Of course, these are sort of like basic
application strategies. We can also define more
complex strategies like we combine different scenarios or
we have maybe injection
chaos, or considering the propagation of errors,
et cetera. And then it's another story. So as a research prototype,
we would like to evaluate how Phoebei functions
when we use Phoebe to monitor system calls and also
inject system calls, invocation errors. So we have different
evaluation targets. For example, this Hedwig
is an email server. So I just show you during my demonstration,
we deploy Hedwig into production and then
we use Hedwig to send and fetch emails in a normal
way so we can calculate the steady
state, of course, and we can also have some realistic traffic
there. And then based on the sending
and fetching emails, the number of failures, et cetera, we can
know how the system behaves without injecting
any errors. But based on our observation, we can
still observe some errors that naturally happened
in production. So for example, the table on your left,
we observed like this almost ten different
kinds of system call errors. And some of them happen
quite a lot. Some of them, the error rate is quite
low. But these are sort of interesting findings for
us and we would like to explore more about how Hedwig
behaves if we inject more failures with
respect to different categories of system calls.
So that's the first phase. Like we observe
the system's behavior and we collect
all kinds of naturally happened errors based on the error
and also based on the application strategies I introduced
just now, or fabi outputs like realistic
hair injection models for our future experiments.
And then we can keep injecting different kinds of invocation
failures. And then we try to send and fetch emails
again to see how Hedwig behaves. And this is the result.
So based on the result table,
Hedwig does have different capabilities about
bearing these injected errors. So some injected
invocations errors, it directly crashed the
server, or it has state corruption. This means
even we turn off the error injection, the email server still behaves
incorrectly. But some of the system calls invocation errors.
Hedwig is quite resilient to them. Even we keep
injecting these errors with a relatively larger
error rate than the original one. Hedwig is
still resilient and provides the correct functionality.
So this is some findings of the experiments on
Hedwig. We also did some experiments on tTorrent.
Ttorrent is a Java client which is
used to downloading files using Bittorrent
protocol. So we still, as a normal user, we use tTorrent
to download files and then we observe ttorrent
like a normal behavior and also collects call
the naturally happening errors and then based on this naturally
happening errors, we generate error injection models
for fabric for our chaos engineering experiments,
and this is the result of the evaluation experiments
on tTorrent. Similarly, we can also find that tTorrent
is resilient to come of the system calls invocation errors
and some of the system call invocation errors are quite critical
to title's functionalities. So based on this
resilience report, we can further explore how
we can improve the application's resilience when such
system call invocation happens. We can also explore what's
the relations between system call invocation failures and
the application's behavior or the exceptions in
this specific application. So there are lots of interesting
topics to see. And another important aspect is the
overhead, thanks to the feature we used in Phoebe.
So now I'm going to introduce it a little bit. The feature is called
EPPF. It's a feature provided
by the Linux kernel. By using EBPF, it's quite
easy to capture the system call invocation
information, and you can also intercept the system calls invocations
to rewrite return code, et cetera. Actually,
EBPF is quite powerful. You can use this
feature to have a really nice observability
of Linux kernel. And an interesting story behind
this is I learned it when I joined Conf 42
last year. There was a workshop which was about the Linux
observability and chaos engineering hosted
by 42. So I learned a lot. And I learned
like EBPF and other the front end
BCC are really helpful to improve the observability
and get really lots of nice insights in our
running system. So we came back to Sweden and
decided to explore more on this. Then we came up with
this research idea and finally we write a paper about it.
So it's really nice. We joined comforted too to
share with each other and we learned a lot, a lot. And finally I
can come back again to tell you. Here is
my findings and here is some extra work we have done in
this year. I think it's quite interesting and I'm very happy to join
comfort two again. But the slide here I would like
to show you is the overhead of Fabi because using these
nice features we actually don't have really
high overhead compared to other research prototypes
even. We ran the experiments in production environment.
Still, the response time, the cpu load and the memory
usage are quite slightly improved,
but not that much. I would say this makes it promising
if we do lots of improvements and then we can
still lower down the overhead and try to
improve, maybe to really learn something more about the system calls
or other useful information at the system calls
level or at the operating system level.
Yes, but as a research prototype, Fabi of course has its
limitations and we need to do lots of future work
to explore this field. For example, Fabi is designed for
Java applications. This is because when
we do some chaos engineering experiment, we would like
to have a multiple layer observability.
So we want to collect the metrics at operating system
level and also the metrics about the GVM and
also some application specific metrics such as
response code, response time, et cetera. That's why we
only evaluate Fibi for Java
applications. But I think the concepts are applicable to other
software stacks. For example, for call kinds of executables
in Linux operating system, we can always use Phoebe
to monitor its system calls invocations and also to
summarize its naturally happened errors about system call invocations.
So this means like some components can be reused with minor applications
such as the monitor, the visualizer, and also the system error
injection. And the second limitation is the
GBM itself, to be honest, may either create
natural system call errors or remediate come errors directly.
So this is only our first step. But in the future I
think it would be interesting to explore the interplay
between the GVM and the application error handling mechanisms
with respect to system call errors. Okay,
so the next page is about the summary of our
research work on Fibi. So basically we proposed some original
insights about the presence of naturally happening system call
errors in software systems. We also contribute
to the concept of synthesizing realistic error injection
models for system calls based on amplification
strategy. So we amplifying naturally happening errors observed in
production. And also we implemented Fabi as a research prototype
about this concept. So the crisp bounding paper
is called realistic average injection for system calls. If you are a
fan of reading papers, you are more than welcome to give us
feedback about how you think of this paper. Okay,
finally, about the Fabi, the framework, we have already
made it open source. So if you feel interested in
Fabi, and also if you feel interested in how
we address the previous two, the first
two challenges I mentioned in the beginning, you can take a look
at our royal chaos GitHub rep because Kth is
Royal Institute of Technology. So we have this royal chaos
repo which contains all kinds of different our research work
in the field of chaos engineering papers, source code and also
experiment. So feel free to take a look and also create
issues if you have further questions. Okay, so as a summary
during this talk I presented a short
demonstration so we had an overall picture about the system
called invocations and the naturally happening errors behind
it. Then we proposed the design of Fabi which
takes the naturally happening errors as input and automatically
synthesize realistic error models for chaos engineering experiments
in the FAbi the most interesting or important idea is
error rate amplification. So we have different strategies and
we have categorized the errors into different types and
also we did some evaluation experiments on FIbi on different
applications so we have some interesting findings and
of course based on the findings we have summarized
the cool ideas and also the limitations of Phoebe.
And the last point is to take a look at
our Royalkowski tab rifle if you feel interested.
Basically that's all of my talk today and I hope you
enjoyed it. If you feel interested please you are welcome
to contact me at any time. I look forward to
discussing more with you and we keep in touch.
Thanks a lot. Have a nice day.