Transcript
This transcript was autogenerated. To make changes, submit a PR.
Today I would like to discuss how to build a more
robust patch API six ingress controller with
litmus Chaos. Let me introduce myself first.
I'm Jin two, an Apache API six PMC member,
maintainer of the Kubernetes Ingress NJX project,
and Microsoft MVP. If you would like to
get in touch with me, you can find my GitHub
profile and email address on the slides in
this talk. The agenda is to discuss why
we need Chaos engineering, how to design chaos
experiments for an ingress controller,
how to practice it, the benefits and
the future of this field.
First, why we need Chaos Engineering
let's review the definition of chaos engineering.
Chaos Engineering is a process of evaluating
software system by simulation destructive
events such as server network outage or
APisix routing. In this process,
we test the system's resilience and reliability
in unstable and unexpected condition
by introduced chaos, for example,
server fraud. Chaos engineering can
also help teams simulate real world
scenarios in a security
control environment to uncover
hidden risk and identify performance
botnecks in distributed system.
This approach is an
effective way to prevent system downtime or
production interactions.
Netflix approach to handling system
inspired us to take a more scientistic
approach that driven the boss and
development of the chaos engineering what
is Chaos engineering? I think the first one
is introduce the introduction
of the disruptive
events. Cows engineering involves
introducing disruptive events
such as network partitions, service degradation
and resource constraints.
Two simulate real world scenarios
and these these system's ability to handle unexpected
condition. The purpose of this is to identify
and weakness and
use the information to improve the system's
design and the architecture.
Make it more robust and resilient.
Then test the system's resilience.
Today's technical landscape is
constantly involved and fast phased
to ensure the system are robust, scalable and
able to hand unexpected challenge
and conditions. It's very important
to test the system silence in real
worlds.
Chaos engineering is an effective way
to do this. It involves introducing
disruptive events to observe
the system's response and machine
ability to hand unexpected expected
condition to measure these impact of
the disruptive event on the
system. Resilience organization can monitor system
logs, performance metrics and user experiments.
By tracking these metrics, organizations can
gain a better understanding of the system's behave
and identify area of the improvement.
Next one is discovering
hidden problems.
Distributed system can be prone to hidden
issues such as data loss,
performance botnecks and communication
errors. These problems can be
hard to detect as they often only
become visible when these system is underpriced.
Chaos engineering can help uncover this
hidden issue by introducing
disruptive events. This information
can these be used to improve the system's
design and architecture, making it
more reliable.
By identify and resolve this
problem, organization can enhance the
ability and performance of the system of
these system. These can help prevent
downtime, reduce the risk of
data loss and opensource the system continue
two run smoothly.
What is worse and why we need it?
Why we need it? First,
distributed system are complex
with many inherent cows
in the system.
The use of cloud and micro
surface architecture provide us
with many advantages, but it
also comes up with completed,
completed and chaos which
can lead two failure. The engineer's
responsibility is to make the system as
reliable as possible. Without testing,
we have no confidence to let
our product be used in production environment
in order to make it more robust.
In addition to the conventional
unit test,
we decide to introduce kelse test when
an error occurred. Repelling it
takes time and can cause immeasurable
loss which may have long
term effect in the future.
In the process of the repair, we need to consider
various facts, include the complex
of the system, the type of the error,
and possible new problem in order to ensure
that the final repair is effective.
Moreover,
when an opensource project bring
serious faults to the user in the production
environment, many user will choose to
switch as a product. Back to today's
topic, how to design Kelsey's
experiments for an Ingress controller.
Let's talk about what is Ingress
first, ingress is a resource
object in Kubernetes. It contains
rules for how client outside the cluster
can excise the service inside these cluster.
These rules include which
client can access which service, how to
root client request to the service,
and how to hand these client request.
On the right is a simple
example. As you can see, ingress is a very
simple resource. No need to make it more
complicated than it needs
to be. These what is Ingress
controller? An ingress resource
requires an ingress controller to precise
h. Otherwise it has no practical
use. The Ingress controller translator the
ingress rules into configuration
on a proxy allow external
clients to access service within these cluster.
Ingress controller is a specific
type of load balance that receives
ingress rules from the cluster and then translates
them. Two configuration that can proxy
client rules. This effectively
manages how external external clients
excise service with the cluster.
However, in a production environment,
we need more complex capabilities
such as limiting access
opensource and request method,
authentication and authorization.
The ingress resource object doesn't include
this part, so most ingress
controller extend the semantics of
the ingress through annotations in
the ingress resource. Different ingress
controller have different implementations.
For example, the annotation used by Kubernetes
Ingress NJX and Apache API six ingress
are different. Okay,
what is Apache API six ingress?
Apache API six Ingress controller is a controller for
Kubernetes ingress resource that helps
administrators manage and
control ingress traffic. It use
Apache API six as a deadline to provide
users with dynamic routing,
load balance and security
policies and other filters to improve
network controller and ensure high
available availability and security
for their business. API six Ingress
support three configuration method you can
use Kubernetes Ingress and customer resource
or gateway API. Each of
these has its own advantage.
For example,
if you using ingress resource,
this is simple to describe and is
a resource carried by kubernetes by default.
It's also easy to integrate
with other components. Next one is Gateway
API. Gateway API is the next
generation. Ingress provide rich semantics
and functions. Also the
last one is CRD. Apache API Six Ingress
provide a site of customer
resources to Apiai
Six's own resource which is convenient
for user to use and understand.
API six ingress adopts special
architecture with control plane handing
routing rules without carrying building
traffic. All client requests
are precisely through the deadline, therefore any
abnormality in the control plane
will not affect the traffic.
In addition, API six ingress controller has
a retry module.
After the control plane component is restored,
the routing rules can be synced
to the data plan and Apiai
Six Ingress also support integration
with external service discovery components.
These what is Litmus Chaos?
Litmus Chaos is Litmus
Chaos is an open source chaos engineering
framework and incubating
project of the CNCF.
It provides an infrastructure
experiments framework to validate
the stability of controllers and microservice
architecture. It can simulate
container level and application level environment as
well as nature,
force and upgrade to understand how
the system respond to these
trends. The framework can also
explore the behavior changes
between controller and applications and how
controller responds to challenges
in specific status.
In addition,
Litmus Kels offer convenient observability
capabilities. It is high
extensible and integratable with
other tools to enable the
creation of customer experiments.
Kubernetes development developers
and sres use litmus to
manage cows in a declarative
manner and identify weakness in
their applications and infrastructures.
Someone asked me why I chaos litmus
chaos over other products. That is
a topic for another time, but for summarize
lead meals cows has filter functions I
need and I'm more familiar with
it. Okay, how to design cows
experiment?
This is a general procedural application
to applicable
to these design of the calcium experiment
in any scenarios.
First you should define the
system under test, identify the specific
components of the system you want to experiments on
and develop a clear and mature
objective for these experiment.
This includes creating prohibitive
list of the components such as hardware
and software that will be tested
as well as defining
the scope of the experiment
and the expected outcomes.
Next one choose the right
experiment, select an experiments
that is alien.
With these objective you
have set and closely mimic us
and real world scenario.
This will help ensure that the
experiment product meaningful result
and accurately
reflect the behavior of the system.
Next one is establish hypothesis.
Establish hope is about how the system will
behave during the experiments and
what outcome you want.
This should be based on past
experiments, experience or
research and it should be reasonable and
testable. Next one is render
experiment render experiments
in controlled environment
such as staging environment to limit these
potential for harm to the production
system. Collector all relevant
data during the experiment and
store it security.
There may be different opinions on
whether the experiment should take
place directly in the production environment.
However, for most scenarios we
need to opensource the service level
objective of the system is
met. The last one is evaluate
the result, evaluate the
result of the experiments and
compare these to your harness.
Analyze the data collected and document
any observation
or building. This includes identify
any unexpected result or describe
face and determine how they might
affect the system. Additionally,
consider how the result of the experiment can be
used to improve the system. Okay,
let's see the main usage
scenarios of the ingress controller.
Proxy traffic is the most important capability
so I write it three times.
The other functions are all based on
these core functions.
Consequently, when conducting
CALC engineering normally process normally
proxy traffic is these key metrics.
Next we
can use the general mode about two define
the system under the past. We can
see for API six ingress users
need to create root configurations
such as ingress gateway, API or CRD
and apply them to the Kubernetes cluster through
Kubernetes. This process goes
through Kubernetes server for authentication authorization
and then store it in
eTCD. Then ApIaI six ingress controller
continually watch over change
in the Kubernetes resources.
These configuration are these translated
to the configuration on the database? When a client requests
the database, it excites the upstream service according
to the routine rules. It is
clear that if Kubernetes API
servers has an exception,
it will prevent the configuration from being
created or the ingress controller from getting
the correct configuration.
This is obvious and
certain scenarios so no experimentation
is needed.
If there is exception in the
deadline such as network interruption,
crash or podcast, it will
also not be able to do normal traffic
process to do normal traffic proxy.
This is also doesn't need to experiment.
Therefore, the scope of our experiment
is mainly the impact on the system.
If the ingress controller has an exception.
Next we can choose we should choose
the red experiments based on the
above reasons. We can directly cover many
scenarios of incorrect
configuration through end two end test
mainly through chaos engineering to verify whether
the data plan can still proxy traffic
normally when the ingress controller occurred an exception
such as DNS error,
network interruption or port killed.
Then establish for
each for for each
second we can create the following
hypothesis. When these ingress controller
get something,
the client request can still
get a normal response.
This is our hypothesis.
Next we should run the run
experiments. The experiment and
variable variable has been determined
so all that is
left it to conduct these
experiment. Litmus chaos provide
various way to conduct experiments.
We can do this through the litmus portal.
To do this, we need to create calcium scenarios,
select the application to be experimented
and these steps are relatively
straightforward. However, we must pay attention
to the factor that
litmus chaos include a proberous
opensource. These problem resource are
plaguable checks that can be defined with
chaos engine for any chaos experiment.
The experiment port execute these
checks based on the mode. These are defined in
and factor. These are five factors.
These are facts as necessary condition in
determining the vector of these
experiment. In addition to these standard
in building checks, at the same time we can also
schedule experiment which is a very variable function.
Additionally, litmus also
support running experiments by submit
Yaml manifest these
how to evaluate
the resource.
Litmus has building statistical
repos that clear shows
these result of the experiments.
There are also other rich reports such
as compression of experiment,
pensions execution
records. It can also be injected
with promises and
Grafana to provide a unified
dashboard for integration.
However, due to my current experiments
scenarios, I only used
the building reports these
benefits and filter Apache API six
is an open source project that is applied
to various company and environment.
Chaos Engineering has given us
confidence that the
delivered API six ingress is
stable and reliable. Thanks to our
completed end two end test, we no
longer need to worry about unexpected
behavior due to the introduction of
new prs. Chaos Engineering also
has also helped us to identify a
bug. When multiple perspective
killed of the API six ingress controller
pod occurred, it may cause a configuration
failed cause a configuration
failure. Fortunately,
these problem has been fixed
and I'm now continual
Kelsey test through a private
deployment environment. I plan to
introduce Kelsey experiments based on
litmus into the CI environment of Apache
API Six Ingress project, and I want to
provide reference documents and examples
for other opensource users to implement
engineering ApIai six ingress in their own
environment. That's all. Thank you. I'm honored
to be here to share some of my
experience with you.
If you are interested, feel free to
contact me anytime. See you.