Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi there, I'm Wen Chao, a software engineer at
today. I'm so glad to have the opportunity to share
with you some exciting updates and innovations
of core gateway. Before that,
let me spend some time to tell you a little bit more about
our company and our products. Core is a
company that provides API gateway products and solutions.
We have been focusing on helping our clients
manage and secure their APIs
effectively, empowering global companies
to embrace an API first approach.
The Core Gateway is an open youre,
scalable and high performance API gateway product
that provides features such as authentication,
traffic control, HTTP transformation,
logging and monitoring. In addition to the open source
gateway, Kong also provides enterprise evolutions
which may offer advanced features and
support for organizations with specific requirements.
We also provide a cloud native solution called
Connect which enables you to manage APIs across
any cloud team from one place.
Connect provides a hosted unified control plan
to manage gateways and ingress controllers.
This empowers your organizations to reduce
operational overhead, speed up time to the
market, increase security, and achieving
federated governance at scale.
So in today's presentation, I will be focusing on
core gateway from several aspects. The presentation
will be divided into three parts, each focusing
on different aspects of our enhancements and innovations.
First and foremost, I will be introducing the
significant improvements with making some plugins
and exploring the key innovations for some
core components that are powering up the core gateway.
Following that, I will show you how we do our performance
at can. In the third part, let's explore
some skills in fine tuning the performance
for an enhanced user experience.
Core has hundreds of out of box plugins
to address various use cases without
introducing any platform complexity.
This makes comb flexible enough to boost developer
productivity and ensure operational
resilience. In today's talk, we will
just have a quick look at the latest improvements
in some widely used plugins.
So let's start with the plumsus plugin.
As a commonly used cornerstone in modern monitoring system,
plums use always plays a crucial role in core gateway.
While it can perform well in scenarios with a relatively
small number of roads, we recognize the
decline in performance as the number of loads
increases substantially.
So we took this as a challenge
and the followings are what we have worked out.
As illustrated on this slide. We have already
accelerated metric scrapping for the
promise use plugin, and we have also implemented
a reduction in high cardinality metrics.
These enhancements eventually result in an
improvement in the p 99 latency,
which means it now can run smoothly even in large
scale systems. Three factors contribute to
these improvements. Firstly,
we've adopted the table pool for the creation
of temporary tables, which optimize the
process and mitigates the cost of excessive table
creation during promiscuous scrapping so that
we can avoid performance hit of creating
new tables all the time.
Secondly, a few yield functions have
been introduced in critical
core sports, which allow the luajit
compiler to release the control of cpu and
give it to other request processing events rather
than holding the cpu for too long time.
This improves the responsiveness
of the gateway quite a lot.
Lastly, we've also replaced
all the NYI functionalities involved in this plugin
with those compatible with the Luigi
compiler, which completely harnesses
the performance benefits of Lua Vm.
These changes collaboratively contribute
to the performance boosts. The following bar chart
shows a comparison of the p 99 latency
between 3.3 and 3.4 version
we just need to pay attention
on the red and the green columns.
The red one stands for version 3.3
and the green one stands for version 3.4.
Here we can see a significant enhancements
of the 3.4 in the p 99 latency.
Now let's move on to the red limiting
plugin, which is a plugin for the purpose of
traffic control. For this plugin,
we've also got a notable increase both in p
99 latency and RPS. The enhancement
is mainly attributed to the introduction
of a batch queue, which is a fundamental
components newly introduced in 3.4,
especially when the back end of the red limiting
counter is can external storage. With the
help of this queue, there is no need to
communicate with external storage for every
single request. It just put data
in this internal in memory batch queue almost
without any latency after every time the counters
are calculated. When the queue flush these
counters to external storage asynchronously,
so it optimize the method of
communication between the plugin and external
storage, particularly beneficial
in high concurrency scenarios.
We've also introduced a new configuration
sync rate so that we can even decide
the flush frequency. The graph
below illustrates the test result
we did compared to a Goland based open
source gateway. As we can see,
the performance in P 99 and the RPS is
nearly twice as good as the competitor.
Now that we have discussed
the improvements in plugins,
let's delve into
the enhancements in some core
components. To begin with our discussion
about the core components, let's first focus on
the rotor components, which plays a
key role in road pattern matching.
Why is there a need for improvements
in the rotor? Well, given that users
heavily depend on its flexibility and efficiency
to config their routing requirements effectively,
the rotor stands out as the cornerstone of
a gateway. Moreover,
nearly every individual request should
be checked by the rotor to determine if it matches
with any API pass.
The config API pass we've set as
traffic volume increases, the rotor
becomes a critical focal point and
its performance emerges as a
key factor. Since version 3.0,
we introduced a new rotor implementation,
which is called ATC rotor. The ATC rotor is
a DESL based approach, and it's written
in rust.
The main reason for
us to upgrade to a DSL based approach is to
enhance flexibility. The new rotor
empowers users to set roads
with expressions.
Configuring roads using expressions allows
for more flexibility and better performance
without losing readability
when dealing with complex or large configuration.
Furthermore, we rewrite
it in rust, as rust is not
only generally quite efficient,
comparable to c language,
providing high performance with
low level control, but also is much safer
than c in terms of its handling of memory.
So it's ideal as a replacement
for the language of choice for the development
of a new module. Although it
is never such an easy thing
to rewrite a brand new module
completely in rust. Because rust is not a
native supported language in operation,
several engineers invested a
considerable amount of effort in this project.
The results proved that
everything was worthwhile. The new rotor
can reduce review
time when routes change.
Meanwhile, it increases runtime
performance when routing requests,
and it reduce p 99 latency from 1.5
seconds to 0.1. Second testing with
10,000 rows besides
the rewritten of the roads,
we've made two more optimizations for
the logic of the rotor. Firstly,
we optimize the road rebuild logic.
Secondly, configuration conditional
rebuild of the roads was introduced,
which means the road builder
becomes more intelligent. They can be conscious
about if roads have been changed and f
needs are rebuilt. Indeed, these optimizations
also contribute to the final enhancements.
The second core components that I want to share
with you is the improvement in
our approach to configuration management.
In our previous iterations, we rely
on a shared memory as the configuration storage,
which is a kind of native approach that provides
node level storage accessible across
all worker processes.
While this was effective
in ensuring fasting memory storage,
we recognized the need for enhancement due
to excessive memory consumption.
Another concern is that shared
memory is not database like
storage, it does not support transactional
operation. In some cases, this is an
unacceptable reality
because the ability to
guarantee atomicity,
which means we would need to use core
code to ensure data consistent.
This will increase the complexity
of the code. From version 3.0,
we handle configuration by introducing LMDB
as the back end for corn, which is
an in memory transactional database.
By doing this, we've not only achieved
a notable reduction in memory usage,
but also unlocked the power of transactional storage.
The benchmark results here also
reveals that LMDb significantly improves
rps during rebuilds with constant
config push.
What's more, we observed a remarkable
50% to 70% additional job
in rebuild time, particularly noticeable
in cases with a large number of workers.
Congate way is based on engines
and openrest. It has a
unique master worker architect for efficient
cpu usage by forking multiple
worker processes and it faces
information sharing challenges across workers
due to the isolated nature of worker processes.
To address the communication needs between workers,
especially for tasks like rotor rebuilding and
health checking, core designed an
event propagation mechanism.
Previously, the event components we
implemented relied on shared
memory. Though this approach
was effectively it had drawbacks
such as high lock overhead. Accessing a
single shared memory requires a lock to
avoid race continuous and keep
consistency. This lead to high overhead
when there are numerous
workers, which is considered as a
kind of inefficiency. Another concern is
the size of the shared memory. It should
be decided and allocated in advance
when NX starts and it cannot be changed
during the runtime. It means that regardless of
whether this part of shared memory is fully utilized,
it has already been allocated to
enhance efficiency and avoid risks.
We upgraded to a new event lab called
Lua rest events. The new event
library operates as a classic publish
subscribe model and is implemented by a
pure Lua code, eliminating the need of
shared memory.
We did best to compare
the two event libraries.
We designed two test cases to do this.
One is to post 1000 events
per five milliseconds.
The other is to post 10,000 events per 100
milliseconds. And the benchmark
here reveals that under
the low pressure condition, the two libraries
have nearly the same performance in the RPS,
while the new library wins
when the workload increases.
Considering that performance is essential for
commercial products to deliver can
excellent user experience, we carefully
designed a testing system and
a set of statistical
methods. Here is
the infrastructure of the testing system.
Two bare metals have been set up.
One works as a load generator
and the other is used for the deployment of
core. We even bind different components,
the NIC upstreams
and the core gateway to different groups of
CPU trying to get rid of CPU
competition and we connect
these two buried metals with a
private network to fully utilize the network,
the network bandwise, and we even
put upstream and cone gateway on the loopback
to give WRK more
bandwidth.
Here are the methods we used to check all the
key metrics, rps,
latency and memory usage and this graph
shows how we check the decrease in
metrics. We compare two groups of sample
results to see if there's a significant difference.
We do the test every day and
if the difference happens the test is
considered a failure and we will get a notification
about it. Among two groups of
results, one comes from the test result
which is a poster release per test
being carried out for a new version release and
the results will be persisted
in storage. The other one is tracked every
day and persists in the same storage
in case we need to review it. The charts
illustrated on this slide tend
to show you the effect of our methods.
As you can see, it is easy to find
out the organizations between the test cases.
When the difference is found, a perf will be
run to collect the perf data and
generate the flame graph so that
we can have an insight into details.
From the flame graph we can generally find
out the problematic function and
after we made changes in our
core to fix the issue, we run the perfect
can to confirm the issue has gone.
This is just an example to show you how
we do performance. So in
the last part of today's presentation, let's learn
some practical insights into tuning the performance.
Sometimes performance tuning can be intricate
and today I will just talk about
some common and fundamental strategies
as examples. For more in depth information,
please explore the technical blocks
on core official website.
So core as an API gateway,
it comes out of box being
well tuned for proxying traffic.
However, there are some cases where cone is used for processing
requests with large payload. So let's
explore this case first.
Core utilizes request buffer to hold the
payload and when the payload exceeds
the buffer size,
the ink results to use
the disk to buffer the request.
This can avoid excessive memory usage,
preventing potential memory shortage that
could lead to extensive disk operations.
Such as such,
disk operations can be costly
and may block NX event
loop resulting in performance
degradation. To mitigate this,
a simple solution is to increase the buffer
size defaulted at 8000
as illustrated in this graph.
The larger the buffer size is, the better
performance we can get. Another case
arises when corn receives
responses from upstream with large
payloads. Similar to the previous case,
a buffer is in place for catching response
bodies. The same principle holds
true as demonstrated
in this graph. Although buffering can
increase performance, some cases
are not suitable to enable it, including real
time cases, long term connection or
streaming transform transmission or
streaming processing. And also
we should not enable it when memory of the
environment is limited. Okay, I think that's
almost everything I would like to cover in today's presentation.
Due to the limitations of the slides,
some topics may not have been fully discussed.
If you have any question, please feel
free to reach out to me. I will be glad to
help you. Thank you.