Conf42 DevOps 2024 - Online

Elevating Excellence of Kong Gateway: Performance Enhancement and Best Practices

Video size:

Abstract

Kong Gateway is a robust and flexible open-source API gateway that empowers organizations to efficiently manage, secure, and optimize their APIs. Join us as we uncover the key aspects of Kong in this presentation, exploring the core platform’s capabilities, performance improvements and extended features that can revolutionize your organization. Navigate through our latest enhancements in core functionalities and components to witness how Kong significantly improves scalability and boosts overall efficiency. Engage in this enlightening session that goes beyond the basics, we offer in-depth insights into Kong’s continuous evolutions. Whether you’re new to Kong or looking to maximize its potential, this talk delivers actionable insights for leveraging Kong’s capabilities and achieving transformative results.

Summary

  • Core Gateway is an open youre, scalable and high performance API gateway product. Kong also provides enterprise evolutions which may offer advanced features. Presentation will be divided into three parts, each focusing on different aspects of our enhancements and innovations.
  • The plumsus plugin has seen significant improvements in p 99 latency and RPS. The red limiting plugin for traffic control has also seen a notable increase in performance. The enhancement is mainly attributed to the introduction of a batch queue.
  • Since version 3.0, we introduced a new rotor implementation, which is called ATC rotor. New rotor empowers users to set roads with expressions. Configuring roads using expressions allows for more flexibility and better performance. The new rotor can reduce review time when routes change. It increases runtime performance when routing requests.
  • From version 3.0, we handle configuration by introducing LMDB as the back end for corn, which is an in memory transactional database. LMDb significantly improves rps during rebuilds with constant config push. We observed a remarkable 50% to 70% additional job in rebuild time.
  • Core utilizes request buffer to hold the payload and when the payload exceeds the buffer size, the ink results to use the disk to buffer the request. Although buffering can increase performance, some cases are not suitable to enable it.
  • Due to the limitations of the slides, some topics may not have been fully discussed. If you have any question, please feel free to reach out to me. I will be glad to help you.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hi there, I'm Wen Chao, a software engineer at today. I'm so glad to have the opportunity to share with you some exciting updates and innovations of core gateway. Before that, let me spend some time to tell you a little bit more about our company and our products. Core is a company that provides API gateway products and solutions. We have been focusing on helping our clients manage and secure their APIs effectively, empowering global companies to embrace an API first approach. The Core Gateway is an open youre, scalable and high performance API gateway product that provides features such as authentication, traffic control, HTTP transformation, logging and monitoring. In addition to the open source gateway, Kong also provides enterprise evolutions which may offer advanced features and support for organizations with specific requirements. We also provide a cloud native solution called Connect which enables you to manage APIs across any cloud team from one place. Connect provides a hosted unified control plan to manage gateways and ingress controllers. This empowers your organizations to reduce operational overhead, speed up time to the market, increase security, and achieving federated governance at scale. So in today's presentation, I will be focusing on core gateway from several aspects. The presentation will be divided into three parts, each focusing on different aspects of our enhancements and innovations. First and foremost, I will be introducing the significant improvements with making some plugins and exploring the key innovations for some core components that are powering up the core gateway. Following that, I will show you how we do our performance at can. In the third part, let's explore some skills in fine tuning the performance for an enhanced user experience. Core has hundreds of out of box plugins to address various use cases without introducing any platform complexity. This makes comb flexible enough to boost developer productivity and ensure operational resilience. In today's talk, we will just have a quick look at the latest improvements in some widely used plugins. So let's start with the plumsus plugin. As a commonly used cornerstone in modern monitoring system, plums use always plays a crucial role in core gateway. While it can perform well in scenarios with a relatively small number of roads, we recognize the decline in performance as the number of loads increases substantially. So we took this as a challenge and the followings are what we have worked out. As illustrated on this slide. We have already accelerated metric scrapping for the promise use plugin, and we have also implemented a reduction in high cardinality metrics. These enhancements eventually result in an improvement in the p 99 latency, which means it now can run smoothly even in large scale systems. Three factors contribute to these improvements. Firstly, we've adopted the table pool for the creation of temporary tables, which optimize the process and mitigates the cost of excessive table creation during promiscuous scrapping so that we can avoid performance hit of creating new tables all the time. Secondly, a few yield functions have been introduced in critical core sports, which allow the luajit compiler to release the control of cpu and give it to other request processing events rather than holding the cpu for too long time. This improves the responsiveness of the gateway quite a lot. Lastly, we've also replaced all the NYI functionalities involved in this plugin with those compatible with the Luigi compiler, which completely harnesses the performance benefits of Lua Vm. These changes collaboratively contribute to the performance boosts. The following bar chart shows a comparison of the p 99 latency between 3.3 and 3.4 version we just need to pay attention on the red and the green columns. The red one stands for version 3.3 and the green one stands for version 3.4. Here we can see a significant enhancements of the 3.4 in the p 99 latency. Now let's move on to the red limiting plugin, which is a plugin for the purpose of traffic control. For this plugin, we've also got a notable increase both in p 99 latency and RPS. The enhancement is mainly attributed to the introduction of a batch queue, which is a fundamental components newly introduced in 3.4, especially when the back end of the red limiting counter is can external storage. With the help of this queue, there is no need to communicate with external storage for every single request. It just put data in this internal in memory batch queue almost without any latency after every time the counters are calculated. When the queue flush these counters to external storage asynchronously, so it optimize the method of communication between the plugin and external storage, particularly beneficial in high concurrency scenarios. We've also introduced a new configuration sync rate so that we can even decide the flush frequency. The graph below illustrates the test result we did compared to a Goland based open source gateway. As we can see, the performance in P 99 and the RPS is nearly twice as good as the competitor. Now that we have discussed the improvements in plugins, let's delve into the enhancements in some core components. To begin with our discussion about the core components, let's first focus on the rotor components, which plays a key role in road pattern matching. Why is there a need for improvements in the rotor? Well, given that users heavily depend on its flexibility and efficiency to config their routing requirements effectively, the rotor stands out as the cornerstone of a gateway. Moreover, nearly every individual request should be checked by the rotor to determine if it matches with any API pass. The config API pass we've set as traffic volume increases, the rotor becomes a critical focal point and its performance emerges as a key factor. Since version 3.0, we introduced a new rotor implementation, which is called ATC rotor. The ATC rotor is a DESL based approach, and it's written in rust. The main reason for us to upgrade to a DSL based approach is to enhance flexibility. The new rotor empowers users to set roads with expressions. Configuring roads using expressions allows for more flexibility and better performance without losing readability when dealing with complex or large configuration. Furthermore, we rewrite it in rust, as rust is not only generally quite efficient, comparable to c language, providing high performance with low level control, but also is much safer than c in terms of its handling of memory. So it's ideal as a replacement for the language of choice for the development of a new module. Although it is never such an easy thing to rewrite a brand new module completely in rust. Because rust is not a native supported language in operation, several engineers invested a considerable amount of effort in this project. The results proved that everything was worthwhile. The new rotor can reduce review time when routes change. Meanwhile, it increases runtime performance when routing requests, and it reduce p 99 latency from 1.5 seconds to 0.1. Second testing with 10,000 rows besides the rewritten of the roads, we've made two more optimizations for the logic of the rotor. Firstly, we optimize the road rebuild logic. Secondly, configuration conditional rebuild of the roads was introduced, which means the road builder becomes more intelligent. They can be conscious about if roads have been changed and f needs are rebuilt. Indeed, these optimizations also contribute to the final enhancements. The second core components that I want to share with you is the improvement in our approach to configuration management. In our previous iterations, we rely on a shared memory as the configuration storage, which is a kind of native approach that provides node level storage accessible across all worker processes. While this was effective in ensuring fasting memory storage, we recognized the need for enhancement due to excessive memory consumption. Another concern is that shared memory is not database like storage, it does not support transactional operation. In some cases, this is an unacceptable reality because the ability to guarantee atomicity, which means we would need to use core code to ensure data consistent. This will increase the complexity of the code. From version 3.0, we handle configuration by introducing LMDB as the back end for corn, which is an in memory transactional database. By doing this, we've not only achieved a notable reduction in memory usage, but also unlocked the power of transactional storage. The benchmark results here also reveals that LMDb significantly improves rps during rebuilds with constant config push. What's more, we observed a remarkable 50% to 70% additional job in rebuild time, particularly noticeable in cases with a large number of workers. Congate way is based on engines and openrest. It has a unique master worker architect for efficient cpu usage by forking multiple worker processes and it faces information sharing challenges across workers due to the isolated nature of worker processes. To address the communication needs between workers, especially for tasks like rotor rebuilding and health checking, core designed an event propagation mechanism. Previously, the event components we implemented relied on shared memory. Though this approach was effectively it had drawbacks such as high lock overhead. Accessing a single shared memory requires a lock to avoid race continuous and keep consistency. This lead to high overhead when there are numerous workers, which is considered as a kind of inefficiency. Another concern is the size of the shared memory. It should be decided and allocated in advance when NX starts and it cannot be changed during the runtime. It means that regardless of whether this part of shared memory is fully utilized, it has already been allocated to enhance efficiency and avoid risks. We upgraded to a new event lab called Lua rest events. The new event library operates as a classic publish subscribe model and is implemented by a pure Lua code, eliminating the need of shared memory. We did best to compare the two event libraries. We designed two test cases to do this. One is to post 1000 events per five milliseconds. The other is to post 10,000 events per 100 milliseconds. And the benchmark here reveals that under the low pressure condition, the two libraries have nearly the same performance in the RPS, while the new library wins when the workload increases. Considering that performance is essential for commercial products to deliver can excellent user experience, we carefully designed a testing system and a set of statistical methods. Here is the infrastructure of the testing system. Two bare metals have been set up. One works as a load generator and the other is used for the deployment of core. We even bind different components, the NIC upstreams and the core gateway to different groups of CPU trying to get rid of CPU competition and we connect these two buried metals with a private network to fully utilize the network, the network bandwise, and we even put upstream and cone gateway on the loopback to give WRK more bandwidth. Here are the methods we used to check all the key metrics, rps, latency and memory usage and this graph shows how we check the decrease in metrics. We compare two groups of sample results to see if there's a significant difference. We do the test every day and if the difference happens the test is considered a failure and we will get a notification about it. Among two groups of results, one comes from the test result which is a poster release per test being carried out for a new version release and the results will be persisted in storage. The other one is tracked every day and persists in the same storage in case we need to review it. The charts illustrated on this slide tend to show you the effect of our methods. As you can see, it is easy to find out the organizations between the test cases. When the difference is found, a perf will be run to collect the perf data and generate the flame graph so that we can have an insight into details. From the flame graph we can generally find out the problematic function and after we made changes in our core to fix the issue, we run the perfect can to confirm the issue has gone. This is just an example to show you how we do performance. So in the last part of today's presentation, let's learn some practical insights into tuning the performance. Sometimes performance tuning can be intricate and today I will just talk about some common and fundamental strategies as examples. For more in depth information, please explore the technical blocks on core official website. So core as an API gateway, it comes out of box being well tuned for proxying traffic. However, there are some cases where cone is used for processing requests with large payload. So let's explore this case first. Core utilizes request buffer to hold the payload and when the payload exceeds the buffer size, the ink results to use the disk to buffer the request. This can avoid excessive memory usage, preventing potential memory shortage that could lead to extensive disk operations. Such as such, disk operations can be costly and may block NX event loop resulting in performance degradation. To mitigate this, a simple solution is to increase the buffer size defaulted at 8000 as illustrated in this graph. The larger the buffer size is, the better performance we can get. Another case arises when corn receives responses from upstream with large payloads. Similar to the previous case, a buffer is in place for catching response bodies. The same principle holds true as demonstrated in this graph. Although buffering can increase performance, some cases are not suitable to enable it, including real time cases, long term connection or streaming transform transmission or streaming processing. And also we should not enable it when memory of the environment is limited. Okay, I think that's almost everything I would like to cover in today's presentation. Due to the limitations of the slides, some topics may not have been fully discussed. If you have any question, please feel free to reach out to me. I will be glad to help you. Thank you.
...

Wenchao Xiang

Senior Software Engineer @ Kong



Join the community!

Learn for free, join the best tech learning community for a price of a pumpkin latte.

Annual
Monthly
Newsletter
$ 0 /mo

Event notifications, weekly newsletter

Delayed access to all content

Immediate access to Keynotes & Panels

Community
$ 8.34 /mo

Immediate access to all content

Courses, quizes & certificates

Community chats

Join the community (7 day free trial)