Conf42 Site Reliability Engineering (SRE) 2024 - Online

Scaling Opentelemetry Collectors using Kafka

Abstract

There have been multiple discussions in observability communities whether to use Kafka or not with OpenTelemetry.

In this talk, we will look into ways to scale OpenTelemetry Collectors and the pros and cons of approaches based on our experience maintaining high-scale observability backends at SigNoz

Summary

  • Today I am talking about how do you scale open telemetry collectors using Kafka? I am one of the founders and maintainers at SigNos. And today I'll be talking about our experience with signos cloud and underneath we use open Telemetry collector and how we scaled it.
  • Open telemetry is a CNCF projects, it's a vendor neutral standard for sending telemetry data from your applications. This is now becoming the default standard for instrumentation. At Signos we are natively based on open telemetry. And I'll talk about how we leverage is for a signos cloud product.
  • Open telemetry collectors are the key components in open telemetry. We use Kafka to scale our signals cloud service. We are both multi tenant and single tenant architecture. It seemed imperative that there should be a queuing system in front of the after the gateway or kettle collectors.
  • Kafka enables very fast ingestion, it can handle spikes in loads from workload customers. One of the key factor which we use for scaling is to scale based on consumer lag. There are few areas where I think there could be improvements which we are working on currently.
  • We are actively growing community in using open telemetry. We are a complete authority stack so you can check out all signals repo. Feel free to create an issue or participate in our slack community. Looking forward to hear more from you.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Today I am talking about how do you scale open telemetry collectors using Kafka? To introduce myself I am Pranay. I am one of the founders and maintainers at SigNos. I have been working as product manager in the past. I was product at Microsoft and multiple other startups in my pastime. I love reading and taking a walk in the nature just to set context. Why I'm talking about open telemetry collector so as I mentioned, I am maintainer at signals. It is an open source observatory platform. We are have been there around for three years. We have around 16,000 70,000 GitHub stars, 4000 plus members of slack community, 130 plus contributors and it's open telemetry native single pane in glass for auxiliary. So we have support for different traces, metrics and logs and you can see all of them in a single pane and correlate across them. And today I'll be talking about our experience with signos cloud and underneath we use open telemetry collector and how we scaled it and our experience regarding that. So that's where I'm coming from. Just to set context for people who are not aware what is open telemetry. So open telemetry is a CNCF projects, it's a vendor neutral standard for sending telemetry data from your applications and it supports all the signals. So you can send metrics using it, you can send traces using it, you can send logs using it. And this is sort of becoming now the default standard how people do observe. This has helped and so why is it important? So open telemetry is the second fastest growing project in CNCF and this is just after kubernetes, so you can understand how popular it is. What it enables is it enables sending data to any backend. So it standardizes on that elementary protocol through which you send data or applications can send data, infrastructure can send data to a backend. This is now becoming the default standard for instrumentation and people can use any backend which supports open telemetry for their use cases. The big advantage it gives is that you are not users are now not getting vendor locked in to a particular ecosystem and hence have more flexibility. So this has promoted a lot of innovation in the ecosystem. The other key advantage is that it's, it has been a unique standard in the sense that it supports all the three signals, metrics, traces and logs from get go. And also new signals like profiling is in progress. And at Signos we are natively based on open telemetry. When we started the project in 2021. This we took a bet on just being natively based on open telemetry. That's the only SDK support. And most of the things we do, we try to rely on open telemetry as much as possible. So today in our talk, we'll talk about if you have used open telemetry collectors or have tried it for setting up your ability, how you can scale it and, and use it for like huge scale. And I'll talk about our context on how we leverage is for a signos cloud product. And hopefully it will be helpful for you to get a sense of where things are. Cool. So now we understand what open telemetry is. One of the key components of open telemetry is the, is the open telemetry collector. You can think of it essentially as a pipeline through which you can send data, process data and send to different destinations, right? So there are three key components of open telemetry collector. There are receivers, there is a processor, and then there are exporters. So through receivers you can receive data from different formats. For example, you have host metric receiver, which can receive data from machines and infrastructure metrics like cpu's memory res. There's a Kafka metric siever where you can get metrics from Kafka. There are like 90 plus such receivers which enable you to receive data from different sources. The next is processors, which basically helps you do different type of processing. So you can do change attributes. You can do filtering of particular type of logs, traces or metrics. And then you export it to new destinations. So you can export it to destinations like click house. You can export it to back end providers like signals. Or you can use any other exporter like Kafka exporter tools to send data across by Kafka. So open telemetry collectors are really the key components in open telemetry. And we'll focus on how we have used open telemetry collectors to, to provide this, our signals cloud service. Just to take you a bit deeper into our signals cloud for the singleton architecture. So without, and this is without Kafka, we are both multi tenant and single tenant architecture. In this talk, I'll primarily focus on the single tenant architecture because that's where we have used Kafka a lot to scale this architecture. So imagine there's a customer who is using otill collector as an agent and sending data to signals backend, or they're sending directly data through applications, right? So in architecture which doesn't involve Kafka, it will, they will send data directly to a load balancer and then there will be then directed to the individual tenant total collectors. So in this architecture, signos tenant has their own hotel collector. The load balancer points to like sends data from a particular customer to their specific signals hotel collector. Right. The problem with this architecture is that if the tenants go, go down or the dbs in this tenant have issues, then the agents start. The hotel collector here starts giving five xx that leads to loss of data. Also if the tenant has to, like if a customer suddenly has spiked their ingestion rate, so say they have made it ten x it in say few minutes. This leads to like if it is directly collected to a single hotel collector, maybe the hotel collector doesn't scale as quickly and that will lead to loss of data for the tenants. Right. So there's always some time which the tenant DB takes to scale up and during that time there will be loss of data for the tenant. So effectively customers may see some data getting dropped, which is obviously not a good thing. Right? So this was some of the problems which we immediately identified with a single tenant architecture. It seemed imperative that there should be a queuing system in front of the after the gateway or kettle collectors, right? So in this example, the, here is the load balancer. So the load balancers. So the signals customers send data to the signals load balancer. And then that talks to a gateway of collectors. So this is basically a fleet of urgently scaling which received data from the customers. We use something called a Kafka exporter. So as I mentioned earlier, hotel collector has something called exporter and receiver and processor. So in this example we are showing hotel collectors, gateway hotel collectors, they, they receive the data from the. And then we use Kafka exporter and the Sotel collector to send data to Kafka. So we have Kafka setup here. I'll get into more details on what is the setup and the configuration for that. And then in the signals tenant, which also has auto collector in it, we enable the Kafka receiver and get data from Kafka there. So as you can see here, sort of, even if there is a huge spike in load from a particular customer, the Kafka acts as the queing system in between to absorb that spike. And then as the tenant system, hotel collector gets scaled up, it can start consuming at a higher rate. So there's no loss of uh, packet draws which we had earlier where there was no Kafka in place. Yeah. So these are some of the use cases where Kafka can be helpful. So this enables a highly available ingestion we have. So this is our current Kafka setup. We have 6 hours retention period, depletion factor of three, and then a ten mb max message, right? So Kafka, as you know is higher, is we can, we have configured it to be highly available. So with a reflection factor of three. So Avka acts as a buffer for 6 hours. It's, it has much higher availability, it can handle busty traffic, tenant can continue consuming at their rate and have some time to scale up as they need. While the Kafka acts as sort of the single line of defense against a very high spike in loads. These are the two key main factors. But one of the advantages, parallel advantage to this is that this can also enable lot of additional processing, which we can do at Kafka. For example, especially in the case of database sampling for traces. Sometimes people want to filter traces by trace id and accumulate all traces at one place and then do some processing on it. For example, if you want to reject or like not store traces which has particular attribute in it. For example say if you want to reject all traces which have health check endpoints, right? Which essentially don't add value in case, if there are just hotel collectors you would have to challenge. You'll have a challenge of like having all hotel collectors in the case where they're like multiple order collectors to like all spans come to the same hotel collector, right? Because auto collector by natal is horizontally scalable, so you don't really control which hotel collector which span will go to and they are stateless. In Kafka, this problem can be solved much easily because you can use trace trace id as a partition key and that would enable all trace ids to go to a particular and hence you can do all those trace tail based sampling much easily there. So having a queue system in between helps you doing a lot of additional processing, especially if you want to do tail based sample tail sampling at the auto collector level. Much easier, right? So as I said, this is our Kafka setup. So just to give an example, these are some of the typical things we monitor. So we like if you have set up Kafka, you will also need to monitor like how many records are getting produced, what's the, for each tenant, what's the sort of difference how much records are getting consumed? So the difference between the number of records produce and consume basically tells you like what you need to do. Is that a lag in between if you need to scale your system or not, right? So if, so, when you set up our system here with Kafka we actively monitor all this, all these metrics and as you can see we use signals to monitor signals so this is a case of the monitor getting monitored itself and we use signals internally heavily to monitor all our SAS customers and use cases so you can monitor all different types of kafka queues here how many records are being consumed, how many records are being generated monitoring consumer lag is very important so this gives you a sense of like hey like how much ingestion is there and how much is the individual hotel collector is consuming right? For example in this case as you can see like for example a particular customer starts sending in data at a very high rate and this signal, this like the tenant for that is consuming a particular rate, right? So if the initial reads rate suddenly spikes then this the consumer lag here will increase and that we monitor actively to throw alerts and so the act of the signal let hey like maybe the signals tenant needs to be automatically scaled and like we need to more add more resources there so this scaling happens automatically but consumer lag is like one of the pointers which helps you decide what to and when to scale the tenants, right? So we monitor all this active like consumer lag actively and what are the number of kafka cluster. So we use red panda internally to monitor all this so how many are the red panda brokers? What is the consumer lag which is there in between? One of the key factor which we use for scaling is to scale based on consumer lag so how many partitions to increase or not and then this can also be used as a metric to scale up your consumer group, right? So you and tenant so if you see the number of like the consumer lag is increasing a lot we write scripts to automatically scale this up basically enable higher ingestion like higher consumptions from the hotel collector tenant tenant altar collectors and that that reduces the consumer lag and basically the systems get stable again so kafka acts as a buffer in between to handle workloads in the scenario if there was no kafka as as we had discussed earlier then suddenly this hotel collector will get overlapped and there will and if it doesn't scale at the same rate then it will start dropping packets if Kafka is present then that doesn't happen so consumer lag is an important thing to monitor that helps you scale kafka also one thing which accumulated monitor is consumer latency so what is the producer consumer latency so here if you see we have plotted how much time it takes for the producer to produce and get into Kafka. And what is the latency which like the consumer has. And if this latency increases by a particular amount we throw out alerts and signals and that basically indicates that hey, there is something wrong and then maybe some steps needs to be taken to fix it, right? So till now the kafka base architecture is working quite well for us. It enables very fast ingestion, it can handle spikes in loads from workload customers and data is retained for 6 hours and that works. We even get a huge compression factor of ten to 15 because hotel collector by default works in a batch model. So it sends data in batches and before ingestion into Kafka we are able to get a huge compression before sending that. So in that sense also it works quite well and it handles spikes, as I mentioned earlier, also very well. There are few areas where I think there could be improvements which we are working on currently. So can we make this automatically increase of based on just on the scale of ingestion of a topic, if a partition gets stuck for a tenant total collector, then can we use some methods like dead letter queue to drop after a few retries so that it doesn't get stuck in a permanent failure. Also making the whole tenant hotel collector which is like Kafka receiver to processors to exporter a synchronous module so that consumer commits an offset only after the message is successfully returned to a DB. So there's some guarantee on when the message is being written and the other key pieces like hey, can we make this exactly one delivery? Which as you know in our queuing processes are not easy to do right? So these are some of the improvements which we foresee in the future and which you're working on to solve. But overall, just adding Kafka on the Autel character has been a step improvement for us and hopefully for other people, other teams which are running the set scale, they can take this as a guide. So that's all I had for my talk. Just want to give a shout out for signals. So we are actively growing community in using open telemetry and we are a complete authority stack so you can check out all signals repo if you try it out create feel free to create an issue or participate in our slack community. As I mentioned, we are quite active slack community where you can come and get your questions answered and if you have any follow on thoughts with me, feel free to email me or that's all from my side. Looking forward to hear more from you and feel free to reach out to us if you have any feedback. Thank you.
...

Pranay Prateek

Co-Founder @ SigNoz

Pranay Prateek's LinkedIn account Pranay Prateek's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways