Transcript
This transcript was autogenerated. To make changes, submit a PR.
Today I am talking about how do you scale
open telemetry collectors using Kafka? To introduce myself
I am Pranay. I am one of the founders and maintainers at SigNos.
I have been working as product manager in the past. I was product
at Microsoft and multiple other startups in
my pastime. I love reading and taking
a walk in the nature just to set context.
Why I'm talking about open telemetry collector so
as I mentioned, I am maintainer at signals. It is an open source observatory
platform. We are have been there around for
three years. We have around 16,000 70,000 GitHub stars,
4000 plus members of slack community, 130 plus contributors
and it's open telemetry native single
pane in glass for auxiliary. So we have support for different
traces, metrics and logs and you can see all
of them in a single pane and correlate across them.
And today I'll be talking about our experience with
signos cloud and underneath we use open telemetry
collector and how we scaled it and our experience
regarding that. So that's where I'm coming from.
Just to set context for people who are not aware what is
open telemetry. So open telemetry is a
CNCF projects, it's a vendor neutral standard for
sending telemetry data from your applications
and it supports all the signals.
So you can send metrics using it, you can send traces
using it, you can send logs using it.
And this is sort of becoming now the default standard
how people do observe. This has helped
and so why is it important?
So open telemetry is the second fastest growing project in
CNCF and this is just after kubernetes, so you can
understand how popular it is. What it enables is it
enables sending data to any
backend. So it standardizes on that elementary protocol
through which you send data or applications can send data,
infrastructure can send data to a backend.
This is now becoming the default standard for instrumentation
and people can use any backend which supports
open telemetry for their use cases.
The big advantage it gives is that you are not users are now not getting
vendor locked in to a particular ecosystem and hence have
more flexibility. So this has promoted a
lot of innovation in the ecosystem. The other key advantage is that
it's, it has been a unique standard
in the sense that it supports all the three signals, metrics,
traces and logs from get go. And also new signals like profiling
is in progress. And at Signos we are
natively based on open telemetry. When we started the project in 2021.
This we took a bet on just
being natively based on open telemetry. That's the only SDK support.
And most of the things we do, we try to rely on open telemetry
as much as possible. So today in our talk, we'll talk about if
you have used open telemetry collectors or have tried it for setting
up your ability, how you can scale it and,
and use it for like huge scale.
And I'll talk about our context on
how we leverage is for a signos cloud product.
And hopefully it will be helpful for you to
get a sense of where things are. Cool. So now
we understand what open telemetry is.
One of the key components of open telemetry is the,
is the open telemetry collector. You can think of it
essentially as a pipeline through
which you can send data, process data and send to
different destinations, right? So there are three key components
of open telemetry collector. There are receivers,
there is a processor, and then there are exporters. So through receivers
you can receive data from different formats. For example, you have
host metric receiver, which can receive data from
machines and infrastructure metrics like
cpu's memory res. There's a Kafka metric
siever where you can get metrics from Kafka.
There are like 90 plus such receivers which enable you to
receive data from different sources. The next is processors,
which basically helps you do different type of processing. So you
can do change attributes. You can do filtering of particular type
of logs, traces or metrics.
And then you export it to new destinations. So you
can export it to destinations like click house.
You can export it to back end providers like signals.
Or you can use any other exporter like Kafka
exporter tools to send data across by
Kafka. So open telemetry collectors are really
the key components in open telemetry. And we'll
focus on how we have used open telemetry collectors
to, to provide this, our signals cloud service.
Just to take you a bit deeper into our
signals cloud for the singleton architecture. So without,
and this is without Kafka, we are
both multi tenant and single tenant architecture. In this talk, I'll primarily focus on
the single tenant architecture because that's where we
have used Kafka a lot to scale
this architecture. So imagine
there's a customer who is using otill collector as an agent and
sending data to signals backend,
or they're sending directly data through applications,
right? So in architecture which
doesn't involve Kafka, it will, they will send data directly to a load balancer
and then there will be then directed to the individual
tenant total collectors. So in this architecture,
signos tenant has their own hotel collector.
The load balancer points to like sends data
from a particular customer to their specific signals hotel
collector. Right. The problem with this architecture is
that if the tenants go, go down
or the dbs in this tenant have issues,
then the agents start. The hotel collector here starts
giving five xx that leads to loss
of data. Also if the tenant has
to, like if a customer suddenly has spiked
their ingestion rate, so say they have made
it ten x it in say few minutes. This leads
to like if it is directly collected to a single hotel collector,
maybe the hotel collector doesn't scale as quickly and
that will lead to loss of data for the tenants.
Right. So there's always some time which the tenant
DB takes to scale up and during that time there
will be loss of data for the tenant.
So effectively customers may see some data getting dropped,
which is obviously not a good thing. Right?
So this was some of the problems which we immediately
identified with a single tenant architecture.
It seemed imperative that there should be a queuing system in front of
the after the gateway or kettle collectors,
right? So in this example, the,
here is the load balancer. So the load balancers. So the signals
customers send data to the signals load balancer.
And then that talks to a gateway of collectors.
So this is basically a fleet of urgently scaling
which received data from the customers.
We use something called a Kafka exporter. So as I mentioned earlier,
hotel collector has something called exporter
and receiver and processor. So in this
example we are showing hotel
collectors, gateway hotel collectors, they,
they receive the data from the.
And then we use Kafka exporter and the Sotel collector
to send data to Kafka. So we have Kafka
setup here. I'll get into more details on what is the setup and
the configuration for that. And then in the signals
tenant, which also has auto collector in it, we enable the Kafka
receiver and get data from Kafka there.
So as you can see here, sort of,
even if there is a huge spike in
load from a particular customer, the Kafka acts
as the queing system in between to absorb
that spike. And then as the tenant system,
hotel collector gets scaled up, it can start consuming
at a higher rate. So there's no loss
of uh, packet draws which we had earlier where
there was no Kafka in place. Yeah. So these are some
of the use cases where Kafka can be helpful.
So this enables a highly available ingestion we
have. So this is our current Kafka setup. We have 6 hours retention
period, depletion factor of three, and then a ten
mb max message, right? So Kafka,
as you know is higher, is we can, we have configured
it to be highly available. So with a reflection factor
of three. So Avka acts as a buffer for 6
hours. It's, it has much higher availability,
it can handle busty traffic, tenant can
continue consuming at their rate and
have some time to scale up as they need. While the Kafka acts as sort
of the single line of defense against a very high spike
in loads. These are the two key main factors.
But one of the advantages, parallel advantage to this is that
this can also enable lot of additional processing,
which we can do at Kafka. For example,
especially in the case of database sampling for traces.
Sometimes people want to filter traces
by trace id and accumulate all traces at one place and
then do some processing on it. For example, if you
want to reject or like not store traces which has
particular attribute in it. For example say if you want to reject all
traces which have health check endpoints, right? Which essentially don't add
value in case, if there are just hotel collectors
you would have to challenge. You'll have a challenge of like having
all hotel collectors in the case where they're like multiple
order collectors to like all spans come to the same hotel collector,
right? Because auto collector by natal is horizontally
scalable, so you don't really control which hotel collector which span
will go to and they are stateless. In Kafka,
this problem can be solved much easily because you can use
trace trace id as a partition key and that would
enable all trace ids to go to a particular and
hence you can do all those trace tail based
sampling much easily there. So having a
queue system in between helps you doing a
lot of additional processing, especially if you want to do tail based sample
tail sampling at the auto collector level.
Much easier, right? So as I said, this is
our Kafka setup. So just to give
an example, these are some of the typical things
we monitor. So we like if you have set up Kafka,
you will also need to monitor like how many records are getting
produced, what's the,
for each tenant, what's the sort of difference
how much records are getting consumed? So the difference between
the number of records produce and consume basically tells you like what
you need to do. Is that a lag in between if
you need to scale your system or not, right?
So if, so, when you set up our system
here with Kafka we actively monitor
all this, all these metrics and
as you can see we use signals to monitor signals so this is a case
of the monitor getting monitored itself and
we use signals internally heavily to monitor all our SAS
customers and use cases so you can monitor
all different types of kafka queues here
how many records are being consumed, how many records are being generated
monitoring consumer lag is very important so this gives you
a sense of like hey like how much ingestion is
there and how much is the individual hotel collector is
consuming right? For example in this case as you can see like for
example a particular customer starts sending in data
at a very high rate and this signal,
this like the tenant for that is consuming a particular rate, right? So if
the initial reads rate suddenly spikes then
this the consumer lag here will
increase and that we monitor actively to
throw alerts and so the act of the signal let hey like
maybe the signals tenant needs to be automatically scaled
and like we need to more add more resources there so this
scaling happens automatically but consumer lag is like one of the pointers
which helps you decide what to and when to
scale the tenants, right? So we monitor
all this active like consumer lag actively
and what are the number of kafka
cluster. So we use red panda internally to monitor all this so how many
are the red panda brokers? What is the consumer
lag which is there in between? One of the key factor which
we use for scaling is to scale
based on consumer lag so how many partitions
to increase or not and then this can also be used
as a metric to scale up your consumer group,
right? So you and tenant so if you see the number of
like the consumer lag is increasing a lot we write scripts
to automatically scale this up basically
enable higher ingestion like higher consumptions from the hotel
collector tenant tenant altar collectors and that
that reduces the consumer lag and basically the systems get stable
again so kafka acts as a buffer in between to
handle workloads in the scenario if there was no kafka
as as we had discussed earlier then
suddenly this hotel collector will get overlapped and there
will and if it doesn't scale at the same rate then
it will start dropping packets if Kafka is present then that
doesn't happen so consumer lag is
an important thing to monitor that
helps you scale kafka also one thing
which accumulated monitor is consumer latency so what is the
producer consumer latency so here if you see we have plotted
how much time it takes for the producer to produce and get into
Kafka. And what is the latency which like the consumer has.
And if this latency increases by a particular amount we
throw out alerts and signals and that basically indicates that hey, there is
something wrong and then maybe some steps needs to be taken to fix
it, right? So till now the
kafka base architecture is working quite well for us. It enables very
fast ingestion, it can handle spikes
in loads from workload customers and data
is retained for 6 hours and that works.
We even get a huge compression factor of ten to 15 because
hotel collector by default works in a batch model. So it
sends data in batches and before ingestion into Kafka
we are able to get a huge compression before sending that.
So in that sense also it works quite well and
it handles spikes, as I mentioned earlier, also very well.
There are few areas where I think there could be improvements
which we are working on currently. So can we make this
automatically increase of based on just on the
scale of ingestion of a topic, if a partition gets stuck
for a tenant total collector, then can we
use some methods like dead letter queue to drop
after a few retries so that it doesn't get stuck in a permanent
failure. Also making the whole tenant
hotel collector which is like Kafka receiver to processors to exporter
a synchronous module so that consumer commits
an offset only after the message is successfully returned to a DB.
So there's some guarantee on when the
message is being written and the other
key pieces like hey, can we make this exactly one delivery?
Which as you know in our queuing processes are not easy
to do right? So these are some of the improvements which we
foresee in the future and which you're working on to solve. But overall,
just adding Kafka on the Autel character has been
a step improvement for us and hopefully for
other people, other teams which are running the set scale,
they can take this as a guide.
So that's all I had for my talk.
Just want to give a shout out for signals. So we are actively
growing community in using open
telemetry and we are a complete authority stack so you
can check out all signals repo if you try
it out create feel free to create an issue or participate
in our slack community. As I mentioned, we are quite
active slack community where you can come and get your questions answered and
if you have any follow on thoughts with me, feel free to email me
or that's all from my side.
Looking forward to hear more from you and feel free to reach out to us
if you have any feedback. Thank you.