Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello everybody, and thank you for choosing this talk.
Today we are going to talk about an interesting subject.
We are going to talk about how to push your streaming platform to the limit.
We're going to talk about performance, benchmarking and
how to measure it. Now, before we begin,
let me introduce myself - My name is Elad Leev,
and I'm a big advocate of everything related to
distributed systems, data streaming data as a whole,
the whole concept of data mesh and products
like Kafka, Flink, Pinot and so on. So if
you want to hear my opinions about those subjects, go ahead to my Twitter
account and follow me. And all the links from this
talk will be later on posted as a thread in my account. So definitely
check it out. Now, before we begin with the
actual content, let's start with a little bit of marketing,
because, that's life, and we have to do it ;) So what
is Dojo? Dojo is one of the fastest growing
fintech in Europe. We power mostly everything related
to the face to face economy,
whether it's bars, pubs, restaurants and so on.
So if you are a UK best based, you probably saw
our devices around those areas.
And as you can imagine, we are dealing with tons of
data, which is quite awesome.
I want to start with a quote from a book by this
guy, a computer scientist named Jim Gray. Now,
I'm a big sucker of those kind of books from the 80's, 90's
that are still relevant even today. I mean, almost, what,
40 years later on? It's kind of amazing. Now,
the book itself, or the handbook, contains a lot of gems regarding
everything from system performance,
how to measure it, and so on. And those gems are actually
relevant even today in the "cloud native" area of Kubernetes
and so on. Now,
we need to understand that measuring system is *hard*.
There is no magic bullets. We can't have a single
metric that will tell us whether our system is behaving well
or not, right? It's a really hard task. So even
if, for example, we will take the biggest streaming platform
in the world, which is of course, Kafka, you all know it,
the use cases may vary
because we might have a different message size,
we might have a different traffic pattern, we might
have different configurations between the different services and
the different consumers and producers in our systems.
And as a result of these different aspects,
we might get a completely different performance
from the same machines, from the same cluster. So it's
crucial to understand it, and crucial to understand that
when we evaluate a system, usually most
of the people just look on the TCO, which is the total cost
of ownership. And it's okay. It makes sense in a way,
but we can't just rely on the performance benchmark
that the vendors are given. Because eventually,
if you think about it, those benchmarks
are a marketing jobs, right? There are almost no engineers
involved in those processes.
Now, it's okay, but as data driven professionals,
this is not something we can trust, right? We need to
actually understand and run it on our own to understand those
limits. We can just believe that the system will
scale as we grow our business and so on.
This is something that we will actually need to test and to see
with our bare eyes. Now,
we need to know those limits. We need to know the limits
of our system, because knowing how much our system
can handle, whether if it's, I don't know,
the biggest message size we can process,
or how many RPS our database can handle,
how many messages a second and so on, and especially
knowing what is our limiting factor,
might later on assist us in different aspects.
for example, in preventing maintenance in a better and
more accurate capacity planning, and even in eliminating toil,
because eventually eliminating toil is a cost, right?
So we need to know the limit of our system.
Now, when you look on those systems
in general, on servers and computers and
performance, sorry, we can actually understand that
we only have four pillars of failures,
right? It's either going to be the disk, the memory,
the CPU, or the network, nothing else.
usually. So we need to actually understand what is
the limiting factor while running those benchmarks.
Now, understanding those limits
is nice, but what are the key criteria
for our benchmarks? Before we actually run it,
we need to build it properly. Now, first and
foremost, it might sound obvious, but when running those
benchmarks, we can't do any tricks. We can't use,
I don't know - faster machines, we can't use a different JVM,
we can't use anything like that
right? Even if it costs us a bit more
to run those benchmarks using
the same hardware and the same systems as we have in production (or any
other environments that you are using) it is crucial
for benchmark, because remember - it's not a game.
We are not trying to aim for a
better result or getting better numbers or
anything like that. We actually aim for an accurate
result in the benchmarks, right? So even if it costs us more,
it's better to mimic our production,
let's say environment, in our benchmark, and use
the same machine types and
amount of clusters and amount of nodes and so on.
Now the second bit is that we should aim
for the peak, right? Because if you think about it,
metrics are crucial for the success of the benchmark.
So to begin with, if you don't collect those metrics,
whether it's the system performance metric, and anything else, the application metrics of course,
you should start by doing that because it will be crucial
later on. But also you need to understand what are your
SLA, what are your SLOs,
for example, what is
the acceptable, I don't know, end-to-end latency from your
system, because it might change between different clusters, right?
Because you have different services and different use cases and so
on. And also,
what do we consider as a downtime? If that
latency is spiking, is it considered to be a downtime?
A downtime is when the system is not performing well
and so on. And one of
the most crucial metric to find is
your peak traffic, because you should aim to
that traffic, right? You should aim to your peak traffic
and better yet, add some buffer because you need
to expect the unexpected. There is a great post by
AWS CTO Werner Vogels,
where he mentions that eventually
failures are given, everything will fail over time.
again, whether it's the disk, the CPU or anything else.
But on those cases, we will still need to serve our users
during peak time. So we actually need to understand what is the
peak time, what is the peak traffic that we have. And maybe
we should aim for N-1, right? Because we still
want to serve successfully even when
we lost one or two of the machines. So on
your benchmarks, aim for that point.
Now, the benchmarks should be scalable and
portable, because the benchmark
itself, for now, we might run it on
system, let's say Kafka or RedPanda or anything else. But in
the future we might decide to move to a different system,
right? We might decide to use RabbitMQ or
Pulsar or NATS or anything that the future will
bring. So the benchmark should apply to
every other system that we use. It should be portable.
And also we want to test our benchmark in different
use cases. So our benchmark should be able to scale up
and scale down the same as our services,
right? The same idea. So it should be possible to
scale up and down our benchmarks and our worker nodes,
which we later on we'll see as a reflect of our
actual performance of the cluster.
The next bit is that simplicity,
is the key. Don't try to overcomplicate things.
Don't try to do any of those stuff. The benchmark must
be understandable, and the benchmark must be reproducible,
because otherwise it will kind of lose the credibility
of the test. Because you want to document everything in the process and
you want to document the key result. And you want your
users, whether it's internal user or external user, depend on
your cases. But you want your users to be able
to mimic and to reproduce the same test
that you saw, because later on they might test it on
their end, if you know what I mean. So it's crucial that your
test will be as simple as possible.
Now we understand why it's super important
to run those tests and how
should we build those kind of tests. But what should
we look for when doing those tests, right? Because this is something
that it's important to understand.
So one of the best methods that
I know of defining a system performance is the USE
method. It was created originally by an
engineer called Brendan Gregg. Amazing. Check his blog.
Definitely. it's an amazing blog. So Brendan
actually created a method to solve 80%,
to identify 80% of the server's issues with
5% of the effort. So the same as for example,
you have a flight attendant,
they have some kind of manual or tiny emergency
checklist that it was like idiot prone to what
to do when there is an emergency. So the same thing we have with
the USE method, we have a straightforward, complete
and fast manual how to test our system.
So for each one of our resources that we already mentioned, the CPU,
disk, network and so on, we want to look on the
utilization, on the saturation and the errors
in the logs or whatsoever. This will help
us to identify what is our problem as fast as
possible. Now for example, if you look on
the same pillars, so if we are looking on the CPU,
for example, we can look on our system time,
the user, the idle and so on. We can check the load average,
if we are talking about our memory.
So we might want to use the metrics related
to the buffers, to the cache,
to see the JVM heap. If it's a JVM
based system, we might want to see the GC time
when we talk about networks. So we might want to use to
look on the bytes in and out, to see how many package, if any,
dropped and so on. So how do we benchmark
those systems? Now like I mentioned, today,
in the streaming platform area,
there's like billions of products already, right? And we have more
and more products launching every day. Now if
it's a JVM product, maybe we can use the old,
I don't want to say rusty, but the old, nice JMeter
that everyone used to run in the past.
But again we are in an area where not all of the data
system are JVM based. So we want to use something
else and we
could use the system specific benchmark tools. For example,
Kafka is packed with its own performance
test shell scripts that you can run against your cluster.
Same goes for RabbitMQ, and NATS for example
has its own "bench" utility.
But as I mentioned in the past, our goal is to seek
a system that will be easy to move between
different systems, right? So we don't
want to use a system specific benchmark tool.
So exactly for that use case we have
a project, a nice project, from the Linux Foundation
which is called the Open Messaging Benchmark or
in short, OMB. This system
is a cloud native, vendor neutral,
open source distributed messaging benchmark.
A lot of buzzwords. Yeah, and it basically
allows us to run benchmark on most of those common
systems that we all use in a simple way.
Now the system itself is built out of two components, which is
easy to understand. You have the drivers and you have the OMB workers.
Again, OMB, open messaging benchmark, the driver
itself is responsible for assigning the task to the
workers. It's also
responsible for everything related to the metadata itself.
So for example it is the one who actually
creates the topics in Kafka, creating the
consumers and so on. And we have the OMB workers
which is the benchmarks executor.
So the driver is communicating over network with
the worker and assigning task and
the worker, sorry, actually execute the test again the
cluster. Now it's super easy to
use it. You can just install using the provided helm chart.
You can easily deploy OMB on every Kubernetes cluster
or even deploy it outside of Kubernetes. Of course if you want
in Kubernetes case our driver will read
all the configuration from a config file for example and will distribute the
load as I send to the workers, which are pods
eventually. So you might want to spawn
the same amount of worker as your most,
let's say your biggest service or something like that,
like I mentioned in the criteria. So it's super easy to scale
the system. You can scale up and down the amount
of workers to match any number that you want.
And like I said, it's a good practice to run them against the
same number of pods that you have in your most crucial
system or service. Of course again I will
say it, use the same Kubernetes machine, use the same instance
types and so on because it's crucial for the test.
Now here you can see an example of the configuration.
On the left side you have the Kafka configuration, on the right
side you have the Pulsar configuration. You can see that we start
with the name of the test and a driver class.
So again makes sense. Kafka is Kafka, Pulsar is Pulsar.
Next, we assign the basic configurations
for the test. For example, we have the connection
string to pulsar or to Kafka the port, we have the amount of
I/O threads, we have the required request,
timeout and so on. Next we have
the producer configurations. So we
might want to use different producer configuration like I mentioned, based on
our services, but it allows you in Kafka use case,
for example, you can set the arc to be all
minus one or whatever you want. You can change the
linger, match bed size and so on. Same goes to pulsar, you can
change the producer. And lastly you define your consumers.
Now in the project repo itself, in GitHub
of course. And later on I will post a link. Like I mentioned,
you have many examples. So make sure to do
the market search, make sure to identify your
biggest producers and consumers and
what are the configurations that they are using. Because later on you might
want to use those configurations in your tests.
Now based on this information, make sure to generate
different use cases and these edge cases and test
it against your cluster. This is an
example for a workload file. Eventually the workload,
the message that we send is just like a binary file that
we send, but the workload definition looks as
you can see in here you have the message size, you have some randomized
payloads configuration. So on things that you
can change, you have the rate of the producer
and many other stuff. So running
the test, it's quite easy. With OMB you
can test it the same test, you can test it again, different systems.
And after you run this test, the test result
will be printed to your screen, but also
it will be saved as a file which later on you can
share. And the project itself is also containing
a nice Python script that allows you to generate a
pie chart from the result and then maybe put it in your documentation
so everyone could see what the result of the system.
Now, after we understand how to do it, and that it's
kind of better to do it with this kind of system,
what are the kind of insights that we might get
from these kind of tests? First of
all, we could potentially find our average
latency, right, the end to end latency, which is important.
We can identify that latency
to find any kind of problems, because maybe if
we have services that the latency is crucial
to them, maybe we could play a little bit
with the configuration, lower the linger.ms , lower the
batch size (again in the world of Kafka),
and then get a better end-to-end latency.
Maybe we can find a better fine tune for
our services and for our configuration, right? So it's really
important to find it. The next bit is we
have a unique opportunity to stress test our
monitoring service and dashboards because we usually,
I hope so. We don't have a lot of those cases, extreme cases,
where the system is super overloaded. So we have a unique
opportunity to see that our Grafana dashboards or
Datadog, or whatever you use, can actually handle these
kind of values and these kinds of extreme cases. Because one
of the things that you don't want is your monitoring service to
be broken when shit hit the fan. Sorry for my
language. Next, you might find
the potential bottlenecks because as we mentioned,
it's super important to identify our limiting
factor because then we could actually address
it in advance, right? We can, I don't know, put more alerts on our systems.
If, for example, it's the disk type, maybe we can lower the
threshold of the alert so we will be notified before
the disk get into I/O starvation or something like that.
Maybe we can change the type of the disk and yada yada yada.
So it's super important to do it.
Next, we might have some kind
of, I call it scale up ballpark. Because when
running those systems, you can actually
add more brokers to your Kafka cluster and
to see the impact of the latency during that time,
but also the impact of
the overall cluster performance after you added that node. So you
will have some kind of ballpark to know how many nodes
in the future. You have some kind of ballpark to know how many nodes you
need to add to your system to your cluster in order to
sustain the growth of the system. Last but not
least, if you run enough benchmark,
like I said, different use cases, extreme cases, and so on,
you can easily create some kind of system of recommendations.
Because if you have enough stress
test cases, and if you test all your producers and
consumer configurations, you might be able to collect
them to put it on some kind of backend with a nice UI
where your developers can actually
change the different parameters and so the result based on your test.
So maybe as a developer, I want to see if
I can raise my batch size
or to lower my batch size, but I want to see the impact
on the latency. So based on your configurations, you might be
able to give your developers the abilities to see
the result in advance without them spending time
and money testing it. Yeah.
And this is it! First of all, I hope
that you like it. And second of all, I hope that this talk give you
enough reason to run those kind of benchmarks.
If you have any questions, feel free to reach out to me on LinkedIn
and Twitter or whatever, and I hope you enjoy
the conference. Thank you again and see you later.