Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi everyone, and welcome to my talk today. When you
need a data streaming platform, when you need to do real time
analytics or real time data ingestion, what probably
comes to mind is Kafka. And if
you've been using Kafka for a while, you might be facing some
limitations or pain points in your daily operations,
like certain rigidity when you need
to scale out right? And if you're using
traditional message bus solutions like Private MQ,
you might also be facing certain limits.
But I have good news for you. There is an alternative that's gaining popularity
right now. Apache Prosa. I'm Julian,
developer advocate at stream native. You can see a QR code and
a link on your screen. Feel free to scan this QR code.
That's the best way to reach out to me and to get access to
additional resources, for example, case studies or benchmarks.
For those who don't know stream native stream native was founded
by the original creators of Puruson. We are one
of the largest contributors to Apache Puruson and the ecosystem surrounding it,
and we provide a fully managed pulsar service with enterprise
features. This service can be deployed on either our infrastructure
or your infrastructure. So what is Apache
Pulsar? Apache Pulsar is a cloud native
messaging and data streaming platform.
Cloud native means that Pulsar is designed for
running in containerized environments and is
designed to scale out, to scale horizontally. But it's
not only about scalability, it's also about elasticity.
Elasticity is about adapting to workload changes,
and Prusan is both messaging and the data streaming platform,
as we will see in this presentation. So Prusan
is a messaging and streaming platform. But what is data
streaming? Data streaming is a process of transmitting
data continuously and in real time from
a source to destination. The data is
sent in small pieces and is processed or analyzed as
received. So streaming works best when this data needs
to be processed in the right order, the data,
these events then can be internally together, so they
need to be processed in the right order. This data may
be persisted for a long time, but also be transformed,
aggregated, or even replayed. And you
may also need to read millions of records very quickly
and perform big catchup reads, so you need throughput.
Streaming platforms can handle massive amounts of data,
so these platforms scale out horizontally.
Typical use cases for data streaming include data ingestion.
There's a process of collecting and importing data from various
sources for storage, processing and analyzes,
and other use cases can be about real time analytics.
You will make the right decisions based on fresh data so
you won't wait for a batch process to be run tomorrow. For example,
you can implement a dashboard that's updating in real
time, or you can implement fraud detection.
The most common streaming platform is Kafka. I can also
mention Amazon kinesis and of course pursuant is also a great
streaming platform. Now let's talk about messaging.
That's very different. Sometimes we
need to assign time consuming tasks to a group of workers.
One producer service needs tasks to be performed by
a consumer service, that is a worker, and these
tasks can take some time to process, or the
consumer service can be unavailable for a couple of
minutes. And we don't want these events to impact the producer
service. We need decoupling to do
that. You can set up a work queue.
This queue contains messages from the producers, and each of
these messages is a task to be performed by consumers.
You don't need to perform this task in a strict order.
For example, if the task consists in pushing notifications to Alice
and Bob's iPhones, it doesn't matter if I
notify Alice first or Bob first. I won't break anything.
I won't break my system if I do that. A message
broker manages this message queue, accepts the producer messages
and delivers them to the consumers so they can perform
those tasks. So there are some essential features
for a message queue. Queues can grow faster
than consumers can handle. So one expected feature is
the ability to add new consumers to consume the queue faster
and also remove consumers when the workload decreases.
Another expectation is when a consumer is busy processing
a message. You don't want to wait for the processing to
be completed. The broker will deliver the next task to
another consumer instance without waiting for the current
task to be complete. When a consumer fails to
handle a message because of an error or timeout, you may
need the message broker to redeliver the message later to
retry the task, and you also need to remove
undeliverable messages from the queue and move them to
what we call a dead letter queue. Because of that,
messages may then be consumed in a slightly different
order than produced. But as we said,
the ordering is not a strong requirement here.
RabbitMQ and Amazon SQs match these requirements
because they implement this message queue semantic and we
cannot expect these requirements from a swimming platform.
These platforms are not designed for that.
However, pulsar implements both the swimming requirements
and the message queuing requirements,
so you can see that data streaming and messaging require a different
set of features on the data streaming
world. The data streaming requirements are that these events
need to be processed in order. That's essential.
You need to ingest a large amount of data.
That's not quite the same on the messaging world.
You need data retention, you need to handle big
catch up rates. And the most common platform that provides
this is Kafka. However, message queuing requirements
are quite different. It's about decoupling
task execution. It's not about processing events
or processing stream of data. It's about decoupling
task executions. You need to be able to add a remove
consumer dynamically so they can catch up with those backlog
tasks. With the task with the backlog of task. Right.
You don't want to block the queue when a consumer is busy
or failed to consume the message. And you need to
redeliver those failed messaging later to
retry it. So one of the most common platform that provide this
is RabbitMQ. So on one side you are in the Kafka
world and on the other side you are in the RabbitMQ world.
Okay, so does it mean that you have to choose between messaging and
streaming, or do we have to manage two different platforms,
one for messaging and another for data streaming?
Well, this is not the dilemma with Prusan. With Prusa you
can do both using the same platform, the same
technology. So you have only one SDK to learn
as a developer and you have only one broker
to manage in production. That's pretty cool.
This is why we see that person is a unified messaging
and streaming platform. It can handle both
messaging and streaming use cases and this is one of the key features of
person. But now I want to talk about pulsar
scalability and elasticity. And these are the very different.
Elasticity means you can grow or shrink resources
quickly to adapt to workload changes,
so you can save on infrastructure costs by avoiding
over provisioning. Some data streaming platforms like
Kafka and Prusan can scale very well,
but Prusan is both scalable and elastic.
The scalability requirements are determined by the bottleneck you
have to address. So if your bottleneck is the number of messages
remaining to be consumed in a topic, then you need to scale.
On the consumer side, when the bottleneck is the number of
topics or the number of connections with clients,
then you need to add more processing power to the pulsar cluster.
And when the bottleneck is the storage capacity,
then you need to add more storage capacity to the pulsar cluster.
Let's delve into the first bottleneck, which is likely to
be the most common one you will encounter. Suppose you have multiple
consumers consuming a topic what happens if
the number of messages to be consumed grows faster
than the consumers can process them? With Prusa,
you just have to add new consumers to what we call a
shared or key shared subscription so you can increase the throughput
by adding new consumers to the subscription.
And that's it. With Kafka, when you
need to add a new active consumer to a consumer group, then you need
to add a new partition to the topic. For that, you need to
perform an operation on the broker side and you will also
perform a rebalance of the data across the partitions.
This can be heavy and this can lead to performance loss or
even downtime. So you have to anticipate and plan carefully.
Right, but Prusar doesn't have this issue because
no partitions. And of course when you have fewer
messages, you may end up with underutilized consumers so you
can remove them to save on infrastructure costs. And that's
pretty straightforward. All of that while preserving the
ordering guarantee, unlike traditional messaging brokers.
Now that we delved into the most common bottleneck, let's explore
how Puruson can handle a rapidly increasing number of
consumer and producers. But first I need
to explain the unique architecture of Puruson. This architecture
is more sophisticated than other platforms, so it
brings many benefits. In Prusan, there are two
types of nodes, the broker nodes and the bookie nodes. The broker nodes
are responsible for managing all the communication and the
processing of the topics. So they are stateless,
they don't store data. So a broker node
deletion won't impact the data.
In contrast, the bookie nodes are responsible for storage. They have
state, they store the messages and the bookies are
Apache bookkeeper nodes. So let's say that I need more
processing power on my cluster, I'll add more brokers
because their state is stored in the bookkeeper. Here I can add a
new broker and that load can then be migrated to
another broker. So some of the magic of Prusan is that it will
take care of all the moving of connections and the moving is
transparent to your application. If you compatible this
to Kafka, then adding a new broker, I will have to
manage the movement of all of the data to another broker to rebalance
the data right across my cluster.
But here with Prusa, there is no
heavy partition rebalance, there is no data movement involved.
Now when you need to store more data, then we'll
just add more bookies. As soon as
you add a new bookie, it's going to be eligible for getting new messages
right away. It immediately becomes functional
and there is no need to wait for any data rebalance
here the node is instantly available. Pulsar does
not have this issue and you need scalability on your data.
So here is a recap of the two levels of elasticity and
the benefits compared to other streaming platforms. When you
do data streaming or messaging, your most common bottleneck
is the consumer. The advantage of prusan is that scaling consumer
doesn't require complex operations like adding partitions.
If your bottleneck is a processing power on the cluster
side, you just have to add new broker nodes without the need for
data movement across nodes. And finally,
if stretch capacity is the limiting factor,
adding new bookie nodes resolve the issue and unlike other platforms,
these nodes immediately receive new messages.
And of course you can easily downscale to save manufacturer
cost. To ease the use of pulsar and Kubernetes environment with
remote, we developed kubernetes operators. The operators
facilitate the deployment of pulsar clusters on kubernetes.
You can define your desired cluster configuration using familiar
kubernetes manifest files and this allows for
seamless scaling and facilitates the installation of additional
components as well. You can find the documentation on
the stream native website, and stream native offers these operators under
a simple and free to use community license. Okay,
so I just presented how you can scale puruson with its multi
layered architecture, which is the second key feature of Puruson.
Now let's shift our focus to another critical aspect.
How does Prusan safeguard the durability of your
data? Pulsar topics are made of segments
and these segments contain messaging. Pulsar can
distribute the segments to separate bookies and this is how a single
topic is distributed across several data nodes.
It's important to note that the storage model is completely
different from the Kafka storage model, which is a partition based model.
So here you have a replication factor of three. So every segment
is replicated on three different bookies I use,
as you can see. So if I kill one bookie here,
the bookie two, I still have all the segments
in the other bookies, so I haven't lost any data.
You may need to store a big amount of data and retain
it for a long time. Depending on your use case,
you need to read the old messages that were produced days
or months ago. And with Prusa you can offload
those messages to an external storage.
So instead of using our fast and expensive disk in
the cluster nodes, we can rather leverage the use of third party
cloud storage systems, moving this data into a more
cost effective storage tier and this is
transparent for the consumers. So if you need to replay a topic,
some messages will be read from the cloud storage and other
will be read from the bookie. But the consumer doesn't see
this, it's transparent. And this offloading
is facilitated by the segment based architecture I presented.
This architecture allows older data segments to be offloaded
seamlessly. So we've seen what happens when you lose a
node, but what if you lose a region
or a data center? That's where georeplication
steps in. Georeplication provides disaster recovery.
So you have several clusters deployed in different regions or
different data centers, and if you lose a region, you can recover
from it. Pulsar can replicate the data to different regions
automatically in a bi directional way. And this is
a built in feature. So setting this up is basically about configuration.
Now I'd like to introduce another very cool feature of
prosor multi tenancy. Multitenancy allows
different departments or teams within an organization to
share a browser cluster while keeping their data isolated.
So multitenancy helps applications work in a shared environment
by providing structure, security and resource
isolation. The benefits include easier management,
as you only need to operate a single cluster for multiple teams.
Plus, sharing resources can lead to a significant
reduction in the number of nodes in your infrastructure,
which can save on cost. This is not a hack
or another layer on top of Prusa. Pulsar is
designed for that. This is a built in feature.
So now you could say, well, Julian Prusan
has impressive features, really impressive features. But you know,
in my company I have an existing software ecosystem,
right? I'm sending or consuming messages with
Kafka and RabbitMQ. I have a bunch of microservices
with thousands or maybe millions of lines, and I can't rewrite all
of them. Well, I have good news for you.
Busan has a high level of compatible, and I'll explain that
right now. Messaging and streaming involve clients
and a broker, and they communicate using a protocol.
Pulsar provides its own banner protocol,
but with the addition of protocol handlers,
Pulsar becomes compatible with scarca clients,
RabbitMQ clients, and MQTT clients.
So by leveraging your existing apps, you can avoid the
need to rewrite everything and ensure a seamless migration
path. Prusa also benefits from a great
ecosystem throughout client applications using the
Pulsar native protocol, and benefit from all the features.
You have many Pulsar client libraries available. You will surely
find one for your favorite language, and don't hesitate to check
out hub streamnative IO, where you'll find a
wide ecosystem of connectors, libraries,
protocol handlers, et cetera. When it comes to
choosing a technology, open source is key.
Some of the benefits of choosing an open source technology are
sustainability, avoiding, vendor locked in
and community support. And Pulsar of course is open source
because it's Apache person. All these features are presented
are available in open source, so if you download Pulsar you
have all of them. This is great because you don't depend on
a specific vendor. You're free to call a vendor to provide
a Prusa as a service like stream native, or you can manage a Prusa
cluster by yourself. There is no vendor unlock in
now some data on the Prusa open source community there is
more than 600 contributors to PruSa and they are more making
contributions to the ecosystem surrounding around perusal.
The entire pursuit code base is growing year over year
and you can discuss with the community on Slack. The number
of Slack members reach 10,000. That's so many
people who can help you. Additionally, there are now
thousands of organizations using purson. And now
let's take a look at the history of person at a glance.
Person was developed by Yahoo in 2012 as their
cloud messaging service. It then went open source in 2016.
By 2018, Prusan has graduated to a top
level appaship project and since 2019
there's been a surge in the Prusa currency growth with rapid
adoption and an increasing number of contributors. It's worth
mentioning that Prusa has been in production for over
ten years. This means that it's tested,
it's proven to be robust and mature. So here is a
quick recap. Prusan is a unified messaging and streaming
platform handling both patterns, both semantics at the
same time, so you can have only one platform to manage.
Prusan is doing great at both scaling and being elastic,
and provides three levels of elasticity.
Prusa ensures the durability of the data and can
offload to external, cheap and unlimited storage.
Prusan has durable application built in which is great for
disaster recovery. Pulsar is natively
multi tenant, pursuant is compatible with the software ecosystem
and all these great features are available as open
source. So you have no vendor unlock in.
All right, that's the end of my talk. I hope you enjoyed
watching for attending. If you have any question, I'll be
very happy to answer. Feel free to scan this QR code on the left to
contact me and to get access to additional resources.
Feel free to try out pulsar by downloading it or by installing the operators.
And you're welcome to join the Apache Pulsar Slack channel.
See you there.