Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello everyone, welcome to my talk. My topic
today is on event streaming and processing with Apache Pulsar.
And again, I'm very happy to be back at Conf 42,
cloud native 2022. This is my
second year joining this conference, so thank you to all of the organizers,
especially to mark for your help. My name is Mary Grygleski.
I'm a streaming developer advocate at
data stacks, and here's
my Twitter handle if you'd like to get a hold of me.
So who is Mary? I'm a senior developer advocate at
Datastax with the streaming team.
As such, Datastax is a company that specializes
in data management and also in NoSQL
database as well as cloud native database.
As a service managed platform, we specialize
in big data. We can handle data in
really huge volumes of so
that's about me. And before this I was at IBM for about three
and a half years with the Java and then websphere team.
So my specialty is in Java cloud native
data DevOps and now moving into more streaming.
And I also work on some reactive stuff in the past.
I'm also a very active community builder.
I'm the president of the Chicago Java users group
as well as the co organizer for several
IBM sponsored meetups in the Chicago area.
And before this too, I was a developer
for over 25 years in various capacity.
But primarily I was really heads down into doing engineering
work development, and then I moved more into technical architecture.
But I've involved, so to speak, in it department,
working in the trenches, doing design integration,
testing, development, deployment and
all kinds of stuff. And when it kind of get into DevOps,
I also got my hands dirty into that. And here are several ways of
how you can stay in touch with me. My Twitter handle,
my LinkedIn profile. I also have a Twitch streams as well as
a discord channel. But I will share this with you towards
the end of this presentation.
So here's today's agenda about
event streaming and event processing using Apache Pulsar.
But first, let's start out with some fundamentals,
talking about what is event streaming, what is event processing,
what is complex event processing, and then event driven
versus message driven type of messaging.
There are some subtle differences in there. Two, very often
two we hear today about events streaming.
All of these things all kind of bundle up altogether.
It's all about asynchronous processing, concurrency,
scalability, especially in the cloud native world too. It makes
the problem even more challenging than before.
Okay, so after this introduction, I'll get into a
bit talking about event processing, the semantics pub sub
queuing these are like typical semantics.
And then why event streaming, right? Why do we want to do this?
And then I'll give you a bit of an intro to Apache Pulsar,
and also Apache pulsar depends very
heavily on Apache bookkeeper
as well as Apache zookeeper. So I'll give you some
kind of background information to that. And then essentially
two, Apache Pulsar is meant, in my opinion, it's really
meant to extend what we have today. For example,
like Kafka. Kafka does its job perfectly well, but there are
certain limitations. So this is where pulsar can
come in and augment the situation, make it
even better. So with this, let me start
this talk, okay, the many facets of
computing events. So first of all,
right, what is an event? And also bear with me too,
if you are coming in here already knowing some of these,
but I have to assume that not everybody has this background.
So I want to go over some basics. So what is
an event like, generically speaking? So for me,
I looked up the dictionary miriamwebster.com right
away. First of all, something like really common to any
kind of scenarios for events, they are really about
something just happens, an occurrence. Now,
as I kind of paged through that list of interpretation,
number four really caught my attention. It's essentially talking about
events being the fundamental entity of observed
physical reality represented by a
point designated. So what is this point? This point is
designated by three coordinates of place. So the
XYZ coordinate coordinates, and then also
one of time. So in the spacetime continuum
and postulated by the theory of relativity,
so it essentially do in computing. And that's how we look
at events. Events is essentially an
occurrence that happens in time and is represented by
the point that's actually then by the three places,
right? The three coordinates of
place, so XYZ, and then in time, too.
So I feel that it's really an exciting topic.
It gets more into the abstract, but also, too, if you
think about it, events actually are more
attuned to what human beings are in computing.
I think when we all first learning about computers, we're kind of
looking at things very mechanical, so to
speak, right? We define data, we flatten
out whatever it is, or maybe like flatten out comes
structure, right? We kind of assign that kind of properties
to your objects and then trying to make sense out of it and then do
the computing. It's actually a bit kind of awkward
if you think about it, right? All of these processing we used to
do would be synchronous, right? It's blocking. If you issue a request,
for example, like in HTTP, even in web interaction,
HTTP protocol is a stateless protocol.
So you issue a request and the request essentially
has to wait. You send a request over, you have to wait for
the response to come back before you can proceed. So it's a very blocking,
very time consuming type of protocol, and it doesn't
really mimic what human beings do. We do well,
do we actually wait even like talking on the phone?
And as you can see, talking on the phone. Real time
communication actually isn't all new. If you think about it, our telephone
systems is essentially real time communication. But think of the
fact that if we do it synchronously, wouldn't it be seem kind of
awkward, like if we call someone, I guess it's probably like
the old walkie talkie, right? When you talk about something, only one person
can talk, and when you're done, you say over. And then
the other, your sender, receiver, or both
persons in the conversation will have to kind of take turns to speak.
So it's a bit kind of like less kind of natural. But again,
this is events. We want events to essentially help with human
conditions and make things happen a lot faster.
So, first of all, what is event streaming processing? It is
really the practice, right, of taking action on a series of
data points that originate from a system
that continuously creates data. So I know it's
like a mouthful, but events essentially refers
to each data point in the system. So if we represent
the stream, right. The stream then basically offers to an ongoing
delivery of those events. So stream can be,
if you represent in a straight line, it can be
many events that are represented by dots that happen in time.
So this is essentially how you can represent events.
And a series of events can also be referred to as a
streaming data or data streams. How about
complex event processing? So, complex event processing, CEP is
a set of techniques for capturing and analyzing
the streams of data as they arrive to identify opportunities
or threats in real time. So we find that,
for example, some kind of security scanning,
monitoring, they are really kind of involves complex event processing.
If something happens, it triggers comes alarm, that type of
scenario. So CEP enables systems and applications
to respond two events to trends and then patterns
in the data as they happen. So that's complex event processing,
essentially. I think we deal with this every single day in our lives.
So we want two do that in computer.
Okay. And now event driven versus message driven
messaging. Event driven, right. This interpretation
I take is actually from Lightbend, the company that makes ACA,
the reactive framework, ACA, reactive systems that
abide by the reactive manifesto. So that's event
driven. What it says is sender emits messages
and interested subscribers can subscribe to the messages. So that's
event driven. Now there's also message driven, but as such
too. Let's go back. Event driven. Think of it more like what we call
publish, subscribe or producer consumer type of systems.
So the sender basically sends the messages without having to
worry about who's going to receive it. It's up to those. The subscribers
who are interested in it will need to subscribe two the messages.
And that's how it works. Now for message driven, it's not the same.
It's basically sender and receiver. They already
have to know they are known to each other beforehand.
So the address needs to be known. So that's message driven
messaging. And usually message driven is really about like two
parties that are communicating with each other.
So they know each other, who they are. So as you can see, these are
differences between the two event driven and message driven
messaging. And now if we look into it,
right, what happens then before event taking
an event approach? I mean, think of it in today's IT system,
in IT department, right? There's still a lot of batch processing
happening. And essentially this is just what happening,
right? We collect all of the data and then kind
of group them all together and then batch them up essentially,
and then have the processing going on perhaps
in the middle of the night when nothing else is happening. So then you don't
take away processing capability in your system to
kind of crunch through all of these heavy data set. So that's
what happened, right? Especially in the 90s, in the early
two thousand s, a lot more batch processing. But as systems
becoming more sophisticated, then we're seeing more and more event
driven type of systems that comes into play. Now don't get me
wrong, batch processing still has its place and it's still very relevant,
especially if you are in IT department that are dealing with
kind of like system of records, like you're processing something
that can be done in batch, I believe too,
even a lot of machine learning processing also being done in batch.
The reason is that the data set is just too big. It's hard to all
processing them all in real time. But then this is the
reason why we want to develop systems
and capability such as Apache Pulsar
is targeting for really the big data set. And that's what it is.
Now let's take a look then into event messaging, some semantics
and patterns. So streaming, right? So streaming.
This is what we are kind of all about. And my job
is essentially about advocating for streaming. So streaming,
when we talk about streaming, is really referring to the pub sub, the publish
subscribe type of a paradigm, right? So publishing client
will sends the data, and then in a pub
sub system there's also always like a middle person, the agent
in the middle called a broker. The broker is the one that's kind of
like mediate and everything. So it receives the messages from the
publisher, publisher sends the data and broker gets
it. And then based on a lot of different types of configuration
parameters, the broker will deliver the messages to those who
are interested in the messages. So it requires subscriber
to subscribe to it. So the subscribing client will
receive the data, but the subscribing client will have to initiate
and subscribe to the topics, right? So very often in a
pubsub system, we're talking about using topics like Kafka
is using topics, Pulsar has it. And even IoT messaging such
as MQTT, also is a pub sub system that has topics.
Topics essentially just label so that you group all
of these messaging together. And then essentially those who are interested will
have to subscribe to that topic to receive it.
Message queueing so what is message queuing? So, message queues
is a form of asynchronous service to service communication
used in serverless and microservices architectures.
Messages are stored in the queues until they are processed
and deleted. So essentially, first in, first out type
of principle applies. And the messages are only intended
for one single customer or consumer.
So only process only once. And message queues can be used
to decouple the heavyweight processing, to buffer
or both work, and to smooth out any kind
of spiky workload. So as you can see, the queuing helps
too, in kind of offloading some of these heavier processing.
So that's what message queueing is.
So about event streaming. So why event streaming, right at this point
in time? Right? So earlier I talked about event streaming is because
it really mimics more of the human behaviors,
a lot more natural to us, but it's also at the same time much harder
to do because now we are talking, bringing into concurrency,
bringing into events that are happening at different times.
It doesn't happen yet, and you have subscriber waiting
for it. So it kind of involves a lot more computing. Well,
kind of managing the computing at the back end too. So it's more
complicated, but you reap the rewards, the benefits
of it being real time. So that's what it is.
So what's driving this change? Right. So if you look into it, we use
real time data to enhance customer experience and
create a competitive advantage for your business.
So on a high level, that's what it is. Marketing people, we like
to talk about advocating for real time.
Essentially two things happen right away. If you are
at a football game, you want to know the statistics of a player right
away. It can be streams, that data can be streamed immediately,
analyzed and output for you. Within seconds you
can get that kind of a data. So that's kind of like some advantage of
that. And then there's also data pipelines. Two, right.
That's actually very useful to build AI and ML
machine learning, smart models from time
series event streams, for example. So that's kind of data pipeline
is another good usages of event streaming or
kind of like how event streaming can help with building data
pipelines and then also kind
of one important aspect of event streaming is it
allows for scalability elasticity to happen
a lot more in a more flexible way.
It helps to meet the demands of large volumes of data
generated by application operating at the edge Iot systems.
For example, if you use like a pub sub kind of broker, you can actually
have like a billion, or not even billion, maybe like petabytes
of data within. Well, I guess maybe I'm exaggerating, but there's
been actually benchmark being done to like 100
billion messages being published with
the topics too, like that. So things like that, right. It's really the scalability
aspect that it helps, especially in today's cloud native
world. Now then, why event streaming?
So let's talk a bit about it. We want to watch for events with
the system or application, and you want to also subscribe
to topics to see certain event types, right? And then you
want to make decisions on data in real time, not after the event.
You want things to happen, you want to see the data, you want to see
it being transformed and churn out some useful data for you.
The real time aspect is what is kind of the
core of why we want to do event streaming. We wanted to
be able to ingest high frequency of messages, but with very
low latency. Right. There's not much of a gap in between when you
receive your messages and then being able to process and output it to
some format, for example. So essentially it's the real time
aspect. And now comes Apache Pulsar.
So what is Apache Pulsar? Right. So as such,
Apache is Apache Software foundation. It is open source,
governed by the Apache Software foundation. There's like proper
top level project. It was actually open source by
Yahoo. It was first developed by Yahoo back in 2016,
2000, and actually 15 or 14 that time frame.
Then Yahoo contributed two Apache Software foundation
in 2016, and then in 2018,
within less than two years, in fact, it already became a
top level project. The reason is that it's really built in
a lot of the capability of doing cloud native that is really
more lacking in most of the other systems, in fact all of the other
systems. So that's where it kind of comes out kind of outstanding in
that regard. It is again cloud native
design. It's cluster based, multi tenancy, and the client
API surprisingly isn't like way too complicated or anything.
It supports Java, C sharp, Python,
Go, JavaScript among others.
Scala too, for example, if you want JVM languages, it's also supported.
There are others ones too, so you can check the website.
And basically the key of it is that pulsar separates
out the compute and the storage. And that's what makes it a lot
more easier from the scalability kind of perspective.
And another thing to pulsar being a pub sub broker,
it has built in already, like guaranteed message delivery
mechanism in there too. So if a message successfully reaches a
pulsar broker, it will be delivered to its intended
target, right? So depending on the message level, for example,
quality of service, it can be quality of service level zero,
it's fire and forget then. In that case,
the broker isn't held accountable for kind of guaranteeing messages
delivery. And then you can have QoS one, which is like one or more delivery,
or QoS two, which is exactly one. So that's a lot more
precise with QoS two. So all of these are being supported
by pulsar too. And then as such,
two, pulsar is very lightweight,
serverless, has a very lightweight serverless functions
framework too. It's called pulsar functions, and it
creates complex processing logic within a pulsar cluster.
For example, if you're building data pipeline, you want
to fetch the data and you want to do some work of
transforming your data or mediating it or whatever you need to be doing.
So you can leverages on pulsar functions to help you to transform and mediate
all of these messages too. So it kind of
comes in very handy. It has already command line that helps you
to deploy, build your function, and then deploy
eventually kind of get it running right, or you can kind
of do it using the console. For example, I work for data sachs now,
so we also have Astra streaming, which is the commercial version,
Astradb is the commercial managed database,
Cassandra database, and Astra streaming is attributed
two the streaming that Apache pulsar enables Cassandra
to have. So then you really enable more of the real time
processing aspect, right? So you can actually use pulsar functions
to do data transformation as well. Okay, that's just
an example. And then they're also like the storage offloads too, are also
like tiered storage offloads. What it means
is that it offloads data from hot and warm
storage to cold and long term storage when the data is bring
out. So when it's kind of fresh, maybe you still keep it in
hot or warm storage or in memory as well.
Or essentially two I wanted to bring out is the bookkeeper.
And so basically it knows, it has built in capability
to recognize that if some data is kind of aging out, not being
used as much, then we may as well write it,
store it into persistent long term storage and cold storage,
kind of like that. So essentially Pulsar has this built
in to handle the tier storage offloads.
So again, what is pulsar? Right, let's give a bit of
some detail about it. So it's a unified distributed
messaging and streaming platform. So what does it mean?
Right, so basically too, it supports messaging.
It supports messaging not only pops up, but also the queuing too.
Not only the messaging itself, but it does transformation and mediation
through pulsar function and things of that nature. Right? So it's
really a truly distributed messaging platform that's suited for
today's cloud native world. It's open source, right? Again, it kind
of came out from Yahoo, now part of Apache
Software foundation, and in fact one of the fastest growing
projects with a lot more committers has, you can see over on these
graphs, right? So GitHub stars has increased since
is becoming like 2018,
when it kind of became top level. You can see too,
it's like kind of a bring kind of jump,
2 June 2021 and so are
the number of contributors and monthly active contributors.
Actually, we kind of overcome that in June
of 2021, overcome that of Kafka too. So as
you can see, it's gaining popularity.
And again, cloud native ready. Kubernetes ready.
There's also a Cassandra, or actually I should say Cassandra
is more like the, sorry, that's kubernetes for
Cassandra. But the thing is too, is that Pulsar is almost
like it's basically already to work with kubernetes and
with Cassandra too. And it supports multi cloud and hybrid
cloud. And in fact, I will share with you a link to how you can
actually quickly test out pulsar through the Astra
streaming too shortly. So four reasons why
Pulsar is essential to the modern data stacks.
Right? Okay, so we'll look at that after we kind of show you a
bit of information too. Who else is using pulsar? Look at
these companies, right? Comcast, Yahoo. Overstock,
Splunk, General Motors, iterable,
Cargill, Verizon, Tencent, Shopin,
Nutanix, all of these. So these are like major players too.
Pretty major players. Right? And then also sorry
for this kind of more marketing slides. Two, I know
I'm talking to developers, but I hope you appreciate that too. The fact
is that there's been some comparison. Two, it really lowest
or three year cost compared to Kafka too because of its capability.
So it's higher performing savings for high complexity
scenarios and savings for higher data volume scenarios.
Two, and again, some brief
history of Apache Pulsar, but essentially I already talk about
it so I won't kind of go through all of it. Again, it's cloud native,
distributed unified messaging and streaming platform has been open
source as Apache top level project since October
of 2018. And we just seeing the trends, it keeps growing
too. And as you can see, it also has the lock
four j two fixed back in December too.
As soon as that came up, the community already delivered a patch
for that quickly. So now let's kind of
go a little into more of the pulsar. It is different,
right? So why is it different? Right. So here it is. I just want to
point out to you, there's a producer, so client application
sending messages to topic managed by the broker.
It's up to the consumer. The ones that are interested in
consuming the messages, I'm interested in that I need to subscribe
to the topic of where I know the
producer is going to essentially send the messages. Two.
So consumers subscribe to the topics and basically then there's the
broker. The broker essentially is a stateless process that handles incoming
messages and message dispatching,
communicates with the pulsar configuration store and store messages
in the bookkeeper instances. So the broker itself actually interacts
with the bookkeeper as well. So what is bookkeeper? Right. So again,
pulsar is unique in the sense that it doesn't
want its compute side to
kind of be worried about doing bookkeeping, as we
all know, right. As human beings we're doing accounting. Do we
like accounting? Most of us don't because it's crunching through a
lot of numbers and all of these things. So it's kind of tedious,
right? A lot of these tedious work, but yet they are utter importance too.
So Apache has a project called bookkeeper. Essentially it
does electronic ledger and journals and all of these things.
If you are accounting kind of enthusiasts or you're
familiar with it, then you're familiar with those terms. Essentially it's just
bookkeeping being electronic, going digital.
And that's what pulsar leverages on for is logging,
storing all of these things, right? And so as you can see,
bookie itself has different segments in the storage
kind of for better organization and managing and faster
retrieval and all of these things too. So broker is the one that's serving,
but it also interacts with the bookkeeper. Now there's also zookeeper
I bet comes of you or most of you probably already know of zookeeper.
That's actually, it's more like it manages the cluster, right?
The metadata, the coordinating tasks
between different pulsar clusters. So as such, right? The name zookeeper is
basically to keep order in the zoo. So as you can imagine, right,
this whole how cloud native thing can become very confusing.
So zookeeper comes in, manages all of these metadata,
handle all the coordination, making sure nobody steps on each other's
toes essentially. So that's essentially pulsar component.
Just wanted to highlight too the design principle of
pulsar. It has adopted a
tiered architecture design kind of approach, right?
And it's a traditional multi node architecture.
So as you can see, we already looked at producer consumer.
They interact with the broker, producer sends message to the broker,
consumer is interested in it, will subscribe to the topic,
right. It's actually, I should say producer sends messages to
the topic and basically it's the broker that kind of manages,
right. But essentially too, broker also will have to acknowledge back
on the network kind of layer too.
Okay. So as you can see too, topics too can be partitioned off
too, as you can see. So broker has different topics bring partitioned
off. And essentially too, this distributed architecture
supports like horizontal scaling really well.
And partition topics, two abstraction, mass complexity
for consumers, right? So consumer don't have to worry about it.
It's the broker that manages all of these topics are partitioned too.
So common challenges, right? What are some of those? So essentially
if you look into it, why using Apache
pulsar? If you look at it, traditional scaling will require partitioning,
rebalancing all of these things and having pulsar
actually separate up the compute and the storage, it helps the
scalability and rebalancing much more like cleaner
than it would be if you have a component that
kind of combines both compute and logging, right? So if you need to rebalance,
how do you do it? It's quite a messy situation, but if you
separate out the concern, they each takes care of things kind of
independently, but yet they are all tied together too
through the broker. So tightly complex persistence
and message service capabilities impose high costs on
historical data. The trade offs to support partition topics
came at the expense of messaging semantics needed for use comes
such has queueing. So as you can see too, basically if
you don't kind of have separate out this
concern, then basically two, the messaging semantics
will have to kind of take care of use comes
such as like queuing essentially that's what it is.
Okay, let's take another look too
into this tiered architecture design of pulsar,
multi node architecture. And so what's the big deal, right? So essentially it's
fast, it's low impact, it's horizontal, it supports
horizontal scaling and reduce the capex and opex
too, which is like capital expenditure and operation expenditure at
your company. So broker is stateless and it has built
in cloud balancing and the scaling is pretty instantaneous.
And essentially disaster recovery is zero impact.
So anytime you need to have recovery of these things,
it pretty much can kind of scale up and down and takes care
of thing itself. If it is disaster recovery, you need to
duplicate certain setups and stuff. So that
one too, because of the nature of the design of the
multi tiered architecture and separation of compute
and storage, it's actually make it a lot more flexible,
more dynamic in that sense. Now let's get into bookkeeper a
little bit. So bookkeeper again is the bookkeeper very
scalable and is wall based. So this is the right archive.
But this is a protocol that helps with the writing of
all of these log records too. And it kind
of helps with the ordering as well. So fault tolerant
and low latency storage services,
tunable consistency for message replication, the ensemble size,
write quorum, act quorum, all of these things, right? So you
can tune that too in bookkeeper. So has, you can see
if you kind of lump all these into pulsar
to do it, it will be overburdening pulsar. So that's why pulsar doesn't
concern itself with things like that with bookkeeping, because pulsar
is a lot larger. Well I shouldn't say larger, but has its own fish to
fry. That's important to handle. All of the brokers,
all of the messaging guarantee, all of them. So I don't want to
overburn myself and having to worry about bookkeeper. I let bookkeeper
do it, the bookkeeper. So as you can see,
right, they're also journaling in bookkeeper.
And it's essentially too, it enables like fast write is guaranteed
through the journals and then ledgers, two electronic ledgers
are basically segment centric data persistence via
ledgers too. So as you can see, it goes very deep into bookkeeper.
It is a library, a project of its own under Apache.
So you can always look that up for more details as well, if you'd like.
But for this particular presentation, I won't go into all the deep details,
not just yet. Okay, so essentially
then we look into the capability of Apache pulsar. So what problems is
it kind of trying to solve? So essentially too, it solves
also the problem of bolt on, because it represents the next
generation of enterprise messaging. So think of
it more like not coming in as a disruptor.
Well, disruptor in a sense, but it doesn't really disrupt you.
Meaning that if I'm on Prem, just keep going on Prem, I don't
worry about it. Just essentially pop
in pulsar, right. It's like a big giant bolt on.
Just kind of like nail it in and it can work too.
You can configure, obviously, and then you can get your whole thing running
has is right. For example, this picture in here, you've got Kafka JMs
Java messaging service, rabbit MQ. All of these are a bit
of an older messaging now, but you have it
in your company, you don't want to quickly change it again.
It costs a lot of time and money too, so you want to keep using
because it contains very important business knowledge. That will
be hard if you try to migrate. So you can keep those. And in the
meantime, kind of drop in pulsar. And pulsar
is non disruptive like this. You just plug it in and then it fills.
Get to work. Why? Because it is actually a unified
solution. It comes in, has like an ambassador,
like unifying everything. A unified solution for pubsub
up for streaming, for messaging. Right. For queueing
and also for the messaging, mediation and enrichment,
the transformation, all of these. If you think of this
kind of capability, I don't think there will be another project
out there, a library out there that can support all of these.
All capabilities of pub sub queueing, streaming, message mediation
and enrichment. And then out of the box
capability include. Right. So think of the fact that you can run
on prem or hybrid georeplication. That's a big
kind of thing of pulsar as well, messaging that you
want to do replication across geographical area. It already
has a lot of the very useful kind of even like user
level capability built into. And we'll take a look at
the next couple of slides. So georeplication,
multiregion support, right? So you can have multiple regions too,
maybe even within a geo area. It has kind of fine grained
support too. For each area you can have different countries,
let's say a huge enterprise corporation, you have
different kind of rules and everything. So you go to different regions, they have different
rules, and basically it's a matter of configuration within pulsar.
And there's also like if you get into data lake, data mining,
all of these two is also supported and much, much more.
And it's only where the tip of the iceberg, right? The project is still
quite not that old yet. There's still a lot of things being
planned for it. Okay, so basically
it unifies for all events, it's a unifying
platform for all events in the enterprise. So essentially it's like the
same kind of picture, but in this particular case, you can have this already on
prem too, and Pulsar will help you do
a job with very minimal disruption.
Okay, here's another picture too. So unified infrastructure with
built in geo replication. So as you can see, multi cloud,
hybrid cloud, multi region in here.
And if you look into right hand side too, if you have a company,
a lot of large corporations, two can have systems
that know very complicated. It's got really
like a zoo, right? You look at this example, you get almost
all software that you can think of, right? Oracle for database,
traditional RDBMs, postgres, Mysql.
Then you have cloud messaging, Amazon SMS and confluent.
And then there's also programs, different types of programs, python,
Golang, Java, Javascript, all of these things, no SQL
database, right, Cassandra for example,
MongoDB and Redis and whatever you want, right?
So all of these can coexist. Don't kind of get rid of things,
just keep running. Pulsar will come in and enable and augment
everything. That's what the goal, the benefits of Pulsar
is. So if you take a look into it too, it gives this universal
upstream connectivity and also universal downstream connectivity
as well. So if you look below in here too, you can also connect to
streams, analytics and processing like flink,
spark, data bricks, all of these other things too as
such. And there's data lakes and warehouses like Amazon
S three, Snowflake and Hadoop for example.
And then if you're traditional messaging, Java, JMS,
Kafka and MQ, it also has the compatibility layer
that's kind of built to talk with them too. So it is
really like positioned itself very well. So how is it?
Pulsar is different, right? So it's a next generation architecture,
provides a distributed tiered architecture. It separates
compute from storage and zookeeper holds metadata for the cluster.
Stateless broker basically handles producers and consumers.
And storage is handled by Apache bookkeeper.
So that's something we already talked about in a couple of slides back.
Now this is like a rich ecosystem of connectors and clients. There are plenty
of connectors too, and a really common one. For example, you can
build data pipelines and send data into a sync. It can be elasticsearch,
it can be MongoDB, Hadoop,
Haskell, Cassandra, Clickhouse, all of
these flume, all of these. So possibilities
are limitless, I should say. No limit.
Okay, so pulsar features rapid patching, zero downtime
multi tenant, right? So essentially too you can have soft isolation via
read write IO separation, independent storage
quotas and message flow control and throttling mechanisms.
Too hard isolation, two you can do via physically separate
brokers and bookies for tenants too. So very flexible
in terms of the multi tenancy type of capability. There's also IAM,
the identity and access management that are pluggable
authentication supporting TLS, SSL's next
generation of version of SSH. That's TLS,
Jot, JWT,
authentic or Athens and then Kerp Rose
and role based authorization provide control at the cluster tenant
message broke producer and consumer level too.
There's also provides you end to end encryption,
intransit kind of TLS encryption and
application managed content encryption too. So you can kind of be
best assured your data will be safe. So key
differentiator number one, separation between compute and storage.
Like I kind of talked about it already. So that's why your
scaling can be independent, you don't interfere with each other.
Storage is handled by Apache bookkeeper, segment centric
message storage management, fast and low impact horizontal scaling
capability. The next thing is about native georeplication.
You can have hands off real time message replication across data centers.
So which is real nice, right? So meaning that you can know data
center in Europe, in North America, all of these places,
it handles all of the georeplication for you. And it's
basically it helps you to meet your data compliance requirements
across geo regions as well. So that's kind of a really
big plus in there. Now there's also multi
tenancy too. As I kind of talk about, it's like an apartment building,
right? You have independent unit, they are isolated from one another,
but they all kind of being handled by pulsar too. So likewise
it's in a company, you can have a pulsar cluster then in it
too, you can separate out according to function. You can have a tenant for
finance and then another tenant for marketing, another one for product, for engineering.
Within each they have different namespace that handles different functionality,
as you can see in here. You can have a namespace that's kind of a
microservice that actually handles message topics and everything.
Then there's also marketing, right? You can have different kind
of a namespace for handling like campaign manager,
lead generation, all of these good stuff. And then there's
also then finance, right? Fraud detection, for example. So all
of these things too, as you can see, right? It's already kind
of built in and it's not too complicated.
Some things are just by configuration, then your system
is up and then can handle all of these kind of
capability that otherwise you have to write yourself, right?
Number four, so flexible message
processing model. So as you can see, right, pops up kind
of model. Or you can do queuing too. But the thing is interesting too,
as you can see in the topic pops up. You can have exclusive
in terms of subscription, exclusive subscription in which a
consumer can. And the topic are kind of tied very tightly together.
It's exclusively for that consumer. And then another one will
be a failover type of scenario. You can have more
than one consumer, but only one is the primary. So if
something fails, then the other consumer will kind
of step in to take charge essentially to receive the
messages. And then there's also like shared subscriptions.
So shared is kind of a very important concept, especially in
a cloud native environment. You can share the subscription and basically
it's part of it. Also for cost computing, right? If you kind of use any
resources, you'll be count by how much cycles
cpu cycle you're using. But if you use a shared subscription,
you actually are saving cost in terms of the cloud native cost
too. And then there's also the special kind of key is called
key shared. So everything is kind of essentially represented
by the key too. That's key shared, okay.
And basically it's good fit for bring use cases as well. And Kafka
has challenges in that. So why pulsar? Right? So a
bit, kind of wanting to kind of like two talk a
bit too about why is it better than kafka?
I wouldn't say it's like all better. Sometimes it depends on what your
scenario is, right? If you are in a situation you don't need all of
these fancier geore applications and giant
kind of setup, then maybe it may not apply to you.
But if you are interested to read about it, there's this link in here,
and under confluent there's a topic written about Kafka
versus Pulsar. And then another one is Kafka versus
Apache Pulsar event streams comparison and then
the features myths explore too. So when you should consider
Pulsar, if you need both queues like rapid MQ and streams processing
like Kafka, or you need easy geo applications,
multitenancy is a must and you want to secure the access for
each of your teams. And then you also want to persist all of your
messages for a long time and you don't want to offload them to another storage.
And essentially too performant is critical for you. And your
benchmarks have shown that pulsar provides lower latency and higher
throughput. So essentially too, yeah, you can guarantee the performance
to be quite well. And if you run on prem and you don't have experience
setting up Kafka, but you have Hadoop experience,
these might be reasons why you want to use Kafka.
So data stacks, flavors of Kafka,
I'm sorry, of Pulsar. So I'll go real, real quickly. So data stacks
is essentially taking pulsar ten times further. We have added
our special sauce, so to speak, right? So all of these things
are already there. And it's binary compatibility with JMS,
MQ and Apache Kafka. We have libraries that helps you
two for legacy kind of systems like transformation,
we have that migration for you two. So think of us
more like we enable you to be even more productive.
So Pulsar meets you where you are. So again, I mentioned already Astra
streaming that has managed Pulsar, there's Luna streaming, it's basically
the open source, but we also provide or give you the options of
processing enterprise support Pulsar,
enterprise support that will support Pulsar. And then there's
also the pure. If you want to go all open source, you can kind
of use the open source version. It's all community driven
comparing against Kafka. So some of these
pain points, I just want to highlight a few, right? So basically Kafka
doesn't have the separation of the compute and the storage.
So it is, when you start off is maybe more straightforward.
You know, how many topics, how many partitions, number of brokers
you plan out ahead of time is there, and it's fine.
However, if there's a need for you two, keep growing your system,
then it may become a little bit more difficult because what
do you do now? You already have defined a number of topics and partitions.
If you want to do that, it's a bit harder to do. You can
do it, but it just requires a lot more work, right?
And essentially too, if you look into it, cluster rebalancing
can impact the performance of connected producers and consumers too.
And over here too, it talks about. Yeah, there's geo replication mechanism
too, by for example, this library called mirror maker,
but it's not very ideal at this point. And companies like Uber have
also created their own solution to overcome these kind
of issues. Two, so overall you can look at Pulsar
as an extension of augmenting what Kafka
augmenting it essentially, right.
So partition centric versus segment centric. I think I already showed
this to you. As you can see, it compares Kafka, which over here, right,
the partition. Everything Kafka manages is storage too.
So all log segments are replicated in order
across brokers too, and then otherwise too, if you look
into pulsar, right, bookkeeper, as you can see, bookkeeper handles
all of these magic essentially for you, for the ledgers,
for the journaling bookkeeper. It comes
easier if you try to replicate, do any kind of multiplying
and things like that too. And essentially it's built more
for kind of like the scalability and it needs to rebalance.
Right? That's another thing you need to rebalance if you need to grow your system,
have more cluster, more nodes. And it actually grows nicely because
pulsar already has built in that it knows to kind of rebalance all
of these topics, how it is being partitioned,
for example, and things like that. All right,
architecture advantage, right? So compute and storage separation,
we already talked about it. And segment oriented log messaging.
Two over here.
So where to go from here and let's keep in touch.
Okay, so here, I think I'm running out of time now, two, kind of
do some demo, but otherwise too, I wanted to bring to your attention.
Right, these are the resources. Pulsar,
apache.org, bookkeeper apache.org, zookeeper apache.org.
You can read up on all of the details. And then at Datastax we offer
Astra streaming over here or lunar streams, right. Lunar streaming essentially
is all free unless you want to purchase the enterprise support too.
So again, data stacks is very much about open source
too. We are strong supporter. We have fills working on Cassandra
and Pulsar committers and pmcs.
And please follow my Twitch stream every Wednesday at 02:00
p.m. Us central time. I have my twitch stream that
these days I'm going into talking more about event streaming
and pulsar. I've been with Datastax
for a little over a month so I haven't got as deep yet. But I
will. And then I also have other topics too,
like developers chat for example last week at Devnexis and things like that.
And please join us at the hood. This is all new. We are efforts
of starting off a website called Apache Pulsar
Neighborhood is over here. Or like there are meetups
two on meetup.com meetup group so please follow us two over
there apache pulsar neighborhood and with that I
want to thank you all. Thank you for listening
to my talk and please I welcome you. Please stay
connected with me. Join my discord server.
I'll be happy to talk about anything right? And also
follow me on Twitter and also my LinkedIn handle is right here
with that. Thank you very much.