Transcript
This transcript was autogenerated. To make changes, submit a PR.
My talk I'm very happy to be here at
the 2023 edition of Conf
42 cloud native track. Thank you
to everyone for listening to me talk. My topic
is about event driven change data capture
pattern using Apache Pulsar. You might be wondering, well,
what does it have to do with cloud native kind
of this track? And in fact I wanted to
point out to you that this particular pattern
in fact is ideal if you are running things
in a cloud, a cloud native kind of environment. And I'll explain
shortly. So this particular pattern, let's first take a look and
see what it is. But before we do that,
let me have a quick introduction of myself.
If you have not heard me talk before,
my name is Mary Grygleski. I'm a streaming developer advocate at
Datastax. Datastax is a company that's based in
California, and Datastax
specializes in data management,
type of software and the flagship product if you really
is based on Apache Cassandra,
if you're familiar with NoSQL, I'm sure you will be very
familiar with Apache Cassandra as well. So we are
the company,
commercial company behind this Apache Cassandra,
and also essentially too, it is a NoSQL
database supporting big data, very fast and
very efficient. So much like that too.
Our streaming platform, which is newer, about two years old, and that's
where I'm working for, streaming platform essentially is also open
source based, and that's on Apache Pulsar. And that's what I'm going to
spend some time today talking about is Apache Pulsar.
It's an event streaming platform really built ideally
for the cloud native platform. Also one note
to point out is that Datastax also has recently acquired another
company called Cascada, and that's
a real time AI software. So company is
also moving into that direction. I mean, given the fact that we already have data
in motion very strong with Apache Cassandra, then we're also supporting
data in motion, this event streaming platform. Then it comes time
to really apply it to higher forms of specialty,
which is the real time AI. So that's something new. I just want to
point out to you, that's the company I'm working for right now, have been
here for a year and then previously too I worked
for IBM as a Java developer advocate.
My areas back then included reactive
systems and taken also the open source websphere
side of things, which is the open Liberty microprofile
Jakarta Enterprise system, the Jakarta ee
two. So those are kind of my area. I'm based in Chicago,
I'm also a Java champion I'm also a community builder.
I currently lead the Chicago Java users group. I'm the president
of that since beginning of pandemic and I'm
still keeping it strong. And I
also actually, well, I should give you a bit of background, is that I previously
was an engineer myself, so I have hands on experience too,
but have become an advocate since about five years ago.
And so my area includes a lot of software engineering work,
design development, as well as some
technical architecture as well. So my area too,
that I came from primarily Java open source,
then leading into cloud systems too, and DevOps and all those things.
So that's briefly about me. And now let's then show
you the agenda. We're going to talk about change data capture
CDC now, as some of you based in the US CDC,
do not mistake that with our center
for Disease Control, not that change data capture here is
what we're talking about. And the purpose of such is to
serve databases, data sources such as
databases, data lakes, data warehouses.
And I also want to point out a bit of a history of
CDC, how it came about, basically pointing out
ETL, extract, transform, load,
and then I want to introduce to you the components of
a CDC system, then also requirements of
a truly modern day cloud native supported
CDC system, what you should have
the requirements for those. And then I'll introduce to you Apache Pulsar
and why it's a good choice to be used in combination
with, for example, Apache Cassandra
that my company Datastax has too. So a CDC
support for Astradb for the managed cloud platform,
and also showing you a quick demo after that. So first,
let's start, why change data capture?
What is CDC for? Basically, I already pointed
out earlier, is that it's to serve data sources such as
databases or data lakes, or even like data
warehouses too. Essentially, too, we're treating the data
sources as the source of truth, of the data. That's what
CDC has to serve. But the thing is,
how does it kind of serve? Right? It serves it because data,
as we know in today's world, is that the state of data
can keep changing too. And we need to be able to capture
those changes. Any kind of times that something happens
to the data, we want to be able to make sure we don't lose
track of it. We basically have to keep track of certain types of changes.
And then with those data changes, we want to transform
them too at certain times, right when we need to filter them out
or enhance it, or do whatever is
needed to the data. And then after the data has been transformed,
then it will be propagated. These changes be propagated
to downstream data systems too.
So what we're looking at is basically CDC is capture changes
in source of truth of data, and then do all sorts of manipulation
that's needed and then propagate that downstream,
and also then essentially sends it downstream
and taken, that becomes what is called derived data
source. And this is a very
simplified illustration of CDC. I mean, in case
just talking about it is not clear. So let's first take a
look. Data sources, right? Basically there can be
changes. So over here, up there too is looking
at could be we are kind of using a soldier in some sort
of monitoring type of process in there.
Its job is to detect changes. So in the old ways too,
very often we may be using polling to do things right. So essentially every
so often you kind of go out and check your data source to
see if there is any changes like that.
But it's not just polling.
Polling can become basically not as fast,
because if something happens, we basically too, in today's world,
we want things to happen in real time as it happens, we want
to capture it. So doing polling may not
be the way to do it, but it's just illustrating right this detection of
changes can be. Polling can also be some
form of database trigger, for example. But database
trigger has, we know too, may not be an ideal kind of situation
too, because the trigger can only be trigger
it from maybe one database and map it to the database. So there may
be subjected to some database type of limitations
and the dependencies on the database itself. But the thing is,
too, is that the idea is to detect changes and then
essentially to then capture the changes too.
Hold on a second. Capture the changes and taken,
capturing it too, and then sends it down to a CDC system,
which is to kind of transform.
Oops, sorry. Okay,
here we go. Okay, so we capture the
changes, and then we capture the changes, and taken, we transform
it. Sometimes we may want to do filter,
do mapping, or do some sort of enhancement or even
some sort of remediation maybe. We know that there are some
data that comes through and we need to actually detect
and also notice that if there are some sort of discrepancies,
we need to adjust those data and all that. But in other
words too, some sort of transformation that needs to happen to the data
that's been captured and then from after it has been
captured and basically to what we are doing then next will be
to propagate it, to send that
down to downstream data system. So that's
what essentially this
whole CDC process is regardless of how we want to implement it.
But now then, let's take a look at what
was there before CDC. Okay, so CDC is not new.
In fact, it can be, CDC can be a slow way. But then today's
talk is really about a new way of doing things in real time.
Very responsive. Okay, so before CDCs,
basically there's also a process called ETL,
right? Short form describing it, extract,
transform, load. As such, it describes the process of extracting
the data from the sources and then transform it and then
load it back into a target database
or data source, some kind. So as we all know,
right, ETL, it still has its place in the it
world, don't get me wrong. But you have to bear in mind too,
ETL traditionally is a very slow type of processing because
it is synchronous in nature. Processing will occur
in batches. So maybe it's collecting some data and
extract a set of data. They are all being done. It's like bundled
them up. It's ready. Now, maybe a certain size of this set of data
are captured within a certain time period. From there,
we kind of take that batch of data and then solely
send it through another processing of transformation of some kind.
As you can see, it's a very synchronous type of process, meaning that
things are not decoupled, so to speak, too. So it
kind of sends through the transformation, and then when the transformation is
done, then it will be loaded back into some sort of data sources,
right, so to speak. So as you can see, it can
also kind of incur or use a
wide, big network bandwidth too,
because now we're dealing with data in bulk. And basically
it requires a large set of data. Everything is being done okay,
nicely, methodically, but synchronously too.
So very slow. And the process set up too tends to be also
very heavy if you kind of look around in today's world,
right, there are ETL tools that are out there,
produced by sometimes companies that are more catered
for the older, traditional enterprise type of environments.
These tools too, the licensing can be quite expensive,
as you know, and it also very often make use of a lot of different
database or replication behind the scenes, all of these things. And they tend
to be expensive and less of an agility that's
associated with it, essentially not able to keep up with
the times. But then again, going back, right, it's not like it is bad
or anything, but I think the key, the trick is that it depends
on your usage circumstances. And in today's world,
we're talking about. We want systems to respond in real time.
We want data to capture data in bulk
and process them as they come. Not wait for it, not like
synchronously. We are kind of looking for ways of doing things in a more
asynchronous time of fashion. And the advantage of such
would be the speed that they can come through.
Right. We want data to be ingested in high speed,
high frequency of ingestion as well as processing them
with very low latency. So now let's take
a look know understanding this kind of needs.
Then next comes to the components of a
CDC system. What make up a CDC system.
I already show you that picture. But let's kind of take a look
a bit more like the components, right? So one is the change detection
piece of it. This again we're looking at abstract,
right? This change detection again can be a process,
can also be, if you are on a Unix system it can also be
maybe some sort of scripting language that write this change.
Capturing changes to. I mean theoretically
you can do that or doing a program that would go into the database
and kind of sense the kind of changes that are occurring in
the database. And of course in some cases it could be a file
system too. It isn't limited to just a database. So some
form of change detection processing process is needed.
And then basically too from there you send it to any kind
of change that happens. You capture it and sends it through this capturing
engine process. And then from there it
may or may not need transformation. But I think a lot of times
too we probably want some sort of transformation. And then
from there basically the changes will get propagated. It kind
of sends it down to some target kind of system.
So let's also take a look too, right? Traditionally too
we might be doing capture data using
SQL, all of these but yet doing polling.
For example the SQL kind of you can go through like select
some field and see if it has change. Maybe since a certain
time you set some criteria to kind of test it.
So that would be more of a SQL database kind of approach.
But yet there's a way that actually it's
probably more efficient. In fact it's more desirable
would be using a log based type of approach for CDC.
So let's take a look in here. So let's say we have two
clients kind of writing to database. So let's say client
one on here it set a variable, we kind
of for example x equals to one and another client comes along
and set variable x to two. So as
you can see then right now there are some changes now to the column.
We are monitoring the column changes x.
So immediately then the data gets captured. But the thing is too is
that what is happening now is that at the same time there are also transaction
log messages that are sensing the changes
and writing them. This is kind of recording
those changes in a transaction log and they must also occur in
the order as they happen too. That part is very important.
So the transaction log will capture all of the changes. And then
the next thing to do is basically then propagate these change to downstream
system. Now we apply the changes. We first kind
of take the x equals one writing to the database,
to the target data system. And then the next thing we know is that it
gets overwritten because there's set x equals two.
So using this lock based system, when we do any query at
the target system data system level is always the
most recent change, that the state change from
this particular column will get returned to you. So that's more
of a log based type of system. It doesn't make use
of any kind of polling involved. It's only occur basically the
logs will get written when there is some kind of changes.
Now let's take a look know understanding a little bit more
about CDC. What are the requirements for a modern CDC
system? Definition of modern. Let's kind
of take a look. What does it mean by modern? It can mean many things
to many people, but in today's kind of
cloud native cloud, so many things are about cloud.
Now it can be either on prem type
of or private cloud, right? Or public cloud or hybrid cloud.
All of these things. These are really like about cloud native systems.
We want systems to be very efficient. We don't want system,
especially if you're cloud native, we don't want it to take
up extra resources. Why? Because especially in a public
cloud environment, you get charged by how much resources you are
using. So you want to be very careful in how you
are utilizing the resources. It needs to be absolutely lean
and also be cost the least for your business.
So cloud native, we want that. And then also too,
the system has to be very responsive. If there are requests coming
in, you want it to be immediately being able to respond back
with the right thing. Very responsive type of kind of flow
of data through the system. And also not only that,
and at certain times too, you could have tons of data flowing through the system
and at other times it could be very little. So the system itself needs
to be very scalable, very elastic,
so to speak, right? So you have different components, different nodes in a
cloud based system, you want that the
node to be able to shrink and grow as the
needs arises, right? So very scalable type of system.
And another thing is that the resiliency aspect too is basically
if a node goes down, right, it shouldn't slow down
everything. Another node should kind of pop up
to kind of overtake whatever is down. So in other words, it's a very
dynamic type of environment that we're talking about for CDC
now that's kind of like the environment taken. Let's take a
look too at the data itself. Right now we are talking about
data being transmitted, data in motion, and they
are being sent as messages. So let's kind of take a
look what kind of techniques
that we're using. Okay, let's take a look again, going back
some of these requirements, reliability,
resiliency, right. So there should be, this system should
implement like a quality of service type of scenario.
Basically messages must be delivered at least once or some
cases, or actually I should say exactly once or
at least one time too. All these kind of mechanisms probably
needs to be properly defined for data to guarantee to
be delivered, to be able to get transmitted to the destination,
delivery needs to be guaranteed. In other words, something goes down,
should not affect delivery. The data, once your messages get
sent, it must be delivered accordingly to destination.
And then another kind of key
point of a modern
cloud native system will be the responsiveness, right? So how is it
being achieved is basically your system needs to be lightweight,
very loosely coupled too. You can have multiple kind of components of these
things. They send messages in a synchronous manner, or asynchronous
manner, I should say. So basically sender send messages, it doesn't
need to wait for the receiver. It's basically I'm going to send the messages and
somebody will take care of it. It's essentially in a pop sub published subscribe
type of system, would be the broker that handle the messages, kind of
labeling the messages accordingly. And then basically the receiving side,
the subscriber side would say I need the messages and they will subscribe to
the messages. And basically you can be going about doing
your own thing. But on the producer side, oh, I have messages
to send. I send it to the broker to actually through call a topic
and then subscriber will then subscribe to the topic when the data comes,
only when the data comes. And other times it could be doing something else and
not holding up the whole system, so to speak. So the responsiveness,
right like that. And then there's also scalability aspect.
Now I just brought up the pub sub type of approach. So that
would be really a good approach because then in a published subscribe
type of system, again, it deals with loosely coupled
type of systems. The publisher and the subscriber, they don't tightly
get coupled to each other. My job is to publish. I will just keep publishing
and then subscriber. I need data, I'll just subscribe to the topics.
So in that sense, if you kind of think about it, the scalability aspect
of a system can be guaranteed too. So basically you can be producing
many messages and then let the broker takes care of all of these
messages and kind of deal with maybe sometimes they are message
retry, all of these things are reduced duplication
of messages, things like that. So the scalability aspect is
absolutely crucial too. And then also too, one last thing
to kind of point out in here, when it comes to data transmission,
the order of the messaging is very important, especially in
some systems, for example, right. And how the message is coming,
you can be like showing the step of maybe some
money gets deducted from an account and
then maybe adding back. All of these needs to be
kept in the order as it happens too. So we
kind of have to make sure a CDC system too will preserve the ordering
of the messages as well. So all of these, right,
just now we just talk about requirements, but how
can it happen, right, that there are so many type of different
systems and basically there's no one single system
that can handle that type of requirement
for kind of modern day kind of processing, so to speak.
So in today's world, though, good news is that we have
solutions and these all require all of us,
right? If we're implementing the systems or designing
the systems, we need to think in a different way. It's basically
requiring us to have a paradigm shift in what we are
used to. So how about taken taking an event driven approach
that would be actually perfect for a CDC type of implementation
now, right. Now let me then give you a very quick
kind of overview of Apache Pulsar. Okay, so pulsar,
right? And folks probably have heard more about Apache Kafka, but that's
what Apache Pulsar is capable of and also has the same kind
of approach, is making use of the pub sub approach of message
delivery, so to speak, right. Message receiving from producer
and delivering it. Pulsar itself is an open source created
by Yahoo Project back in early 2013
or so and taken, basically Yahoo decided to taken donate
it to the Apache Software foundation in 2016 and
quickly it became a top level project in 2018.
It is designed with the cloud native in mind and basically very
good design. And it's cluster based already kind of built in like
that. Also too, it has what is called multitenancy.
It's basically a way of allowing you to organize your data according
to how you want to kind
of segregate or organize essentially, right? You can have a
company of different departments, you want to have different tenants handle
all these messages accordingly. So it's already built into Pulsar
also too. If you want to interact working with Pulsar,
the client binding the APIs can
be in different languages, including Java, C sharp too,
if you're a net person, python and go, and even community
contribution like Scala, Ruby, all of these other things. And also
pulsar too. It separates out the compute and storage.
We talk about CDC, we're dealing with tons of messages
coming through. Well, I should say they are like detecting changes
in the database and it could be a lot too. And basically you want to
have Pulsar worry about delivering all these messages to the target,
kind of propagating them downstream faster, and not
worry about the storage part as such.
Right. Messages it comes, we don't just throw them away, we need to have
some ways of logging them. Think of how I talk about you can
use a log based type of system. So in some ways we are leveraging pulsar
as the logging mechanism for kind of storing all of
these messages. So pulsar is a very efficient way. It makes
use of the Apache bookkeeper to help it to
kind of with storing all of the messages. And then you
can also do message replay afterwards, all of these things,
which I won't go into all the details now. And also too,
basically pulsar, the requirement is about message
delivery, the ordering very important, and also not just ordering,
basically it's the delivery part is absolutely important.
You send the messages out that should never disappear.
So Pulsar has that built in already.
Basically pulsar is the broker, serverless runtime,
and it guarantees that any messages that comes through
to pulsar will be delivered to the intended target.
And there's also a pulsar function kind of framework that allows you
to transform data as well, which comes in very handy when I talk
about transform. That's what you can use in terms
of, without using any external dependencies, you can immediately
leverage on pulsar functions to help you transform the data as they
come in and before you propagate them down to a derived
data source target system. And then there's also another note
to point out is that pulsar is also very smart in recognizing if
your data becomes cold, meaning that you are staying
in warm storage. Why it costs more money. So if you are not
using those data, it will move them to offline storage such as S
three buckets, hdfs, the Hadoop file system.
These are much less expensive kind of storage space.
So pulsar takes care of things like that for you as well.
Okay, so here just want to kind of quickly bring to you
what is pulsar, Apache Pulsar over there on the right hand side
is basically you can see increasing number of committees,
GitHub stars and the activities. So it's kind
of definitely like increasing in
popularity. And in 2021 it was actually one of the
top five Apache projects too in the foundation.
And again, this is a brief history. I won't go into all the
details. And who else is using pulsars here? As you can see these are big
companies such as Yahoo, Overstock,
Splunk, Verizon, all these GM, these are like big companies that
are already using it. And also too it has what is called like connectors and
clients too, as you can see this is over here, all the symbols,
logos from different companies and as you can see.
What are the language bindings, language clients on the right hand side,
what kind of messaging systems, including Kafka as well, it interacts
well, it's absolutely a very flexible type of system.
And also this here I won't go into all the details we already taken
about producer consumer is what you have to write that you
can write code about. Right. And there are also actually new UI that we're
going to be coming out to that helps you in writing your producer
and consumer code. And the broker serverless process,
very efficient managing all of the message delivery.
Communicate with the bookkeeper which manages all the
backend storage will be bookkeeper but broker communicates
with it as well as Broker also communicates with Zookeeper is another
Apache project that helps with managing your
cluster metadata handles all the coordination tasks between the pulsar
clusters and so on and so forth. And here too I
won't get into all of the details, I just thought I'd provide them here.
Just so you know, that's what it is. We want to get back to CDC
enabling CDC for AstraDB, right. I work for a company
data stack, so we know Apache perfect,
right. Data and storage. We have Apache Cassandra.
So as you can see, database table, we want to kind of
basically look into it and create a serverless
database and then create some tables, one for data
and one for CDC messages. And taken create the streaming tenant.
Right. That's kind of the procedure, how you can build
out the change data capture and then enable the CDC
for Astra to connect everything together.
So that's kind of essentially the steps you do. And then basically too,
over here, downstream targets will be you building up a
streaming sync to kind of receive the changes
that has been kind of transformed and send it back down to the CDC
messaging table. Right. So you can then add
these data back to the data table. So in other words, you can actually go
do detecting changes in an Astra database,
Cassandra database, and then having the streaming
target in there and basically help you create
the tenant and then enable CDC. Make sure you do that. And taken essentially
any kind of transformation, you can use pulsar function, as I mentioned before,
and then you write all the transform data back down to the
downstream targets. And then you can also kind of essentially over
here, send it back into the Cassandra database, but with
the transform too. So even
over here. Oh, I should have done that. So CDC for Astra streaming sync.
And these are what our company can do for you. And also to
wanting to point out to you, pulsar meets you where you are here too.
We have managed cloud platform called Astra streaming, and that's our managed
pulsar. And then if you want to manage
your own cloud, you can use our Luna streaming. So in
other words, you can use that and write your helm chart and work with deploying
it to your Kubernetes cluster, everything. Or you can
use essentially the open source version, completely open
source, and run it on your desktop as well. You can do that too.
So really, really flexible data stacks too, also has different
streaming related products too. There's also Luna streaming.
I talk about if, you know, you use our customer
manage kind of platform, Luna streaming, you manage yourself.
And there's also CDC for Cassandra too. And then if you
want to use our managed cloud, you can use Astra streaming. And then
there's also CDC for Astra database within the GUI,
which I'll show you shortly. Okay,
so let me do that. Or actually,
you know what, I probably don't have time now. I do have to say right
now, given like 30 minutes or so. So what I'll do is
that I'll give you the link and then you can then take a look into
that. Right. So, okay, maybe let me still do that.
Okay, let me do that. Sorry about
that. Okay, here we go. So this actually is the link
to our documentation. It describes how you can do CDC
for estradb so the link
again is towards the end of our document. So if you
want to reference that and then go to this page,
it will guide you through how you can actually set up CDC
change data capture. And this particular example here is basically
help you create a tenant, create a topic using our pulsar
or actually the Astra streaming do that and basically you can
also then create a table or use any table you want
that you want to capture changes from a column. You can do that
and then you have to then enables the CDC feature for the
Astradb. Once it's enables, then you are all set. And basically
from that point on you can then connect your streaming tenant
to an elastic, for example, this example is using elasticsearch,
but you can actually replace it with astradb as a
sync is the place where you receive the changes that has
happened after capture it. So this particular example is assume
the use of an elastic search and taken
basically too, it will guide you through how you can do too.
Okay, so has such, I don't have very much enough
time so I won't get into all the details. Again, this particular link
in here has been shared in
the slide itself. So let me get back to the slide, but I also
want to really quickly point out to you, over here is our estra database.
We offer like $25 credit to per month you can
use for free. And every month it will refresh itself for you to do personal
experimentation and projects too. So in order to sign in,
you can example, oops, in my case is you sign it
with my account. And somehow I think there's a
bit of a slow thing today.
But let's see. Okay, so this is just
as easy as when you sign up. You can just give your,
you don't need your credit card for the $25 credit free tier.
Give your name and also your email address. And that's what's needed.
So over here, if you want to create database, it's over here.
And you can create your database or you can do create your tenant over
here like that. So on the left side it shows you streaming and database.
So this is how you create it. And then also as you step through,
as I mentioned earlier with the example
CDC, you can step through this example and see how it is being done
at your own pace.
Okay, and let me then go back now we finish this
presentation. Where to go from here? Please do keep in touch. These are resources
for our Apache Pulsar and Astra from Datastax.
Also the pulsar bookkeeper and zookeeper projects from Apache.
Also, we have here in the middle right here is our Astra
database link. And you can. Oops. And then you
can kind of take a look in here. Actually, I'm sorry, I should have gone
back to a little bit here. Okay, so all these are Astra streaming,
Luna streaming, and also CDC for Astra. So that's where that page
comes from. So please do take a look into that. And here
is additional credit if you're interested. You're a small business.
You need more credits. Our company is very supportive of you too.
So please visit this link and use this particular open source
200 code to get additional $200 credits.
Or if you like to stay in touch with me and I'll help you too.
Community resources if you want to stay in touch with the community side of
the open source Pulsar Apache project, here are the links and also select
channel and I myself also has a twitch stream
every Wednesday at 02:00 p.m. When I'm not traveling or just follow me when it's
up. When you are available you can watch it or there's also recording being
kept for 60 days too. Thank you. And with
that, I really want to thank you for having sat through here and watch
me this today's presentation. Yes, please stay in touch.
Follow me on Twitter on LinkedIn, connect with me and
also on my discord channel. I'll be open to any kind of conversation
at any time. Everybody, good luck to
you for all your projects and happy coding.