Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi, everyone.
Welcome to Conf42 kubernetes event.
I am glad to provide this session.
This talk demonstrates how to build a real time data processing
pipeline with serverless infrastructure from your day one.
I hope you will like the session and it will be helpful to, for you to
understand the different technologies and tools for real time data processing
nowadays in modern streaming environment.
So with that, let's get started.
Let me introduce first myself.
My name is Babur.
I am a developer advocate at Glassflow, and also I am a
Microsoft MVP for Azure AI services.
I like to build recent AI applications, and I am also an
evangelist for data streaming tools.
If you would like to connect with me, you can always request on LinkedIn.
Please, if you have any questions about this session or after this
session, I would be more than happy to address your questions.
So let's start with our story.
Our story begins with a data engineer, Bob.
Dreamer who works at a medium sized company called dream together with more
than 50 his colleagues and they are building an innovative car selling online
platform where I like you can register and announce your vehicles to buy or sell.
Bob was actually highly motivated by the reason AI hype, since
everybody around him talks about AI.
He comes up with a brilliant idea of creating a real time price
recommender solution for their existing website using modern data tools.
Let me explain how that works.
As new cars are registered in their primary database, such as it can
be NoSQL database, the app should suggest the price for the vehicle
using the AI and this application need to show in real time recommended
price on their website so users can quickly receive appropriate feedback.
Bob presents his idea to the team.
And the team was curious about the outcome and the team that gives
him to prove some time to Bob to provide the proof of concept.
And Bob starts his investigation by understanding what is batch processing and
real time data processing because with his background in using the tools for batch
processing, he discovers that in practice bots collect, they collect and process
data for using downstream applications.
The only real distinction between batch and stream processing is
that they, how they operate at slightly different time intervals.
For a long time, like in flexibility of batch orchestrators and schedulers,
keep the data teams to maintain a separate set of systems, like
one system for batch processing, another system for stream processing.
But it doesn't have to be that way.
Nowadays, a real time data streaming pipelines can solve this problem and
combine for both batch processing and stream processing into one workload
so that you can use the same tools and technologies infrastructure.
to process and transfer your data.
Here, as you can see in the diagram, how the real time
data streaming pipeline works.
Actually, this solution is essential for some applications.
They need to up to date, bring some up to date information,
like monitoring live events or updating dashboards in real time.
As you can see, we have the data source and we have the data destination.
And in the middle layer, we Where all the transformation happens as data changes,
change happens on the source side.
These changes are in the real time captured from the transformation
part and then you, in the transformation, you can do aggregation.
You can do filtering, enrichment, maybe introduce ML or AI models to train
your model, then send the output to intended destinations like data lakes
that can be data warehouses or any analytical services in even a real
time application and microservices.
And then Bob once understanding the about the string data streaming, pipelines.
He thinks of his first solution for the real time price recommendation
system where he uses Debezium to capture data changes, as you can see
in the diagram, and stream those events via Kafka to the prediction service.
And the prediction service, actually just a Python application that connects
to the OpenAI API to calculate the suggested prices and sends and reaches
output to the web application or to the analytics tool to see how the price
is changing, how often and understand Which users are buying or selling and
so on, but however Bob face it with several solution in his first solution,
let's say he lacked the experience.
First of all, with Kafka, because he has a Python engineer, a Python
background, and he didn't want to do with the infrastructure complexities
because he doesn't have enough experience in DevOps or learning Kubernetes.
Due to time concentrates to build this a simple proof of concept
and bring it to the production.
His goal also to implement everything through Python, obviously, because
there might be some data scientific approach to call some ML models
to calculate predicted price.
And he also make sure that multiple data engineers in his team could
collaborate on one pipeline.
And in addition to this, he needed the product service to notify
the web application or their other systems at the same time.
And Bob continues his discovering by searching on Google or asking
GPT, also asking his friends.
about their experiences with Kafka or any real time data processing tools.
And Bob hears some stories from Kafka users about taking nine months to
implement production ready Kafka based data pipelines if they are self managed.
They see the customers having 50 teams relying on a single Kafka cluster.
Memory is only by one person and sometimes data engineers cannot
simulate easily production environment without initial complex setup.
And also we see that data scientists struggle with data integration, like
building offline ML pipelines, experiment and reproduce some models without
pushing it to prod or debug them locally.
There are many problems using this Kafka related infrastructure, and then
he also determines that there are some problems with self managing Kafka that
presents, let's say also it's difficult to determine which team or individual
actually responsible for Kafka operations, Bob or DevOps team or software engineering
team, or configuring Kafka correctly from the scratch is also problematic.
I think you need, you don't know like how many.
Clusters you need for your organization or you need single
block cluster or multiple clusters.
Also deploying changes continuously to Kubernetes to or to other environments
and bring some also additional problems.
Also upgrading Kafka brokers always is challenging.
Sometimes you don't know how much you want to scale your
Kafka broker or how much storage.
You would like to give to the topics and so on.
And another big topic is, of course, monitoring Kafka performance and health
is quite challenging when you are, have a self managing Kafka environment.
Managing Kafka clusters, as we said, can be expensive, both in terms of
infrastructure and operational costs.
At this point, you start to consider using some managed Kafka providers.
As you can see in the screen, I highlighted some well known managed
Kafka providers like Red Panda.
We have the Ivan for Apache Kafka Confluent, one of the biggest
one, and we have a new startups coming up like Warpstream.
But however it does so it doesn't solve the data engineers like both problem
because they need still dedicated engineer who understand this, his physical use
cases and business needs and spin up the Kafka environment in the cloud
providers or that's in the confluent.
Also, choosing providers should be also secure enough, you need to
make sure that everything is secure.
Also, you need to additionally watch out over provisioning or fluctuating costs,
because If you don't manage this cost you made you may end up with a high costs.
You need to understand cost of model that you are managing
inside your Kafka providers.
One of those drawbacks once you start to use you need to bring some data
transformation logics dynamically.
But to use it in data transformation, you need to use a separate
environment in your Kafka provider.
Let's say Amazon MSK also offers quite a good useful tool for managing your Kafkas.
But if you would like to do some transformation, You need
to use and also the pay for additionally using Lambda functions.
So we said he starts also understands like data engineers or often depends on other
teams to maintain data infrastructure.
Sometimes data engineers need to ask DevOps to assist the provision cloud
resources to manage Kafka or deploying new data pipelines for just sake of a proof
of concept they are all creating delays.
So commonly Bob wants some self efficiency, self sufficiency in Python.
He can just take care of his logic for the data pipeline and orchestration.
But the rest, when it comes to infrastructure, do do it while,
by maybe any other technologies.
Bob combines all the problems and I said, the data engineers and scientists are
facing in 2024 and he writes a blog post.
If you scan this quote, you can understand more about other problems data scientists
and engineers are facing nowadays.
My advice is actually, you can also start by discovering options.
First of all, without using any Kafka providers.
Nowadays there are other tools from the cloud providers like
Azure or Google or the Amazon.
Or instead of some other message brokers coming up like a nuts which
is open source used by many people now a days gaining a lot of popularity.
And Bob, as he's a dreamer, one of his dreams, he so said, there's also some
stream processing frameworks in Python.
That can solve his existing problem.
So he starts to investigate this stream processing frameworks in Python.
Let's focus on when and why we can use a stream data
processing frameworks for Python.
First of all, we know that Python is go to language for data
science and machine learning.
As you can see in the diagram.
Python based framework can unify the streaming data platform and stream
processor components in the given streaming data pipeline architecture.
You can connect directly to any data source.
Using built in connectors that provides these frameworks within your application
code without knowing like how Kafka or other Message brokers work and you
can focus only on your business logic.
In other words, stream processing frameworks Combines both Kafka and Flink
environment for let's say stateful data stream processing operations and then
Bob asks himself, okay, why used to put Python frameworks for data streaming?
There are other several reasons.
If you are a Python developer you can use frameworks out of the box
with any existing Python library.
No JVM, no wrapper, no orchestrator or server side engine you need.
You can bring pandas or any libraries or you get used to deal with.
And if you are a data engineer, you don't need to deal with infrastructure.
You can start this framework without any complex initial
setup, such as creating clusters.
And Python frameworks offer zero infrastructure environment for you.
If you're a data scientist, no need to learn Kubernetes.
You can run your local code right from the Jupyter notebook to try out your pipeline.
And if you're worried about your data, your original data also stays where it
is in your primary database, and then the framework just runs a transformation
layer on the top of it and sends data to your output destinations.
Like you can use change data capture settings like Debezium by directly
connecting to the primary database and doing the computational work, and save the
result back to the your primary database or send the data to output streams.
So with that, our team also, we build this stream processing framework to do with
real time data or even the driven data.
It's called it's called Glassflow.
Glassflow is also serverless, Python centric real time
data processing pipeline.
You can build your data pipelines within the minutes without
setting up any infrastructure.
With just simple clicks or using the CLI, you can bring your first data pipeline,
let's say, in the first 15 minutes.
And actually, Glassware excels in real time data transformation of your events.
So that application can immediately react to the new information.
In other words, data, real time data pipelines creation process with
Glassflow is as simple and casual as Yusuf shoots at the Olympics.
No infra, no Kafka, or the flink is required.
How we build data pipelines is very simple.
You just connect to your data sources, which you send some real time data
information using built in integrations.
Without writing any code in a low code environment, or if there is no
any built in integration, you can build your own integration without
spending much time by implementing custom connector using GlassLove SDK.
And you write the code to implement in Python data transformation logic.
In Python, once you deploy this transformation function to your
pipeline, and this transformation function runs in autoscalable serverless
infrastructure, You can easily deal with the billions of log records in real time.
And also, Glassdoor built using the modern technology, or the technology
you get to use like Docker, NANTS, and Kubernetes under the hood.
That just abstracts for you to make it easy not to manage this
infrastructure under the hood.
And let me show you some real time examples of pipeline.
I built myself using the log glass flow.
For example, you can build a real time clickstream analytics dashboard.
You can use, let's say some data analytics API, like a Google
analytics in Python to collect some clickstream data from your website
and sends them to glass flow pipeline.
And your transformation function might analyze data to calculate additional
metrics and you can use this metrics or energy data to visualize something
using some visualizing tools like a stream lead or a plot lead in Python.
Or another example is I build a data processing pipeline to enrich
some classified ads in real time.
You can use Glassflow to process like website ads and reach them with additional
information or categorize them using a long chain or open AI and store this
enriched ads in some A quick and advanced search database like Redis so that the
website is always aware of real time ads as they are coming into your system.
Or what about like processing, for example, Airbnb listing
data in some sub base, database?
And you can enrich this data with your AI and continuously update the
vector databases, such as we will wait.
So this example pipeline can be used also to create some personalized recommendation
solutions or generate some targeted ads based on the real time data information.
Last example I would like to show is a data transformation
pipeline that demonstrates some log data anomaly detection.
With AI and ML models to monitor your server logs to detect some unusual
patterns or the activities and send some notification to the notification
systems or stores in long term storages.
To understand these logs in the future.
So let's just scan this QR code.
You can see all the other use cases with code examples, tutorials, and
architecture explanation in which use cases you can apply, or you can use them
as a template for your future projects.
So let's back into our topic of Bob's about building real
time data price recommendation.
Here's a second example of real time price recommendation data pipeline architecture.
He As he provided and he developed, but it is you can see this there are some
components to process car price data.
He ends up with using the super base because of it is simplicity, open
source then also he would like to, that when every vehicle is registered
in the PostgreSQL, or you can say, or Support Base Database they would like
to send the output to to Glassflow Pipeline using a webhook connector, so
that we are, Pipeline can capture every change, every insert, every delete, or
every update operation on Support Base.
In, within a milliseconds, this data will be arrive to the pipeline and the
pipeline can connect to the open AI API and to to provide some insights from ai.
And then predicted car prices can be sent to the output destinations.
As you can see tools we use is first of all starting from super base is
pretty much like a fire base or Heroku.
But it's open source SQL database.
It's 100 percent portable with your PostgreSQL as well.
You can easily bring your existing PostgreSQL database or build real
time data applications on top of it.
SuperBase in our architecture triggers event whenever new entry
with the car data added in the car price database, let's say.
Then it sends directly to that Glassflow pipeline using a webhook data source
connector, as I mentioned earlier.
Then we use a Glasswalk web app, which is one way to, you build a
pipeline Glassflow without writing any code, just a UI interface.
You put the specified data source transformation and data destination.
You're going to see it soon in my demo.
Then your pipeline is ready.
Then we are gonna use open AI obviously to use AI models like GPT
to predict future car price based on some information we got from superb.
So once the use case is clear, let me show you demonstrate how that works and
how you can build your first pipeline within the your five or six minutes.
If you're scanning this QR code, you can also find the project on GitHub.
You can give a try.
There are some samples code for transformation and also sample
codes for you to set up some queries on the SOPA base site and so on.
So let me dive into the demo.
Assume that you have a primary database like a super base and you have project
inside the project some databases and we have also car price table with some
attributes belong to the car announced for the selling or buying and then also
you have sample data for some car prices.
Let me dive into the demo of creating the pipeline.
Navigate to GlassLaw webapp.
Using the GlassLaw webapp you can create a pipeline with easy step within a minute.
Start by creating new pipeline
using the Glassflow.
I will provide my pipeline name for the new pipeline and like a cut price data.
Click next.
And for as a data source, I'm going to use a webhook.
Because a webhook, we're going to set up for server based to push some
every data changes to the Glassflow.
Then next step third step, I need to define my transformer function.
You can use it existing templates or write Python code to define your.
Transformer function.
I have my already defined transformer function that you use OpenAI, and the
call makes the call to Chat completion endpoint to predict the car price for
the upcoming car product event changes.
So I'm going to copy this transformation function and put
into the editor of the web app.
You can also find the transformation function code on our GitHub repository.
So we don't have to also use our own OpenAI API.
We provide the free OpenAI API account to try out OpenAI use case.
Next step, I'm going to set up the destination as a webhook.
That can be your real website or any external website to build a custom
workflows for the demos cases.
I'm going to use webhook site to create some random webhook so
that the Glassware Pipeline can send the output to the webhook
once we data is enriched in the transformer stage.
I click next and finally check my transformation function pipeline setup,
then click next to create my pipeline.
Now.
Everything's good.
My pipeline is created.
Now I will need only my access token and pipeline ID to access
push some events from the database or from your primary database.
In that case, you can also create some webhook connector for the super base.
You can create a click put some webhook connector details.
And in this case like it's a core quality, the car price data and change.
Super base detects every change when we do in sorts, then push the event to
the Glassdoor pipeline automatically, you can also push your events.
To the Glassflow pipeline using Python SDK and write some custom code to push
from any data source to the Glassflow.
So let's define also some HTTP type and I put also HTTP request header to
define some access token for my pipeline.
I'm going to copy and add the access token header.
For the pipeline, then put the value to make sure that we are securely
pushing the events to the pipeline.
Now, the webhook is ready.
Once the webhook is starting to push events, you can immediately see a
transformed output on the Glassflow side.
Let's go to the next step.
Once again to some sample data in source, and I'm going to insert some sample
data into the, my primary database.
If I run this SQL query command, I will gonna add four more in source
with some simple color prizes.
Then, as you can see, as output, we will have Some logs are coming
from the price collected on API.
Wow, here is, you can see some failures.
In that case, you can also show with a web app requirements, dependency
requirements for the Python.
I'm going to specify OpenAI and using inside my transformer function.
So now, if I try once again, as you can see in the logs, I should get all
of the transformers and processes that events from the, our primary database,
like server base, as you can see, one hour, four events out for transformers
and class will immediately send the transformer data as output to the
webhook we defined it when we create the pipeline this is the magic of the
pipeline that reacts on real time data.
As you can see, creating pipeline is quite easy when you're
using tools for for your need.
As you can, Bob made his Python code that runs in real time mode with a
serverless infrastructure from day one.
Compared to challenges he faced, he now Have a simple solution.
So a few words about glass flow itself when people, other people use a glass
flow for, let's say, if you'd like to have a blood jobs batch jobs Might
take a minutes or hours to execute.
Glassflow enables you processing your data in milliseconds to seconds,
making the real time analysis for reality from your day one.
And also if you would like to reduce the costs connected with running
a spark, like a similar jobs.
That consume your big amount of competing resources.
Glassdoor offers more cost efficient solution.
As earlier, I mentioned that like you can both combine data streaming
part and data transformation part in the side of the same price.
You don't have to run separate Lambda or in Azure or functions to run.
Transformation part and launching new data project with Glassflow as
you can see easy Also integrating with real time data sources.
So this nation is also quite fast Glassflow excels in real time data
ingestion and processing of your events and so on You can have a look Our
documentation and blog to learn more about this in summary You here few inside
to what we have learned along the way.
So Bob understand that we understand that engineers face a problems
nowadays with JVM, basic real time processing tools like Kafka.
And Bob as a data engineer or a scientist maybe his team, or he
needs to maybe data analysts they want to self sufficiency in Python.
And then there are some stream processing frameworks in Python nowadays so that you
don't have to learn some GVM environment.
And so Bob also find out some serverless stream processing solutions and tools.
To apply to his first pipeline.
So with that we are about to finish the demo.
Now, if you have any questions after the session, I will be
more than happy to address them.
So thank you for attending my session.
See you soon.
Thank you.