Conf42 Kube Native 2024 - Online

- premiere 5PM GMT

No JVM, no Kafka: Shape the future of streaming data pipelines

Abstract

Learn how to build streaming data pipelines with serverless infrastructure from day 1.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hi, everyone. Welcome to Conf42 kubernetes event. I am glad to provide this session. This talk demonstrates how to build a real time data processing pipeline with serverless infrastructure from your day one. I hope you will like the session and it will be helpful to, for you to understand the different technologies and tools for real time data processing nowadays in modern streaming environment. So with that, let's get started. Let me introduce first myself. My name is Babur. I am a developer advocate at Glassflow, and also I am a Microsoft MVP for Azure AI services. I like to build recent AI applications, and I am also an evangelist for data streaming tools. If you would like to connect with me, you can always request on LinkedIn. Please, if you have any questions about this session or after this session, I would be more than happy to address your questions. So let's start with our story. Our story begins with a data engineer, Bob. Dreamer who works at a medium sized company called dream together with more than 50 his colleagues and they are building an innovative car selling online platform where I like you can register and announce your vehicles to buy or sell. Bob was actually highly motivated by the reason AI hype, since everybody around him talks about AI. He comes up with a brilliant idea of creating a real time price recommender solution for their existing website using modern data tools. Let me explain how that works. As new cars are registered in their primary database, such as it can be NoSQL database, the app should suggest the price for the vehicle using the AI and this application need to show in real time recommended price on their website so users can quickly receive appropriate feedback. Bob presents his idea to the team. And the team was curious about the outcome and the team that gives him to prove some time to Bob to provide the proof of concept. And Bob starts his investigation by understanding what is batch processing and real time data processing because with his background in using the tools for batch processing, he discovers that in practice bots collect, they collect and process data for using downstream applications. The only real distinction between batch and stream processing is that they, how they operate at slightly different time intervals. For a long time, like in flexibility of batch orchestrators and schedulers, keep the data teams to maintain a separate set of systems, like one system for batch processing, another system for stream processing. But it doesn't have to be that way. Nowadays, a real time data streaming pipelines can solve this problem and combine for both batch processing and stream processing into one workload so that you can use the same tools and technologies infrastructure. to process and transfer your data. Here, as you can see in the diagram, how the real time data streaming pipeline works. Actually, this solution is essential for some applications. They need to up to date, bring some up to date information, like monitoring live events or updating dashboards in real time. As you can see, we have the data source and we have the data destination. And in the middle layer, we Where all the transformation happens as data changes, change happens on the source side. These changes are in the real time captured from the transformation part and then you, in the transformation, you can do aggregation. You can do filtering, enrichment, maybe introduce ML or AI models to train your model, then send the output to intended destinations like data lakes that can be data warehouses or any analytical services in even a real time application and microservices. And then Bob once understanding the about the string data streaming, pipelines. He thinks of his first solution for the real time price recommendation system where he uses Debezium to capture data changes, as you can see in the diagram, and stream those events via Kafka to the prediction service. And the prediction service, actually just a Python application that connects to the OpenAI API to calculate the suggested prices and sends and reaches output to the web application or to the analytics tool to see how the price is changing, how often and understand Which users are buying or selling and so on, but however Bob face it with several solution in his first solution, let's say he lacked the experience. First of all, with Kafka, because he has a Python engineer, a Python background, and he didn't want to do with the infrastructure complexities because he doesn't have enough experience in DevOps or learning Kubernetes. Due to time concentrates to build this a simple proof of concept and bring it to the production. His goal also to implement everything through Python, obviously, because there might be some data scientific approach to call some ML models to calculate predicted price. And he also make sure that multiple data engineers in his team could collaborate on one pipeline. And in addition to this, he needed the product service to notify the web application or their other systems at the same time. And Bob continues his discovering by searching on Google or asking GPT, also asking his friends. about their experiences with Kafka or any real time data processing tools. And Bob hears some stories from Kafka users about taking nine months to implement production ready Kafka based data pipelines if they are self managed. They see the customers having 50 teams relying on a single Kafka cluster. Memory is only by one person and sometimes data engineers cannot simulate easily production environment without initial complex setup. And also we see that data scientists struggle with data integration, like building offline ML pipelines, experiment and reproduce some models without pushing it to prod or debug them locally. There are many problems using this Kafka related infrastructure, and then he also determines that there are some problems with self managing Kafka that presents, let's say also it's difficult to determine which team or individual actually responsible for Kafka operations, Bob or DevOps team or software engineering team, or configuring Kafka correctly from the scratch is also problematic. I think you need, you don't know like how many. Clusters you need for your organization or you need single block cluster or multiple clusters. Also deploying changes continuously to Kubernetes to or to other environments and bring some also additional problems. Also upgrading Kafka brokers always is challenging. Sometimes you don't know how much you want to scale your Kafka broker or how much storage. You would like to give to the topics and so on. And another big topic is, of course, monitoring Kafka performance and health is quite challenging when you are, have a self managing Kafka environment. Managing Kafka clusters, as we said, can be expensive, both in terms of infrastructure and operational costs. At this point, you start to consider using some managed Kafka providers. As you can see in the screen, I highlighted some well known managed Kafka providers like Red Panda. We have the Ivan for Apache Kafka Confluent, one of the biggest one, and we have a new startups coming up like Warpstream. But however it does so it doesn't solve the data engineers like both problem because they need still dedicated engineer who understand this, his physical use cases and business needs and spin up the Kafka environment in the cloud providers or that's in the confluent. Also, choosing providers should be also secure enough, you need to make sure that everything is secure. Also, you need to additionally watch out over provisioning or fluctuating costs, because If you don't manage this cost you made you may end up with a high costs. You need to understand cost of model that you are managing inside your Kafka providers. One of those drawbacks once you start to use you need to bring some data transformation logics dynamically. But to use it in data transformation, you need to use a separate environment in your Kafka provider. Let's say Amazon MSK also offers quite a good useful tool for managing your Kafkas. But if you would like to do some transformation, You need to use and also the pay for additionally using Lambda functions. So we said he starts also understands like data engineers or often depends on other teams to maintain data infrastructure. Sometimes data engineers need to ask DevOps to assist the provision cloud resources to manage Kafka or deploying new data pipelines for just sake of a proof of concept they are all creating delays. So commonly Bob wants some self efficiency, self sufficiency in Python. He can just take care of his logic for the data pipeline and orchestration. But the rest, when it comes to infrastructure, do do it while, by maybe any other technologies. Bob combines all the problems and I said, the data engineers and scientists are facing in 2024 and he writes a blog post. If you scan this quote, you can understand more about other problems data scientists and engineers are facing nowadays. My advice is actually, you can also start by discovering options. First of all, without using any Kafka providers. Nowadays there are other tools from the cloud providers like Azure or Google or the Amazon. Or instead of some other message brokers coming up like a nuts which is open source used by many people now a days gaining a lot of popularity. And Bob, as he's a dreamer, one of his dreams, he so said, there's also some stream processing frameworks in Python. That can solve his existing problem. So he starts to investigate this stream processing frameworks in Python. Let's focus on when and why we can use a stream data processing frameworks for Python. First of all, we know that Python is go to language for data science and machine learning. As you can see in the diagram. Python based framework can unify the streaming data platform and stream processor components in the given streaming data pipeline architecture. You can connect directly to any data source. Using built in connectors that provides these frameworks within your application code without knowing like how Kafka or other Message brokers work and you can focus only on your business logic. In other words, stream processing frameworks Combines both Kafka and Flink environment for let's say stateful data stream processing operations and then Bob asks himself, okay, why used to put Python frameworks for data streaming? There are other several reasons. If you are a Python developer you can use frameworks out of the box with any existing Python library. No JVM, no wrapper, no orchestrator or server side engine you need. You can bring pandas or any libraries or you get used to deal with. And if you are a data engineer, you don't need to deal with infrastructure. You can start this framework without any complex initial setup, such as creating clusters. And Python frameworks offer zero infrastructure environment for you. If you're a data scientist, no need to learn Kubernetes. You can run your local code right from the Jupyter notebook to try out your pipeline. And if you're worried about your data, your original data also stays where it is in your primary database, and then the framework just runs a transformation layer on the top of it and sends data to your output destinations. Like you can use change data capture settings like Debezium by directly connecting to the primary database and doing the computational work, and save the result back to the your primary database or send the data to output streams. So with that, our team also, we build this stream processing framework to do with real time data or even the driven data. It's called it's called Glassflow. Glassflow is also serverless, Python centric real time data processing pipeline. You can build your data pipelines within the minutes without setting up any infrastructure. With just simple clicks or using the CLI, you can bring your first data pipeline, let's say, in the first 15 minutes. And actually, Glassware excels in real time data transformation of your events. So that application can immediately react to the new information. In other words, data, real time data pipelines creation process with Glassflow is as simple and casual as Yusuf shoots at the Olympics. No infra, no Kafka, or the flink is required. How we build data pipelines is very simple. You just connect to your data sources, which you send some real time data information using built in integrations. Without writing any code in a low code environment, or if there is no any built in integration, you can build your own integration without spending much time by implementing custom connector using GlassLove SDK. And you write the code to implement in Python data transformation logic. In Python, once you deploy this transformation function to your pipeline, and this transformation function runs in autoscalable serverless infrastructure, You can easily deal with the billions of log records in real time. And also, Glassdoor built using the modern technology, or the technology you get to use like Docker, NANTS, and Kubernetes under the hood. That just abstracts for you to make it easy not to manage this infrastructure under the hood. And let me show you some real time examples of pipeline. I built myself using the log glass flow. For example, you can build a real time clickstream analytics dashboard. You can use, let's say some data analytics API, like a Google analytics in Python to collect some clickstream data from your website and sends them to glass flow pipeline. And your transformation function might analyze data to calculate additional metrics and you can use this metrics or energy data to visualize something using some visualizing tools like a stream lead or a plot lead in Python. Or another example is I build a data processing pipeline to enrich some classified ads in real time. You can use Glassflow to process like website ads and reach them with additional information or categorize them using a long chain or open AI and store this enriched ads in some A quick and advanced search database like Redis so that the website is always aware of real time ads as they are coming into your system. Or what about like processing, for example, Airbnb listing data in some sub base, database? And you can enrich this data with your AI and continuously update the vector databases, such as we will wait. So this example pipeline can be used also to create some personalized recommendation solutions or generate some targeted ads based on the real time data information. Last example I would like to show is a data transformation pipeline that demonstrates some log data anomaly detection. With AI and ML models to monitor your server logs to detect some unusual patterns or the activities and send some notification to the notification systems or stores in long term storages. To understand these logs in the future. So let's just scan this QR code. You can see all the other use cases with code examples, tutorials, and architecture explanation in which use cases you can apply, or you can use them as a template for your future projects. So let's back into our topic of Bob's about building real time data price recommendation. Here's a second example of real time price recommendation data pipeline architecture. He As he provided and he developed, but it is you can see this there are some components to process car price data. He ends up with using the super base because of it is simplicity, open source then also he would like to, that when every vehicle is registered in the PostgreSQL, or you can say, or Support Base Database they would like to send the output to to Glassflow Pipeline using a webhook connector, so that we are, Pipeline can capture every change, every insert, every delete, or every update operation on Support Base. In, within a milliseconds, this data will be arrive to the pipeline and the pipeline can connect to the open AI API and to to provide some insights from ai. And then predicted car prices can be sent to the output destinations. As you can see tools we use is first of all starting from super base is pretty much like a fire base or Heroku. But it's open source SQL database. It's 100 percent portable with your PostgreSQL as well. You can easily bring your existing PostgreSQL database or build real time data applications on top of it. SuperBase in our architecture triggers event whenever new entry with the car data added in the car price database, let's say. Then it sends directly to that Glassflow pipeline using a webhook data source connector, as I mentioned earlier. Then we use a Glasswalk web app, which is one way to, you build a pipeline Glassflow without writing any code, just a UI interface. You put the specified data source transformation and data destination. You're going to see it soon in my demo. Then your pipeline is ready. Then we are gonna use open AI obviously to use AI models like GPT to predict future car price based on some information we got from superb. So once the use case is clear, let me show you demonstrate how that works and how you can build your first pipeline within the your five or six minutes. If you're scanning this QR code, you can also find the project on GitHub. You can give a try. There are some samples code for transformation and also sample codes for you to set up some queries on the SOPA base site and so on. So let me dive into the demo. Assume that you have a primary database like a super base and you have project inside the project some databases and we have also car price table with some attributes belong to the car announced for the selling or buying and then also you have sample data for some car prices. Let me dive into the demo of creating the pipeline. Navigate to GlassLaw webapp. Using the GlassLaw webapp you can create a pipeline with easy step within a minute. Start by creating new pipeline using the Glassflow. I will provide my pipeline name for the new pipeline and like a cut price data. Click next. And for as a data source, I'm going to use a webhook. Because a webhook, we're going to set up for server based to push some every data changes to the Glassflow. Then next step third step, I need to define my transformer function. You can use it existing templates or write Python code to define your. Transformer function. I have my already defined transformer function that you use OpenAI, and the call makes the call to Chat completion endpoint to predict the car price for the upcoming car product event changes. So I'm going to copy this transformation function and put into the editor of the web app. You can also find the transformation function code on our GitHub repository. So we don't have to also use our own OpenAI API. We provide the free OpenAI API account to try out OpenAI use case. Next step, I'm going to set up the destination as a webhook. That can be your real website or any external website to build a custom workflows for the demos cases. I'm going to use webhook site to create some random webhook so that the Glassware Pipeline can send the output to the webhook once we data is enriched in the transformer stage. I click next and finally check my transformation function pipeline setup, then click next to create my pipeline. Now. Everything's good. My pipeline is created. Now I will need only my access token and pipeline ID to access push some events from the database or from your primary database. In that case, you can also create some webhook connector for the super base. You can create a click put some webhook connector details. And in this case like it's a core quality, the car price data and change. Super base detects every change when we do in sorts, then push the event to the Glassdoor pipeline automatically, you can also push your events. To the Glassflow pipeline using Python SDK and write some custom code to push from any data source to the Glassflow. So let's define also some HTTP type and I put also HTTP request header to define some access token for my pipeline. I'm going to copy and add the access token header. For the pipeline, then put the value to make sure that we are securely pushing the events to the pipeline. Now, the webhook is ready. Once the webhook is starting to push events, you can immediately see a transformed output on the Glassflow side. Let's go to the next step. Once again to some sample data in source, and I'm going to insert some sample data into the, my primary database. If I run this SQL query command, I will gonna add four more in source with some simple color prizes. Then, as you can see, as output, we will have Some logs are coming from the price collected on API. Wow, here is, you can see some failures. In that case, you can also show with a web app requirements, dependency requirements for the Python. I'm going to specify OpenAI and using inside my transformer function. So now, if I try once again, as you can see in the logs, I should get all of the transformers and processes that events from the, our primary database, like server base, as you can see, one hour, four events out for transformers and class will immediately send the transformer data as output to the webhook we defined it when we create the pipeline this is the magic of the pipeline that reacts on real time data. As you can see, creating pipeline is quite easy when you're using tools for for your need. As you can, Bob made his Python code that runs in real time mode with a serverless infrastructure from day one. Compared to challenges he faced, he now Have a simple solution. So a few words about glass flow itself when people, other people use a glass flow for, let's say, if you'd like to have a blood jobs batch jobs Might take a minutes or hours to execute. Glassflow enables you processing your data in milliseconds to seconds, making the real time analysis for reality from your day one. And also if you would like to reduce the costs connected with running a spark, like a similar jobs. That consume your big amount of competing resources. Glassdoor offers more cost efficient solution. As earlier, I mentioned that like you can both combine data streaming part and data transformation part in the side of the same price. You don't have to run separate Lambda or in Azure or functions to run. Transformation part and launching new data project with Glassflow as you can see easy Also integrating with real time data sources. So this nation is also quite fast Glassflow excels in real time data ingestion and processing of your events and so on You can have a look Our documentation and blog to learn more about this in summary You here few inside to what we have learned along the way. So Bob understand that we understand that engineers face a problems nowadays with JVM, basic real time processing tools like Kafka. And Bob as a data engineer or a scientist maybe his team, or he needs to maybe data analysts they want to self sufficiency in Python. And then there are some stream processing frameworks in Python nowadays so that you don't have to learn some GVM environment. And so Bob also find out some serverless stream processing solutions and tools. To apply to his first pipeline. So with that we are about to finish the demo. Now, if you have any questions after the session, I will be more than happy to address them. So thank you for attending my session. See you soon. Thank you.
...

Bobur Umurzokov

Developer Relations Manager @ GlassFlow

Bobur Umurzokov's LinkedIn account Bobur Umurzokov's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways