The Streaming Plane

Video size:

Abstract

The streaming plane connects the operational and analytical aspects of data processing. It captures and processes real-time data, allowing it to flow seamlessly. This bridge enables organizations to make faster, data-driven decisions based on both real-time and historical data.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hello, everyone. Welcome to the streaming plane, the plane that brings all other data planes together. I'm Hubert Dulay, developer advocate at StarTree, the company that provides Apache a real time OLAP database. I'm really happy to be presenting among many other data experts here at COMP42. About me, I'm the primary author of two O'Reilly books, Streaming Data Mesh and Streaming Databases. And in this presentation, you'll get to see some ideas from both books. In this session, you'll learn about a few concepts that are shaping the future of data. First, we'll explore the data divide, what it is and why it's becoming a critical challenge in data management. Next, we'll discuss data gravity and how it influences where and how data is processed. We'll also introduce the streaming plane, a crucial layer for handling real time data. And finally, we'll cover how to effectively consume real time analytics. Enabling you to make data driven decisions instantly. The data divide describes data movement within a business and is crucial for understanding how data drives business decisions. In summary, The data divide distinguishes between two main data planes, the operational and the analytical data planes. The operational data plane consists of data stores that support the application's powering a business, while the analytical data plane is focused on data analysis and insights. The idea is that the insights derive from the Analysis are fed back to the operational plane in hopes of generating more revenue or, more importantly, preventing revenue loss. As data moves through a typical pipeline, it doesn't just get copied. It undergoes transformations, cleansing, and security processes like encryption. The data is secure and reliable by the time it reaches the end. However, getting this refined data out of analytical systems is tough because these systems often reformatted into formats like Parquet, which are great for analysis, but make it hard to access complete data sets. This is where ETL, Extract Transform Load, and ELT, Extract Load Transform, processes come into play. While there's debate on which is better, the lines between them are starting to blur. Each plane has its own type of database. On the operational plane, you get OLTP databases that are optimized for operational workloads like Insert, Update, Delete, and Single Row Lookups. These databases are usually row based. On the analytical plane, databases are optimized for analytical workloads like aggregations and joins. These are called OLAP databases, or Online Analytical Processing databases. What makes them special is that they store data in columnar formats, which is more efficient for running complex analyses. OLAP systems keep a history of all the data that originated from the operational side. That's really useful for tracking changes over time and seeing how things evolve. In the operational plane, you'll see microservices supporting various applications. And on the analytical side, you'll find data lakes, data warehouses, and lake houses. But here's the thing, since the data on the analytical side is already clean and transformed, why should transformations? Moving data from analytical to operational is hard. It's an upstream path. OLAPs don't like to return all rows and columns. That's not what they are optimized to do. Remember that these databases are intended for analytical workloads. Selecting all the raw records from a table in an OLAP is expensive and slow. Solutions like we're Reverse ETL and data activation are designed to help bring this polished data back into your applications. Data gravity is a concept that compares data to a planet, where accumulated data attracts services and applications, much like a planet's mass pulls in moons. In traditional data architectures, data flows from operational systems, moons to an analytical system, the planet, leading to potential overload and latency as the analytical system becomes burdened with historical data. However, with a streaming plane, data can be processed closer to its source, acting like satellites that alleviate the gravitational pull, enabling smoother data flow and real time analytics. Data gravity Highlights the challenges of managing growing data, while the streaming plane offers a solution for more effective data movement and processing. In the streaming plane, you'll typically find several key systems working together. First, There are connectors, which enable seamless data flow between various sources and destinations. Then, we have streaming databases that store and manage real time data streams. Real time OLAP data stores provide the ability to perform fast analytical queries on streaming data. Streaming platforms like Kafka, Red Panda, and Pulsar are the backbone. Handling the distribution, and routing of data streams. And finally, stream processing frameworks, which allow you to process and analyze data in motion, ensuring timely insights and actions. There are two ways to serve streams of data, synchronously and asynchronously. Synchronous serving is handled by databases that consume directly from the stream, such as streaming databases and real time OLAP databases. Real time OLAP databases are specifically optimized for analytical queries, allowing you to query real time data with low latency and high concurrency. High concurrency means these systems can efficiently serve many end users simultaneously. On the other hand, asynchronous serving of processed data streams is managed by Kafka. In the diagram, the two Kafka icons represent either the same Kafka cluster or different clusters replicating data across them. The key difference between synchronous and asynchronous serving lies in how data is consumed and delivered. In synchronous serving, the client requests data and waits for an immediate response, which can lead to delays if the data is large or complex. Conversely, Asynchronous serving allows clients to subscribe to data changes and receive updates in real time without waiting for a request response cycle. This approach is more efficient for handling continuous streams of data as it enables clients to react to changes as they occur. Ensuring that no updates are missed. Asynchronous systems, like those utilizing Kafka, can handle high volumes of data and provide a more scalable solution for real time data processing. Data locality and replication are crucial concepts in the streaming plane. For example, if your operational data is in EMEA but you need to serve it in APAC, you wouldn't want your APAC systems to constantly reach out to EMEA for real time data. This would create latency and potential bottlenecks. Instead, you replicate that data between regions, and tools like Kafka make this easy. By creating replicas of your data in other regions, you ensure that applications local to those regions can consume streaming data efficiently and in real time. And keep in mind, because this data is still within a Kafka topic, it's streaming. In real time, operating within the physical limits of speed, like the speed of light. A materialized view is a database object that contains the results of a query, storing them as a physical table. Unlike traditional views, which execute the underlying SQL statement at query time and do not store results, materialized views pre compute and cache the results, allowing for faster data retrieval. They can be refreshed periodically or on demand to keep the data up to date. But in the context of stream processing, materialized views are continuously and incrementally updated as new data arrives. This capability makes them particularly valuable for improving query performance, reducing data duplication, and simplifying analytics, especially in environments where real time data access is crucial. Balancing between push and pull queries is essential for optimizing performance and meeting specific use case requirements. One effective approach is to utilize materialized views, which can act as a bridge between the two query types. By submitting a push query to create a materialized view, clients can benefit from the heavy lifting done by the push query, while simultaneously subscribing to changes in the materialized view. The balance between push and pull queries can be adjusted based on the application's needs. For instance, if low latency is paramount, a greater reliance on push queries may be warranted, while scenarios requiring more complex queries might benefit from the flexibility of pull queries. Ultimately, the choice depends on the specific requirements of the application and the nature of the data being processed. The diagram that models how to balance push and pull queries illustrates the tradeoff between query flexibility and latency. It features a box in the middle representing a materialized view, which serves as a bridge between the two types of queries. As you move down the diagram, The materialized view provides less flexible queries but offers better performance, meaning that the queries execute with lower latency. This is ideal for scenarios where immediate data availability is critical, such as user facing applications that require real time updates. Conversely, as you move up the diagram, the queries become more flexible, allowing for deeper insights and more complex queries. However, This increased flexibility comes at the cost of higher latencies, as a serving engine has to perform more work to fulfill these requests. The overall concept is that push and pull queries can work together to find the right balance between latency and flexibility. Push queries are preferred for low latency requirements, while pull queries offer the flexibility needed for more complex analytical tasks. The materialized view acts as a compromise, allowing users to benefit from both approaches depending on their specific use case. This balance is crucial for optimizing real time analytics, as it enables systems to react to events while also providing the capability to perform ad hoc queries when necessary. Global replication and local consumption are vital in the streaming plane for optimizing performance and reducing latency. Global replication duplicates data across multiple nodes, ensuring consistency and enabling efficient workload distribution, while local consumption allows for faster access to data at its source, enhancing performance and resource optimization. Together, we They create a robust framework for efficient data processing and scalability in streaming environments, supporting federated governance and tailored security measures. The streaming plane enables the creation of virtual data products that can be consumed locally, even if the source of the data resides remotely. This is achieved through a streaming platform like Kafka. Users can query and interact with data locally while it is incrementally replicated from different global regions. A streaming data catalog is essential to managing these virtual data products effectively. It serves as an inventory of data and its metadata, enabling users to discover and subscribe to streaming data products. The catalog provides crucial metadata, including table definitions, validation rules, data types, and lineage information, which helps consumers understand the data's provenance and processing. In summary, the streaming plane facilitates the consumption of virtual data products that may exist remotely, while a robust streaming data catalog ensures effective governance and understanding of these data products. The Venn diagram of the operational, analytical, and streaming planes illustrates how these areas of data processing are distinct yet interconnected. The operational plane focuses on transaction processing and generating real time data, while the analytical plane is dedicated to complex tasks.

Slides

Download slides (PDF)

See all 50 talks at this event!

Conf42 Platform Engineering 2024 - Online

September 05 2024 - premiere 5PM GMT

The Streaming Plane

Video size:

Abstract

Summary

Transcript

Slides

Hubert Dulay

Developer Advocate @ StarTree

Join the community!

Featured event

2025

2024

Info

Conf42 Platform Engineering 2024 - Online

September 05 2024 - premiere 5PM GMT

The Streaming Plane

Video size:

Abstract

Summary

Transcript

Slides

Hubert Dulay

Developer Advocate @ StarTree

Join the community!