Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello, everyone.
Welcome to the streaming plane, the plane that brings all
other data planes together.
I'm Hubert Dulay, developer advocate at StarTree, the company that provides
Apache a real time OLAP database.
I'm really happy to be presenting among many other data experts here at COMP42.
About me, I'm the primary author of two O'Reilly books, Streaming
Data Mesh and Streaming Databases.
And in this presentation, you'll get to see some ideas from both books.
In this session, you'll learn about a few concepts that are
shaping the future of data.
First, we'll explore the data divide, what it is and why it's becoming a
critical challenge in data management.
Next, we'll discuss data gravity and how it influences where
and how data is processed.
We'll also introduce the streaming plane, a crucial layer
for handling real time data.
And finally, we'll cover how to effectively consume real time analytics.
Enabling you to make data driven decisions instantly.
The data divide describes data movement within a business and
is crucial for understanding how data drives business decisions.
In summary, The data divide distinguishes between two main
data planes, the operational and the analytical data planes.
The operational data plane consists of data stores that support the
application's powering a business, while the analytical data plane is
focused on data analysis and insights.
The idea is that the insights derive from the Analysis are fed back to
the operational plane in hopes of generating more revenue or, more
importantly, preventing revenue loss.
As data moves through a typical pipeline, it doesn't just get copied.
It undergoes transformations, cleansing, and security processes like encryption.
The data is secure and reliable by the time it reaches the end.
However, getting this refined data out of analytical systems is tough
because these systems often reformatted into formats like Parquet, which
are great for analysis, but make it hard to access complete data sets.
This is where ETL, Extract Transform Load, and ELT, Extract Load
Transform, processes come into play.
While there's debate on which is better, the lines between
them are starting to blur.
Each plane has its own type of database.
On the operational plane, you get OLTP databases that are optimized for
operational workloads like Insert, Update, Delete, and Single Row Lookups.
These databases are usually row based.
On the analytical plane, databases are optimized for analytical
workloads like aggregations and joins.
These are called OLAP databases, or Online Analytical Processing databases.
What makes them special is that they store data in columnar formats, which is more
efficient for running complex analyses.
OLAP systems keep a history of all the data that originated
from the operational side.
That's really useful for tracking changes over time and seeing how things evolve.
In the operational plane, you'll see microservices
supporting various applications.
And on the analytical side, you'll find data lakes, data
warehouses, and lake houses.
But here's the thing, since the data on the analytical side is already clean and
transformed, why should transformations?
Moving data from analytical to operational is hard.
It's an upstream path.
OLAPs don't like to return all rows and columns.
That's not what they are optimized to do.
Remember that these databases are intended for analytical workloads.
Selecting all the raw records from a table in an OLAP is expensive and slow.
Solutions like we're Reverse ETL and data activation are designed to help bring this
polished data back into your applications.
Data gravity is a concept that compares data to a planet, where accumulated data
attracts services and applications, much like a planet's mass pulls in moons.
In traditional data architectures, data flows from operational systems,
moons to an analytical system, the planet, leading to potential overload
and latency as the analytical system becomes burdened with historical data.
However, with a streaming plane, data can be processed closer to its source,
acting like satellites that alleviate the gravitational pull, enabling smoother
data flow and real time analytics.
Data gravity Highlights the challenges of managing growing data, while the
streaming plane offers a solution for more effective data movement and processing.
In the streaming plane, you'll typically find several key systems working together.
First, There are connectors, which enable seamless data flow between
various sources and destinations.
Then, we have streaming databases that store and manage real time data streams.
Real time OLAP data stores provide the ability to perform fast
analytical queries on streaming data.
Streaming platforms like Kafka, Red Panda, and Pulsar are the backbone.
Handling the distribution, and routing of data streams.
And finally, stream processing frameworks, which allow you to
process and analyze data in motion, ensuring timely insights and actions.
There are two ways to serve streams of data, synchronously and asynchronously.
Synchronous serving is handled by databases that consume directly
from the stream, such as streaming databases and real time OLAP databases.
Real time OLAP databases are specifically optimized for analytical queries,
allowing you to query real time data with low latency and high concurrency.
High concurrency means these systems can efficiently serve
many end users simultaneously.
On the other hand, asynchronous serving of processed data
streams is managed by Kafka.
In the diagram, the two Kafka icons represent either the same
Kafka cluster or different clusters replicating data across them.
The key difference between synchronous and asynchronous serving lies in
how data is consumed and delivered.
In synchronous serving, the client requests data and waits for an immediate
response, which can lead to delays if the data is large or complex.
Conversely, Asynchronous serving allows clients to subscribe to data changes
and receive updates in real time without waiting for a request response cycle.
This approach is more efficient for handling continuous streams
of data as it enables clients to react to changes as they occur.
Ensuring that no updates are missed.
Asynchronous systems, like those utilizing Kafka, can handle high volumes
of data and provide a more scalable solution for real time data processing.
Data locality and replication are crucial concepts in the streaming plane.
For example, if your operational data is in EMEA but you need to serve
it in APAC, you wouldn't want your APAC systems to constantly reach
out to EMEA for real time data.
This would create latency and potential bottlenecks.
Instead, you replicate that data between regions, and tools
like Kafka make this easy.
By creating replicas of your data in other regions, you ensure
that applications local to those regions can consume streaming
data efficiently and in real time.
And keep in mind, because this data is still within a
Kafka topic, it's streaming.
In real time, operating within the physical limits of speed,
like the speed of light.
A materialized view is a database object that contains the results of a
query, storing them as a physical table.
Unlike traditional views, which execute the underlying SQL statement at query time
and do not store results, materialized views pre compute and cache the results,
allowing for faster data retrieval.
They can be refreshed periodically or on demand to keep the data up to date.
But in the context of stream processing, materialized views are continuously and
incrementally updated as new data arrives.
This capability makes them particularly valuable for improving query performance,
reducing data duplication, and simplifying analytics, especially in environments
where real time data access is crucial.
Balancing between push and pull queries is essential for optimizing performance and
meeting specific use case requirements.
One effective approach is to utilize materialized views, which can act as
a bridge between the two query types.
By submitting a push query to create a materialized view, clients can benefit
from the heavy lifting done by the push query, while simultaneously subscribing
to changes in the materialized view.
The balance between push and pull queries can be adjusted
based on the application's needs.
For instance, if low latency is paramount, a greater reliance on push queries may
be warranted, while scenarios requiring more complex queries might benefit
from the flexibility of pull queries.
Ultimately, the choice depends on the specific requirements of
the application and the nature of the data being processed.
The diagram that models how to balance push and pull queries
illustrates the tradeoff between query flexibility and latency.
It features a box in the middle representing a materialized
view, which serves as a bridge between the two types of queries.
As you move down the diagram, The materialized view provides less
flexible queries but offers better performance, meaning that the
queries execute with lower latency.
This is ideal for scenarios where immediate data availability is critical,
such as user facing applications that require real time updates.
Conversely, as you move up the diagram, the queries become more
flexible, allowing for deeper insights and more complex queries.
However, This increased flexibility comes at the cost of higher latencies,
as a serving engine has to perform more work to fulfill these requests.
The overall concept is that push and pull queries can work
together to find the right balance between latency and flexibility.
Push queries are preferred for low latency requirements, while pull
queries offer the flexibility needed for more complex analytical tasks.
The materialized view acts as a compromise, allowing users
to benefit from both approaches depending on their specific use case.
This balance is crucial for optimizing real time analytics, as it enables
systems to react to events while also providing the capability to
perform ad hoc queries when necessary.
Global replication and local consumption are vital in the
streaming plane for optimizing performance and reducing latency.
Global replication duplicates data across multiple nodes, ensuring consistency and
enabling efficient workload distribution, while local consumption allows for faster
access to data at its source, enhancing performance and resource optimization.
Together, we They create a robust framework for efficient data processing
and scalability in streaming environments, supporting federated governance
and tailored security measures.
The streaming plane enables the creation of virtual data products that
can be consumed locally, even if the source of the data resides remotely.
This is achieved through a streaming platform like Kafka.
Users can query and interact with data locally while it is incrementally
replicated from different global regions.
A streaming data catalog is essential to managing these
virtual data products effectively.
It serves as an inventory of data and its metadata, enabling users to discover
and subscribe to streaming data products.
The catalog provides crucial metadata, including table definitions,
validation rules, data types, and lineage information, which
helps consumers understand the data's provenance and processing.
In summary, the streaming plane facilitates the consumption of virtual
data products that may exist remotely, while a robust streaming data catalog
ensures effective governance and understanding of these data products.
The Venn diagram of the operational, analytical, and
streaming planes illustrates how these areas of data processing
are distinct yet interconnected.
The operational plane focuses on transaction processing and generating
real time data, while the analytical plane is dedicated to complex tasks.