Transcript
This transcript was autogenerated. To make changes, submit a PR.
All right, so let's get started by looking at some terms that are
used in streaming. I've put this slide because it's important to
look at those different terms so that we think about streaming
in the same dictionary as we are talking in the slides
going forward. So let's look at some terms. Event time.
The time at which the event actually occurred. For example, if you're
playing a video game, when did you actually hit a
button? When did you actually hit pause?
When did you actually do an activity?
The actual time of that activity is the time that's
called the event time processing time. It's different from event
time in that it's the time at which a streaming data
ingestion pipeline will actually process your event.
The event time in an ideal terms, which is not usually the case,
or it could be the time after a few seconds or milliseconds of
event time when the stream processor gets the data
and processes it. Watermark.
That's the notion of completeness. We are going to discuss this much
in detail, but for now, just assume that watermark indicates
when the processing of a given chunk of streaming data is considered
complete by a pipelines, chopping up the data into temporal
boundaries for groupings and aggregations. What it means is
if you have to aggregate a certain chunk
of a streaming data, you window it basically similar
to batch, where you have group buys aggregates.
You need a certain start and stop to group
by and aggregate. And that certain start and stop in
streaming data world is called a window. Triggers.
Triggers are when in the processing times are
the results materialized? When do you actually materialize the results?
Accumulation is the relationship between multiple results in the same window.
Do you want to create a running sum? Do you want to create
just one sum in a given window? What do you want to do?
So these are all the different terms that we are going to use in the
world of streaming data. Let's look at
some interesting challenges in streaming, as these
challenges form the fundamentals
of how we look at streaming. So let's look at this stream.
So those line, those black line below that
shows the actual wall time or the clock time.
And then you see the events that are coming,
the red boxes that are there. One was an event
that happened at 08:00 a.m. But came into the pipeline after
11:00 a.m. Then there was one event that actually
happened at 09:00 a.m. And that came into the streaming pipeline between
11:00 a.m. And 11:10 a.m. And then the third one
was actually an event that happened at 10:00 a.m.
That came in the streaming data pipeline between 11:00 a.m.
And 11:15 a.m. So what does it show us?
It shows that this data is out of order. Why it
came late. The event happened at 08:00 a.m. But came after
11:00 a.m. So should we process this because it's late
data? Should we not process it? So there are
these challenges where you have data almost every time coming
out of order late because of networking
issues, because of some issues of bandwidth between where the stream
processor is versus where the data is originating. So you will not
always get the data or the
stream in time whenever it happened. You can have a
lot of use cases where the data is coming late.
So this forms the basis of interesting challenges in streaming where
now we are worried about how do we process this late arriving data
or should we actually process it? That's the
dilemma between correctness and completion. In batch
world. The correctness is like I process the
bounded data based on the business logic, right?
Whenever we write our unit best cases and whenever
we do the actual testing before productionizing those code. The notion
of correctness in a batch platform is that, okay, I had
a bounded data set, and the word bounded here means that I had finite boundaries.
Whether it's a table, it's a file,
whatever it is, I have a bounded data set and
I have to apply a set business logic to that bounded data set.
If the business logic that sits behind the input
and the output gives me correct results, I'm good. That's correctness
for me. In the world of batch, in the
world of streaming, it's all what I said in
batch plus figuring out how to handle out of order
data. We saw in the previous slide that there was out of order
data. If we have to think about correctness,
should we add that data because it is being missing,
or should we discard that data because it's late? What would be more correct?
That's dependent on your business logic and that adds additional complexity.
In the world of streaming that we did not have in the world of batch.
Second is completion. In the world of batch, we think about
completion as I processing all the records in a bounded data
set, be it a file, be it a table, et cetera. If a file had
10,000 records, I process all those 10,000 records with
the business logic, that's completion. If I'm partially
processing a file, that means, for example, I just process 6000 records
and I have 4000 unprocessed records because of some schema issues,
et cetera. That's completion in the world of batch.
But in the world of streaming, the completion is like I
try to process all the data on time and late
arriving, or how much of late arriving data should I process?
Should I process the data that's late arriving by 30 minutes or 40
minutes? What's completion in the world of streaming? So there's always a dilemma
between correctness and completion in the world of streaming that we
struggle with and to delineate some of that and to demystify
some of that. We have those important concepts in the world of streaming.
The first one is watermark. Watermark is the notion
of completeness. What it means is all the input data with event
time less than X has been processed.
X is the watermark here not perfect because
it needs strategy to process late arrivals,
right? The second term is trigger,
when to emit the aggregated results for a window.
Event time triggers are there. There are processing time triggers,
and then there are data driven triggers. And then the
third term is the accumulation. The accumulation mode determines
whether the system accumulates the window panes or discards
them. Basically what it means is if I'm processing buttons
in my video game console or PlayStation
or anything, does every click accumulate to my score
for that given window? Or should I discard everything and
just consider the last one? So those are the three terms that we
use and that help us clarify
for a given streaming pipeline. Should we focus on correctness and
completion, bringing all these concepts together?
Let's look at them from a completely different angle. The first
one is what results are being calculated. So if you have to
define what results are you calculating, you can define them with the help
of transformations, filtering,
aggregation. You can run some SQL queries, or you
can do some training if it's a machine learning pipelines.
The second concept is when in the event time are they being
calculated? So here is where we think
about the windowing part. There is event
time windowing, which is similar to sharding in batch, where now
we are windowing the data based on the window
logic. There are fixed windows, there are sliding windows, there are session windows.
And I've also given an example here in Apache Beam,
which is an open source framework where you have a window
that is defined as a fixed window for 60 seconds, for example,
in the event time. And then comes the third one which is
when in the processing time are the results being materialized.
You have watermark, which is the notion of completeness that we went through
before. You have triggers, which means when
do you trigger the results in a given window? And then you have accumulations,
which means how do you actually compute the results across a window?
Do you do a running sum, do you not do anything?
And do you just do the last value in a given window,
et cetera? And then you have later rivals as well, which are also
defined by triggers. So if you see here, there is an example
where we have a fixed window. We have a trigger that after
processing time, 60 seconds after the processing time, you just
do the triggering of the results, which means we are allowing for late arrivals
for up to 60 seconds in this example. And the
accumulation mode is you do not accumulate the results, you just give me the final
result. You don't accumulate or do a running total. So this
is how we bring all the concepts together in the world of streaming.
Now, we spoke about some terms. We spoke about how
do you bring those concepts together. I showed you some code examples
as well. Let's talk about some real examples of
how streaming at scale has some challenges
and how we can sort of solve them using infrastructure and code considerations.
The first one is auto scaling. We have heard of auto scaling in terms
of batch, and most of us, when we speak about auto scaling,
it's mostly horizontal auto scaling, where we keep addressing more machines
of the same type to a given cluster to
improve the performance. In the world of streaming,
that may or may not work because there might be a particular
instance in which a stream needs much more memory on the
same machine itself, rather than having additional
workload distribution across different machines. That happens in
the horizontal scaling. In that case, it is good
to have infrastructure that supports both horizontal and
vertical auto scaling in flight. What does in
flight mean is while the stream is being processed,
because we cannot stop the stream, then do the
auto scaling and then come back and start processing the
stream again. It's running water, it's running data.
So you need capabilities to be able to auto
scale your existing machines, Ram and
number of cores in flight while the stream is being processed.
And that's why vertical auto scaling is important. Dynamic work
rebalancing in the world of batch,
we have multiple machines that are working in the same
world of distributed compute. With streaming, there could be
some machines that are working on some data. Let's assume it's x,
and there are some machines that are working on some other streams, which is
y. It could very well be possible that there are
some machines that are not working as much as other machines,
which means those is disparity of
work between some worker nodes. This can
cause to SKUs, this can cause SKUs in the machines, this can
cause performance issues, et cetera.
To avoid this, it's good to have dynamic work rebalancing.
Have infrastructure that allows for workers to redistribute
work amongst each other so
that you can avoid stragglers, or you can avoid nodes
that are not working and you have dynamic work rebalancing.
Window processing decoupling this is a very important one
because a stream pipeline comprises different
things. You have the code logic.
Suppose you have a code of filtering and then you have windows which
have to be processed, right? It's always good to have a
decoupling between the windows that are being processed from other
stream operations because it helps you scale
better. You can have your primary SDKs for streaming on one
machine and then you can decouple the processing that can goes to different machines
which can help you scale your operations.
Giving memory and GPU related resource hints for
a particular pipeline or specific steps in a pipeline
can be really helpful. You can give python hints
or any other language that your streaming pipeline supports and
say hey, this function would need ideally this much memory in
GPU so that your streaming pipeline can scale up or down accordingly
without relying too much on the auto scaling capabilities. Because now you're
being very definitive on what the stream will need.
I think this is a ubiquitous principle that applies both to batch and
streaming, where it's good to use combined
by instead of group by because combined by reduces shuffle
unless you have to use a group by because group by introduces a
lot of shuffle. Retrying forever I had one of
the customers who had a default retry forever
and their streaming pipelines would hang a lot because they had this option.
Always have a time period or a retry count which returns
an error and ideally sends the element to a dead letter queue after a
particular number of retries, and you can always reprocess those
dead letter queues later. This helps you to have
some amount of definitive time period after which you don't have
to retry and your streaming pipelines will not hang. File I
o this is a very typical one that is common between batch and streaming
right size your files, input files, output files,
et cetera, between 100 mb and one GB shard, depending on
the total size. Network. It's always good to have all the sources
and syncs in the same region to reduce network latency might
not be possible all the time, but if it is, please follow this
type hints when you use type hints. Apache beam raises exceptions during
pipeline construction time itself, so you don't have to worry
about runtime issues because you're catching most of the
issues in the compile time itself. So those are some of
the nuanced infrastructure and code
considerations that I have learned across throughout
my field experience in streaming, and I do hope that
some of these resonated with you as well, and you could use them in
your production pipelines. Thank you so much for having me today,
and I hope you enjoyed the session.