Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hey folks, thank you very much for joining me today at Comp 42 Python.
Today you get to learn about sketching algorithms and how you can make sense of
big data in a single pass. My name is Tun Shui
and I'm the vp of data at Quix. Wix is a remote company and we
have teams across Europe and the US and our headquarters in London
in the United Kingdom. A bit about myself I
have a background working in high growth startups. As a data engineer and head
of data. I help companies determine their data and AI strategy
and where possible, center it around streaming technologies.
I'm a big fan of the mantra less is more and simplifying
things because I think that just makes everything easier to build on and maintain.
And I'm really interested in the real time data and AI ecosystem,
which should explain why I'm so excited about sketching algorithms.
I've popped my LinkedIn there in the corner, so feel free to connect.
First of all, a bit of information about Quix, so our mission is Python.
First stream processing. The streaming data landscape was
started and built in Java, and to this day it's mainly accessible to
those who use Java and SQL. Quicks focuses on
the Python developer experience, and we have two products. We have
Quickstreams, which is an open source library that uses data frames to work
with data and build streaming and event driven applications.
And since sketching is based on streaming data, it goes hand in
hand naturally, with quicks. Quix Cloud is a fully
managed platform where you can run your streaming applications and build out real time
data pipelines quickly using open source connectors.
We're pushing the idea of streaming data frames, and as you can see,
it's reminiscent of the other data frames you may have encountered in pandas or
Pyspark. You can build your data transformations using Quickstream's
applications, and they're designed to work with Docker and be containerized.
You can deploy those containers on your own infrastructure, or if you prefer not
to manage infrastructure, you can deploy it on Quicksloud. We provide
a nice interface so you can build out streaming data pipelines. You can get
started quickly integrating data with connectors to
data sources such as Kafka, MQTT, and databases
such as postgres, as well as popular destinations
for data such as Snowflake, redis, and influxdB.
The connectors and the transformation code samples are all open source,
and we have contributions from our friends and partners such as influxdB,
and we'd love to have more contributions, so please get in touch if you're interested.
So what is a sketch? Note that this is a really large
topic, because there are families of sketching algorithms, all for
different use cases. And my hope here is to introduce you to a few of
my favorites and inspire you, and hopefully get you introduced to
the possibilities. So you go out and explore by yourselves.
After internalizing all the documentation. My experience using sketches,
I've come up with this that sketches are small, stateful streaming
data programs that deal with a huge volume of data,
and they provide approximate answers orders of magnitude faster
than exact techniques. It's grounded in the truth
that when it comes to big data, approximate solutions are favorable over exact
solutions, because they're a good tradeoff for memory resources
and time. And you might not realize it, but you're
already probably accepting of approximations. You may have
heard about this little tool called Chat GPT, which works by responding with
its best guess, an approximation. And here are some
characteristics of sketches. The first is that they're small,
and they have to be small, and they're usually only a few kilobytes in
size, meaning that they have reduced memory requirements and will grow
sublinearly in space. They're stateful, and they
maintain a data structure that keeps the state of the observations. So every
data point that comes in isn't retained in its exact representation,
it's streaming, meaning that it deals with data in a single pass.
So the sketch needs to look only at each item in the stream once.
And they're fast. They're orders of magnitude faster than exact
techniques. The results are also mergeable without a loss
in accuracy, and that keeps everything fast. And these
are the components. So the first typical stage of a sketching process is
a transformation that makes the input data stream resemble white.
So you have a uniform distribution of values, and this
is usually achieved by hashing the input keys and then normalizing the result into
a fraction, that is, a random number between zero and
one. The second stage of the sketch is the creation
and updating of a data structure that follows a set of rules
for compressing down the values it receives from the transform stage.
So in the middle there, when you initialize the sketch, you can configure
the parameters for the data structure, which enable you to manage
its memory. The last stage, there are estimator algorithms
that take a query, look it up in the sketch's data structure,
and returns a result. Now this result would be an approximate,
but the important thing to note is that it will have mathematically proven error
distribution bounds. So this helps in dealing with the confidence
of the results. So whilst tools like chat JBT
give you an approximation. It's not always transparent about its error thresholds,
which is why you get those things called hallucinations.
But with sketches you have mathematically provable error bounds, which helps
with confidence that the approximation is sufficient.
And let's talk about why being exact is slow.
We live in a world where the majority of the time we use exact
techniques. And what are exact techniques? It's when you run a
query to get the exact answer. For example, a common one I'll go through
today is a count of the number of unique identifiers.
The most common example is a database query, and hopefully everyone here
has experienced a query that runs for a really long time.
If you have billions of rows with high cardinality, that is, you have lots
of unique identifiers. You could experience a really slow query
when you're running a group by could. So distributed systems
exist and they use parallelization to attempt
to speed up these queries. And here's an example
of how parallelization works to solve a problem such as word
count. So on the left there you have some input data coming in.
And to get the unique count of words, count of unique words,
it has to go through these stages of splitting the data,
mapping, shuffling and reducing. And these operations happen on
different nodes or machines, so each box there represents a node or
a separate machine. So splitting and mapping are self explanatory.
But why does it need to go through shuffling and reducing?
The reason is because if you look at the mapping step, each of the nodes
have local answers and they are non additive, meaning that you
can't simply sum them up since the same word occurs in multiple nodes.
Here you can see that at the mapping step you can see the word deer
and the word bear appear in two different nodes.
So shuffling ensures that the same unique words are colocated
so that they can be accurately counted. So when the reducer performs
the sum operation, you get accurate results. There at the end you
get bear is two, can is three, deer is two, river is
two, and this is taken from Hadoop's documentation.
And this is common to all massively parallel processing systems
such as spark and snowflake. And really all distributed
databases work in a similar way.
And the take home message here is that shuffling is slow. And when processing queries
you have to do what you can to avoid ad hoc network I o.
And there are techniques in distributed systems like Spark that help you to get around
this by using broadcasting, where you would use broadcast joins
where you want to join two tables. So that rather than a
huge amount of shuffling taking place. A copy of the smaller table
can be stored upfront alongside the larger table on the same machine,
so it's available when the join needs to be performed.
And I refer back to the classic notes here of latency numbers
that every programmer should know. When you compare the time it
takes to read from memory versus reading from disk versus
round trips over a network, we're talking orders of magnitude difference.
Even though at certain scale we're counting things
in nanoseconds and milliseconds. When the data gets large
and you're dealing with many different unique items, we're talking
orders of magnitude difference. So why are sketches
fast? There are four main parts of sketching I should
go through first. So the first is the stream processor. It processes the
data as it streams in with a random algorithms selection.
There's also a data structure which starts off as an empty summary and
is updated with each data point over time. As it comes in,
the size grows sublinearly. The query processor,
it computes the desired result and models the error properties using probability
statistics. And lastly, we have merge and set
operations. So these are the operations that operate
on multiple sketches, and there is no loss in accuracy doing
so. Here's an illustration of those four parts.
So you have the front end stream processor that uses a stochastic process
to randomly select data, which it uses to populate a data structure.
This data structure can be queried at any time using a back
end query processor that has probabilistic analysis to return
an approximation with known error thresholds. And at
the bottom there you have a way of dealing with multiple sketches where you can
stream them in and perform merge or set operations to combine them
into their resulting sketches. And you should note that the sketches are
designed to be cross language compatible as well.
The data structures should be easily queryable
by the implementation in any language.
I used the word sublinear earlier, and what does that mean?
So here on the x axis we're plotting the number of items or the cardinality,
and on the y axis it's the memory or the space. So in most
data structures that take exact representations of input data.
So for example, if you're using words in a key value store,
the growth in memory as cardinality goes up is somewhere between
linear and superlinear. But with sketches it's always
smaller. So with sketches it's always sublinear.
The mergeability I mentioned also keeps things fast.
This enables the merging of multiple sketches without loss in accuracy, and this
solves nonadditive challenges. So I've thrown another
term at you. So I'm going to define that too. So what are non additive
challenges? So, aside from being absolutely everywhere,
let's take this example where Quix has gone into the retail industry.
So we've opened up four stores in some of the busiest cities in the world.
They happen to be the fashion centers of the world. We have the annual sales
figures from each store, as well as the count of unique products in
each store. And with the actual sums of
the annual sell figures, since money is additive, you can
sum it up to get a global total sum. So you can see we're doing
pretty good here. But when it comes to the unique item counts,
you can't do this, because the last time I checked, the quicks
London store definitely has some of the same items as the quicks Paris store.
And the quicks Tokyo store has some of the items as the Paris store.
So there's usually an overlap. And since that information about unique keys
is lost at aggregation time between the stores, you cannot
simply sum them up. You would just get incorrect unique counts.
But if you use sketches, you could solve this. You could save
the sketch for daily records in the database in their own column here on
the left, we have two tables. So one is for all
the days in February for the London store, and the one below
is all the days in February for the quicks Paris
store. We could union the number of days in the
month to get the sketches for the unique products per store. Then to get
the monthly unique count, it would be a case of doing a merge on the
store's sketches. So here we can combine the sketches from each
of the european stores to get a european approximation
of uniques. And of course, you could employ data modeling
to get the exact unique counts, but that would involve combining data
across doors. That's probably a lot of network I o. And you
would have to do this in a batch job. You'd wait for all the data
to come in and could perform historical batch. You'd perform it
as a historical batch job. So this is a really good example of using
sketches where you're able to do approximate analysis with
data that is being streamed in, and to perform that approximation
in real time or near real time. And this is really why
merge ability with sketches is such a superpower.
Let's have a look at some of my favorite sketches.
So the first is counting unique. So the ones I really
like are the ones that estimate count distinct. So there's one
called the count min sketch. Also, for estimating
the cardinality amongst many duplicates. So for that you have hyperlog
log and you have compressed probabilistic counting sketches.
You also can estimate frequent items. So this is good for
recommended systems as well. So when you estimate what are termed the
heavy hitters or top k, for example,
during comp 42, if we were streaming in the data, we could
perform an approximation for what are the top most viewed talks in this conference.
We have quantiles and data distributions. So this estimates the distribution.
So you've got the different percentiles, like 25th percentile,
95th percentile, 99 percentile. You can do all of those with well
defined error bounds. So if you're streaming in data again from this conference,
you could determine and approximate how much time people spend on
each of the web pages. And lastly, we have sampling.
So sampling keeps some semblance or the transformation
of the item from the stream. For example, with reservoir
sampling, you choose k samples from a list of n items and you retain
that. And to give you an idea of how the algorithms
work, I'll cover the count min sketch. Let's start with a
naive example, and I'm going to go with food items. So here we're counting,
or rather approximating food items. So on the left there we
have a continuous stream of food items coming in. So in practice,
this could be food items from orders in an online supermarket or
a food delivery service. And on the right we have
a hash or key value data structure where we're using
the key as a representation of the food item, and the value is the
incrementing count of the number of times it's observed in the stream.
So as we go through the stream one at a time, we're counting this.
So by the end, when we get to the last item there, we have counted
one burger, three pizzas, one salad, one burrito, et cetera,
et cetera. So this example does only have ten items,
but this approach won't scale at high cardinality. So when there's
a huge increase in the number of unique items,
it just simply wouldn't scale. So if this stream introduced all the food
items in the known universe, the data structure would have to grow linearly
or sublinearly this way to hold
all the keys. It would mean reserving huge amounts of memory and
long wait times when you're querying to retrieve the cans for a food item.
And if you wanted this in real time or near real time, forget about it.
And from here, there's not much you could do to improve things, like in a
naive way, you could reduce the representation of the food item down to
perhaps the first three characters. So, for example, instead of storing
the graphical bitmap, you could save pizza as
P-I-Z burger becomes bur. But then eventually,
as the list got longer and the cardinality increased,
could start having key collisions, like when a burrito comes along
and tries to increment bur as well.
So let's see an example with hashing, which is a better representation
of the count min sketch. So, in this example, rather than taking
the exact representation of the food item, we'll run it through two
hashing functions. And you can see that there on the right, we've got two
of the columns, each of which represent the hashing function and
the number that they return. And we're going to use this to know how
to store it in the Data. So it's kind of like encoding a Burger as
one and six and storing it in the Data structure here in this Array in
the Middle, that Way. So we'll use the
hashing functions there on the right to increment the counts in this 2d array
in the middle. So this array starts off all blank,
or rather all zeros. And the first item
in the Stream is a Burger. So we pass it through to the two hashing
functions there on the right. And we see, for Burger, we get the representation
of one and six. We then go to the one
in the first column and increment that by one. So it goes from zero to
one. And then we go to six in the second column, and we increment
that also by one. Now we have one and one in those positions, and we're
done for this item. So we read the next item in the stream.
It's the pizza. We look it up on the right there, and it's
encoded as four and one. So we go to positions four and
one in our array in the Middle, and we increment at those positions,
four and one. So we have a count of one and one. And we'll
keep doing that. So we can do that for Salad as well. But let's see
what happens when a duplicate appears. So when another pizza appears,
we look up its hash representation again, which is for one.
And since we already have a count of one in each, we increment them again.
So now we have the number two in both those positions. So you
just keep incrementing them by time you encounter them in the Stream.
So let's fast forward to the last item. So this is what it looks
like when all ten food items are stored in the Array. And now
let's try querying the Data structure now. Because it's a stream, you can query
it at any point in time. That's the beauty of this. So you can perform
an approximation at any time that you like.
So let's say at the 10th item mark, we want
to query this Data structure. We want to get an approximation for the count of
burgers. So we look there, on the right there, we see the hash representation
is one and six. So we look into our data structure in
the middle and we get the counts of three and one.
Now you've probably figured out why this sketch is called count min, because we take
the minimum value of the obtained counts. So out of three and one,
the smaller number is one. So we have our approximation of
account of one for burgers, which in this case matches the
exact count. Let's do another example. Do the same with pizza,
which has a hash representation of four one. So we
look up four one in the middle data structure, and that gives us account of
three four, and we take the min. So that's an approximate could
of three for pizza, which again matches the exact count. So you get the idea.
You can see that for some of the food items, for example, if you look
at the last one there for tacos, that's six and one,
which gives us a could in the middle of one and four. So you can
see that though we take the min as one and our approximation becomes one,
one of its counts got as high as four. So some errors can start creeping
in, like with this particular size. And so
this would suggest that you either need to increase the space, making the arrays
bigger, or adding more hashing functions. And these are sort of parameters you
can control in sketching. So this would all work together to
reduce the collisions and keep the counts as accurate as possible.
So now that you know what sketches are, how they work and how
useful they are, how do you get started with them today? So you're probably thinking,
will I need to implement them like it sounds like some complex statistics and math
going on. Is there anything available in open source? Well, the good news
is you'll be glad to hear such a project does exist. So please go and
check out the Apache data sketches project. It's available
in Java, C Plus plus, and you'll be happy to know it's also available in
Python. This project was started at Yahoo in 2019,
and it achieved Apache's top level project status in 2021.
So it has a really active community and it's got a good roadmap for implementing
sketches. There's a very clear picture of the sketches
that have been implemented and the ones that are coming up. And the beauty
of this project is that it's cross language compatible,
meaning that you can mix and match languages for different components.
So like I mentioned before, with the stream processor you
could implement that in Java and then the back end query processor could be
implemented in Python and it would all work. But we're at Conf 42 Python,
so of course we're going to do it all in Python and I'm
going to do that here. So I'm going to use these final few slides to
show you some code in Python. So this is an example in a Jupyter notebook
where I load in some Spotify data and filter it
for the artist information. So you see I've imported in the library there,
and I've also imported in pandas to load in the data so they're interoperable.
You can see there at the bottom there that I've specified a confidence score
that I want, which is percentage. So I want 90% confidence.
And you see that the library gives you some convenience functions
that helps you get a suggestion for the number of hashes you should be
using to get that level of confidence. And the same again for
the relative error, which I'm going for 1%, it will suggest the number
of buckets that you want to instantiate this count min sketch
with. And to make it easier to
follow, you'll see that I load in the records one at a time using a
loop. So when you run it with such a
large data store, such a large data source as the
Spotify data that I'm using, it actually takes a while.
And in reality what you'd be doing is you'd be reading the records one
at a time from a stream and then updating the sketch that way.
So for that you could use Kafka with quicks for this because
it's sort of a general purpose use case. But if your use case was
something else like getting approximation statistics like quantiles
and counts from changes that are happening in one of your databases,
you could use change data capture with something like debesium to
obtain a stream of changes, and then you perform the updates in that
way. So getting into a stream is really the first part of the puzzle.
There you can see the sketches provide functions there
which allow you to specify different things. There you've
got a lovely print statement there that shows you
what their sketch summary is with their number of hashes, the number of
buckets, the number of filled bins, et cetera. So this is a good way for
you to keep can eye on how you should size out your sketches.
So yeah, you can see they maintain
the desired confidence score. You've got nice summaries, and at the end you can see
I've put in the estimations. I've requested the estimations for some
of the popular artists in that data source,
and this one's my last slide. So you can pip install the Python library today
and get started. You may already even have sketching algorithms
available to you if you use certain database technologies or processing engines.
So we have Druid, we have postgres and Hive,
which have been supported for some years, and it's been implemented
recently in Apache Pinot and Apache Spark last
year as well. So if you're using any of these, you should be able to
just search for in the documentation the sketching algorithms available,
and a lot of them will have things like hyperlog log and some
sort of counting or quantile operation they should have.
So as I mentioned, there is an active community of mathematicians,
data scientists and engineers working on sketches, so you'd
be amongst really good company if you do check out the project and want to
contribute. And so yes, I'm certainly
going to be exploring more sketches in future, but I'm curious to know how you'll
get on as well. So please do let me know what you think. Here are
the links to the Quickstream's open source library in GitHub,
where you can start the project to follow its process and
follow its progress, and most importantly, show us your support. And also
please do come hang out with me and the rest of the Python team in
the quicks community on slack, and we'd love nothing more than
to help you get started with streaming data using Python today.
So if you're going to implement the algorithms and you
don't know where to start, please join us there. So thank you
very much for taking the time to join me today, and I hope to see
you soon. Peace out.