Abstract
When you hear “decision-maker”, it’s natural to think, “C-suite”, or “executive”. But these days, we’re all decision-makers. Restaurant owners, bloggers, big-box shoppers, diners - we all have important decisions to make and need instant actionable insights. In order to provide these insights to end-users like us, businesses need access to fast, fresh analytics.
In this session, we will learn how to build our own real-time analytics application on top of a streaming data source using Apache Kafka, Apache Pinot, and Streamlit. Kafka is the de facto standard for real-time event streaming, Pinot is an OLAP database designed for ultra-low latency analytics, and Streamlit is a Python-based tool that makes it super easy to build data-based apps.
After introducing each of these tools, we’ll stream data into Kafka using its Python client, ingest that data into a Pinot real-time table, and write some basic queries using Pinot’s Python SDK. Once we’ve done that, we’ll glue everything together with an auto-refreshing dashboard in Streamlit so that we can see changes to the data as they happen. There will be lots of graphs and other visualisations!
This session is aimed at application developers and data engineers who want to quickly make sense of streaming data.
Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello everybody, and welcome to this talk in which we're going to learn how to
build a realtime realtime realtime realtime realtime realtime realtime analytics
Dashboard streamlit Apache Pinot let's first start by defining real
time analytics. So Gartner has
this quite nice definition where they say that real time analytics
is the discipline that applies logic and mathematics to data to provide insights for making
better decisions quickly. And so I've highlighted the
key bit here, which is that we want to make decisions quickly. That's the idea.
The data is coming in and we want to quickly be able to make some
sort of actionable insight based on that data. They further divide it down
into on demand and continuous analytics, where on demand is
where the user makes a query and gets some data back and continuous is where
maybe the data gets pushed to you. So we're going to be focusing on the
on demand part of that. So that's real time analytics.
But what about real time user facing analytics? So the extra bit here
is the idea that maybe we've always had these analytical queries
capabilities, but we want to give them to the end users rather than just having
them for our internal users.
We want to give our end users analytical querying capabilities on fresh data.
So we want to get the data in quickly and be able to query it.
And in the bottom right hand side of the slide, you can see some examples
of the types of applications that we might use where we might
want to be able to do this. We can further break
it down like this. So the real time part
of this definition means the data should be available for querying as soon as possible.
So the data comes in often from a real time stream. We want to be
able to query it as quickly as we can. So we want very low latency
between for the data ingestion. And then on the user
facing side, we want to be able to write queries that are similar in
query response time to what we'd get with an OLAP database. So I e.
It should be in milliseconds, like how quickly the queries come back. And then in
addition to that, lots of users are going to be executing those queries
at the same time. So we need to be able to handle high throughput,
low latency queries on very fresh data.
And let's have a look at some examples of places where those type of
dashboards have been built. So weve got one example here. So this is the Uber
eats. So imagine we're running a restaurant on Uber eats. So in the middle
of the screen we can see there are some examples of the data
of what's happened in the last twelve weeks, like how much money have
we been making? Can we see changes in what's
happened over the last week? And then on the right hand side are some things
that might need our attention. So we can see there are some missed orders,
there's some inaccurate orders, there's some downtime, and those are like places where we
might want to be able to go and do something like, and do it now.
So it's allowing us to see in real time there's something that
you need to do that might be able to fix a problem for somebody.
LinkedIn is another good example of where these
types of dashboards have been built. So we've got three different ones on here.
So on the left hand side we've got a user facing one. So that might
be for you and me using LinkedIn. So for example, we might want
to see who's viewed our profile. So it sort of shows the traffic to our
profile over the last few months. And then if you
remember this page, we can also see who's been viewing it. And where do they
work for? Over in the middle we've got a recruiter
dashboard. So recruiters are trying to find people to fill the jobs that they have
available. And so this is kind of giving them like a view of all the
data that's there to try and help them work out where they might best target
those jobs. And then finally, on the right hand side is more of an internal
product or bi tool. So this is capturing metrics
and we want to be able to see in real time what's happening. So maybe
this is a tool that's being used by a product manager.
Okay, so that's the overview of what a realtime, realtime, realtime, realtime,
realtime analytics dashboard, streamlit, Apache, Pinot, try and build our own one. And we're going
to be using three tools to do this. So we're going to be using Streamlit,
Pinot and Kafka. So let's have a quick overview of what
each of those tools are. So, Streamlit, I first came across it a couple of
years ago, so it's sort of beginning of 2020. So it's a Python based web
application framework, and the idea is that it should be used to build
data apps, often machine learning apps, and it integrates
with lots of different Python libraries. So plotly, Pytorch, Tensorflow,
Scikitlearn. If you have a Python client driver for your database,
it'll probably integrate with that as well. And if you want to share your apps
with anyone else they have a thing called the Streamlit
cloud that you can use to do that. Apache Pinot
is next up. So that's a columnar OLAP database,
but it also has additional indexing indexes that you can put on top of the
default columns and it's all queryable by SQL.
And the idea here is that it's used for real time low latency analytics
and these finally we've got Apache Kafka, a distributed event
streaming platform. It uses a publish and subscribe
sort of setup for two streams of records
and those records are then stored in a fault tolerant way and we can process
them as they occur. So how are we going to
glue all these components together? So were going to start in the top left hand
corner. So were going to have some sort of streamlit API. So some events coming
in from a data source, we'll have a Python application processing
those. We'll then load them into Kafka. Pinot will listen to the Kafka
topic and process those events and ingest them into Pinot. And then finally
our streamlit application will query Pinot to
generate some tables or some charts so that we can see what's actually going on
with our data. So the data set that we're going to use is
called the Wikimedia recent changes feed. So this is a continuous streamlit of all
the changes that are being made to various Wikimedia properties. So like for
example Wikipedia and it publishes
the events over HTTP using the server side events protocol.
And you can see in the bottom of the slide an example of what a
message looks like. So you can see weve got an event, it says it's a
message, weve got an id and we've got a bunch of data. So it indicates
like the title of the page, it's got the Uri of these page, the user,
what type of change they made when these made it and so on.
So what we're going to do now is we're going to move onto our demo.
So we'll go into visual studio here and what you can see
on the screen at the moment is a Docker compose file that we're going to
use to spin up all those components on our machine. So first up we've
got Zookeeper. So this is used by both Pinot and Kafka to manage their
metadata. Next one down is
of course Kafka. So that's where the stream of data is
going to be it connects to zookeeper and then it also
indicates a couple of ports where we can access it from Pinot and then from
our Python script. And then finally on the last three bits we've got
the various Pinot components. So we've got the Pinot controller.
That's what manages the pinot cluster and sort of
takes care of all the metadata for us. We've got the pinot broker.
So that's where the queries get sent to. And then it then sends them out
to the different servers. In this case we've only got one server, but in a
production setup you'd have many more and it would get those results
back and then return them to the client. And then finally we've got the Pinot
servers themselves. So this is who stores the data
and processes the queries. So I'm just going to open up a terminal window
here and we'll run the
docker compose script, so docker compose up and that
will then spin up all of these containers
on our machine. So we'll just minimize that for the moment. And we're
going to navigate over to this wiki Py. So this is the
script that we're going to use to get the data, the event streamlit from
Wikimedia. So you can see were on lines ten to 13.
We've got the Uri with the recent changes. We're then
going to call it with the request library and then we're going to wrap it
in this SSE client. So you can see here, this is a server side events
client for Python and it then gives us back this events function.
It just has a stream of all these events. So if we open up the
terminal again and we'll just open a new tab and if we call Python
wiki py, we'll see we get like a stream of all those events.
You can see loads and loads of events coming through. We'll just quickly stop that.
And if you have a look at one of the events, if we just highlight
one down here, you can see it has exactly, very,
very similar to what we saw on the slides. Weve got the schema, it indicates
if it's a bot. We've got the timestamp, we've got the title of the
page and a bunch of other metadata as well. Okay, so that
was the top, that was the top left of our diagram. So we've got the
data from our event stream into our python
application. Now we need to get it across into Kafka. So that's our
next bit. So that's Wiki to Kafka. So the
first bits are the same. So we've got the same code
getting the data from the source. We've now got a
while true loop. So this handles like if we lose the connection to the recent
changes, it will just reconnect and then carry on processing
events. The extra bit we've got is here on line 39, where we're adding
an event to Kafka. And so producer is defined up here on line 25.
And that's a Kafka producer for putting data onto a
Kafka topic. In this case, the topic is Wikipedia events.
And then the last interesting thing is on line 41 down to
44. Every hundred messages, we're going to flush those messages into Kafka.
So let's just get our terminal back again.
And instead of calling that wiki one, we'll call Wiki to Kafka.
And so if we run that, the output will just be every hundred messages.
It's going to say I've flushed hundred of messages. I've flushed another hundred messages.
So you can kind of see those are going in nicely. If we want to
check that they're making their way into Kafka, we can
run this command here. So this is going to run call the Kafka console
consumer, connect to the Kafka running on localhost 1992,
this topic, and then get all the messages starting from the beginning.
So if we open ourselves up another tab here, let's just have a
quick check that the data is making its way into Kafka. So these we go,
you can see all the messages flying through there. And if we kill it,
it says, hey, I've processed 922 messages. And that was
because we killed it. There are probably more of them in there now. And again
we can see it's a JSOn message. It's exactly the same message
that we had before. We've just put it straight onto Kafka. Okay, so so far
we've got to the middle of our architecture diagram. So we took the data from
the source, we got it into our python application. Originally we printed it to the
screen, but now weve put it into Kafka. So all the events are going into
Kafka. So our next thing is, can we get it into Pinot?
So let's just minimize this for the moment. What we need to do now is
we need to create a pinot table. So that's what this code here will
do. And a pinot table also needs to attach to a schema.
So let's start with that. So what's a schema? So a schema defines
the columns that are going to exist in a table. So our
table is going to be called, our schema is called Wikipedia. We've got some dimension
fields, so these are like the descriptive fields for a table.
So we've got id, we've got wiki, weve got user. These are sort of
all the fields that come out of the JSON message.
We could have metrics, optionally you can have metric fields. So for example, if there
was a count in there, we could have a field that we use for that.
But in this case we don't actually have that, but we do have
a date time field. So you need to specify those separately. So in this case
we've got a timestamp. So we specify that. So that's a schema.
The table, a table is where we store the
records. And columns like that basically store the
data that's going into pinots. They're stored in segments.
So that's why you'll see the word segments being used.
And so we first need to specify the segment config. So were going to say,
okay, where's the schema? If you had a replication
factor, you could specify it here, although we've just set it to one.
And then you need to indicate the time column name if
you're doing a real time table. So real time table basically means I'm
going to be connecting to some sort of real time stream and the data is
going to be coming in all the time. And I need you to process that.
We then need to say, well, where is that data coming from? So we specified
the stream config. So in this case it's a Kafka streamlit coming
from Wikipedia events. And then we need to tell it, where is
our Kafka broker? So it's over here on port 1990,
these. And then we can just say like, well, how should I process those
messages that are coming from Kafka? And then finally the segments
that store the data, how big should they be? So in this case, we've said
they're going to be 1000 records per se. That's obviously
way smaller than what we'd have in a real production setup.
But for the purpose of this demo, it works quite well.
So let's copy this command, let's get our terminal back again,
and we'll create our pinot table. So that
command is going to run, and you can see down here it's been successful.
So the pinot table is ready to go. So now we're going
to go into our web browser so that
we can see what's going on. So we'll just load that up. There we go.
And so this is called the Pinot data Explorer. So it's like a web basic
tool for seeing what's going on in your Pinot cluster. So you can see here
we've got one controller, one broker, one server. Normally you'd have more than
one of those, but since we're just running it locally, we just have one.
And then on the left hand side over here you can see we've got the
query console and we can navigate into this
table. We can write a query so we can say, hey, show me how
many documents or rows you have. And you see each time you run it,
the number is going up and up and up. We can also go back into
here and we can see, we could navigate to the table. And so you can
see that, hey, there's a table. It's got a schema. We could navigate
into the table. The schema is defined in more detail there you
can edit the table config if you want to, and down here it indicates
what segments are in there. So you can see this is like the first segment
that was created. What we're going to do now is we're going
to go and have a look at how we can build a streamlit application on
top of this. So we'll go back into visual studio again and
we've got our app Py class.
So this is a streamlit script. It's basically just a normal python bit of
code, except we've imported the streamlit library at the top and then we've
got a bunch of other things that we're going to use and then there's just
some python code. And whenever you want to render
something to the screen there's like a streamlit something.
So streamlit markdown, streamlit header,
whatever it is. And that will then put the data on the screen. So we'll
come back and look at this in a minute. But let's just have a look
what a streamlit application actually looks like. So if we
do streamlit, run app py,
that will launch a streamlit application on port eight five one.
So we'll just copy that. Let's navigate back to our web browser.
So we'll open a new tab on here. Just paste that
in and you can see here we've got a streamlit
application running. So this is actually refreshing every 5 seconds. You can see
here the last time that it updated. So I've got it like running on
a little loop that refreshes it every 5 seconds and you can
see that these number of records is changing every time it
refreshes. The last 1 minute there's been 2000 changes by
398 endusers on 61 different domains. The last
five minutes is a bit more, last ten minutes, obviously a bit more
than that. And then actually the last 1 hour in all time
are exactly the same because we haven't actually
been running it for that long. We can then zoom in. So on our navigation
on the left hand side, that's an overview of what's happened. We can
then see who's been making those changes. So we can see is it bots,
is it not bots? So at the moment it says only at least I'm
not sure exactly how it defines what a bot is, but 27% of the changes
have been made by what they define as bots and 73%
not by bots. And every 5 seconds this updates. So these percentages
will be adjusting as we go. We can then see which users
was it, who were the top bots and who were the top number.
So you can see these people are making a lot of changes. Like this one
here has made 548 changes in the seven
minutes or so since we started running it. We can also see where
the changes are being made. So if we click onto this next tab here were
are the changes being made like which properties which wikimedia?
So it's mostly on commonswikimedia.org, which is surprising to me.
I'd expected most of it to be done on Wikipedia, but it's actually not.
It's mostly on Wikidata and Commons Wikimedia. And then we
can see the types of changes. So weve got edit, categorize, log and then new
pages and then interestingly conf 42 I'm not entirely sure what that is.
We can also do drill down. So when you're building analytics dashboards that's a
pretty common thing. So you might have like these overview pages,
but then you want to do a drill down. So like take me into one
of them. So maybe it's show me what's being done by a particular user.
So in this case we can pick a user, see where they've been making changes
and what type of changes they've been making. So this
user here, I'm not going to attempt to pronounce that, but they've made like a
lot of changes on wikidata.org and mostly editing stuff not
really any categorizing, if we pick like a different user,
so say the KR, but let's just pick one changes
back again. But if we were to pick another one, we could sort of see
what they've been up to. So this is a sort
of dashboard that we can build. So now let's
go back and just have a quick look at how we went about
building that. So we might be able to build one for ourselves. So we'll
just minimize this here. So you can see we can
ingest rather any python libraries
that we want to use. This is where were setting the refresh. So this is
refreshing the screen every 5 seconds. You don't necessarily have to have that.
If you wanted to just have a manual button to refresh it, you could have
that. Instead. We can define the title. So this was the title that we had
on the top of the page. We've also got. This is how we're
printing like the last update that was made.
I've done a bit of styling myself here on
this overview. So this is the overview tab. So I've actually just explained how that
works. We've got down here, we've got a sidebar title
and these we're building a little map that has a function representing each
of the radio buttons. So we've got overview who's making the changes? Where are the
changes that done and drilled down? You don't have to have this. If you had
just a single page app, you wouldn't need to bother with this. You could just
literally just print everything out straight
in one script and it will show all of it on one page. But I
wanted to be able to break it down and then if we narrow in on
one of them, so say this one here, this is showing the types of
changes being made. We've got our query were. So it's saying select
type, count the number of users, group by the type, and for
a particular user. So where user equals whichever user you
selected and then it puts the results,
it executes the query, puts the results into a data frame here,
and then finally builds a plotly chart or graph
online 303 and prints it out to the screen. And the rest of the code
is pretty similar to that. It's quite procedural code. We're not really doing
anything all that clever. It's just sort of reasonably
basic python code. It's just that streamlit is making it
super easy to render it to the screen. So hopefully
that's given you an idea of what you can do with these tools. So I
just want to conclude the talk by sort of going back to the
slides and just recapping what weve been doing.
So just to remind ourselves, what did we do? So we had this streamlit
API on Wikipedia events.
We used a python application using a couple of python libraries to the SSC
client that processed those events. We then connected to Kafka
using the Kafka Python driver. We put the events into Kafka.
We then had Pinot connected up to that. So we created, wrote some Yaml,
I guess not only Python, but a bit of yaml, connected that to the Kafka
topic and ingested that data into Pinot. And then we were using a pinot Python
client to then get the data out. And then finally we
used the streamlit Python library to get the data in here. And so we've used
lots of different python tools. So we used other tools outside of
Python, but we were able to glue them all together really nicely and build a
web app really quickly. That looks actually pretty good.
It's pretty nice, it's pretty interactive, and it's all done using
Python. And finally, if you want to play around with any of
these tools, this is where they live. So we've got streamlit, we've got Apache,
Pinot, Apache, Kafka and the code used in
the demo is all available on this GitHub
repository. If you want to ask me any questions of
anything doesn't make any sense, you can contact me here. So I've put my
Twitter handle, but otherwise, thank you for listening to the talk.