Transcript
This transcript was autogenerated. To make changes, submit a PR.
Welcome to my lecture, anomaly detects with Apache,
Pinot and Thirdeye.
By the end of this lecture, I want you to be able to
solve this, well,
this, or at least recognize this as
soon as possible. So what happened was I was working
at this big data company, an ad tech company, and we were
dealing with 1 million events per second.
And all I needed to do was add a new parameter
to the data. The problem was that
there was a misconfiguration or a misunderstanding
about the data type. See, it was
sent as a string, but I implementing it as
a long value,
and this was only relevant for
any mobile phone of type. IPhone V seven.
Needless to say, that this problem went
unrecognized for 12 hours,
after which we lost about $10,000
of billing data, and we wasted five
days to try to remedy the lost data,
to no avail. So,
what I'm actually talking to you about today could
have saved all of this mess, because we
would have recognized the problem immediately.
Let me introduce myself. My name is Yoav Nordmann.
I am a technology enthusiast. I love
working with new and emerging technologies.
You can call me either a nerd or geek, or both. It is
definitely a compliment. At the
clients, I usually work as a tech lead or architect.
And at Tikal, the company I'm working for,
I am a group leader and mentor for fellow
workers in the back end.
So, let's start and talk about what
is anomaly detection. But in order
to understand what anomaly detects is, let's try
actually to first define what is an
anomaly. Right? So, an anomaly
is defined as deviation from the common rule,
something different, abnormal,
peculiar, and even not easily
classified. Now that we have an understanding
of what an anomalies is, we can define
anomaly detection. So,
anomalies detection is understood to be
the identification and or observation
of data points and events that deviate
from a data set's normal behavior.
Simple enough, right? So, what is the problem?
Well, the problem is quite complex.
Let's say an issue occurs just
as the issue occurred at the company I was working for.
So many, many times, if an issue occurs,
it just ends up in a black hole, because nobody even recognizes
the issue. Let's say somebody actually recognized
there is an issue and invests time
and effort to research the problem.
So he was able to identify the issue,
but then again, it might not be able to
find a root cause. So again,
it'll end up in the black hole, and it might
occur again. Let's say there are many people who try to identify
the root cause, and we found it so after
that, actually we can fix the issue.
So what we are trying to do today with
apache Pinot and third eye is first of all eliminate
the black hole. There is no more issue
which will occur undetected. There is no
more identifying an issue and not being able to get to
the root cause. So there is no more black hole.
Second of all, we are trying to
reduce time to detect. We are
also reducing time to resolution.
So let me take you on this journey through
the twilight zone. We are the thirdeye of rules and
let's talk about star Tree's third eye.
So what is third eye?
Third eye is an anomaly detection, monitors and
interactive root cause analysis platform.
Remember I said star tree thirdeye. So what
is startree? Or rather who is startree?
So startree is a company and
they are the ones who are offering Apache
pinot and thirdeye as a SaaS solution.
But Apache Pinot of course is open source
and can be used freely. Thirdeye is
not open source, it is actually given out
with a community license, which means you
cannot take third eye and create a sauce
product with it. Other than that, you are
allowed to use it at your own discretion and enjoy
the full benefits of this great platform.
If we look at the architecture of third
eye, we can see that this platform is
not just a simple tool or a simple UI. There is
some more to this application or to this
architecture, but the interesting part is
that it actually is working against a
data source. It is actually always querying this
data source. So this data source has to be very, very efficient
for all of these queries to be run
and return in a sub second response
time, right? And this data source is none other than
Apache Pinot. So let's talk a little bit about
Apache Pinot. What exactly is Apache Pinot?
So Apache Pinot is a real time distributed OLap
data store, purpose built to provide ultra
low latency analytics even at
extremely high throughput.
Let us try to put all of this into context and
explain the problem.
In the analytics world, we usually talk about
dimensions and metrics, right?
Dimensions are the labels used to describe
data and metrics are the quantitative
measurements of the data. So for example,
dimensions would be device type could be Android
or iPhone, or the dimension could be country,
which would be Israel, us,
Mexico, or any other country. On the
other hand, the metrics that would be, for instance,
temperature, which has a value or views,
which also is a value.
And all we want to do with the dimensions and
metrics is slice and dice.
So slice and dice might be
easy with three dimensions and as you can see here,
this would be seven permutations.
What happens if we have five dimensions?
That's already a little bit harder, because those are 31
permutations. And what about seven dimensions, which has
127 permutations?
And of course, we would like to have many, many more
dimensions.
To understand the problem,
let's see how the data is usually kept.
So, data is usually kept either in raw data
and it is being processed going down
the processing
line. It might be after raw data, it might be
joined and aggregated, and in the end it
might be cubed.
So why would we do this to data?
The more we keep data
in raw format, the more flexibility we have to
do certain calculations. The more we preprocess
the data, the less flexibility we have. But on the flip
side, the more we keep the data
in raw format, we have a very high
latency. It'll take a long time for those computations.
And the more we preprocess the data,
we could have low latency. So Apache
Pinot is right in the middle between
joint data and aggregated data, trying to
have maximum flexibility and
minimum latency.
But why would Apache Pinot be better than its
competitors, of which there are a few?
Well, this has to do with the history of Apache Pinot.
See, Apache Pinot was actually invented and
written at LinkedIn. And first it was used as
can internal analytics database for the business
users to see what is happening. As soon
as those business users saw the immense
potential of Apache Pinot, they actually
said it should be expanded to be used
by all 500 million users on LinkedIn.
And if you go on LinkedIn today, you might even see all
the queries which are being sent to Apache Pinot at any
given time. As you can see just on this
page, I have seven queries, and there might be
more which are being run for each user
with a low latency, most of the time,
subsecond latency response time for any
given user on LinkedIn.
Some of the statistics on LinkedIn. So as you can see,
there is 200,000 queries
per second. Those statistics are
actually a year old, so there might be more. They have
a max ingestion rate of about 1 million
events per second, and when querying,
they have about 20 billion records which are scanned
each second. So as
you can see, Apache Pinot is a database
built for speed and efficiency.
Now, this speed and efficiency especially is
being delivered by a pluggable indexing
technology. And as you can see, they implemented a lot
of indexes in order to help
to try to create a minimum
latency response time.
So let's get back to
what I'm trying to solve.
So at the same company there was another problem.
We had a lot of data, right? And what
happened was usually there was this odd
user which every day would enter the system
and would download a lot of data.
The problem was that when this person could
initiate his query to download the data,
all other queries, for some
reason there was heavy
usage on the system, so all the other queries
had a higher response time. Basically other
people had to wait longer for their data to arrive.
And the funny thing was that, do you know when
we knew about this problem? Actually only a day later,
because there was a batch job which would go
over the log files of the day before and extract all the
query latency response times. And only a day later we
would know that there were certain queries at
a certain point which had a high latency response
time. So again,
the problem we are trying to solve is this,
what happens if at a certain point
there is a sudden degradation of
performance and we do not know about it in
real time. So the way Apache Pinot works
or the way third eye works is actually using an
alerting template. The alerting template
is actually the detects logic or boilerplate
that can be used to create an alert. An example
could be we would like to create an anomalies if a certain
metric is bigger than a
certain maximum value.
Using this alert template, we can then create an alert.
So alert is an anomaly
detection rule configuration, right? That would be our anomalies
detection. So an example for this would be create an anomaly
if revenue, that is our metric,
is bigger than 20,000 and
we would like to check every hour and
the anomalies would occur if this
alert would be triggered. So at
a certain time when querying the data, we would
see that revenue is 30,000, which is above the
threshold of 20,000 on Thursday, the third between
09:00 p.m. And 10:00 p.m.
The interface looks as follows. As you
can see, we have a view of certain
metrics, and for those with great
eyesight can see that on February
20 eigth, there is a dotted line
and there is a solid line. And the solid
line is the actual data and the dotted line is
the expected data. So as you can see here,
this is an anomaly which can be traced
in third eye. There are
multiple detector algorithms which can be used.
In the example you saw, the threshold rule,
there is also a mean variance rule. There is a percentage
rule, an absolute change rule. And if you
want to get the services from Thirdeye from Star Tree,
there is also a hold winters rule. This is proprietary
to Star Tree. If you are using this
for free, the non commercial license,
then you will not have the hold winter took.
But you want to write
your own. This platform is actually
pluggable. So if you want you can write your own
detector algorithms based on your needs.
Now this is all great and nice and
we could know when there is an anomaly.
But as I said before, third eye is also
a root cause analysis platform.
So you can see when there is an anomaly
you will be able to create or go into the root cause
analysis and see what exactly the problem
is. For those again with great eyesight might see the
difference that on the left we have
the current data range and on the right
side we have the baseline. And so we
could see what exactly the difference is why there
is an anomaly. Now as you can see, I have
different colors. I have blue, I have red,
reddish. So pretty simple. If we
have a certain metric as a
deep red, that would be a big change down.
And if we have an intense blue, it is a big change
up. So now looking
back at this root cause analysis in third eye, we can see
that there are certain values which are higher
than the baseline and certain values which are lower than
the baseline. All of this to help us with our
root cause analysis.
Now what
about alerts, right? I mean, any company
has its own alerting system. So how could
I integrate those anomaly detections, those anomalies
with my alerting system?
Well, there is a possibility to have a subscription
group if we would like to create a subscription group.
In third eye, as you can see at the moment the channels are
either email slack. There is also an option for
webhook. So we can definitely integrate the anomalies
which occur in third eye with our alerting
system. But still there
are a few skeptics. The baseline. Remember,
what if the baseline was a holiday?
Or even more, what if
the baseline did
occur on a day where we had a
change in the system or a new version of the product?
Well, so this is one of the greatest issues. We can create
events. So we would create
an event on certain dates. For instance, for each
time we would actually update a new version of the
product. That could be an event. If there are holidays,
that would be an event. This is a
very simple took to create events. And this
would be integrated into our baseline
and into our anomaly detection. And with our root
cause analysis we will be able to see that there
were specific or special circumstances at
any gives point. So let's remember
why we are here. As I said,
at the end of this session, I want you
to be able to, if not find,
understand this, at least find it as soon as possible.
When I say as soon as possible, I mean within
minutes, if not seconds.
I really hope I gave you an option to do
just that. Now I would like to
demo Apache Pinot and especially third eye.
So something
short about the demo. I will demo this I
have on my computer. I've set up a kubernetes
cluster using k three s. I am running Kafka,
Pinot and third eye on my Kubernetes cluster from
my own laptop. I'm sending via telegraph,
I'm sending metrics of my cpu performance to
kafka which will be ingested into Apache P
zero. And thirdeye is going to query
this data every minute. The lowest
we can get actually using Thirdeye is every minute,
and it will try and decipher, or it will try
to see whether there is an anomalies in the data which I'm
sending.
So let me show you the demo of
Apache Pinot and Star Tree third eye.
First of all, I'm going to spin up my canines
and we can have a look at kubernetes.
So everything is in the Pinot Quickstart
namespace. As you can see I have a small
Kafka cluster and then I have Star
Tree mySql, I have all
the different pinot servers, and then here
we have the different Star Tree components
at the end. Zookeeper also is being used
for Star Tree. As you can see,
Pinot itself has many components,
Star Tree as well. And if you want you can run
Apache Pinot straight away. There's a helm chart for
Apache Pinot, the open source. I just used the
quick start guide from Star Tree, which includes Pinot
as well, just for ease of use.
So going ahead,
looking at the UI, I now entered the Ui of
Apache Pinot.
This is what you get, as you can see again,
a lot of components. Let's go to the tables.
There's this one table I have host metrics,
cpu real time. This is the table
I configured in Apache Pinot to receive
all the events I'm sending using telegraph to
Kafka. And this is ingesting straight away from
Kafka so we can have a look at the data. As you can see
at the moment I have over 10,000 data points.
I can run different queries. Now if I run
a query, it doesn't matter what I have in here, as you
can see, the total number of documents
just there were some additions because
every five to 10 seconds more documents are
being entered into this table. This is
a real time table, meaning it
receives data from Kafka and it
is updated at real time.
So let's go and have a look at third eye.
This is Star Tree third eye. This is what you get
when you enter. As you can see, I have this
one alert and I have
ten different anomalies if I would like to look
at the alerts. So this is a cpu alert.
This is what's going on on my laptop as we speak.
Let's go first into configurations. I have configured
the data source which is my Apache P zero.
I have a data set which I configured the host metrics
cpu and these are all the
parameters in the data set, host metrics
cpu. This can be seen also in Apache
Pinot in the table. Then these are all
the alert templates that exist in Star Tree.
Again, there are many which come out of the box and you can always add
yourself. Here are the subscription groups. I didn't add
anything because there is nothing for me to add.
And here are events. Again, I didn't add here anything
either. So let's take a look at the anomalies.
As you can see, I have a lot of anomalies going on
on my computer at the moment. Let's first take a look
at my alert. The alert I configured.
So there are two views. There's a simple view,
there is an advanced view which is basically all the JSON.
I will take a look at the simple view. I configured the name to
be cpu alert. I would like to run it every minute,
every hour, every day. It is based on
the template type start free threshold and again
it is using some I would
like to do the aggregation of a sum on the usage system and
I defined a threshold of 170.
If I do a reload preview I can see actually
the data and I can see the rule being happened
to the data which I have.
So I have this already configured. Let's cancel this,
let's go back to the dashboard. So at the moment I have
14 anomalies. I can go into the anomalies. Let's take
a look at the last anomaly.
I would like to. Let's see the last 1 hour.
I think that's enough. Okay. And the anomaly
just happened. Right now I'm already at the threshold
of 170. Again you can see the dotted
line which is the expected, and you can see the solid
line which is what's happening. So let's go
and investigate our anomalies.
So this is the anomaly. It's right here on the right side.
And here we have the heat map as we seen
in prior. As I've shown you,
you can look at the top contributors, but there's not much
going to be in here. Again, I don't have a lot of data and
I would be able to look at the events.
First of all, I can also change the baseline.
So this is a week. Let's look at the baseline of
one day, which again is not
something I have because I don't have a day's worth of data.
I just collected data for the past maybe
2 hours. Let's go back to see my anomalies.
Let's take something a little before that.
Again, let's take a look at the last 1 hour.
Okay, so this is the anomaly.
And then I can also
say, okay, no, this is not an Anomaly and
this is a Feedback which has been received.
Let's go back and as you can see, before it was on
16, now it is on 15
again, I can also view all the anomalies here and
I can preview this, I can look at it,
there are also a lot of more,
um,
I can also change a lot of parameters.
For instance, I am able to
say, let's say I would like for those
two anomalies to be counted as one.
This is all within the configuration of
here. Merge, Max duration, how many minutes?
I would like it to merge, for instance,
and gap at.
So, well, this is not a new alert, so it's not going to change this
on the fly. But again, if I would create the alert to begin
with like this, it would count these two
as one. Thank you
very much for joining this lecture. I hope you've
learned something today and I hope I can
help you or I help you to
achieve better data consistency and
data integrity.