Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi everyone, I'm Madhumita, product lead at Startree.
I'm very excited to be here along with Suvadeep, who is
my colleague. We are going to give the talk on
troubleshooting real time business metrics.
It's a very favorite topic for me,
and without further ado, I would like to get started.
I always start with a quote. So today's favorite quote
for me is real time business metrics are
constantly changing. That's why it's very important
to be proactive in detecting those issues.
Don't wait for a problem to become a crisis before
you take action. So that's the crux
of today's talk and something very
close to my heart and I have experienced
it and been living through this for some time.
And today I'm very excited to share my journey,
my story, along with my colleague Savadi,
with all of you. So what are we going to
talk about? So some of the challenges and
introducing to real time metrics, monitoring and
anomaly detection, that's what I will be covering. And then
I'll hand it over to my colleague who will give a walkthrough
of real life use case and how
to troubleshoot that, a live demonstration of that. And then
he will touch base on some of the benefits and advantages and
we will open up the floor for Q A. So stay
tuned. You're going to learn a lot and lot of interesting stuff
we are going to share with you in few minutes.
So to start with, what are these challenges
associated when it comes to troubleshooting real time
business metrics? Real time business metrics are the metrics
that you are monitoring to identify any issues
that are happening in real time. So one
example is in IoT space, if you
have several devices planted across different locations
and those devices are getting heated up and you
would like to know as soon as possible before
it becomes a crisis, which means these device shuts down
or something like that. And that's not good for your business or
not good for your users. So in
Netnet is very important to monitoring
that, but something goes wrong, then troubleshooting as quickly as
possible so that you can take action.
So now to do that, what are the challenges involved?
Well, when we talk about metrics and that too in
real time data is something stands out.
In the era of generative AI, data is everywhere
and the data volume and velocity is becoming
very, very important. So handling
massive amount of data in real time,
what you need an efficient storage solution and
processing solutions, there are not many out there.
So that's a challenge. Now second thing
is, if you are troubleshooting, then you are making
decisions around what are you going to do next?
Whether you have to fix the device or something else going on. To fix
that now, to do that efficiently,
you need to have good quality data, you need to have consistent data.
Otherwise you're not going to be making qualitative decisions
in real time, and that's going to be a big challenge.
Now, what's next? Well,
data quality. We talked about data. We talked about. Now data
is everywhere, as I was saying, and especially if you're talking about IoT
world data is in multiple devices, data is
Internet, data is offline.
You're integrating all that data to make a sense out of it
in real time. That's very complex and that could
be also hindrance if you're not doing that on time
and accurately. So that's another important aspect you need
to consider. And latency and scalability
is another big problem. Like if you are also trying to solve through
a solution, latency becomes a challenge.
Data may not be arriving on time or you're not processing on time. There are
multiple issues involved. And now as data is growing, you can't scale
it either. So these are some of the limiting things
to do, troubleshooting in real time in a smart
and efficient way. So now what's
the solution? Well, since we are talking about metrics,
we will have to monitor the metrics, obviously, but monitoring
is just not enough. You need to identify those anomalous
events as quickly as possible because it's very, very important.
And the sooner you detect, the impact is going to be high in
terms of revenue or cost savings. Now what
is this anomalous events? These could be spike
or drops or could be gradual change over a period of
time, which is very, very hard to detect in
a manual way. And just detecting
anomaly events are not enough. You need
to have the cause, what is
causing this anomalous event so that you can troubleshoot
as quickly as possible in real time,
getting answers and then taking actions. So these
are actionable insights, which is very critical
for a qualitative decisions, which is hugely
associated with cost implications.
So in general, the solution to
troubleshooting real time business metrics is
having a metrics monitoring system and detecting
anomalies and identifying these actionable insights
so you can make qualitative decisions as quickly as possible.
Now, what are the different scenarios
of this anomalous event? So the different
scenarios are, before we talk
about the scenarios, let's talk about how would you build a solution like
that? Right? So the first thing is you're bringing in all these
data, storing it in an efficient storage, making a meaning
out of it, then identifying those anomalies by
applying some algorithms, and then determining
those insights by looking at the data and then
providing the insight. So stitching it all together provides
an automated anomaly detection solution.
And in addition, that is not enough because we were
talking about metrics. Metrics can take any shape and
form. It could be steady, it could be upward trending, it could be
seasonal. So detecting anomalous
events through this different various pattern of data in
real time with a solution that I spoke in my previous slide,
stitching it all together is not easy.
And on top of it, detecting accurate outliers
is also not easy. Applying lot of smart algorithms
is not easy because you need to write the algorithm,
apply the algorithm, or if you are using out of box algorithm
through libraries, but you still have to stitch it together, apply it in
real time, and then detect those accurate anomalies based
on your business context. Combining it all together,
it's a huge challenging. Now, is there a
solution out there? Well, of course. And which is what we will be
giving a live demo of that that will help you to understand
how you can troubleshoot in real time.
So these solution is rati third eye. It helps with
real time metrics, monitoring almost near real time
and anomaly detection of large and complex
time series data. And it fast track problem solving by
unlocking those actionable insights that I was talking about.
Now, it has three close pillars where you are stitching the data real time and
offline, and storing in a very efficient
storage and computing resource like Apache
Pinotree is built. Third eye is built on Apache Pinot
which is a columnar data store and does an olap and
allows you to run massive queries in your data in a very
compute efficient way and gives you results
in fractions of seconds. So you can apply your smart algorithms
in an automated fashion, detect those outliers as quickly as possible
in the accurate outliers and providing an
interactive UI and getting those actionable insights
throwing at you to make a quicker investigation
to get to the root close and then make some informed
and qualitative decisions. Now this solution
is very awesome. It's something my favorite.
It also has a smart UI, low code,
no code UI and also API based solution.
So with that I will hand over to my colleague who
is going to give a live demo of some of the real life use
cases and how you will use this tool to
troubleshoot in real time and be self efficient on
your own. And something very interesting you don't
want to miss out. So stay tuned for the next
half and welcome suvdeep and handing it over to
you to give a walkthrough of the exciting use
cases you have in the live demonstration.
So with that, looking forward to my next segment
of the talk. Thanks.
Thanks Madhumitamantri, for the intro. Hey everyone, my name is Shubhadeep.
I'm a founding engineer at Startree. I'm going to walk you through a couple
of use cases around anomaly detection and also show you under the
hood how thirdeye works and how are we trying to tackle some of these issues.
So for the purposes of this demo, I'll share the
rideshare use case that we have, as well as some IoT use case
which we have in terms of data
around sensors. And yeah, let's jump in.
So here we have an instance of third eye, which is a demo instance.
This is how the dashboard looks like. It gives you an overview of what's
happening around your system. Are there any anomalies,
what are your charts looking like, how are your metrics behaving, et cetera?
So let me quickly jump into the create alert. So I'll click
here and I'll just try to show how a
third eye simple alert works. And what is it to create such an alert.
So let me just go ahead and create a basic alert here. I'm going to
use the write share data set. And let
me just choose wait time. So just to explain the context
here. So we have Pinot, which is a time series database which
is showing all of these data sets. And we're choosing a metric
here which is, in this case, wait time. So ride share is
typically, think of it like a data set for Uber, where you
have these cab rides which are being taken all over the place,
and we are trying to monitor different metrics around them and trying to figure out
if there are anomalies associated with it. So I'm going
to go ahead and load the chart. So, notice that the crime larity
is daily, meaning that we are trying to aggregate daily wait times
over this period of time. And in this case,
I'm aggregating with sum. So I'm adding up all these wait times and showing it
on a per day basis. Now, the metric looks reasonably
well behaved, except probably this weird spike
here. And maybe I'll just change these granularity
to hourly to see how that looks like.
As you can see, the metric is fairly seasonal,
except this particular spike. So I'm going to actually try to create an alert
on top of this so that if we see such kind of spikes, we want
to make sure that the system is actually behaving okay.
So third eye actually shows you a whole suite of algorithms.
In this case, these algorithms are helping you to detect anomalies.
I'm going to actually go ahead and choose the matrix profile in this case,
which is a startree algorithm that helps you with patch
level analysis and figures out anomalies for
use cases especially like this. So let me go ahead and select that and click
next. So I'm not going to change any settings,
I'm just going to just load the chart. And as you can see with
the default setting here, I'm able to actually see that the algorithm correctly
predicts that there is a weird spike here which is very different from the overall
metric. So I'm actually pretty happy with this. I'm going to go ahead and
click next. So I'm going to skip
over all of this and go ahead and create the alert.
Now what happens is that we just created a basic alert in a metric and
third eye in the background is actually running a task which
goes ahead and checks for,
or rather performs the entire detection routine on these metrics
and eventually comes up with its results. In this case, as you can see,
took a few seconds and then it showed that, hey, this is the anomaly that
we have. So this is roughly the workflow of third eye.
And thirdeye not only does a basic level of alerting, but it also
is really good at multidimensional alerting. So let me show an example of what I
mean by that. So I'm going to go back to the homepage and go ahead
and create alert. And I'm going to just go ahead and show you what
a multidimensional alert means here. So imagine that you have
this ride share metric and we were doing sum over wait time. And what if
I wanted to know that, I want to know the wait times across
different cities or different ride types. So in this case,
let me go ahead and choose a data set here and I'm going to just
choose wait time as the same metric. But in this case I have a
bunch of dimensions. So I can choose on city, I can choose
on driver rating, maybe device type.
Write type also looks interesting. So in this case, let me just go ahead and
choose that. So I see that across write types.
So what is the distribution here in terms of the
amount of wait time across these different dimensions?
So in this case I have write type is shared premium
economy and they have more or less similar wait
times across this entire time period.
Right. And the reason is because in this case, we are using a simulated data
set, but with real data, you'll see a lot more color here, and it really
talk about which dimensions or which combinations of dimensions could be important.
I can actually go ahead and choose city and write type here.
So third eye will actually go ahead and run combinations of
these metrics, sorry, these dimensions, and then try to figure out,
hey, which particular faction of your data has
a larger impact to your overall metric. So in this case, as you can
see, folks are weighing a lot more in New York versus
the other slices of data. In this case, San Francisco economy is
probably just 16% of the overall traffic. So in
the interest of time, I'm actually go ahead and,
for example, in this case, I can just go ahead and select all of
them and say, create multidimensional alert. So what thirdeye will do
is actually go ahead and create an alert, going through all
of these different time series and monitoring them in one single alert.
So that really helps you have an overview over different
slices of the metric, all of them in one single alert, which is extremely powerful.
So I can just go ahead in the usual process and go ahead and create
the alert here. For the purpose of this
demo, I actually have an alert created. So let me just quickly
go ahead and jump to. Jump on.
That's cool.
Oh, I see that these dimensionality is slightly different here.
But let me walk you through what the analysis of an
anomaly looks like in third eye. So, third eye not only helps you figure out
what is an anomaly, it also figures out, or rather helps
you in figuring out what could have caused this particular anomaly. So in this case,
what I see is we have right type is equal to economy,
and I can actually go ahead and expand that chart. And I can see that.
Okay. The algorithm is actually pointing to a couple of cases where these
spikes are not favorable. So I can actually go ahead and see
what these anomalies mean or why they're anomalies.
So if I click on this particular red dot, it actually jumps to the anomaly
right away, and I see that there is this big spike here, which is
kind of a little bit odd compared to the metrics surrounding it. So maybe
it is an anomaly I have to investigate and figure out. So what I can
do is I can go ahead and click investigate, and this will give me
a much deeper view into the underlying metric and how
thirdeye is able to analyze all of that and present it to you in one
phase. These is what the root cause analysis module of third eye.
So in this case, what you can see is the algorithms in
third eye root cause analysis is trying to analyze the underlying
data and share with you what could have caused this anomaly. So there are a
few results here. And the common theme, if you see here, is that maybe
the traffic condition here was heavy or moderate, or we don't
know what exactly, but in this case,
seems like that seems to be coming out a lot.
The other thing that Thirdeye care does is offers you different kinds
of tools. In this case, heat map is one such tool,
which gives you a lot of ability
to figure out which dimensions were behaving weirdly.
Like in this case, anything that's blue increased and anything that's red
decreased. Now, since the RCA stock
contributor suggested that the traffic condition could have been one of the reasons,
we can actually see that from here in the heat map, that moderate actually
increased quite a bit in terms of traffic. So we see
that these entire change is 161% increase in
traffic moderate, and also some sort of an increase
in heavy as well, which is in this case tripled. And actually
light traffic is pretty small. So maybe it could have been that in this
particular time frame, these traffic was actually
pretty heavy, and that was probably the reason why there
were issues in the wait time. So we can
actually add some of these things into the chart and see how
they behave. So in this case, I'm going to go ahead and search
traffic equal to heavy and just
go ahead and click that. When I do click this,
you can see that there is this underlying chart which actually
goes ahead and shows me that what percentage of
this metric was actually contributed by
the heavy traffic. I can do these similar for
moderate traffic as well.
It. So in this case, I think I'm adding both.
I'm going to go ahead and add this to the investigation.
So as you can see here, it seems that pretty much a decent
spike is coming because of traffic condition equal to moderate. And that might as
well be one of the reasons why we might be seeing some of these spikes
in wait times. I'm going to go ahead and go to the next step.
So third eye also gives you
an event framework to work with. So the whole point of RCA here is
to basically take
an anomaly and translate it to an event that caused this anomaly.
Right? So in this case, if you had a spike in wait
time, what event could have caused this anomaly? And Thirdeye
tries to put a lot of these information in front of you so
that you can kind of make an informed decision here in this case, we have
a whole bunch of events which are just holidays coming in from different areas.
So I'm going to actually go ahead and select all of that. And you can
see that third eye is trying to find related events around
the time period that the anomaly had occurred and just kind of tries
to plot them across time. So in this case, if I see there was an
armed forces day, for example, which could have been one of the reasons why
there was traffic, maybe in San Francisco, but this
is how third eye would kind of try to correlate different things
together and put them in one place so that you can make an informed decision.
I'm actually going to go ahead and click next, and then
you see that I can have this space where I can save the investigation.
So in this case, I'll say traffic was traffic
heavy or moderate was
the reason for the increase in wait time.
Right. And I can go ahead and save this investigation. And this now
is a shareable entity that you can share with the team or with other analysts
who might be interested in knowing what's actually going on or what your analysis was
and why did you get to that? Right, so this is broadly the workflow
in terms of the overall anomaly
direction to RCA. Right. So let me just quickly
go through another use case which is more related to sensor
data. So in this case, I have an alert which
is just doing temperature analysis. Maybe not
this one. Give me 1 second.
Hmm.
Probably it is this one. All right.
So yeah, in this case, we can actually go ahead and see that for
different devices. So these is not a temperature monitoring of,
this is not again, an alert which is working on the overall temperature, but rather
on the temperatures of each device,
ids. Right. And which is very interesting because you can have a much deeper
insight into every device that is working in this case. So why was
the temperature of this particular device much higher than usual?
Right. And this is, again, we can basically go ahead through the same flow
that we did earlier, and we can get an insight as to
why or which dimensions are actually maybe having an
impact. So in this case, we see a
much fewer set of dimensions compared to our previous. Right. Share use case.
But as you can see here, it seems like simply because
a lot of devices started becoming more active. So maybe in this case,
that's probably one of the reasons why maybe the overall temperature
went up. In this case, I think the alert was set on a sum of
temperature. So I assume that probably that it might be a reason.
So in short, what third eye is trying
to do here is if you have a set of metrics and you
have a set of dimensions associated with it. Third eye can really do a deep
drill down and try to find all sorts of correlations and causations around it
so that you can make an informed decision as to what your anomalies can
be related to or what are the correlated
events, et cetera. That can be reason
for that particular incident.
So, yeah, that's all I had in terms of the
demo. Let me just share a few things now in terms
of how Thirdeye works under the hood.
So I want to really touch upon some of these concepts
just to illustrate that everything is not like
there is some work under the hood that needs to be done to make sure
that your data set is performant and the anomaly
detection goes smoothly. So the way third eye does all of these analysis
is third eye is actually pretty noisy in terms of querying and figuring
out, hey, every minute, every five minutes, depending on what
level or what speed of responsiveness you need
from third eye in terms of getting that current value
of the metric. And this can be expensive. So it's very
important that the data is actually modeled. Right. So some
of the things that I would like to call out here is these is
where actually Apache Pinot really shines. Because it is
a database designed for time series. It really helps us make
sure that we are able to give timely responses to the user in
terms of some of these queries, which are constantly monitoring your critical metrics.
Also, one of the things that is important
is how your data is organized in Apache Pinot or any
kind of these databases, because how your schema
is set up, how your data is set up is immensely important in terms of
getting your results performant. So for
example, I just picked up a couple of use cases here. So in this case,
if you have your timestamp as a string, for example, these are typically much
more expensive in terms of compute power to parse rather
than if you have something like a long right. And it's
keeping these factors into consideration when you're modeling your data that really adds
up in terms of making sure that your monitoring system is performant and
is able to live up to the requirements that your analysts
are providing. Another thing that you might be
thinking is that, okay, time buckets is an interesting use case as well.
So the idea here is that whenever you want to bucket time,
so let's say you want to round it off to the nearest minute,
five minutes, hour, 15 minutes or daily. There is a
lot of translation that happens in queries and sometimes
when you're monitoring a high throughput query, it kind of gets difficult
in terms of aggregating all of these into realtime and can add
up into costs at runtime as well. So making sure that you
have the right derived columns or making sure that your data types are set right
is actually crucial in terms of making sure that your alerts are
performant. Lastly, I also add
that certain kinds of systems are capable of handling aggregations
in a much more faster and more efficient way. The whole point here is
to be able to avoid expensive query scans, sorry, table scans,
and Star Tree index is one such example. What Star Tree
index does is it's able to, given a configuration, it's able
to pre compute a lot of these metrics under the hood and keep
it prepared so that if you're asking for a certain set of queries,
it's not a table scan anymore, it just has all of these things pre
computed and gives you a number right up front.
And this is extremely powerful in terms of whenever you're
doing a large number of dimensions
across a high cardinality data set. And also
if you have a lot of rows getting ingested every minute, this becomes
super important so that you're not reading a
lot of data. So yeah, coming to again, this is
how we're kind of positioning ourselves as well, because Thirdeye
is something which has a high query volume and it requires
low latency because we want to make sure that we are responsive to the user
in terms of reporting anomalies. We actually pair up really well with
Spino, which is a real time time series database, and it really is
capable of handling these aggregations. And it has a lot of indexing capabilities
that Thirdeye is able to leverage and make sure that we are able to stay
performant across the entire vertical.
Another thing that we also make a lot of use
of is Pino's ability to slice across different
dimensions with things like Star Tree index. And thirdeye also
builds on top of that capability to make sure that we just don't monitor
things on an overall level. But we were able to slice much deeper and
give you insights much more quicker in terms of being able
to drill down and look at exactly the right dimensions and
their slices in this case. All right, so let
me talk a little bit about what the detection looks like in
terms of how we're detecting
such anomalies. So third eye actually has a very sophisticated anomaly
detection flow, and this is what we use to kind of model alerts.
So the same flow that you're seeing that we are modeling a single metric
is the same architecture that we are using to model multiple
decisions as well. And this kind of capability gives us
a lot of benefit in terms of being able to
model derived metrics, being able to do runtime computations,
and make sure that we
are able to monitor a diverse set of use cases to kind of give you
the results that you need without doing a lot of pre computation.
So third eye actually is capable of handling a
lot of these flexibility on its own. We do have a solid notification
framework, so currently we support
email slack and pager duties coming soon. The main
goal here is that once your anomalies are detected,
we want to make sure that we are able to present it to you in
a way that's consumable to you. And notifications is exactly designed
for that. So we have a solid notifications flow which makes
sure that we are able to consolidate all the report properly and kind of feed
it to you in whatever way. You can also build your own integrations with webhooks,
but this framework actually prepares
and sends all of these reports to you live in real time.
All right, so third eye is obviously API first.
We pretty much do this pretty much for everything at Star
Tree, and the idea here is to be able to build applications
on top seamlessly and this kind of gives you a lot of scope in
doing that. So pretty much everything that we have showed me on the UI is
very easily integratable by some API provided
by Third Eye. We are cloud ready as well.
You can deploy us in Kubernetes. We have a helm chart and
we are available at GitHub. So we have a
community edition and that is available@GitHub.com slash Startree
Slash Thirdeye and feel free to try this out and let us
know your feedback. Most of the tools that we have shared here is available
in the community edition. There should be no problem in kind of reproducing this and
that's pretty much it. So here are the links
as well. Feel free to try out these,
join the Star Tree community or try out the community edition. If you guys
want to be a little more hands off and want want things to be more
delivered to you in terms of whole package,
the Enterprise version is also available for you to play with. You can go
ahead and use this self serve trial with that
I will wrap up. Thank you so much and
happy holidays.