Conf42 Internet of Things (IoT) 2023 - Online

Troubleshooting real-time business metrics with StarTree ThirdEye

Video size:

Abstract

There is a rise in monitoring key business metrics in real-time to make informed decisions whenever something goes wrong. Troubleshooting these metrics and identifying blindspots in real-time is challenging and resource-intensive. This talk focuses on how to close some of these gaps.

Summary

  • Madhumita, product lead at Startree, will give the talk on troubleshooting real time business metrics. She will give a walkthrough of real life use case and how to troubleshoot that, a live demonstration of that. And then he will touch base on some of the benefits and advantages and we will open up the floor for Q A.
  • Real time business metrics are the metrics that you are monitoring to identify any issues. In the era of generative AI, data is everywhere and the data volume and velocity is becoming very, very important. To troubleshooting in real time in a smart and efficient way, you need to have good quality data.
  • Rati third eye helps with real time metrics, monitoring almost near real time and anomaly detection of large and complex time series data. Suvdeep will give a live demo of how you can use this tool to troubleshoot in real time.
  • Shubhadeep: I'm going to walk you through a couple of use cases around anomaly detection. Show you under the hood how thirdeye works and how are we trying to tackle some of these issues. Thirdeye not only does a basic level of alerting, but it also is really good at multidimensional alerting.
  • Thirdeye helps you figure out what is an anomaly, or rather helps you in figuring out what could have caused this particular anomaly. Thirdeye also gives you an event framework to work with. And Thirdeye tries to put a lot of these information in front of you so that you can make an informed decision.
  • Third eye can really do a deep drill down and try to find all sorts of correlations and causations around it. Third eye can make an informed decision as to what your anomalies can be related to. That can be reason for that particular incident.
  • How your data is organized in Apache Pinot or any kind of these databases is important. Certain systems are capable of handling aggregations in a much more faster and more efficient way. Thirdeye is able to leverage these capabilities to stay performant across the entire vertical.
  • Third Eye has a very sophisticated anomaly detection flow, and this is what we use to kind of model alerts. Most of the tools that we have shared here is available in the community edition. The Enterprise version is also available for you to play with.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hi everyone, I'm Madhumita, product lead at Startree. I'm very excited to be here along with Suvadeep, who is my colleague. We are going to give the talk on troubleshooting real time business metrics. It's a very favorite topic for me, and without further ado, I would like to get started. I always start with a quote. So today's favorite quote for me is real time business metrics are constantly changing. That's why it's very important to be proactive in detecting those issues. Don't wait for a problem to become a crisis before you take action. So that's the crux of today's talk and something very close to my heart and I have experienced it and been living through this for some time. And today I'm very excited to share my journey, my story, along with my colleague Savadi, with all of you. So what are we going to talk about? So some of the challenges and introducing to real time metrics, monitoring and anomaly detection, that's what I will be covering. And then I'll hand it over to my colleague who will give a walkthrough of real life use case and how to troubleshoot that, a live demonstration of that. And then he will touch base on some of the benefits and advantages and we will open up the floor for Q A. So stay tuned. You're going to learn a lot and lot of interesting stuff we are going to share with you in few minutes. So to start with, what are these challenges associated when it comes to troubleshooting real time business metrics? Real time business metrics are the metrics that you are monitoring to identify any issues that are happening in real time. So one example is in IoT space, if you have several devices planted across different locations and those devices are getting heated up and you would like to know as soon as possible before it becomes a crisis, which means these device shuts down or something like that. And that's not good for your business or not good for your users. So in Netnet is very important to monitoring that, but something goes wrong, then troubleshooting as quickly as possible so that you can take action. So now to do that, what are the challenges involved? Well, when we talk about metrics and that too in real time data is something stands out. In the era of generative AI, data is everywhere and the data volume and velocity is becoming very, very important. So handling massive amount of data in real time, what you need an efficient storage solution and processing solutions, there are not many out there. So that's a challenge. Now second thing is, if you are troubleshooting, then you are making decisions around what are you going to do next? Whether you have to fix the device or something else going on. To fix that now, to do that efficiently, you need to have good quality data, you need to have consistent data. Otherwise you're not going to be making qualitative decisions in real time, and that's going to be a big challenge. Now, what's next? Well, data quality. We talked about data. We talked about. Now data is everywhere, as I was saying, and especially if you're talking about IoT world data is in multiple devices, data is Internet, data is offline. You're integrating all that data to make a sense out of it in real time. That's very complex and that could be also hindrance if you're not doing that on time and accurately. So that's another important aspect you need to consider. And latency and scalability is another big problem. Like if you are also trying to solve through a solution, latency becomes a challenge. Data may not be arriving on time or you're not processing on time. There are multiple issues involved. And now as data is growing, you can't scale it either. So these are some of the limiting things to do, troubleshooting in real time in a smart and efficient way. So now what's the solution? Well, since we are talking about metrics, we will have to monitor the metrics, obviously, but monitoring is just not enough. You need to identify those anomalous events as quickly as possible because it's very, very important. And the sooner you detect, the impact is going to be high in terms of revenue or cost savings. Now what is this anomalous events? These could be spike or drops or could be gradual change over a period of time, which is very, very hard to detect in a manual way. And just detecting anomaly events are not enough. You need to have the cause, what is causing this anomalous event so that you can troubleshoot as quickly as possible in real time, getting answers and then taking actions. So these are actionable insights, which is very critical for a qualitative decisions, which is hugely associated with cost implications. So in general, the solution to troubleshooting real time business metrics is having a metrics monitoring system and detecting anomalies and identifying these actionable insights so you can make qualitative decisions as quickly as possible. Now, what are the different scenarios of this anomalous event? So the different scenarios are, before we talk about the scenarios, let's talk about how would you build a solution like that? Right? So the first thing is you're bringing in all these data, storing it in an efficient storage, making a meaning out of it, then identifying those anomalies by applying some algorithms, and then determining those insights by looking at the data and then providing the insight. So stitching it all together provides an automated anomaly detection solution. And in addition, that is not enough because we were talking about metrics. Metrics can take any shape and form. It could be steady, it could be upward trending, it could be seasonal. So detecting anomalous events through this different various pattern of data in real time with a solution that I spoke in my previous slide, stitching it all together is not easy. And on top of it, detecting accurate outliers is also not easy. Applying lot of smart algorithms is not easy because you need to write the algorithm, apply the algorithm, or if you are using out of box algorithm through libraries, but you still have to stitch it together, apply it in real time, and then detect those accurate anomalies based on your business context. Combining it all together, it's a huge challenging. Now, is there a solution out there? Well, of course. And which is what we will be giving a live demo of that that will help you to understand how you can troubleshoot in real time. So these solution is rati third eye. It helps with real time metrics, monitoring almost near real time and anomaly detection of large and complex time series data. And it fast track problem solving by unlocking those actionable insights that I was talking about. Now, it has three close pillars where you are stitching the data real time and offline, and storing in a very efficient storage and computing resource like Apache Pinotree is built. Third eye is built on Apache Pinot which is a columnar data store and does an olap and allows you to run massive queries in your data in a very compute efficient way and gives you results in fractions of seconds. So you can apply your smart algorithms in an automated fashion, detect those outliers as quickly as possible in the accurate outliers and providing an interactive UI and getting those actionable insights throwing at you to make a quicker investigation to get to the root close and then make some informed and qualitative decisions. Now this solution is very awesome. It's something my favorite. It also has a smart UI, low code, no code UI and also API based solution. So with that I will hand over to my colleague who is going to give a live demo of some of the real life use cases and how you will use this tool to troubleshoot in real time and be self efficient on your own. And something very interesting you don't want to miss out. So stay tuned for the next half and welcome suvdeep and handing it over to you to give a walkthrough of the exciting use cases you have in the live demonstration. So with that, looking forward to my next segment of the talk. Thanks. Thanks Madhumitamantri, for the intro. Hey everyone, my name is Shubhadeep. I'm a founding engineer at Startree. I'm going to walk you through a couple of use cases around anomaly detection and also show you under the hood how thirdeye works and how are we trying to tackle some of these issues. So for the purposes of this demo, I'll share the rideshare use case that we have, as well as some IoT use case which we have in terms of data around sensors. And yeah, let's jump in. So here we have an instance of third eye, which is a demo instance. This is how the dashboard looks like. It gives you an overview of what's happening around your system. Are there any anomalies, what are your charts looking like, how are your metrics behaving, et cetera? So let me quickly jump into the create alert. So I'll click here and I'll just try to show how a third eye simple alert works. And what is it to create such an alert. So let me just go ahead and create a basic alert here. I'm going to use the write share data set. And let me just choose wait time. So just to explain the context here. So we have Pinot, which is a time series database which is showing all of these data sets. And we're choosing a metric here which is, in this case, wait time. So ride share is typically, think of it like a data set for Uber, where you have these cab rides which are being taken all over the place, and we are trying to monitor different metrics around them and trying to figure out if there are anomalies associated with it. So I'm going to go ahead and load the chart. So, notice that the crime larity is daily, meaning that we are trying to aggregate daily wait times over this period of time. And in this case, I'm aggregating with sum. So I'm adding up all these wait times and showing it on a per day basis. Now, the metric looks reasonably well behaved, except probably this weird spike here. And maybe I'll just change these granularity to hourly to see how that looks like. As you can see, the metric is fairly seasonal, except this particular spike. So I'm going to actually try to create an alert on top of this so that if we see such kind of spikes, we want to make sure that the system is actually behaving okay. So third eye actually shows you a whole suite of algorithms. In this case, these algorithms are helping you to detect anomalies. I'm going to actually go ahead and choose the matrix profile in this case, which is a startree algorithm that helps you with patch level analysis and figures out anomalies for use cases especially like this. So let me go ahead and select that and click next. So I'm not going to change any settings, I'm just going to just load the chart. And as you can see with the default setting here, I'm able to actually see that the algorithm correctly predicts that there is a weird spike here which is very different from the overall metric. So I'm actually pretty happy with this. I'm going to go ahead and click next. So I'm going to skip over all of this and go ahead and create the alert. Now what happens is that we just created a basic alert in a metric and third eye in the background is actually running a task which goes ahead and checks for, or rather performs the entire detection routine on these metrics and eventually comes up with its results. In this case, as you can see, took a few seconds and then it showed that, hey, this is the anomaly that we have. So this is roughly the workflow of third eye. And thirdeye not only does a basic level of alerting, but it also is really good at multidimensional alerting. So let me show an example of what I mean by that. So I'm going to go back to the homepage and go ahead and create alert. And I'm going to just go ahead and show you what a multidimensional alert means here. So imagine that you have this ride share metric and we were doing sum over wait time. And what if I wanted to know that, I want to know the wait times across different cities or different ride types. So in this case, let me go ahead and choose a data set here and I'm going to just choose wait time as the same metric. But in this case I have a bunch of dimensions. So I can choose on city, I can choose on driver rating, maybe device type. Write type also looks interesting. So in this case, let me just go ahead and choose that. So I see that across write types. So what is the distribution here in terms of the amount of wait time across these different dimensions? So in this case I have write type is shared premium economy and they have more or less similar wait times across this entire time period. Right. And the reason is because in this case, we are using a simulated data set, but with real data, you'll see a lot more color here, and it really talk about which dimensions or which combinations of dimensions could be important. I can actually go ahead and choose city and write type here. So third eye will actually go ahead and run combinations of these metrics, sorry, these dimensions, and then try to figure out, hey, which particular faction of your data has a larger impact to your overall metric. So in this case, as you can see, folks are weighing a lot more in New York versus the other slices of data. In this case, San Francisco economy is probably just 16% of the overall traffic. So in the interest of time, I'm actually go ahead and, for example, in this case, I can just go ahead and select all of them and say, create multidimensional alert. So what thirdeye will do is actually go ahead and create an alert, going through all of these different time series and monitoring them in one single alert. So that really helps you have an overview over different slices of the metric, all of them in one single alert, which is extremely powerful. So I can just go ahead in the usual process and go ahead and create the alert here. For the purpose of this demo, I actually have an alert created. So let me just quickly go ahead and jump to. Jump on. That's cool. Oh, I see that these dimensionality is slightly different here. But let me walk you through what the analysis of an anomaly looks like in third eye. So, third eye not only helps you figure out what is an anomaly, it also figures out, or rather helps you in figuring out what could have caused this particular anomaly. So in this case, what I see is we have right type is equal to economy, and I can actually go ahead and expand that chart. And I can see that. Okay. The algorithm is actually pointing to a couple of cases where these spikes are not favorable. So I can actually go ahead and see what these anomalies mean or why they're anomalies. So if I click on this particular red dot, it actually jumps to the anomaly right away, and I see that there is this big spike here, which is kind of a little bit odd compared to the metrics surrounding it. So maybe it is an anomaly I have to investigate and figure out. So what I can do is I can go ahead and click investigate, and this will give me a much deeper view into the underlying metric and how thirdeye is able to analyze all of that and present it to you in one phase. These is what the root cause analysis module of third eye. So in this case, what you can see is the algorithms in third eye root cause analysis is trying to analyze the underlying data and share with you what could have caused this anomaly. So there are a few results here. And the common theme, if you see here, is that maybe the traffic condition here was heavy or moderate, or we don't know what exactly, but in this case, seems like that seems to be coming out a lot. The other thing that Thirdeye care does is offers you different kinds of tools. In this case, heat map is one such tool, which gives you a lot of ability to figure out which dimensions were behaving weirdly. Like in this case, anything that's blue increased and anything that's red decreased. Now, since the RCA stock contributor suggested that the traffic condition could have been one of the reasons, we can actually see that from here in the heat map, that moderate actually increased quite a bit in terms of traffic. So we see that these entire change is 161% increase in traffic moderate, and also some sort of an increase in heavy as well, which is in this case tripled. And actually light traffic is pretty small. So maybe it could have been that in this particular time frame, these traffic was actually pretty heavy, and that was probably the reason why there were issues in the wait time. So we can actually add some of these things into the chart and see how they behave. So in this case, I'm going to go ahead and search traffic equal to heavy and just go ahead and click that. When I do click this, you can see that there is this underlying chart which actually goes ahead and shows me that what percentage of this metric was actually contributed by the heavy traffic. I can do these similar for moderate traffic as well. It. So in this case, I think I'm adding both. I'm going to go ahead and add this to the investigation. So as you can see here, it seems that pretty much a decent spike is coming because of traffic condition equal to moderate. And that might as well be one of the reasons why we might be seeing some of these spikes in wait times. I'm going to go ahead and go to the next step. So third eye also gives you an event framework to work with. So the whole point of RCA here is to basically take an anomaly and translate it to an event that caused this anomaly. Right? So in this case, if you had a spike in wait time, what event could have caused this anomaly? And Thirdeye tries to put a lot of these information in front of you so that you can kind of make an informed decision here in this case, we have a whole bunch of events which are just holidays coming in from different areas. So I'm going to actually go ahead and select all of that. And you can see that third eye is trying to find related events around the time period that the anomaly had occurred and just kind of tries to plot them across time. So in this case, if I see there was an armed forces day, for example, which could have been one of the reasons why there was traffic, maybe in San Francisco, but this is how third eye would kind of try to correlate different things together and put them in one place so that you can make an informed decision. I'm actually going to go ahead and click next, and then you see that I can have this space where I can save the investigation. So in this case, I'll say traffic was traffic heavy or moderate was the reason for the increase in wait time. Right. And I can go ahead and save this investigation. And this now is a shareable entity that you can share with the team or with other analysts who might be interested in knowing what's actually going on or what your analysis was and why did you get to that? Right, so this is broadly the workflow in terms of the overall anomaly direction to RCA. Right. So let me just quickly go through another use case which is more related to sensor data. So in this case, I have an alert which is just doing temperature analysis. Maybe not this one. Give me 1 second. Hmm. Probably it is this one. All right. So yeah, in this case, we can actually go ahead and see that for different devices. So these is not a temperature monitoring of, this is not again, an alert which is working on the overall temperature, but rather on the temperatures of each device, ids. Right. And which is very interesting because you can have a much deeper insight into every device that is working in this case. So why was the temperature of this particular device much higher than usual? Right. And this is, again, we can basically go ahead through the same flow that we did earlier, and we can get an insight as to why or which dimensions are actually maybe having an impact. So in this case, we see a much fewer set of dimensions compared to our previous. Right. Share use case. But as you can see here, it seems like simply because a lot of devices started becoming more active. So maybe in this case, that's probably one of the reasons why maybe the overall temperature went up. In this case, I think the alert was set on a sum of temperature. So I assume that probably that it might be a reason. So in short, what third eye is trying to do here is if you have a set of metrics and you have a set of dimensions associated with it. Third eye can really do a deep drill down and try to find all sorts of correlations and causations around it so that you can make an informed decision as to what your anomalies can be related to or what are the correlated events, et cetera. That can be reason for that particular incident. So, yeah, that's all I had in terms of the demo. Let me just share a few things now in terms of how Thirdeye works under the hood. So I want to really touch upon some of these concepts just to illustrate that everything is not like there is some work under the hood that needs to be done to make sure that your data set is performant and the anomaly detection goes smoothly. So the way third eye does all of these analysis is third eye is actually pretty noisy in terms of querying and figuring out, hey, every minute, every five minutes, depending on what level or what speed of responsiveness you need from third eye in terms of getting that current value of the metric. And this can be expensive. So it's very important that the data is actually modeled. Right. So some of the things that I would like to call out here is these is where actually Apache Pinot really shines. Because it is a database designed for time series. It really helps us make sure that we are able to give timely responses to the user in terms of some of these queries, which are constantly monitoring your critical metrics. Also, one of the things that is important is how your data is organized in Apache Pinot or any kind of these databases, because how your schema is set up, how your data is set up is immensely important in terms of getting your results performant. So for example, I just picked up a couple of use cases here. So in this case, if you have your timestamp as a string, for example, these are typically much more expensive in terms of compute power to parse rather than if you have something like a long right. And it's keeping these factors into consideration when you're modeling your data that really adds up in terms of making sure that your monitoring system is performant and is able to live up to the requirements that your analysts are providing. Another thing that you might be thinking is that, okay, time buckets is an interesting use case as well. So the idea here is that whenever you want to bucket time, so let's say you want to round it off to the nearest minute, five minutes, hour, 15 minutes or daily. There is a lot of translation that happens in queries and sometimes when you're monitoring a high throughput query, it kind of gets difficult in terms of aggregating all of these into realtime and can add up into costs at runtime as well. So making sure that you have the right derived columns or making sure that your data types are set right is actually crucial in terms of making sure that your alerts are performant. Lastly, I also add that certain kinds of systems are capable of handling aggregations in a much more faster and more efficient way. The whole point here is to be able to avoid expensive query scans, sorry, table scans, and Star Tree index is one such example. What Star Tree index does is it's able to, given a configuration, it's able to pre compute a lot of these metrics under the hood and keep it prepared so that if you're asking for a certain set of queries, it's not a table scan anymore, it just has all of these things pre computed and gives you a number right up front. And this is extremely powerful in terms of whenever you're doing a large number of dimensions across a high cardinality data set. And also if you have a lot of rows getting ingested every minute, this becomes super important so that you're not reading a lot of data. So yeah, coming to again, this is how we're kind of positioning ourselves as well, because Thirdeye is something which has a high query volume and it requires low latency because we want to make sure that we are responsive to the user in terms of reporting anomalies. We actually pair up really well with Spino, which is a real time time series database, and it really is capable of handling these aggregations. And it has a lot of indexing capabilities that Thirdeye is able to leverage and make sure that we are able to stay performant across the entire vertical. Another thing that we also make a lot of use of is Pino's ability to slice across different dimensions with things like Star Tree index. And thirdeye also builds on top of that capability to make sure that we just don't monitor things on an overall level. But we were able to slice much deeper and give you insights much more quicker in terms of being able to drill down and look at exactly the right dimensions and their slices in this case. All right, so let me talk a little bit about what the detection looks like in terms of how we're detecting such anomalies. So third eye actually has a very sophisticated anomaly detection flow, and this is what we use to kind of model alerts. So the same flow that you're seeing that we are modeling a single metric is the same architecture that we are using to model multiple decisions as well. And this kind of capability gives us a lot of benefit in terms of being able to model derived metrics, being able to do runtime computations, and make sure that we are able to monitor a diverse set of use cases to kind of give you the results that you need without doing a lot of pre computation. So third eye actually is capable of handling a lot of these flexibility on its own. We do have a solid notification framework, so currently we support email slack and pager duties coming soon. The main goal here is that once your anomalies are detected, we want to make sure that we are able to present it to you in a way that's consumable to you. And notifications is exactly designed for that. So we have a solid notifications flow which makes sure that we are able to consolidate all the report properly and kind of feed it to you in whatever way. You can also build your own integrations with webhooks, but this framework actually prepares and sends all of these reports to you live in real time. All right, so third eye is obviously API first. We pretty much do this pretty much for everything at Star Tree, and the idea here is to be able to build applications on top seamlessly and this kind of gives you a lot of scope in doing that. So pretty much everything that we have showed me on the UI is very easily integratable by some API provided by Third Eye. We are cloud ready as well. You can deploy us in Kubernetes. We have a helm chart and we are available at GitHub. So we have a community edition and that is available@GitHub.com slash Startree Slash Thirdeye and feel free to try this out and let us know your feedback. Most of the tools that we have shared here is available in the community edition. There should be no problem in kind of reproducing this and that's pretty much it. So here are the links as well. Feel free to try out these, join the Star Tree community or try out the community edition. If you guys want to be a little more hands off and want want things to be more delivered to you in terms of whole package, the Enterprise version is also available for you to play with. You can go ahead and use this self serve trial with that I will wrap up. Thank you so much and happy holidays.
...

Madhumita Mantri

Product Lead @ StarTree

Madhumita Mantri's LinkedIn account Madhumita Mantri's twitter account

Suvodeep Pyne

Founding Engineer and Lead @ StarTree

Suvodeep Pyne's LinkedIn account Suvodeep Pyne's twitter account



Join the community!

Learn for free, join the best tech learning community for a price of a pumpkin latte.

Annual
Monthly
Newsletter
$ 0 /mo

Event notifications, weekly newsletter

Delayed access to all content

Immediate access to Keynotes & Panels

Community
$ 8.34 /mo

Immediate access to all content

Courses, quizes & certificates

Community chats

Join the community (7 day free trial)