Transcript
This transcript was autogenerated. To make changes, submit a PR.
Good morning, good afternoon, good evening,
wherever you are in our virtual world, my name
is Ron Dagdag and I will be talking about developing
spidey senses. Anomaly detection for
Javascript application. Let's get started.
So what is this spidey sense?
Most likely you've heard about Spider man. It's that stingling
sensation on back of Peter Parker's skull
that gives him that ability to sense or
react to danger. It increases his ability to
figure out and detect clones,
navigate if he is impaired, can't see anything
to find secret passageways and different
hidden and lost objects. It actually helps him fire his
web shooters and swing instinctively. And I
think it helps him change to his costume.
And the real amazing
part is this real
spidey senses that
spiders has, it's called hyper awareness.
It's this long thin hairs, it's called trichobotria
that actually allows them to detect and low level vibrations
and events from sound. And that's the interesting
part of it, that they can even detect up
to insects up to 3 meters away because of
that. And every time I see spiders in all these
different hairs,
you feel different. It feels the Hibby GB's.
Yes. And then of course, if you're a
new web developer and JavaScript developer
and you're still starting out and trying to cast your
web out there in the World Wide Web and
there's none coming out, it's okay. We're here to help
you understand. What is this anomaly detection?
Okay, what is this spidey senses? It's that gut
feel and vibe or
intuition that you learn through time. Right? You learn from the past.
There are some as a developer, being a developer
for more than 20 years now, you get that sense
of feeling of a project, if it can
become successful or not. I guess you learn it through time and
you learn it through the different
experience from the past. And you kind of build
that intuition and what the technology can deliver in terms of
requirements and those things. So today
we'll be talking about what is anomaly
detection, what is time series data?
And we'll do anomaly detection specifically for time services.
And then we'll do some demos and some takeaways.
Okay, let's get started.
What is anomaly detection? It is identifying
unexpected items and events
which is different from what is normal. It's so
weird, like pandemic, right?
It's not something that we're used to. I guess we're getting used to it by
now. So it's becoming the new normal, right?
So sometimes it's called an outlier. The assumptions
are that anomalies rarely occur in your data set
and the features differ from the normal instances
significantly.
So there are two causes of outliers. It's either
artificial or non natural or natural
cause. So one
causes of it could be data entry errors. Think about
it's 100,000 versus 1 million. That excerpt zero
makes a whole lot of difference, but that is an outlier.
Measurement errors, which is very common experimental
error. When you start in the late of
the sprint, you start collecting at the late of the sprint, even though
you're supposed to collect the whole data
set around certain interval
levels. Intentional outlier.
One good example of that is if you ask high
school students or college students about their
consumption of alcohol, most likely they may underreport
it for any other reason. So depending on how you collect your
data, data processing errors is where
you extract from one service
to put it on another service or one
dataset and pass it to another dataset. Sometimes you may encounter
extraction errors that may cause outliers.
Sampling errors. In this case, you're trying
to report the height of all the athletes, but most of
your data set are basketball players. So your data would get
skewed and that may cause some outliers and
of course natural outliers when it is
not artificial and it wasn't
caused by data processing or
data collection. So at the end of the day,
you have an input data stream. This area
right here, you're trying to detect. These are
good data. And of course you have some defective data right there, and then
you're trying to analyze if this data set is anomalous
or not. So most of
the time it's good data, but sometimes you would
encounter it. You haven't really figured out that that is defective.
So we'll go to this column, but hopefully,
most of the time you'll be able to detect the things that are defective.
But sometimes it is not really defective, but you
would be able to catch it here. So it just depends on how
you would implement this anomaly detector.
Sometimes finding anomalies in a data set,
it's kind of like finding a needle in a haystack.
If the needle is that big. Yeah, that is easy.
Most of the time it's not that big.
So what are different methods and how you would do anomaly
detection?
Sometimes it could be rule based systems, sometimes it's statistical
techniques. Sometimes you would use machine learning.
And we'll go through each one of these rule based systems.
Most likely you kind of know that it's where you specify
the specific rules and assign a
threshold of limits on, like for example,
at certain temperature level, you want to
alert when it reaches a certain threshold or
certain limits you want to set up, can alert or if it
goes down a certain value, send an alert.
The advantage, in a way, advantage and
disadvantage of it is it does require an experience of
industry expert to detect known anomalies.
So you have to interview people and say, what do you think is
the possible problems that causes this
type of issue? And if it goes to a certain threshold,
then we can do alert for certain conditions.
Right. The disadvantage of rule
based systems is it does not adapt as pattern
changes. So once you sets up the formula to calculate
and set up the rules, then it would not
adapt because you have to change it and modify that
logic again. And of course, it does require data
labeling and knowing. Okay, this data set, it is anomalous.
This data sets is not anomalous for
statistical techniques.
It's where you can flag the data points
that deviate from common statistical properties. So this
is where you calculate the mean, the median or quantiles,
or some other cases where you figure out rolling averages or
moving average, most likely, if you're like buying stocks
and it gives you the moving averages or buying and selling
securities, those kind of things, you use a lot of
statistical techniques to identify if
it's out of the ordinary and of course the trends where it's
going. Right. You can also sometimes have
simple moving averages, sometimes be called low
pass filters. One good example of that one is Kalman filters.
There's a formula specifically for that. Sometimes it's
histogram based outlier detection that can be implemented. The good
thing, the advantage of statistical techniques is it's more
interpretable and sometimes it's useful than machine
learning methods. It's easy to explain to someone,
to one of the bosses, this is the formula I use. I use the mean
and the median, and this is how we detect anomalies
that way.
So for machine learning methods, sometimes you can
do anomaly detecting as supervised,
unsupervised, or self supervised services
is more of decision tree.
Unsupervised would talk about k means,
hierarchical clustering, self supervised.
When we start talking about auto encoder, we're not
going to cover the formulas on each one.
I'm just showing you different ways on how you would do machine
learning methods here. But we
want to know when do we use anomaly detecting
versus supervised learning? Anomaly detection
for machine learning, when you have very small
positive positive examples and very large
negative examples, you would use anomaly detection techniques.
If it's supervised learning, most likely you have large number
of positive and negative examples
and you have enough positive examples for the
algorithm to learn.
For the anomaly detecting type.
Sometimes it's hard to learn from positive
examples as compared to supervised learning.
And sometimes the anomalies have not been discovered
yet. So you want as much as possible to do anomaly detection
techniques for this
rather than the services learning. Because for supervised learning,
future positive examples may have not likely to be
similar than your training set and it might not
know how to detect because of that.
So when would you use anomaly detection techniques?
If you're doing fraud detection, manufacturing engines
or machineries, they have a certain routine
or a machine just goes through items cycles and if it's out
of that cycle, that's when you know it's anomalous.
When you're trying to monitor data centers, that would
be a good use case for anomaly detection and Internet of things,
which I would explain a little bit more for supervised
learning. Email spam classification. Why is
that? Because there's a lot of good examples of what spam and not
spam is. And so you're detecting a certain type
of email to detecting and use
that for supervised learning. Weather prediction,
there's specific criteria for weather
to identify it. And that would be a good example for using
supervised learning and cancer classification.
Because an expert already knows what they're looking
for and specific cancer cells and those
kind of things, then it might make
sense to use supervised learning rather than anomaly
detection. So for machine learning,
sometimes it could be density based anomaly detection,
where they can cluster whenever they cluster the
data set. So based on the kneeest neighbor,
so where the normal data points occur around the senses neighborhood.
So that means they're closer to each other and anything that's outside of it.
These are the anomalous because they're
not close to the center of your data set.
Clustering based is the assumptions are
that data points are similar and tend to belong to clusters from
local centroid versus,
and then anything outside of that, anything farther away,
then it can be detected as anomalous.
You can also use gaussian distribution where you calculate
for any given data point, the probability
of that data point being as
normal. These are all the normal in terms of the gaussian distribution right here.
Anything outside of that for a very far away would be
considered anomalous or an outlier support
vector. Machine based anomaly detection,
that's also a good formula. At the end of the day, what it's trying to
do, it tries to split your data into two.
This side is these
are your normal data anything outside of that line, most likely it's
anomalous. That's one way on how you would
do different anomaly detecting techniques.
All right, so let's try to do a simple anomaly detection
and we're going to focus it on our javascript. So let
me try to pull in my data set.
Am using, right now I'm using this
Jupyter notebook, and I have under Jupyter notebook I'm
running typescript application. I think
this is more javascript application right here.
And the reason why I'm showing this so I can execute line
by line and be able to show you the results. So in
this case I'm using stats analysis.
There's an NPM package and I have this array of
numbers right here. And what I would like
to do is to filter out the outliers on this
and just keep the ones that are normal. So in a way they cluster
together, right. They're kind of close to each other and this is so
far away from the rest of the data set,
so they are considered outliers. So run
it this way too. So it gives you the results. So the results here
is that all the outliers are taken away
in just the good data. So that's the simplest
explanation, simplest code that I can find
that we can start, how to start using outlier
detection in our javascript application.
Okay, let's go back to the presentation.
Okay, so let's talk about time series
data. Time series data is
a series of data points indexed in time order.
One good example of that are logs or stock market data
or sales data or senses related to
at the end of day. What we're talking about here is any data
captured with the timestamp. So you have your timestamp data and
then the value timestamp, then value timestamp and value,
right. And you can have multiple values,
or however those values are, as long as they're
indexed against time. Most likely
this is very common because if you start looking at log files,
you'll see it's all time series based.
Of course, Internet of Things has a lot
of time services data because of whatever data
you collect from sensors, it's from specific
time, right? So because Internet
things is happening, because you have increased data volume, you can start detecting data
from these senses. The sensor are getting cheaper and of course there
are increased data speed, meaning the networking
to collect this data and send it to the cloud or get processed,
it's possible, but it's very
important that the data that
you're collecting from these sensors are moving very fast.
But failures are, these systems
are becoming more and more critical day to day.
Right. Tell me about that.
Because sometimes whenever our Google home
or Alexa device are down,
we're having trouble how to turn off the tv, and we have to find that
remote again, the remote control, those kind of things,
little things here or there. But it's becoming critical at
our household. So whenever the Internet
of broken things, it feels something like this. It's trying to debug,
like, what actually happens on that data stream that
you are receiving.
So there are different time series anomaly
types. It could be outlier spike
and level shift, pattern change and seasonality.
And we'll go through each one of these.
Outlier would look something like this, right? You have
your data set, your time services data
through time as you received it, and of course, the values of each
one. And then, of course, there is a spike here or an
outlier, and this is out of
that ordinary. So this
is what you want to detect. It could be spidey
and level shift. One good example of this one
goes through this level, and then suddenly
it shifted up. And what happened? Sometimes you
want to detect this area right here where
you're detecting that spike. And of course, the level shift
can also be possible.
Notice how the data is flowing through like this,
and now it's lower. And why was that level shift
changed? Pattern changes
look something like this, where the way I kind of imagine
this is you have, you're watering your
garden and there's specific flow of water
as it flows
out of the hose. And suddenly someone stepped or
there's a kink in the hose, and then suddenly water just
slowed down. And you want to know when that happened,
where it happened, those kind of things. And so you're trying to detect
pattern changes because of that.
And, of course, seasonality, you have to consider that, too,
whenever you're detecting anomalies.
If you think about it, certain times of the year,
there's seasonality, like around
summertime, of course,
ice cream sales are higher compared to the
winter months. There's also, like here
in the United States when we have football
season or around Super bowl,
pizza sales are higher compared to anywhere else.
Everyone wants to watch their favorite,
favorite game, those kind of things. So you have to consider
that as part of your data sets and identify if there's seasonality
around that.
So what you're trying to do here in
terms of time series is to detect these type of instances
where it's out of the ordinary. So this is the pattern.
And suddenly these data is outside of its pattern, what you
can expect, and this one too. And through time
you have the series of time and based
from these data set identify if the last part is
an anomaly or not. So it depends on how far and
you have to specify sensitivity to how sensitive
you are to trigger an anomaly.
Okay, so far
what I've been talking about is it's called univariates where
you have one variable and through
time series data set, but there's
also a concept of multivariate variant where you have
different time series data and you're trying
to identify if this lot is this out of the
ordinary or this lot is out of the ordinary. This is more
complex to implement as compared to a univariate.
So we're going to focus on the univariate today. But I just
want to let you know that sometimes depending on what the
needs are, you might need to implement a multivariate
system.
Okay. Azure cognitive services
is AI for every developer without
the need or expertise for machine learning
expertise at the end of the day what it is, it's an API
call. So each azure cognitive services have different
capabilities in terms of this. And today we're focusing on decision
capability and there's this anomaly detector
right here which identifies potential problems
early on. So that's where it's more of a decision make
time. So we're going to focus on the
anomaly detection detecting. So anomaly detector
can detect anomalies as they occur in real time
and also you can detect anomalies as a batch.
So you have a choice if you want to pass your data to this API,
do you want it real time or you want it as batch.
It automatically adapts and learns
from newt data set and you can fine tune its sensitivity
for it to detecting anomalies. So there's settings
that you can do. These are rest APIs. It does
not require machine learning expertise and it does
not need labeled data. That's the crazy part about
this is because you don't need training data to send.
You just call the API, send your data and it would detect
anomalies based from a time series data set.
It automatically identifies and applies the best fitting
model for you at the back. And it actually has
these gallery of algorithms and a lot of
these I do not know how to implement. It's using sometimes
Fourier transform, which is kind of like in the computer vision
side. You would do extremes, all these
different algorithms that it's implemented. But the
interesting part of the anomaly detector is it
classifies what type of algorithms it's
going to use. So if it figures out your data
set has some seasonality in it, it would have these algorithms
related for seasonalities. If it has course,
granularity without seasonality would have different set of
algorithms. And it's doing this every time you call the
API. So that's the interesting part. It's trying out different
algorithms all at the same time too.
There are some limitations on how you would use the anomaly
detector API. The data granularity,
it's either daily, hourly, minutely,
monthly, weekly, yearly.
And the series
data points that you have to pass in looks something like this,
where it says series. And this JSON file where
you have the time series data and the value, the minimum
is twelve items, so twelve on this array
and maximum is 8640. And you
specify that granularity. The interesting part
is if you want every five minutes, you have to specify this custom
interval that it would know that, hey,
this is every five minutes.
Okay, so there are two ways
in how you would call anomaly detector API. It's either
through a client SDK, a c sharp python node, which I'm going
to demo today, how to use the client SDK node,
or it's through rest API, so it can support any language
as long as you can call HTTP or rest calls.
So let's start with our demo.
So I have here actually,
this Jupyter notebook right here
is actually running on one of my
raspberry PI's right here. And this raspberry PI
has this sense hat so I can
get temperature data of the room and also have
some led pixels so I can display
if the data that we've collected is anomalous.
And then we display something here.
Okay, so before we start, I can show you
the package JSon that I'm using for this in
order to call anomaly detector NPM package.
There's azure AI anomaly detector,
and of course Ms. Rest js. We would need.
This env allows us to read
environment variables. Then this spidey senses hat,
which allows me to talk to the
raspberry PI hat. It's called the senses hat.
And then this was the stat analysis I did demo a few minutes
ago. Okay, let's look at
this senses hat right here. And what we'll do is
I'm going to clear all my outputs.
Not yet. Well, I just wanted to
show you how I did run it a while ago. And like right here,
see how I'm running it.
This typescript kernel, I'm actually using Tslab
to be able to have typescript running
Javascript running into Jupyter
notebooks. So right here is the version I'm using for tSlab.
So this one right here is node sense hat.
I would like to get the leds on that matrix.
And then I wanted to read some
data from that acceleration data. So I'll show you
what the output does look like right here.
Let me try to run that. So, notice how the acceleration
data looks something like this. It reads it. So I was
able to get, in this case, I was able to get this temperature
of my raspberry PI right here and
to display that value. And then I
went through here and actually get this.
What I'm doing here is I read every minute, and every
minute I will push it into an array, and then
after that I will have something like this.
So I will have this value with this timestamp I get the value.
So this is my time services data
that I collected.
So this is where I was running it and I would like to
get it every minute and then make
it look something like this.
So once I got my time series data, now it's
time to process it and send it to anomaly detector.
So that requires me to use this AI
anomaly detecting client SDK. I need
this core auth to be able to get the credentials.
Before I can do this or before I can use
anomaly detector, I need to create an
instance of anomaly detecting through
Azure ClI. And these are the commands I did
to create the resource group, the cognitive services
instance, and then to get the keys. So there
are two things that you need. In order to call the API,
you need the endpoint, that means the URL
where you would read the call,
and also you need the access key or the API key. So that's what
I'm doing here. I have that in this config
or this environment file that
just loaded it to memory.
So in order to call the anomaly detector
API, you need to use this anomaly detector
client. You specify the endpoint and then you
pass the key to this azure key credential.
And then it would give you this anomaly detector client.
And once you have that anomaly detector client, now you can
pass things to it. This one right here,
what it's doing is it's sending a
data set, right? And it's
detecting the last entry of that data set.
So I have to send certain set of data,
a time series data set, and that's why I'm
putting this into the body and
I'm identifying my data set
is every minute.
And what this one does,
it would give me a response that if the last
items on my list is anomalous or
not. So you would say true if it's anomalous or
false if it's not anomalous. So you can
actually run this.
Of course. The important part is to run this first.
Right. Initialize the anomaly detector
client. Now I can call it right here.
And that's what I did. So it tells me right here, the last point
on my list, which is row 15, is not detecting as
anomaly. And then
I will create. So what I did here
is I'm creating a new instance. This one's
new points. And what I want to do is I want to
get the last item. This is the last item
on my list, right? So 34.275.
And I just want to force it to be anomalous,
right. So in this case it has to be 134 instead of 34.
So now my new points would look something like this,
where this one is the normal and
this one is outside of the normal abnormal.
So this one should be detected as anomalous.
So this one right here, if it's
just some constant that I want to pass in to
what you call these, let's go back there to
my leds and I want to put an x,
if it detecting as anomalous.
Okay, let's go back and I would like to show you
how that would look like. Let me try to set it up real quick.
I want to make sure that you can actually see what
it's going to do.
So let's try to run this one again.
Come on,
set it up. See if we can fit all
that data set. So when I run this,
if it's the last detection,
if the last item on my list is anomalous, I would
set the pixel to cross. So this one would have an x in it.
And let's see what happens. Boom.
There is, well, it's kind of harder to see, but there's a letter.
The leds right there is a little bit, it's too bright if
you ask me. That has an x.
That means there is anomalous there. Let me clear that up.
And there you go. So it kind of cleared it. Okay.
Isn't that cool? What just happened? What we did was
to read data from
our sensor. Right here, I'm using JavaScript to read data
from the temperature sensor of this
raspberry PI. And then I collected some array.
I used anomaly detector API
to send my data set that I collected, and then it
gave me a result that says the last item on my list is
anomalous. And then I send an alert and
say, hey, there's something wrong. With my data set and set the pixels
on these raspberry PI and set an x in it
and I cleared it out. Cool.
All right, so let's go back to the presentation.
So where can you use anomaly detector API? It has
C sharp JavaScript or Python SDK clients.
There's docker containers. You can actually integrate
it with power Bi or Azure databricks if you
want streaming data. So there's a lot of use
cases where you could integrate anomaly
detector. So where
can. We already talked about that. Those are just different links.
The cool thing is there's docker containers so you can easily
integrate it into your application
too and running it at the edge. There is
also another Azure cognitive service called Metrics Advisor.
And this metrics advisor is specifically has
a web portal that you can actually diagnose anomalies
and help with root cause analysis. It's more of a
software as a service application where
you can collect time series data from different data sources and
detecting anomalies from there, and then you can configure
it where it would send alerts and it would help you find the root
cause of that issue.
All right, so the best superpower
that you can give to your project is anomaly
detecting, which sometimes it's called Spidey.
So if you're interested in learning more about
what I did today, if you want to get the
code, this is the GitHub link where you can
get and download the code.
So just to recap what is anomaly detecting? It is the process
of identifying unexpected items or events
in our data set. What is time series data?
It's a series of data points indexed
by time order. And then today I did demonstrate
what is anomaly detector API?
It's can API to detect anomalies automatically
adapts and learn from new data sets without
needing training data.
Cool. If you're interested in learning more about me, my name is
Ron Dagdag. I'm a lead software engineer at Spacee.
I'm a fifth year Microsoft MVP awardee.
The best way to contact me is through Twitter
at Ron Dagdag or LinkedIn. Connect me through LinkedIn.
Ron Dagdag thanks for geeking out with
me about spidey senses and anomaly detecting.
End now that you got bitten off by these virtual
spider, feel free to test out your new
superpowers that you just learned today.
Thank you very much. I appreciate your time and
have a good day.