Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi everyone, welcome to monitoring pipelines, AI pipelines as
product my name is Hila Fox and I'm a squad
leader and a backend developer at Auguri, currently leading
a squad that is responsible for taking the AI insights
from the engine and distributing them to different end products.
So let's talk about the agenda. We're going to talk
about Auguri as a company and the product. We're going to talk about
machine health AI and how we do it, the detection
management layer and why it's important monitoring what we
want to achieve using monitoring, the hybrid approach we took and being
proactive. And in the end, the conclusions. First of
all, let's talk about augury. Augury is a ten year
old startup in the machine health business and this is
in the manufacturing industry. Like it says on the slide, the world runs
on machines and we're on the mission to make them reliable. We do
this by giving our customers a full SaaS solution that
includes avoiding unplanned downtime, aggregated insights
and line of expert vibration analysts. This helps our
customers reach resilient production lines.
Something we saw is very important during COVID
With the increase in demand and also having no
or less people on site, this became very,
very important. We have a lot of customers which
are enterprise companies like P. G. Fridolay,
Pfizer and Roseburg, and even that if it's not
written here in the slide. But we also have Heineken
and Sat, which keeps our beer and our toilet paper
coming even in these hard days. We are operating in
the US, in Israel and Europe, and we are expanding.
So how does it work? Augury's main flow starts from
our IoT devices. We make our IoT
devices and these are sensors. We have three
types of sensors in these devices, vibration,
temperature and magnetic fields. And we monitor our machines
24/7 once the data is recorded, we pass it
on to the cloud, into our AI engine to
get diagnosed. With this AI engine,
we also have a line of expert
vibration analysts to give even more precise
diagnosis. Once we have the diagnosis, we need to visualize it
and communicate it to the customers. And we do this via
web, mobile, emails, SMS and more.
So let's just see, how does it look? We can see here
in some manufacturing plant. We have here three and a half pumps.
The one at the back is a bit cut off, but we have here four
pumps. Each of the pumps have our sensors installed on it
and the sensors are communicating with what you can see
at the upper left corner. This is also
our device. We develop it's called a node.
And what it does, it communicates to the sensors with
Bluetooth, aggregates the information and sends it
to the cloud. This is a snapshot of one
of the machines, and we can see that on each machine we have
four sensors, and we have two sensors for
each component. In this picture, we can see
one component, which is a motor, and another component,
which is a driven pump. So, machine health AI,
we talked about how we install on the machines and
how we collect the data. Let's talk about the AI and the complexity
in it. First of all, some numbers, because numbers
helps us understand the amount of data that we have and the complexity
we are tackling. So we have over 50 million
hour of machines monitoring 80,000
machines, diagnosed, multiple customers with global
expansion, dozens of thousands of machines.
We also have dozens of machine learning algorithms, which are
based on time series deep neural networks,
MLP decision trees, and more. On top of that,
we have three product squads, which are developing the
customer facing products which are using these
insights, and also three algo squads and a data engineering
team that work on the AA engine itself.
So, let's take a deeper dive into the whole AI flow.
I'm not going to bother you with the specific calculations, but for
each machine, we collect 1.3 points per
hour per machine, and we send it to the cloud.
In the cloud, we first of all reach the transformation process.
The transformation process is built
from a validity algorithm, and afterwards
it calibrates the data from electrical units to
real physical units like acceleration velocity and more.
After we have calibrated our data,
we pass it on to our feature extraction pipeline, which is also a
model. This is actually a dimensional reduction technique to
capture the essential parameters for
machine health. We collect roughly around 1000 feature
per hour per machine and save it and pass it on
to, of course, another machine learning algorithm,
which is time series. In these time series
algorithms, we collect and calculate features which are relative to themselves.
Once we have this information, this is passed on to our ML
platform. The ML platform is designed with two
layers, two major layers. The first one is
not diagnosed. The first one is for high recall,
meaning we want to never miss a machine
health issue. We want to always alert our customers when there
is something going on with their machine.
This is called anomaly detections algorithms.
The other types of detectors that we have is
fault detectors, which means specific faults
that can happen on a machine. So the anomaly detector
is a semisupervised machine learning algorithm,
which actually calculates a relative baseline for
each machine. And this is very important because it actually compares
the data according to the same machine states, meaning we
are finding anomalies per machine and not with our
whole machine pool. The other detections, like we said, are fault
detections and these are actually specific
faults which are identified by a specific signature
which is correlated with the fault. Each hour all
of the detectors generate detections and output them
to the detection management layer. Each detection has a confidence
and a confidence is pretty similar to the probability from
the machine learning algorithm. And once it reaches the detection
management layer, it gets handled in there.
What's amazing is that actually 99.87%
of the detections are not being passed on from the detection
management layer to the customers. And this is amazing because this
helps us give our customers only the relevant information
they need to actually handle their machines.
Some of the detection are being passed on directly to the customers
and some of them are passed on to our
analysts. When our analysts are adding labeling
on the detections, we can actually afterwards use this to retrain
our algorithms. And this is a big
picture of the whole flow together. So we talked about
the AA engine from the inside of it, but now
let's see how we use it in the product because this is very correlated
to what I'm going to talk about as we move on with
the presentation. So we have here two main
usages for detections, and this is a real snapshot
from the website from a specific machine.
And we have two graphs over here. The graph at the top,
like its name, says, it's machine health events. It's events happening
on the machine's lifeline as we go along.
And it's pretty self explanatory. Green is good,
red is danger. And we can see the gray circles
on the graph and each gray circle actually means something
happened. And in our case, those gray circles
are detections. Each time a detection is being propagated out
of the detection management layer, it reaches our vibration analysts,
which decides if it actually caused a change in the
machine's health. Once it does, the analyst
says it does, and then it notifies our customers and the
customers can choose to take actions accordingly. We can see here that
a customer decided to perform a repair on the machine and after the repair actually
the machine health went back to green. So this
is very good. The second graph that we are seeing over here
is actually detector confidences over
time. And this is very interesting because I chose
to show here the bearing wear confidence output over time.
And we can see it's very correlated with what's
going on in the graph above on
each detection that is actually propagated to our customers,
we can see an increase in the confidence of this specific fault
of bearing wear, which means that in high probability,
this is probably what's going on in their machine. So we can also
notify them that there is something going on, but also give
them a specific confidence to what the specific fault
is. So the detection management layer,
we talked about the AI engine and the overall flows,
and we talked about how we use it in the product. So let's do a
deeper dive into the detection management layer. So it's
important because of two main reasons. And the first one is that
it connects the AI engine to
the customer facing product. Like this diagram shows
us, we have the AI engine on the right, and it generates
all of the detections going downstream to the detection
management layer and then distributed to different end products.
So we can even call this maybe a single point of failure.
So, being confident that the detection management layer
is working as we would expect it to be, is very important.
It's also a very delicate area because it has multiple consumers
and multiple, and it
consumes from multiple producers and also product to multiple.
Yes, you got me. Okay, that's cool. It's complicated. That's the
point. Another important issue here
that algo, it contains logic and it makes decisions
onto where to propagate to. So it
makes this component very important in
our flow. So we need to be confident in the changes we make.
And as I sit, we have two type of changes. We have expected
changes and unexpected changes. Expected changes are new features
and we can actually mitigate the risks over there
by testing in staging environment or even running in
dry run in production, using feature switches and writing to
logs without making real changes, will affect
our customers. But the other types of changes that
we have, which is in my opinion a little bit
more interesting, is the expected changes, all sorts
of bugs. So in the AI engine itself, we can
enhance our visibility to see if there are things
going on in the engine itself. But we also have the detection management
layer, which creates logic, but also consumes
the information from the AA engine. So we can add
matrix over here was well, but what are we trying to achieve?
So our motivation is to avoid product issues.
It's from simple bugs to bad deployments. And when I'm saying
bad deployments, I mean someone made a change and the change is valid,
but just the technicality of performing the deployment
to production. Something failed and for some reason
a detector stopped generating detections. This happens every
now and then and just needs to be handled. So this is
something we would want to know. Changes in
interfaces between squad. And this is also a very important
point because was I said, we have three product squads,
we have three algo squads and a data engineering team. This is
a lot of people and a lot of communication that needs to happen.
And it's natural that sometimes things
wouldn't be perfect. So we want to be on top of that and figure out
changes before they make a big impact.
Another point that we would like to avoid is
negative effects from configuration changes. And I'm going
to explain about this one by using an example that actually happened
to us after we started the monitoring,
the monitoring initiative. And it actually caught this.
It caught these issues. So what happened is that
our DevOps team made security changes, something that
needed to be done, and two of our detectors
stopped generating detections. Now, it's all good
stuff happened, right? But we need to figure it out very quickly.
So the detections stopped being generated.
We got an alert and then we just told
them, hey, can you revert this change and please just investigate
and see how we can make this change again in the proper
manner. And that's what happened, and we figured it out very quickly.
Another type of common production issues is
making changes which you think are correct, but have effects
that you can't even imagine, especially in complicated
systems like this. It's very hard to understand how
it's going to affect. So all of this can happen.
And due to the nature of downstream flows, an error
in the top of the funnel can cause major issues to several
consumers. So this can affect a lot of products and a lot of customers.
So this is very important to us, monitoring.
It's the moment you've all been waiting for.
So what do we want to achieve? First of all,
we want to achieve good service and good support. It's the core of
our product and we want to catch issues before
our customers. It can go either way,
even if they didn't know an issue happened, so we caught it
beforehand. But even if a customer did notice the issue,
we can already tell them we are handling this. So this makes us look very
professional. Also, we want to find issues as fast as
possible. Sometimes nobody notices
an issue until it's too late, right? So we want to be on top
of things, because how important it is to our
product. We want to have consistent AI insights.
The quality of our insights is very important. It's about
giving our customers the consistency they expect.
We want to find machine health issues, but also minimize the
amount of false alerts we give to them. We want to improve
the collaboration between our teams. I've already mentioned this,
but we've grown from eight people working on the diagnosis flow to seven
squad. This is a lot of people, a lot of team,
and we need a way to improve our communication and enable first response.
So our top goal is actually to retain the trust from our
customers. We want to be able to give them a product.
They know when we give an alert it's viable
and when we don't it's all good. According to the Google
Sre book, there are two types of monitoring. We have
white box monitoring which is based on matrix of internal,
internal stuff, cpu memory usage and more.
And we also have black box monitoring which is testing
externally visible behavior as a user would see it.
So let's look at our use case and the title
already gives up and also the drawing gives up
where I'm going to. But we are talking about a hybrid approach.
And why is that? Because from
one side of this, I don't want to necessarily know
about each component in my system and if its cpu
is running low or we are out of
memory. But from the other side, monitoring each
end product by itself I mean it's just a piece of the puzzle,
it's not the whole picture. So what can I do?
So the detection management layer is actually a
consumer or a customer to the AI engine.
So this is pretty similar to black box monitoring,
right. But also it
expected product logic and decides on detection
state. So this is very interesting too because
this actually affects what the external users are seeing.
So this is very similar to whitebox monitoring.
So what we decided is to actually merge the
two ideas together and monitoring an internal product
process that makes also decisions about how
external customers get this information.
So this led us to believe that there are patterns we can commit
to and it's very related to the product.
We saw this example early on with the two graphs and
the machine health events and also the detections over
time with the confidence and actually we can commit
to the amount of detections that are going to be propagated to users,
not on a specific machine, but statistically propagated to users
from our pool of machines. And also taking into consideration
the was that it can be filtered in the detection
management layer. Another thing that we can commit to is the amount
of detections being generated overall and being sent
to the detection management layer per detector. And in
general. So this led us to understand that actually
we have a detection lifecycle. A detection lifecycle is
what it goes through in the detection management layer.
It first of all reaches the detection management layer and
afterwards it's either being filtered by the detection confidence,
meaning we are not confident enough in this specific
detection. We don't need to propagate it to our user,
to our users, or even if the detection confidence is high
enough, we might want to filter this detection due
to the machine states maybe we already alerted the user
on this machine and we don't need to put on another alert
on this machine and in the end propagated to the customers.
So these are the states that we have for a detection.
And this actually led
us to add the matrix on
the detection lifecycle. And using graphite and Grafana we're
actually able to visualize a lot of aggregated views on
the state of our AI engine as a whole.
In this graph we can see the amount of detections
coming into the AI, into the detection
management layer, daily pale detector. And this
really gives us like a full flow of understanding of the differences
between them. This is another very interesting graph because we can see
here the differences between each step
in the detection's lifecycle. And again not specifically for a
machine and not specifically for a detection, but in general and how our
system behaves. So let's get proactive.
So once we had all of these aggregated
views and now we know how our data looks like,
we can actually use Grafana's alerts to set up alerts
and knowing when something is not working as expected.
So what we did is actually decide
on the first four alerts, which is the bearing wear arriving
to the detection. We chose a detector. I chose a detector,
the bearing wear alerts. The bearing wear detector, sorry.
And I decided on four alerts that we would like to
monitoring, meaning four patterns that we
would like to commit to. The first one is the amount of detections
arriving to the detection management layer, meaning not too
much and not too many. The other alert was
about the amount of detections being filtered.
And also I added not too many detections being propagated
and not too less. Right. So once we
had all of this running, I set up
a slack channel, which is called detections monitoring,
and started getting these alerts. Now it took some
time because it took some time, we needed to tweak the values
because we chose really simple absolute values
to put our alerts by. It was really noisy
understanding the different behaviors
of the detector and it took some time but it did
mellow down and just like a sort of FYi,
we are talking right about now of changing our strategy
with this by maybe moving to calculating
the percentage in change on
these numbers instead of just monitoring absolute
values. So this is also very interesting, but out of scope for this stock.
So after the Beringware alerts were stabilized,
I created a workshop and together with all
the algo team, we added dashboards and
alerts for all of our detectors. And now we have a very full
view of our entire AI engine,
including the detection management layer and all of
the detectors and all of the pipelines and everything that you
can imagine. Everything is actually in there
in one place, because we have alerts that indicate
a sort of working, not working indication in a
very high level and in how our customers would
expect to get this as a product. This is an example
of one of the graphs. It's a consistent detections
generation graph and it's pretty straightforward.
We have the red line which indicates the alerts threshold.
There's also one on zero, so we can't see it, but it's there,
I promise you. And another very interesting point here
is the purple, barely dot visible line
that we have over here that has written with it
deployment tag. So actually tags is a feature that Grafana
enables. You can use their open API and
in each time send out a tag that has extra information
on it and it gives you a point in time and you can enrich it
over your graphs. So what you actually see here
in this purple line is a deployment tag on
each service and each component that we have in our system.
We added a deployment tag
that being created when we deploy to production. On this deployment tag
we have a Githash, we have the name of the person that did this
deployment and also the name of the service he deployed to.
So when you have a very complicated system that keeps on deploying
different component to it, but all can affect the
downstream flow, we can use this and really have a
quick way to identify what change was made. And they just
like ping the person. Hey. Hi. I saw you did this change.
I see this detection stopped generating. Can you please take
a look? So this is very powerful.
So in conclusion, keep the customers in the center,
whether they're internal or external. Internal teams
can consume products from each other. It's not
about having zero bugs product, it's about fast response.
To move fast, we need high confidence in our process.
And having an easy way to communicate across
teams is crucial. Thank you, I hope you
enjoyed it. And if you have anything to add or say or ask, feel free
to contact me.