Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello. Today we're going to about detecting silent machine learning
model failure in models that are already deployed in production so
they are making business decisions. It's going to serve as a sort of
production to machine learning monitoring. So let's get started
on the agenda today we have three things. First,
we're going to discuss two main reasons why machine learning models
can fail in production. These reasons are concept drift and data drift.
We will define them and we'll define how they can impact
performance of your machine learning models and to what extent they
normally impact these models. Then we're going to talk about
performance estimation calculation. We'll discuss why
calculation is rarely possible and why access to targets
data after you make predictions is often limited and we need
to estimate performance instead. And then we're going to
come back to find to data and concept drift. And we'll
try to find the link between drop in performance and potential
reasons for that that we can find in data drift.
And before we start doing that, let's set the stage with an example
use case that most of you should be already familiar with. We're going
to talk about a simple binary classification use
case when we're trying to predict whether a client is going to default on their
loan. So we take credit scores and customer information
and we're trying to predict whether a customer is going to default on the loan.
Our target is going to be non payment within one year, which means we'll have
to wait for an entire year after making a prediction before we
can get access to the ground through data, before we can simply calculate the performance.
And we're going to use a technical metric to evaluate the quality of our model.
And for that we're going to use Rokuc.
So before delving into the details
of data and concept drift, let's do a quick refresh and
define again, what are we trying to actually do when
we train a machine learning model. So there exists some true
pattern in reality, not even in these data, but just in reality,
something that relates one variable to the other or multiple
variables to some other variable. So in this example here, we're going to
have just one input feature, x, and we're going to have a relative frequency
of positive and negative class. And as the x increases,
we're going to have some kind of function that maps this increase in
x to the relative frequency. So the higher these value of x,
the less likely it is that the data point belongs to
a positive class. Then what we'll do is we're going to sample
this from this pattern from this population,
we're going to sample our data set, and this data set is all
the data we actually have access to. So it can be our training, validation and
testing data, for example. And that is what we take.
And then we try to find this true underlying pattern
by using machine learning algorithms.
And let's say that we capture this pattern in some way,
maybe imperfectly, maybe perfectly. Most likely it's
not going to be to be a perfect capture of this true
pattern, but it's going to be close enough. And these
let's examine what happens if we experience data
drift. So, data drift can be defined as change in the sampling
method. So the true pattern that exists in reality remains
the same. But what changes is how we sample this data.
So the data is going to be different, the input to the
model is going to be different, but the underlying pattern
between the model inputs and the targets is going to remain
the same. So now we can formally
define it after we've built a bit of intuition about what data
drip is, and we can say that it's a change in
join model input distribution. So again, it's all about the model
inputs and it has nothing to do with the targets,
but of course it does have to do with the model outputs.
And just to illustrate here, we see that there is
data here before and after the data drift,
and we will see that class balance might actually change,
because class balance might be dependent, and normal is dependent
on where in the space of your inputs
and the data exists. So that's data drift.
And data drift can, but does not have to
impact your model performance. If your data moves from one
region when model is performing very well, to another region where model is performing
equally well, we do not expect to see a significant
drop in performance. Now, let's define concept
drift. So, unlike data drift here, what changes is the true
pattern. So we will sample our data in exactly the same
way. Of course, if we have concept drift, that is only concept drift and does
not have a data drift component, and then what changes
is the true pattern? So the actual underlying pattern that we're trying to
find, or the underlying function that we're trying to find between the model inputs
and the target changes. So our data or the model inputs
can look more or less the same, or even exactly the same,
but how they map to the target will change,
and we can visualize it here in a slightly different way. So imagine that
we have a two dimensional data set, and we have training data and production
data, and we'll see that the data tools very similar, the data points
are basically in the same regions in space, but the class boundary
is going to be completely different. And that also means that if
you experience or your data, your use cases experiences
strong concept drift, your models are almost guaranteed to fail
because the pattern that they have learned is no longer valid.
And now that we've defined concept drift
and data drift, we know how they can potentially impact performance.
Let's talk about why we need to worry about performance at
all, actually. So first and foremost,
just monitoring data drift is not enough, because data drift
does not always lead to performance. And if you were to do that, data changes
constantly. And especially if you have a lot of features.
And if you look at data drift from kind of feature to feature perspective,
you will really have a lot of alerts, so many that they will become
basically useless. So because data drift does not always
mean performance drop, we cannot just monitor the data
drift. Another reason why we should monitor performance directly
is that this is something that we've been directly optimizing for in
training. We chose the model that has the best performance. However we define performance,
it can be our like in our example, the RoC AUC can
be precision, recall for classification examples can be root mean square
error for regression examples, et cetera. Another thing
why we need to look at performance is that it is the
best proxy we have for the business impact. Of course, when we develop machine
learning use cases in industry setting, what we care about is creating
value, creating business impact. And the only
way we can quantify it easily in development and within the
technical setting is to look at these technical metrics for
the machine learning performance.
So, since now we know that we need to measure performance
or monitor performance, how do we do this? The easiest thing would be
to take the ground truth. After we make the prediction, compare it,
sorry, after we've obtained it, compare it with our prediction
and see what's the difference. Just measure it, literally calculate
the performance. However, this is very rarely really possible.
Why is that? First and foremost, in some of
these use cases, the data is delayed. So let's take our example again,
and we'll have to wait one year, sometimes maybe even longer,
to get actual target data.
So our model will be operating always for a
year. And without knowing the performance for a
fact, that means that we are exposed to huge risk
of giving loans to people who should not have received the loans and generally
mispricing these loans. And this is the risk that it's not
acceptable and you should always try to minimize it.
Another thing is that even when we do get the labels, these labels
are not complete. What I mean by that is, we do know whether the people
that received the loan, whether they paid it or not. But what we don't know
is whether the people who have not received the loan youll have paid
or not, had they been given the loan. And the
last reason or example where we do not have
the ground truth is automation. Use cases. These are the use cases when we
try to automate some menial human task,
such as, let's say document classification, most of computer
vision tasks, when we want to replace or augment human
to do something. In these cases,
we train model based on data that was manually processed
by humans, and now we want to replace them. That means that we will never
get ground truth for all the data. That would literally defeat
the purpose of developing such an algorithm. We can also,
of course, do some kind of spot checking when we can take maybe 1%
of the data and have it double checked by the human. But that will not
give us a full picture of performance of the algorithm. So that
means that we need to estimate the performance of a machine learning
model. So we arrive at performance
estimation. And that
is possibly the most interesting part of machine learning, monitoring and detecting
science. Model failure is that you can indeed estimate
the performance, and specifically you can fully estimate
the impact of data drift on the performance.
And I'm going to give you a high level intuition how this
algorithm works. It's an algorithm that we developed. It's part of our open source
library, so feel free to check it out.
So what we're going to do is we're going to look at
the confidence of a model. So in our
example of a binary classification, this is just going to be the model scored
number between zero and one. If these number is close to either one or to
zero, it means that the model is very confident. If the number is close to
0.5, it means that the model is not confident. First thing we need
to do is we need to confirm that these
scores do represent probabilities. They don't always,
and sometimes we need to do probability calibration to make sure that
these scores are turned into something that actually represents real probabilities.
That if a model output 0.6,
there is actually 60% chance that this data point is
going to belong to class one. So after we have
that, we have properly calibrated probabilities, we're going to
look at the expected performance of a model as a function
of this uncertainty. So imagine that in training we have
the picture on the left here, and in this picture youll will see that most
of the data points are in the high confidence regions.
Maybe the part of this kind of butterfly wings,
one of the butterfly wing is concentration of negative class.
Another one can be the concentration of positive class. And kind of the
body of the butterfly is going to be the class boundary when the algorithm
is really uncertain whether this class is positive
or negative. So then imagine that
sometime passes and we see significant data drift.
And this data drift is of a specific form
that it moves from high performance regions or high confidence
regions to low confidence regions. That means we would expect the
model to perform worse, and the model itself
expects to perform worse. We can these take that
and convert it into expected confusion metrics.
And from there we can calculate the expected value of basically any
metric you want, it being accuracy, precision, et cetera.
I will not go into details how youll actually go about it as
that would take too much time, but do feel free to check our docs with
more explanation on it. So now we have estimated the
performance. What is the next step? The next step is trying to figure out if
there's identify issues, see if there's failures.
And if we see that there are some issues with performance, we want to figure
out why they haven't. Just before we jump
there, let's look at an example of how such performance
estimation algorithms performs on a real life data set.
So we took, for the purposes of these presentation, we took a California
housing data set that's available basically everywhere
in scikitlearn, among other places. We turned
this model into a classification problem and created
very simple algorithm. I think it was random forest. We trained it on
the training part of the data set, and then we evaluated in production.
And you see here that this algorithm that explained
to detect the impact of data drift on performance. So the performance
estimation algorithm works quite well, and the estimated rock AUC is
very, very close to the real rock AUC.
So now let's jump into figuring out why
models can fail. So then we'll go back to the data drift. And here
we want to basically figure out what features or sets of features
or segments of the data can be responsible for the drop in
performance. And we can do it in two ways. First, we can look at the
data feature by feature, and just see which features changes
significantly. This is the univariate data drift,
or we, youll also look at multivariate data drift.
So we look at maybe all features at once, or some subsets of features,
and we're trying to figure out whether there is a significant change in
joint distribution of these features. The simplest and most
interpretable option is of course the univariate data drift. So we're
trying to detect it. And to do that we can use simple tests,
simple statistical tests such as comoger ks test or chisquare
test, where we will just look at the reference data set
for which we know that these performance is stable and high
and all the data looks like it should, which could be, for example,
our test set, or it could be the first month of production and
we're going to compare it to our analysis data set, which is
the part of the data for which the performance has decreased
and we'll see if there's any changes in the data that are significant.
These tests of course have few shortcomings.
The first one is that if you have hundreds of features,
you will get reallife a lot of false positives in absolute terms.
And that means that let's say if you go into thousands of features,
you will just not be able to go through all these false positives
and you will not be able to find the real issues with the data.
The other shortcoming of these approaches is that they fail
to find more subtle types of data drift. If we see only the shift
in correlation between features or some changes in internal
data structure that is not really visible from univariate
changes. This test of course will fail to detect
that kind of data drift. So kind of to alleviate this problem,
we can look at the multivariate data drift
detection approaches. And here I'm going to present one that we develop,
which has to do with the data reconstruction. So what we're going to do here
is we're going to take our original data. So all
the failures that we have and we're going to compress it,
we're going to project it to a latent space at lower dimensional and then
we're going to do the invariance transform. We will reconstruct the data,
then we will compare the original
data with the reconstructed data and we'll see what is the compression loss,
how strongly this data is different. And we
will use basically any dimensionality reduction or compression
algorithm that is fitted on the data.
So any algorithm that learns the internal structure of the data can be used
here. Let's delve deeper into that.
So first, when it comes to the choice of this algorithm,
there's few requirements. First, as I already mentioned, the encoding needs
to learn the internal structure of the data because we want to
track the changes in that structure. That's the entire underlying intuition here
is that we want to track how the internal structure of the data is changing,
and we can measure that using the change between the
original dislocation of points before and after the reconstruction.
So between the original space and the reconstructed space,
then this encoding needs to reduce dimensionality of the data
because we want to compress the data in some way so that
this internal structure of these data needs to be learned in order to perform compression.
Well, then of course the inverse transformation needs to
be possible because we want to reconstruct the data. That one is obvious,
and there's one important requirement these is that the latent
structure needs to map in a stable way to the
original space. So let's say if you took out
encoder that is not variational, so traditional one there,
you would see that the latent space can map in completely
unpredictable way to the original spacer, which means
that these reconstruction error is not going to be a reliable metric
to measure really the change in the data structure.
So what we're going to do is we take, let's say PCA,
we do the PCA, we reduce
number of components, we take let's say top end components,
we keep 95% of variance in the data,
and then we do the inverse transformation.
We measure the dislocation of points before and
after the reconstruction. And to do it we can use
any distance metric. We're using the minuclidean distance between
the original and reconstructed points, and we get the reconstruction error.
And then we only need to keep track of one metric to
see what is the data drift here. Of course,
this does have a certain shortcoming, that it becomes less and less interpretable.
If we look at maybe five features at a time and we do the
data reconstruction these, it's still reasonably interpretable. If we
look at the entire data set and we perform this
kind of multivariate drift detection, it's not going to be interpretable.
But we will still know that if there is drop in performance and we
see spike in our multivariate data
drift, we see that data drift is responsible. And we'll have to dig
deeper to find out what exactly change in our data that
affects performance. And just
a few words about how to actually interpret this reconstruction error.
We will have some kind of baseline reconstruction error because this compression
is meant to be lossy. And as you deploy your model,
you can see that this reconstruction error stays roughly constant, which is
perfect case scenario. In that case, we see no drift, it increases or
it decreases if it increases. We see that these encoding,
the internal structure that was learned by the encoding is no longer appropriate.
So these compression doesn't work as well. We see increase
in the reconstruction error and we have data drift. However, if there is a
drop in reconstruction error, that means that the
internal structure learned by the encoding is even more
appropriate to the data than it used to be before. So we still see data
drift. This case is rare, but it might happen.
And I want to give you a very quick example of where
such an algorithm would be necessary to detect drift and univaliate approaches
would fail. So let's imagine two
very simple data sets, the reference data set in blue,
these when we know everything's fine, and the orange data set,
which would be the analysis data set. So we see there is some
kind of drop in performance. We want to find out why.
And if we just look at the univariate data drift
detection methods, we would see that there is no
increase in the d statistic. There is basically
no alerts, because from univariate perspective, if you
just look at the x and the y axis, these data sets look basically
the same. However, if we do our encoding
decoding and we measure the distance between the reconstructing
area, the original space, we will see that there is a strong
difference. These because the internal structure of the data has significantly
changed, obviously. So this is the
simplest possible example where multivariate data
detection would be absolutely necessary.
So we're slowly nearing the end of the presentation.
Let's summarize. So first thing is that there's
two reasons why. Two main reasons why performance can
drop in machine learning models deployed in production. It is data drift and concept drift.
Data drift does not always lead to drop in performance, whereas concept drift tends
to almost always lead to drop
in performance. Ideally, we'd like to always calculate
the performance of a machine learning model in production to
really know whether there's any issues or not. However, this is rarely possible because
production targets are often not available because they are either delayed,
not available at all, or we dealing with an automation use case
where you can only get a very small percentage of these.
And that means that performance estimation of our target data is
key to machine learning monitoring because it allows us to estimate
what is the current performance of the model, whether we need to be worried or
not, and then we need to go back to data drive detection and
figure out why. So thanks
for listening. And if you'd like to learn more about
the topic about detection
of silent machine learning failures, either visit our
website, shoot me an email, add me on LinkedIn,
or most importantly, check out our GitHub.
We are, we've just launched our product,
so we have our GitHub. It's publicly available. You can just pip install the library
and use the method I described in the presentation. So,
yeah, that's it. Thank you very much, and I'm
happy to talk more on LinkedIn or anywhere else.