Transcript
This transcript was autogenerated. To make changes, submit a PR.
Machine learning in production this session
is an introduction to running machine learning in production, which is
being called MlOps. I'm Ryan Dawson and I'm an
engineer working on Mlops solutions at Seldon.
The MLops scene is complex and new.
It's distinct from mainstream DevOps. So we'll start by comparing
mlops to DevOps. To understand why it's so different,
we need to understand how data science is different from programming.
We'll find out that the difference centers on how data is used.
When we're clear about that difference, then we'll look at how the build
deploy monitor workflows for DevOps differ from Mlops.
From there, we'll be able to go deeper on particular steps in the mlops
build deploy monitor workflow. I'll try to explain
that MlOps challenges vary by use case, and that some use cases
have especially advanced challenges. Lastly,
I'll go into some of the advanced challenges and how they relate to the
topic of governance for running machine learning.
So before we try to understand mlops, let's make sure we're clear about DevOps.
As I see it, DevOps is all about making the build deploy monitor
workflow for applications as smooth as possible. It tends to
focus on CI, CD and infrastructure SRE
or site reliability engineering. As I see, it is an overlapping role,
but with a bit more focus on the monitoring stage of the workflows.
This whole workflows is a key enabler for software projects.
Fortunately, there's some great tools in the space that have become pretty well
established across the industry, tools like git, Jenkins,
Docker, Ansible, Prometheus, et cetera.
MLOPS is in a very different space right now.
There's surveys suggesting that 80% to 90% of machine learning
models never make it to live, and at least part of that is
due to the complexity of running machine learning in production.
There's a famous paper called hidden technical debt in machine learning
systems, and it explains about all the effort that goes into
running production grade machine training systems. It has
a diagram with boxes showing the relative size of different tasks,
and there's this tiny little box for ML code and
really big boxes for data collection, data processing,
runtime, infrastructure monitoring.
The Linux foundation for AI have tried to help by producing a diagram of
the whole mlops tool landscape. It's great, but it
has loads of tools in loads of sections, and even
the section titles won't make much sense for newcomers to mlops.
But let's try to understand more about the fundamentals of Mlops and where it's coming
from.
Fundamentally, MLops is different from DevOps because machine learning is
different from programming. Traditional programming codifies rules
explicitly, rules that say how to respond to inputs.
Machine learning does not codify explicitly. Instead,
rules are set indirectly by capturing patterns from data and
reapplying the extracted patterns to new input data.
This makes machine learning more applicable to problems that center on data,
especially focused numerical problems.
So with traditional programming, we've got applications that respond
directly to user inputs, such as terminal systems or GUI
based systems. You code these by starting with hello
world and adding more control structures.
Data science problems fall into classification problems.
Regression problems classification
problems put data into categories. An example would be,
is this image a cat or not a cat?
Regression problems look for numerical output, for example,
predicting sales revenue from how advertising spend is
directed. The hello world of data science
is the mnist dataset, which is a data set
of handwritten digits. And the problem is to categorize each
handwritten sample correctly as the number that it represents.
When I think of machine learning as capturing patterns from data,
I think about fitting for regression problems basically
have data points on a graph, and you draw a line through the
data points and try to get the line as close to as many of the
data points as possible. The distance from each data
point to the line is called the error, and you keep adjusting the
equation of the line to minimize the total error.
The coefficients of the equation of the line correspond to the weights of a
machine learning model, and you then use that to make new predictions.
Of course, the machine learning training process is more complex than the way I'm explaining
it. For example, there's more to the process of adjusting
the weights than to try to get the line to fit the data.
It's done promogrammatically by using an algorithm called
gradient descent. Essentially randomly pick
a way to shift the line, but it's only pseudoram, as it will
take a step in a given direction and then check whether that
reduced the error before deciding whether to keep going that way or go
a different direction. That step size
can be tweaked and you can get different results,
so the overall process is tunable.
So basically, data scientists are looking for patterns in data and trying to find
which methods are best for capturing those patterns in models.
This is an exploratory process, and the tools data scientists
use reflect this. Jupyter notebooks, for example, are great for
playing around with slices of data and visualizing patterns.
These differences between programming and machine learning have implications for
how we can best build, deploying, and run machine learning
systems. So let's get into more detail about how different
these build deploy monitoring journeys are.
Let's go on an imaginary development journey. We can
start with a user story. Let's say we're building
a calculator and our user story says that our lazy users want to
put numerical operations into a screen so they don't have
to work out the answers.
We could write a Java program to satisfy the story,
compile it, and distribute it as a
binary. But this
is 2020, so we'll more likely package the code to
run as a web server so that users will interact with it via
a browser. Most likely we'll also dockerize
the web app and run it on some cloud infrastructure.
Now let's think of a machine learning build journey.
This is more likely to start with some data and maybe a question.
Let's say we've got data on employees and their experience and
skills and salaries, and we want to see whether we
could clean whether we could use it to benchmark salaries
for other employees during a pay review.
Let's assume the data is already available and clean, though this
is a pretty big assumption. But let's assume
we've got good data and we can create a regression models that maps employee
experience to pay,
maybe using scikitlearn.
So we train the model and then
it can be used to make a production for any given
employee a
prediction about what the salary benchmark would be.
So let's say we give our predictions for a particular set of employees
to the business and they're happy with that.
So happy that they want to use it again next year or more
regularly. Then our situation changes.
Because then what we want isn't just a prediction but
a predict function as
we might not want to have to rerun the training process every time the
business has some new employees to check.
This problem would be magnified if another department says that they want to
make predictions too. Actually, that would add extra complication
as even if we know that the patterns from our
training data are applicable to our department, we don't necessarily
know about the new department. But let's assume
that it is applicable. Then our main problem is a problem
of scaling. How do we make all these predictions
without burning ourselves out?
Probably we're going to be interested in using the machine learning model in
a web app.
So maybe we add a rest API around our python code and
look to run it as a web application. We might naturally package
it in a docker container like we would for a traditional web app.
This is a valid and common approach, but it's just one approach
with machine learning deploying it does present a challenge
about how to dockerize the predict function without including the training data in the docker
image. So it's also common to
separate the model from the data by taking
the Python variable for the model production and
serializing that to a file using Python pickling.
Then the file can be loaded into another training Python application server.
So if we load the model into a suitable Python
web server app, then we can serve predictions that way.
This varies a little from framework to framework, and can vary
quite a lot if the language is not Python.
But basically this is a good picture for the machine learning lifecycle.
Get data, clean it, experiment with it, train a model,
package the model into something that can serve predictions.
And there are tools pitched at each stage of this lifecycle
for data storage and prep. There's tools like s three and
Hadoop training can use
a lot of compute resource and take a long time.
So there's tools that help with running long running training jobs,
and also tools for training the operations performed during training.
There are tools specifically aimed at helping make batch
predictions on a regular cycle,
say for just getting predictions every month or whatever the cycle
is that the business works to.
Or predictions could be needed at any time.
And then there's tools for real time serving of
predictions using a rest API.
Some real time serving tools are specific to the framework
and some SRe more general. I personally work on seldon
core, which is a framework agnostic open source serving
tool. The seldon team also collaborates on another tool
called KF serving. Both of these are part
of the Kubeflow ecosystem, which is an end to end
platform. That's another space of tools,
end to end platforms that try to join up the whole journey.
Platforms can save you the effort of stitching together several different
tools, but platforms are also opinionated
can be, so they don't necessarily fit every use
case. I'm listing these types of tools because I think
it helps to divide the machine training lifecycle up like this into
data prep, training and serving.
This helps us make sense of the concept landscape of mlops
tools out there, as we can then put them into categories mapped to the
lifecycle. There's also the monitoring part of
the lifecycle, but we'll get to that later.
For now, the key point to see is that mlops is different from DevOps,
mostly because of the role of data. In particular,
models are built by extracting patterns from data using
code, so that the training data is
a key part of the model. The training
data volumes can be large,
and that leads to complexity in storing and processing the data,
which there's specialized tools to help with.
You also get different toolkits for building machine learning models,
which results in models for different formats and adds
some complexity to the space of tools for getting predictions out of models,
space called serving. So the complexity of the
way the ML build deploying monitor lifecycle uses
data has knock on effects to the tool landscape.
We've not talked about the post deployment stage yet,
but there's also complexity there. For example,
you can sometimes need to retrain your model,
your running model, not because of any bugs in it, but because the data
coming in from the outside world changes.
Think, for example, of how fashion is seasonal.
Let's say you've got a model trained to recommend clothes for an online fashion
store, and you trained it based on purchases made in winter.
Then it might perform great in winter and make lots of money.
But when it comes to summer, it's still going to be recommending coats when
people are looking for summer clothes.
So you would need to be regularly updating the model with new data and
ideally checking that it's leading to sales. That's a
complex you don't normally get with traditional software.
These complexities about handling data, they ripple
all the way through the whole mlops lifecycle.
We've talked about this at a high level so far, but let's now think
about the individual steps of the workflow and the tools used in them.
So let's just remind ourselves of the workflows steps with traditional
DevOps. We'll start with the user story
specifying a business need. From that a developer will write
code and submit a pull request. Hopefully test will
run automatically on the pull request. Somebody will
review it and merge. It gets to merged to master there,
our pipeline will build a new version of the app and deploy that to
the test environment. Perhaps further tests
will be run and it'll get promoted to the next environment where there
might be more deeper tests, and then it'll go to
production. And in production we'll monitor for anything going
wrong, probably in the form of stack traces or error codes.
The pipeline producing these builds and running the tests will most likely be
a CI system like Jenkins. The driver for the pipeline
will most likely be a code change in git. The artifact
we'll be promoting will probably be an executable inside a
docker image.
ML workflows are different. The driver for
automation might be a code change, or it might be new data,
and the data probably won't be in git as git isn't a great store for
data getting into the gigabytes,
the workflows are more experimental and data driven.
You start with a data set and need to experiment to
find usable patterns in the data set that can
be captured in a model. When you've got a model, then it
might not be enough to just check it for past fail conditions and monitor for
errors like you would. Traditional software likely have
to check how well it performs against the data. In numerical terms,
there can be quite a bit of variation with ML workflows.
One major point of variation is whether the model is trained offline or
online. With online learning,
a model is constantly being updated by adjusting itself through each
new data point that it sees. So every prediction it makes also adjusts
the model. Whereas with offline learning, the training is
done separately from prediction. You train the model and deploying
it, and when you want to update the model, you need to train a new
one. We have to pick somewhere to
focus, and offline learning is probably the more common case. So let's
focus on offline training workflows.
As we've talked about already, an ML workflow starts with data.
It can be very large and typically needs to be cleaned and processed.
A slice of that data can be taken so that the data scientists can
work with it locally to explore the data on their own machine.
When the data scientist has started to make some progress, then they might
move to a hosted training environment to run some longer running experiments
on a larger sample of the data. There will
likely be collaboration with other data scientists, most likely using
Jupyter notebooks. The artifact produced will
be a model, commonly a model that's pickled
or serialized to a file. That model can be
integrated into a running app to serve real time production through HTTP.
There will probably be a consumer of those predictions, which may be
another app, perhaps a traditional web app. So you may need to integration
test the SRE model against the consumer.
And when you roll out the model to a production, you want to
monitor the model by picking some metrics that represent how well performing
against the live data,
the rollout and monitoring phases of the workflow can be linked.
An example might help to understand this. Say we've
got an online store with ecommerce. A common way to
roll out new versions of a model is an A B test.
With an A B test, you'd have a live version that's already
running and that's called the control. And then you run
other versions alongside it. Let's call them version a and version b.
So we're running three versions of the model in parallel, each training a
bit differently to see which gives the best results.
You can do that by splitting the traffic between the versions
to minimize the risk. We'd send most of the traffic to the control version.
A subset of the traffic will go to a and to b, and we'll
run that splitting process for a while until we've got a statistically significant
sample. Let's say that variation a has the highest
conversion rate, so a higher proportion of the recommendations
lead to sales. That's a useful metric, and it
might be enough for us to choose variation a, but it
might not be the only metric. These situations can get
complex. For example, it might be that model a
is recommending controversial products. So some customers might really
like the recommendations and buy the products, but other customers are
really put off and they just go to a different website.
So there are trade offs to consider, and monitoring
can need more than one metric depending on the use case.
So we're seeing that MLOPs is complex, and in many organizations
right now, the complexity is enhanced by challenges from
organizational silos. You can find data scientists
that work just in a world of Jupyter notebooks and model accuracy
on training data that then gets handed over to
a traditional DevOps team with the expectation they'll
be able to take this work and build it into a production system.
Without proper context. The traditional DevOps team is likely
to look at those notebooks and just react like, what is this stuff?
In a more mature setup, you might have better understood handoffs.
For example, you might have data engineers who deal with
obtaining the data and getting it into the right state for the data scientists.
Once the data is ready for the data scientists, then they can take over and
build the models. And from there, data science will have an
understood handoff to ML engineers. And the ML engineers
might still be a DevOps team, but a DevOps team that knows about the context
of this particular machine learning application and knows how to run it in
the production.
This is new territory. There are special challenges for mlops
that are not a normal part of DevOps, at least not right now.
Now that we've got a high level understanding of where MLOPs is coming
from, we can next go into more detail on particular MLOps topics.
So let's take these in order and go first into training,
then serving, finally rollout and monitoring.
So there's tools that are pitched, particularly at the training space,
to name a few examples. There's Kubeflow pipelines,
MLflow Polyaxon.
These are all about making it easy to run long running training jobs
on a hosted environment. Typically, that means providing
some manifest that specifies which steps sre to be done and
in which order. That's a manifest for a training
pipeline. As an example,
a pipeline might have as its step an action to
download data from wherever it's stored. That could be the first step.
Then it gets split into training and validation data.
The training data will then be used to train the model, and the validation
data will be used as a check on the quality of the model's predictions.
When we check the quality of the predictions, we'll want to record those checks
somewhere and ideally also have an automated way
to decide whether we should consider this as a good model or not.
If we do consider it a good model, then we'll probably want to serialize it
so that the serialized model would be available for promotion to
a running environment. This is
probably sounding rather like continuous integration pipelines. It is
similar, but also different. The difference can be seen in the specialized
tools dedicated to training. One tool
for handling training is Kubeflow pipelines.
In Kubeflow pipelines, you can define your pipeline with all its steps,
and also visualize it and watch it progress and see any steps that fail.
But the pipelines aren't only called pipelines,
they're also called experiments, and they're parameterized,
so there's options in its UI where you can enter parameters.
Remember I mentioned before that the process can be
tunable. There are tunable parameters on training, such as
the step size, so you can kick off runs
in parallel of the same pipeline using different parameters to
see which parameters might result in the best model.
Cube flow pipelines is not alone in having this idea of being able to kick
off runs of an experiment with different parameters.
MLflow, for example, uses the same terminology and has
a similar interface. So there's
similarity here with traditional CI systems, as the training platforms
execute a series of steps and an artifact gets built.
But it's different, as you've also got this idea of running experiments with
different parameters to see which is best.
That means you have to have a definition of which is good,
which is the best model,
whereas traditionally with continuous integration you would just be building from master,
and if it passes the tests, then you're good to promote.
But so long as you can automate, which counts as the best model,
then your training can build an artifact from a promotion, much like
with CI. And sometimes these CI systems do have integrations
to rather, sometimes these training systems do have integrations
available to CI systems let's
say we've got a way of building our model and we want to be able
to serve it. So we want to make predictions available in real time
via HTTP, perhaps using a rest API. We might
use a serving solution, as there's a range of them
out there, some that are particular to a machine learning toolkit
such as Tensorflow serving, Tensorflow or Torch SRE
or Pytorch. There's also serving solutions provided by cloud providers
as well, some that are more toolkit agnostic.
For example, there's the toolkit agnostic open source offering that
I work on. Seldom. Typically,
serving solutions use the idea of a model being packaged and hosted,
perhaps in a storage bucket or a disk location,
so the serving solution can then obtain the model from that location and
run it. Serving solutions often come with support for rollout
and some support from monitoring as well.
As an example of a serving solution, I'll explain the concept
behind Seldon and how it's used. Seldon is aimed
in particular at serving on kubernetes, and the models
are served by creating a Kubernetes custom resource. The manifest
of the custom resource is designed to make it simple to plug in a
URI to a storage bucket containing a serialized model.
So at a minimum, you can just put in the URI to the storage bucket
and specify which toolkit was used to build
the model. Then you submit that manifest to Kubernetes
and it will create the lower level Kubernetes resources necessary
to expose an API and serve the model's HTTP traffic.
There's also a docker option to serve a model from a custom image.
I'm emphasizing the serialized or pickled models in this talk,
mostly because it's common to see those with serving solutions,
and it's not very common outside of the mlops space.
The serving stage links into rollout and monitoring I've
talked a little bit already about a b testing as a rollout strategy.
With that strategy, the traffic during the rollout is split between
different versions, and you monitor that over a period of time until
you've got enough data to be able to decide which is best.
There's a more simple rollout strategy, which also involves splitting the traffic
between different versions. With the Canary strategy,
you split traffic between the live version of a model and a new version
that you're evaluating. But typically with a canary,
you just have one new model, and you evaluate it over a shorter
period of time than with the A B test. It's more of
a sanity check than an in depth evaluation,
and you just promote if everything looks okay.
Another strategy is shadowing.
With shadowing, all of the traffic goes to both the new and the
old model, but it's only the responses from the older
model, the live model's responses that are used and which go back to
the consumer. The new model is called the shadow version,
and its responses are just stored. They don't go back to any live consumers.
The reason for doing this is to monitor the shadow and compare it against the
live version so it makes sense to be storing the shadow's output
for later evaluation.
Serving solutions have some support for rollout strategies. In the
case of Seldon, for example, you can create a Kubernetes manifest
with two sections, one for the main model and one for the Canary.
The traffic will automatically be split between these two models by
default. Seldom will split traffic evenly between the models. In a manifest,
you can set a field against each model called traffic.
That field takes a numeric percentage that tells seldom how much
of the traffic each model should get.
Each rollout strategies involve gathering metrics on running models.
With seldom, there's out of the box integration available for Prometheus,
and some Grafana dashboards are provided.
These cover general metrics like frequency of requests and latency.
You may also want to monitor for metrics that are specific to your use case,
and there are defined interfaces so that extra metrics can be exposed in
the usual Prometheus way.
I mentioned earlier that in the shadow use case, you might want to be
recording the predictions that the shadow is making so that you can compare its performance
against the live model. This can be handled through
logging all of the requests and responses to a database.
There are other use cases as well where recording all predictions can be
useful. For example, if you're working in a compliance heavy industry
and an auditor requires to know of every prediction that's been made
in the shadow use case, you'd use that database then
to run queries against the data and compare the shadow's performance against
that of the live model.
In the case of Seldon, there's an out of the box integration
which provides a way to asynchronously log everything to elasticsearch
so that everything can then be made available for running queries on later, but without
slowing down the request path of the live models.
This idea of taking the live request and asynchronously sending
it elsewhere can also be useful for some monitoring use cases,
and not just for audit. In particular, there's some
advanced monitoring use cases that relate to the data that's coming
into the live model, how well it matches to the training data.
If the live data doesn't fall within the distribution of the training data,
then you can't be sure that your model will perform well on that data.
Your model is based on patterns from the training data.
So data that doesn't fit that training distribution might have different patterns.
One thing we can do about this is to send all the request data
through to detector components that will look for anything that
might be going wrong so that we can flag those predictions if we need to.
So let's drill a bit, a little bit further into what we might need to
detect.
One thing we might need to detect is an outlier. This is when
there's the occasional data point which is significantly outside of the training
data distribution, even though most of the data does fall
within the distribution. Sometimes models express
their predictions using a score. So, for example,
classifiers often give a probability of how likely a data point is to
be of a certain class for the model.
You might expect to get a lower probability on everything when
the data points are outliers.
Unfortunately, it doesn't work that way. And for outliers, sometimes models can
give very high probabilities for data points that they're getting completely
wrong. This is called overconfidence. So if
your live data has outliers and your use case has risk associated
with those, then you might want to detect and track outliers.
Depending on your use case, you might choose to make it part of your business
logic, for example, to handle outlier cases differently,
perhaps scheduling a manual review on them.
Worse than the outlier case is when the whole data distribution is different
from the training data. It can even start
out similar to the training data and then shift over time.
Think, for example, of the fashion recommendation.
The example that we mentioned earlier was trained on data from
winter, and then you continue using it into the summer. Then it's
recommending coats when it should be recommending t shirts.
If you have a component that knows the distribution of the training data,
then you can asynchronously feed it all of the live requests,
feed them into that component, and keep a watch. That component
will keep a watch so you can use it to set up notifications in case
the distribution shifts. You could then use
that notification to decide if you need to train a new version of the model
using updated data. Or perhaps you've
got other metrics that let you track model performance, and if
they're still showing as good, you might just choose to check those metrics more
frequently while you look more closely into what's happening with live data distribution.
These monitoring and prediction quality concerns also feed into
governance for machine learning. It's a big topic and we
can't go into everything in detail, but I want to give an impression
of the area, so I'll at least mention of a few things I've
talked about. Detection for data drift and outliers
another thing detectors might be applicable for is adversarial attacks.
These are when manipulated data is
fed to a model in order to trick the model. Think, for example,
of how face recognition systems can sometimes be tricked by somebody wearing a mask.
That's a big problem in high security situations,
and there are analogous attacks that have appeared for other use cases.
I also mentioned that in high compliance situations you might
want to record all of the predictions in case you need to review them later.
This can also be relevant for dealing with customer domains.
This relates to the topic of explainability. For example,
if you've got a system that makes decisions on whether to approve loans,
you're deploying somebody a loan. Then you might want to be able to explain why
you denied them the loan. You'll want to be able to
revisit exactly what was fed into the model. The explainability part
is a data challenges data science challenge in itself, but it links
into mlops because you'll need to know what data to
get explanations for and what model was being used to
make the original loan decision.
The topic of explainability also relates to concerns about bias and ethics.
Let's imagine that your model is biased and is unfairly denying
loans to certain groups.
You'll have a better chance of discovering that bias if you can
explain which data points are contributing most towards its decisions.
There's also a big governance question around being able to say exactly what
was training and when. In traditional DevOps, it's a familiar
idea that we'd want to be able to say which version of the software was
running at a given point in time and what code it was built from,
so that we can delve into that code and build it again if we need
to. This can
be much more difficult to achieve with mlOps, as it would also require
being able to get access to all the data was used to train the model,
and likely also being able to reproduce all of the transformations that were
performed on the data and the parameters that were used in the training run.
Even then, there can be elements of randomness in the training process
that can scupper reproducibility. You don't plan for them.
So let's finish up by summarizing what we've learned.
MLOPs is a new terrain. ML workflows
are more exploratory and data driven than traditional dev workflows.
MLOPs enables ML workflows. It provides
tools and practices for enabling training runs and experiments
that are very data intensive and which use a lot of compute resources.
Provides facilities for tracking artifacts produced and
operations on data during those training runs.
There are MLOps tools specifically for serving machine learning models
and specialized strategies for safely rolling out new models for serving
in a production environment. There's also tools
and approaches for monitoring models running in a production environment and
checking that model performance stays acceptable, or at least
that you find out if something does go wrong.
So that's my perspective on the field of mlops right at
least as it is right now. Thanks very much for listening.