Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi everyone. My name is Melisa, and I'm a developer advocate
here at iterative AI. And today
I'm going to talk to you about using reproducible experiments
to create better machine learning models.
So feel free to reach out to me personally
on Twitter at flipped coding. If you have any questions,
or if you want to get in touch with the whole DVC team,
feel free to reach out to us on twitter@dvc.org
but to get started, if you want
to follow along or at some point go back and reference
the project that I'm going to be using as an example.
Throughout this talk, youll need a few things installed.
So you'll need Python three. You don't need
vs. Code, but vs. Code does make it a lot
easier. And you'll need to fork this
repo here, and it'll give you the exact project that you
see me show in this presentation.
So let's just jump into it. There are a few common
issues when it comes to machine learning projects.
First, we're tuning to find the best combination of
hyperparameter values, algorithms,
data sets, environment configurations.
There's a lot that goes into each of these
models we produce. And every time we get new
requirements, or we get new data, or we get access to
new resources, we still have to come back
to this fundamental thing where we find
this best combination of all of these
different things that go into our model.
So this is fine, right?
We're just going to keep trying out different things.
We'll do our hyperparameter tuning.
We'll read through academic papers
and find the most cutting edge algorithms to practice
with or to try out. And we have to keep track
of all of these changes, because eventually you will find
a model that's really, really good.
Like it'll give you some kind of incredible accuracy,
or youll notice that the performance is
a lot faster or something. And you need to keep track
of all of the changes you made throughout all of these
different experiments, so that when you get this incredible model,
you can reproduce it. So as
we go through all of these experiments, trying to find
the best combination of hyperparams
and data sets, we have to keep track of each
experiment we run. And the problem with
that is that over time, it gets really hard to
follow those changes. So there's this thing
where we might have hundreds of hyperparameter
values to test out. How do youll manually
keep track of that many different
experiments? And in between times you're
changing hyperparameters? You might take a look at your code and think,
oh, maybe if I just change this one line
here, that might do something different, and then
you get more data from production or
something. So you layer on to these experiments
really fast and sometimes without even noticing
that it's happening. That's why over time, it's hard
to follow those changes that led you to your
current best model that actually
has that best combination of all of those factors
that go into tracking this model.
So we want to make sure we have some
kind of way to keep track of all of these experiments,
make sure that we know what code,
data and configs were associated with our
model. So when we get ready to deploy to production,
we don't have that weird drop in
accuracy or there's this strange
environment difference that we just couldn't account
for before. When we're able to follow these changes over
time, when you're ready to deploy to production,
it becomes a lot more,
it just becomes a lot more consistent, a lot more reliable.
So let's look at how we actually fix these issues.
The first way is just by thinking of each
experiment as its own little bundle.
So can experiment consists of your data
set, any hyperparameters you have,
and maybe you have a model to start with, or you just have
some algorithm you want to test out. But for each experiment you
run, they all have these same
things in common. So as you're adjusting
your parameters, as you're updating your
data set, you want to be able two track
each of those experiments, kind of like you see on the screen here.
So that's why
we're going to track about a little background on hyperparameter tuning
before we jump too deep into how we fix the problem.
So with hyperparameter tuning,
we know that hyperparameters are the values
that define the model. If you're working with a neural
net, that means the values like the
number of layers in your neural net, or if
you have a random forest classifier that will be
like the max depth for that classifier.
So these aren't the things that your model predicts
for you. These are the values that actually build
the model that does the prediction. And there
are a couple of common ways. Two, approach hyperparameter
tuning, and that's through grid search and random search.
With grid search, you have sets
of values for each of your hyperparameters,
and you go through all of them. So if
there is a best combination of hyperparameter
values, grid search is definitely going to find that
for you because it's testing everything.
And random search is another method that we
use for hyperparameter tuning. And it's similar to
grid search in that you give it sets
of values for each hyperparameter, but the difference is
that it just jumps around random conditions
of these values instead of going through each of
them very systematically. A lot of
times if you run a random search for about the
same amount of time as a grid search, youll end up getting
better conditions of hyperparameter values. And that's
just because random search samples a wider
variety of those values.
So we know our problem is keeping track
of all of these experiments. We know we want to
solve it by making these little
bundles for each experiment. And we know with hyperparameter tuning,
we have a lot of values that we're going
to experiment with. So let's take a look
at DVC, which is a tool that helps us manage all
of this experiment. Tracking DVC
is an open source tool. You can go check out the GitHub
repo. It works on top of git.
So think of DVC as
git for your data. So you're able to
check in your code with git. You're able to check
in your data with DVC, and it works on top
of git so that you're able to bundle your
code, your data, and any hyperparameters
or other environment configs together for
each experiment you run. And the best
part is, it's not opinionated at all.
So to use DVC, you don't actually
need to install any particular libraries.
You just need to initialize DVC in your
git repo and use the commands.
That's really it. There's no API calls to
slow down your training. I know this comes up a little bit with Mlflow
because it makes API calls to their service.
But with DVC, there's no API calls.
Everything is right there on your local machine. If youll
decide to set up some kind of remote environment,
you can use DVC there. It works with AWS,
which is the one that I think most people work with when
they're handling their file storage. But you can
use GCP in Azure as well.
The main thing that I really like about DVC,
there's a lot of stuff to it, but my favorite thing is experiments.
So every time you run this DVCep run command,
it takes a snapshot of your code, your data,
and your configurations, and it
stores this as metadata in DVC.
And all of that is attached to whatever model
you produce from this experiments run.
So let's say we've decided to update
a hyperparameter, and we run an experiment with this
command that will be
bundled together with our data and everything else
we already have in place to tie to
the model that we're going two get from that experiment.
So basically, what we're looking at is something like this.
A single experiment has our current data
set that we just used to tracking our model.
It has the hyperparameters that we use
to tracking the model. And again,
it has a model. If maybe you're working with
some from production and youll need to do some kind of comparison.
It's all lumped together in this one experiment,
so you don't have to keep some kind of ridiculous spreadsheet
stashed off to the side where you have a
link to your GitHub commit. For this one hyperparameter
value you changed, and then another link to this zip file
on Google Drive that you can't change because it
was just for this particular experiment. And then you have
another link to some other git repo
that has all of your configurations,
because of course, that's in a separate repo, you don't have two do that anymore.
It's all right there in DVC,
and you can look at your experiments as you
run them. So when you run DVCEP
run, it's going to go through your training script,
it's going to look at whatever dependencies youll gave,
and it's just going to run that experiment. And once it's
finished, you can take a look at the results from that experiment
and decide which way you want to go from there.
So in this example, we have some experiment
we run. It has an average precision, and this rock
AUC value and a couple of hyperparameters.
So the average precision looks really good,
but we want to do some more hyperparameter tuning
because we think we can get something better
and what we'll do, which is pretty common, we're going
to set up a queue of experiments. So we
have a bunch of different hyperparameter values
that you can see over here under our train nest
and our train min split columns.
These are all the different hyperparameter values
that we want to test out for this
particular project. So we've queued
up these experiments in DVC, and you'll see they gave
their own ids associated with them, and they
are in the queued state. We don't have any results yet
because these haven't run. One really big advantage
of queuing experiments like this is not only can
you see the values before you run the experiments,
you're also able to push these
experiments off to some cloud environment. If you
want to run them on a different server, use a
GPU or some other resources. So we have
those experiments queued and now we're going
to run them all. So we'll use this exp run command
and then we'll use this DVCEP show command
to take a look at the results from all of
those queued experiments. Now you
can see all of the different experiments
that were run with all of the different hyperparameter
values. And you can see all of the results.
Now take a second and imagine if youll
had two document all of this manually.
So you have to go back in that spreadsheet
or in some kind of document and manually
say when I had an nest of
26 and 355 and a min split of
355, these were my outputs. Then you
have to attach the data somewhere. You have to attach the
code somewhere. Now you don't have to worry about
how everything is being track.
All you have to do is take a look at this table,
decide which experiments you want to keep working with,
which experiments you want to share with other people.
And you already have these results here just
to share and look at at any time.
No more managing all of those things separately.
So let's say that you want
to make some kind of plot to compare a couple of
experiments, because you see some that just have some
interesting results. Maybe you want to get a second opinion
from somebody else on the team. So we have
this DVC plots command. You can take the experiment
ids you want to compare and it generates this
plot for you based on the parameters
you define. So all of this data is coming
from either some kind of JSON file
or tsv is generated typically
within your training script. So DVC
isn't adding anything new, it's just using
the information that you provided with. But you're able
to quickly generate these plots. And again, I want you
to think about if you had that many experiments you had
run and you wanted to create a plot like this,
how much effort would it take to actually do that?
I'm pretty sure it would take a little bit more than just one command.
And yeah, it's just easy
to use some of the tools that are already built to handle
this stuff for us, but we're not done.
We have more hyperparameter values to try out, of course.
So we're going to queue up a few more experiments this time.
Let's just say we're doing a random search, but we
see the values is jumping around. This hasn't been anything
we've tried before. Let's just see what we get.
So we'll go ahead and run all of the experiments we had
queued up, and we'll take a look at our table again.
And now you see these new values.
Well, it still looks like one of our earlier experiments
gave us better results. But we might
not need to use numbers quite as big as we
did in the first experiment. So just being
able to quickly look at these metrics shows you
which direction you should take your hyperparameter tuning.
Or maybe it tells you that it's time to try
a different algorithm, or maybe it's
time to try a different data set. Or youll need to slice
up your data set different, or youll need new data points.
But whatever it is, your next step will be.
This is a very quick and easy way to see
how youll experiments should guide your model training.
And of course, we have to do hyperparameter tuning
one more time, because we have all
of these different experiments to run. We have all of
these different values to try, and it's
not uncommon to run hundreds of experiments in a day for
a machine learning engineer or a data scientist.
So we're using to queue up a few more experiments to see
maybe how low we can get those values. Or maybe
we just have another theory we want to test out from showing our
results to somebody else on the team.
And again, we'll run our experiments and look at the
table, and we see some promise.
So this one looks a little bit better than the previous one,
and these values are definitely a lot smaller.
So maybe we're getting a better feel for the
range of the hyperparameter values, or maybe
which hyperparameter values are the most important.
So with something like DVC, you're just able to
do this kind of whimsical
experimentation without worrying about taking
notes every 2 seconds. You're able to focus
on finding that good model instead of
having this eureka moment and no idea how
to get back to it.
And just to make sure that we
are not crazy and we're looking at our values
correctly, we might take another look at some plots just
to see if these experiments are going in the
direction we think they should. So you might share these with somebody
else on the team. They might just be for you to get
a range of what you should be expecting or what you
should do next. Either way, DVC just makes it
easy to do that and play around with your metrics in whatever
way you need to. So these are all
of the experiments that we've run over the course of this
talk. And actually, these aren't even all of
them. These are a few of them just from this table.
So, as you can see, there's a lot of
experiments that we can really fast,
and we didn't have to keep track of all of it. You can
see right here in the table that DVC has
tracked every hyperparameter combination we've
run, and we don't have to worry about it.
With each experiments, it's taken a snapshot of the
code and the data set that we have associated
with it, and it's created those little bundles with our hyperparameters,
our data, and our code to associate
with each model produced by each experiment.
And basically what that means, if you wanted to
go back and redo any of these experiments,
all you need is the experiment id over here,
a few DVC commands, and you
have the exact reproduction of
the conditions that led up to the model. That is
just this awesome thing you need to get out to production
right now. But I hope
that you see just how we can solve some of those
problems that are in the machine learning community,
and how we can use tools that already exist
to do this heavy lifting for us. Please don't use
spreadsheets to keep up with your machine
learning experiments when we have stuff that'll do it for
you now. But there are a few key takeaways that
I hope you get from this. First,
adding reproducibility to your experiments is
important when it's time to deploy your
model two production, you want to make sure that
you have the exact same accuracy
and the exact same metrics that you
had while you were testing in production, so that
there isn't any weirdness happening. And you need to roll everything back.
And DVC is just one of the tools that helps you track every part
of your experiments. Of course, there's still Mlflow
and some others in the ML ops area,
but you always want to make sure you have some kind
of tool that's tracking every part of your experiments.
DVC is, at least from what I've seen
around, it's probably the best one. Just because it
tracks your data changes too. So when you're
dealing with data drift in production,
you still have the exact copies
of those data sets before the drift happened.
So if there's any research you need to do that you can check it out.
If there's anything you want to go back to and refer
in your model, you can check it out. DVC just does all
of this for you, and then don't be
afraid to try new tools. I know a lot of
times those of us who write code feel a
need to build our own tools for every issue that
pops up. Youll don't gave to do that. It's not cheating
to use tools that are already there for youll it
makes you faster, it makes it easier for you
to have an impact, and it takes a lot of
stress off when there's already tools out there.
Even if you spend an hour or two and it's not quite
what you're looking for, it's at least good to know that they exist
just in case something pops up and you need it later.
And I want to leave you with a few resources, so if
you're interested in DVC, make sure to check out our
docs. We have a very active discord
channel, so if you want to drop in, ask some
questions, say what's up to the mlops
community, feel free to do that. And if you want
to see a more gui type
version of DVC, head to DVC studio
here at Studio iterative AI and
check it out. Hook up your GitHub repo and start
running experiments. And of course,
if you want to see these slides, you can go to my speaker
deck link here and download them and
get whatever you need. So thank you and
I hope that this talk was useful for for you.