Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello everyone. My name is I'm Tommy Fernandez. I'm a technical writer
at Semaphore, and this talk is based on a workshop
we did on Belgrade about machine learning,
DevOps and mlops. Since we don't
have as much time, I will assume you have
the basics of machine learning and cover all the tools
and practices you need to take a model from Jupyter
notebooks and deployed in production in a
safe and automated way. So why use mlops?
First, because automation means there's less work
for us to do. We can do more in less time.
The automation means we also can scale our work,
take the same workload, and use bigger machines
without additional work. It also provides consistencies
because we track everything that goes into the
model. We can know exactly what data points were
used for training, what scripts and parameters
were used to train our model. And finally,
we have traceability. We know exactly what
data sets went into our training and fine tuning.
Let's start with a quick overview of the machine learning
model we are going to use. This is oncaggle.com.
i will leave a link in the slides so you can
check out the code for yourself. We're using REsnet 34,
a computational neural network, and using the
optfor pets dataset to fine
tune this model to recognize cats and dogs.
I don't want to spend a lot of time here explaining
the code, because you probably know this very well.
The main problem I think Jupyter notebook have
is that they work like Excel, they work great in
your machine. They are great way to experiment and
to build and discover data.
But when you need to deploy the application, the model into
the general public is not feasible to
use it with Jupyter notebooks. So we're going to
take everything we have here and put it in pure Python
and train the model using continuous integration.
So this is the project. I will leave a link to
the repository in the slides. This project has
all the python code we need to fine
tune the model, test it, and deploy it.
The code is the same. We found the Jupyter notebook with
some small modifications.
The first thing we're going to need is to download
the training data set. I'm doing that here on a terminal,
and this is the first problem we're going to encounter when
using DevOps practices on machine learning
workflows. The data is big, it's about 800
megabytes, and we can't use something like
git to track this. In theory we can,
because git support like large files,
but we're going to easily reach the maximum amount
of data very easily. So we need a different alternative.
And to manage the data and to later create
the machine learning pipelines we're using a
tool called, called DVC. So DVC is an open
source tool. I don't work for DVC. This is not an
endorsement. It's just a tool that I find useful.
And it's useful because it lets me track the
data sets in git without actually having to
upload the files into git. It uses hashes and
special files to track what data goes in
into the model. DVC comes for
all macOS, windows and Linux. So you
can install it's a command line tool.
And as you can see we have here the file.
This is a git repository. So visual
studio code is going to mark this file as pending
to upload. The problem
is, as I said, we don't want to upload this
big file into the git repository. So instead
we're going to run DVC add and
file. We are going to execute this.
So this is going to do a number of things. It's going to create a
new file which has the extension DVC.
And this file contains the hash for the original
file, the size and the path to the
original file with the data.
And it also has updated Gitignore and
added the file. So it's no longer going to be
pushed into repository, it's going to be ignored.
The other thing it does is to create a cachet
directory into the DbC
folder. This folder is not to be checked in into
the repository. It's also be ignored. But what
DvC does is to move the original file into
this cache directory and then link
it back to the original location. Is going
to use either ref links, hard links or
symlinks depending on the file system you have
on your computer. In case of macOS, it's going
to use ref links, meaning that both
entry files point to the same part of the
disk so the file is not duplicated. And now
we, what we need to do is to check out the Gitignore
and the DVC file. So once we
push this, we track in in our repo
which data we are using in our training.
So let's pause and check how DVC works.
It follows the git syntax and workflow.
It ties into the git way of working.
Each time you run a DVC add, it will hash
the file, move it into the cachet and create that
DVC file as a pointer to the
original file. And when we do a DVC
checkout, DVC will pull that
file from the cache and link it into our
working directory. So we can have different branches
in our git repository, each with different DBC
files pointing to different data sets.
And all the datasets will be stored eventually in
the cachet, and we will pull from the
cache using DVC checkout the correct files
every time. So another cool feature about
DBC are machine learning pipelines.
Pipelines are like make files for machine
learning. They are versioned with git, so the whole process
to build and train model is stored
on git and tracked there. And all the results,
all the intermediate files, all the models,
all the transform datasets are stored
and cached. DVC will keep track of all
the changes and will reuse intermediate files
as needed. This is the syntax. To add stage
we put a name which is arbitrary,
the dependencies which are the inputs files.
They can be source code files,
database data files, and the outputs. We can also
store metrics as a separate entity.
And finally we have the command that runs
in this case is a Python program that cleans up the
input data. The stages are stored in
a component file called DBC YAML.
It can be checked in into git and it tracks all
the stages, the inputs, outputs, and computes the
dependency graph. The pipeline is stored in a file called
DBC YAML, which can be tracked with git
and it tracks the inputs, outputs, and builds
the dependency graph automatically.
Okay, let's see pipelines in action.
Instead of tracking the image torbo, I'm going
to track the output of these files.
So what I have here is a prepare script
which basically unpacks the turbo
into separate images. I'll start by removing
the turbo from the DVC cachet. We need
to remove the DVC file which tracks
the image and this will update
Gitignore. And now the file
is removed from the cache and these files is again,
since we don't want to track this file,
add it into the now
I'm going to add to run the
prepared script, the inputs
is images, dot, tar, dot. So I
put that as a dependency here. The output
is this directory data images will
contain the individual images and
we call the script one pack. This is the command that will
take this input and create these outputs.
So this step added images to Gitignore
so the files inside these folders are not tracked by
git and has created a new file called DVC
YAML. As we can see here, we have
name of the stage, the command that we run the
inputs and the output. This file DBC
YAML would be tracked with git. So we should
add it to the repository and to run the
pipeline we're going to run DBC Repro
and it will detect what is missing. We only had
one state and it's going to see there's no
images in the images folder. So I'm going to run
script. The script unpacks and
once it's unpacked DVC will cache.
Every one of these images which are located
here are going to be stored in the DBC
cachet and they are going to be linked back into my
working directory. So now we are tracking
this file, these individual images files which
are training. That does it. The other
thing that happened is that DVC created this file,
DVC lock. This file saves
the output of the DVC repro
so we know which is the state, the final state
of running this script. We have the numbers files
and the total size. It also tracks
the hash of the script. So if we modify
the prepare script is going to rerun the
this stage to confirm that everything is okay.
We can run DC repro again and this time
it will not do anything because nothing has changed.
The input script is the same,
the output files are all the same,
we haven't changed them now what
happens if we change the output? If we delete
one of these files? If we run DVC repro
again, it's going to find that there are
some missing files and it's going to pull
that file from the cache. It's going to check out
automatically the output from the last run
and the files will be recreated. As you can
see we have the same files I have deleted again. So these
were stored initially in the cache and are now relinked
into my working directory. Now let's add
another stage. This case we're going to add a train
stage that will fine tune a convolutional
neural network. This is the same code we find in
the Jupyter notebook, pulls a pre trained network
and uses fine tuning to categorize
inputs as images, as cats or
dogs. This script
also outputs a few graphics, the confusion
metrics and the top losses and the fine tune
results. These are all plots that
evaluate the error of the model. In order to
add this stage, in order to add this stage,
we're going to call DVC stage at we're
going to call this stage train. As inputs we have the train script.
The images in the images folder
and the outputs are two file that are the
models. We can supply the plots as outputs or
with the plots keyword this will treat
the outputs differently because it will let DVC know
that these are things that we can compare across
different iterations of our training. So if
you have different trainings you can compare the plots with
different training data and parameters. Plots are
usually images and we can also use the metrics
keyword to add files like CSV
or TXT files and also compare that those benchmarks
across different runs. And finally we call the
train script to take these inputs and grade
this out. We can see that DVC YAML
is modified. A new stage is here
with all the inputs and outputs. And to run
this stage we're going to run the PVC wrapper.
This will skip the prepared stage because nothing
has changed. This process will take some time so
I will speed up the recording.
So the whole process took about 15 minutes.
I'm running this on my laptop, so it's not the best machine for
this task, but hopefully you're using a more powerful
machine. Let's check the DVC lock.
We are going to see new files here due
to the train stage. We're going to be the
outputs which are the bots. The models are
located in the model directory. These files,
because I mark them as outputs in my stage
are also ignored so they won't be uploaded
into GitHub. Same things for the
files in the metrics, the images that our
training generated. Now if I want to run BBC repro
again, going to skip both stages because
nothing has changed. Let's try deleting some files.
We can delete one of these plots
and let's also delete the model files.
And if we run DVC repro
again going to find these files are
missing and pull them from the cache. So here they are
again and they are safe in our PVC cache.
So to finish we have the test file. The test
file loads some images from Wikipedia
and loads the model and tries to run
the prediction. Let's add the test stage.
We're going to call call it test. The input is
the test file, the model files and
there are no outputs. The idea is that the test
file will return with an exit code different
from zero when there's an error, and it's going
to exit with a serious test code when it
passes. So again, running dvd reprocess
will only run the test and we
don't have any output, but we can check the status
code which is zero as always. The stage
are shown here in the DVC YAML.
We can also visualize the stages
calling DVC.
It will create a graph with all the
stages and dependencies. And we can also find here
in DBC log the inputs and the outputs
all hash. So we want to also
go here into our repository. We are going
to add the DVC lock. It's going to
be tracked in git, DVC,
YAML or git ignores these files we
have deleted so we don't need to check them. And this one
change that I made on the prepared script is superfluous.
So we can, we can undo that. So that's
it. We have this, we have tracked all
our process. We, the data that came into the
training script, the outputs, the models
and the results of the test are all tracked
using metadata in our git repository.
Now let's see how we can run this application.
We have an application file here. We are using
the stream lead library for that.
And this is a very easy way to quickly run
a model in a browser. The ST
namespace is from streamlit and we
are using different methods here.
One for the title. To set a title we can
create widget to upload images.
This widget will show the image on screen and
we have a button to run the prediction.
This will call make prediction, which loads the
model and returns the probability
the model will return. True or false? If it's true,
it's yet. If it's false, it's a doc. So it's going
to show me that message. Auto run
this model we call streamlet run on
the file. Now the application is running on my machine.
Let's try it by uploading. One picture
of my cat here she has just woken up
and let's try the prediction. It's 99%
certain that's cat and I think that's dude.
One other thing we may want to do is to
put this model inside a docker container.
And here we have the basic docker file.
To do this, we start from a by ten
container. We add an application
user to not run the application as root and
copy the requirements, install them and,
and then copy. Basically what we need here is the models,
the source file, and then run the application
with streamlit. Now that this tab is complete
and we have committed all the files, what if we want to
share this cache with my colleagues, with other
team members? This is where remote storage
comes in. DVC supports remote storage
and when we add a remote in the same vein
as it, we add the remote using a similar syntax
and we can upload these files,
push the files into our remote repository and
other people can connect to that storage.
But we have a common cache for everyone in the
team and this will remember have all the versions,
all the iterations that everyone had done during
their work all in one place. DVC supports
by default several cloud providers,
and you can also bring your own. If you have a server
you can connect via SSh or using different
protocols. But in my case I will use AWS
and s three buckets. I have created an
s three bucket only for for storing
this example.
So the syntax to other modes very similar.
To keep DBC remote, add a
name. In this case I'm going to call it origin, just to
keep the convention, but it can be anything. And then
since I'm using s three, I'm going to prepend
their packet with s three and the
name of the bucket. Once we added the cache,
we need to run PVC remote
default and the name of the remote
which I called origin and this will make that
the default. And now we can push the files.
If we run DB, push will connect to the street
bucket, see what's missing and push the changes.
In this case the bucket is empty, so it will
push everything we have into the
remote cachet. So it's starting to to do that right
now. So once we have everything
in our repository, it's we can share
or work with other people. They only need
to add the repository at the remote and then run
DBC. And this will pull all the changes
into our local file system.
Here we can see the complete workflow.
We have our code and our pointers to
the files in our repository,
git hub, Bitpacket, GitLab, any git
provider and we run git pull. This pulls all the
files that are the code,
the pointers, the hashes, the DVC files,
the DVC YAML, everything that preserves state.
Then we run DBC pool. This will connect to
their mode storage and pull all the big
files that stored there and they're
going to be stored in our cache. Any changes
that were made will also be synced with our local cache.
Then we will run EVC rePl, that will
run our training, fine tuning, testing, everything we
want. We can try different parameters, then we commit
all the changes. This will contain any changes with it
to the code and all the new reference to
the new outputs, models, plots,
metrics, everything that is stored in our local
cache. And this will store
push all this reference into git. And then when
we run DVC push, it will actually push the copy
of our cache into the remote storage.
Now that we have everything in a remote repository
and decoding git, we can set up continuous
integration. You can use any continuous integration
product. I'm going to use semaphore because I work
for Semaphore and it's the tool that I know best.
So here we have our workflow
editor. This lets us configure our
commands. First we're going to open the pipeline and
select one of the machines that are available.
And here are the commands that are going
to run before each of my jobs. We're going
to set the Python version, install DVC
and install the dependencies
of python and pull everything from the
cache. This checkout pulls the code
from git. Then if we go to the train
step, we're going to run report train.
This will show only the change in the envelope file in the logs.
And then we're going to push the new models into
the DVC cache and the train command will use
DVC repro test. This will be the only command in
this test. Remember that we put DVC pool
as a common command in the pipeline.
These are all the commands that are going to run before
any of the jobs. So basically this
job pulls the models and run the
test stage. And then we have two continuous
delivery pipelines. We have one for Docker that
will build the docker image and
pushes to Docker hub. Then we have a second
pipeline that deploys the model into
hacking phase using streamlet. And this will
decode the call the code, pull the
cache with all the models and run as deploy
script. We are providing
environment variables. One is the address of the space
and the other is a private ssh key that we
use to push the changes into the hacking
face space. Hiking face uses git
and lfs to support
large files. So basically this job joins
everything into one repository and pushes the
bundle using SSH into the hangface
repository. You can check the code in detail in
the repository that I'm going to share in the slides.
So you can download that and you will find links to blog
post, to the source code
and to the pipeline so you can replicate
that into your CI CD system.
So once I run the continuous integration
pipeline and continue deployment pipeline, I have
the application running on hiking face. This is
by the way for free. You can host your models here
on hiking face for free using different frameworks.
I'm using streamlit which is supported by hiking
face. And now let me again
the picture just to ensure that's working same as
before. And yeah, this case is even
more serious than scan. So yeah, this is one
way we can deploy quickly. And this is all running
in automation. We don't need to deploy manually. We just need
to push our changes into the git repository
and let the A CD
system to take over, drain and deploy for us.
So that's all I have. Thank you for watching this talk.
I hope it help you incorporate different
rules or practices into your ML
workflows. And if you want to contact me,
here's my contact information so I will. Happy to
talk to you. Thank you for watching and have a nice
conference. Thank you.