Conf42 Machine Learning 2024 - Online

MLOps: From Jupyter to Production

Video size:

Abstract

Taking a model from Jupyter to a production application is not a trivial task. In this short talk, we cover the whole process, starting from a Jupyter Notebook and finishing on a deployed model on HuggingFace.co. We’ll be using open-source and free services only.

Summary

  • Tommy Fernandez: How to take a model from Jupyter notebooks and deployed in production in a safe and automated way. Why use mlops? First, because automation means there's less work for us to do. Also provides consistencies because we track everything that goes into the model.
  • So I'm going to run script. The script unpacks and once it's unpacked DVC will cache. Every one of these images which are located here are going to be stored in the DBC cachet. They are then linked back into my working directory.
  • We're going to add a train stage that will fine tune a convolutional neural network. The test file loads some images from Wikipedia and loads the model and tries to run the prediction. All of this is tracked using metadata in our git repository. The whole process took about 15 minutes.
  • We can use any continuous integration product. We just need to push our changes into the git repository and let the A CD system take over, drain and deploy for us. Once I run the continuous integration pipeline and continue deployment pipeline, I have the application running on hiking face.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hello everyone. My name is I'm Tommy Fernandez. I'm a technical writer at Semaphore, and this talk is based on a workshop we did on Belgrade about machine learning, DevOps and mlops. Since we don't have as much time, I will assume you have the basics of machine learning and cover all the tools and practices you need to take a model from Jupyter notebooks and deployed in production in a safe and automated way. So why use mlops? First, because automation means there's less work for us to do. We can do more in less time. The automation means we also can scale our work, take the same workload, and use bigger machines without additional work. It also provides consistencies because we track everything that goes into the model. We can know exactly what data points were used for training, what scripts and parameters were used to train our model. And finally, we have traceability. We know exactly what data sets went into our training and fine tuning. Let's start with a quick overview of the machine learning model we are going to use. This is oncaggle.com. i will leave a link in the slides so you can check out the code for yourself. We're using REsnet 34, a computational neural network, and using the optfor pets dataset to fine tune this model to recognize cats and dogs. I don't want to spend a lot of time here explaining the code, because you probably know this very well. The main problem I think Jupyter notebook have is that they work like Excel, they work great in your machine. They are great way to experiment and to build and discover data. But when you need to deploy the application, the model into the general public is not feasible to use it with Jupyter notebooks. So we're going to take everything we have here and put it in pure Python and train the model using continuous integration. So this is the project. I will leave a link to the repository in the slides. This project has all the python code we need to fine tune the model, test it, and deploy it. The code is the same. We found the Jupyter notebook with some small modifications. The first thing we're going to need is to download the training data set. I'm doing that here on a terminal, and this is the first problem we're going to encounter when using DevOps practices on machine learning workflows. The data is big, it's about 800 megabytes, and we can't use something like git to track this. In theory we can, because git support like large files, but we're going to easily reach the maximum amount of data very easily. So we need a different alternative. And to manage the data and to later create the machine learning pipelines we're using a tool called, called DVC. So DVC is an open source tool. I don't work for DVC. This is not an endorsement. It's just a tool that I find useful. And it's useful because it lets me track the data sets in git without actually having to upload the files into git. It uses hashes and special files to track what data goes in into the model. DVC comes for all macOS, windows and Linux. So you can install it's a command line tool. And as you can see we have here the file. This is a git repository. So visual studio code is going to mark this file as pending to upload. The problem is, as I said, we don't want to upload this big file into the git repository. So instead we're going to run DVC add and file. We are going to execute this. So this is going to do a number of things. It's going to create a new file which has the extension DVC. And this file contains the hash for the original file, the size and the path to the original file with the data. And it also has updated Gitignore and added the file. So it's no longer going to be pushed into repository, it's going to be ignored. The other thing it does is to create a cachet directory into the DbC folder. This folder is not to be checked in into the repository. It's also be ignored. But what DvC does is to move the original file into this cache directory and then link it back to the original location. Is going to use either ref links, hard links or symlinks depending on the file system you have on your computer. In case of macOS, it's going to use ref links, meaning that both entry files point to the same part of the disk so the file is not duplicated. And now we, what we need to do is to check out the Gitignore and the DVC file. So once we push this, we track in in our repo which data we are using in our training. So let's pause and check how DVC works. It follows the git syntax and workflow. It ties into the git way of working. Each time you run a DVC add, it will hash the file, move it into the cachet and create that DVC file as a pointer to the original file. And when we do a DVC checkout, DVC will pull that file from the cache and link it into our working directory. So we can have different branches in our git repository, each with different DBC files pointing to different data sets. And all the datasets will be stored eventually in the cachet, and we will pull from the cache using DVC checkout the correct files every time. So another cool feature about DBC are machine learning pipelines. Pipelines are like make files for machine learning. They are versioned with git, so the whole process to build and train model is stored on git and tracked there. And all the results, all the intermediate files, all the models, all the transform datasets are stored and cached. DVC will keep track of all the changes and will reuse intermediate files as needed. This is the syntax. To add stage we put a name which is arbitrary, the dependencies which are the inputs files. They can be source code files, database data files, and the outputs. We can also store metrics as a separate entity. And finally we have the command that runs in this case is a Python program that cleans up the input data. The stages are stored in a component file called DBC YAML. It can be checked in into git and it tracks all the stages, the inputs, outputs, and computes the dependency graph. The pipeline is stored in a file called DBC YAML, which can be tracked with git and it tracks the inputs, outputs, and builds the dependency graph automatically. Okay, let's see pipelines in action. Instead of tracking the image torbo, I'm going to track the output of these files. So what I have here is a prepare script which basically unpacks the turbo into separate images. I'll start by removing the turbo from the DVC cachet. We need to remove the DVC file which tracks the image and this will update Gitignore. And now the file is removed from the cache and these files is again, since we don't want to track this file, add it into the now I'm going to add to run the prepared script, the inputs is images, dot, tar, dot. So I put that as a dependency here. The output is this directory data images will contain the individual images and we call the script one pack. This is the command that will take this input and create these outputs. So this step added images to Gitignore so the files inside these folders are not tracked by git and has created a new file called DVC YAML. As we can see here, we have name of the stage, the command that we run the inputs and the output. This file DBC YAML would be tracked with git. So we should add it to the repository and to run the pipeline we're going to run DBC Repro and it will detect what is missing. We only had one state and it's going to see there's no images in the images folder. So I'm going to run script. The script unpacks and once it's unpacked DVC will cache. Every one of these images which are located here are going to be stored in the DBC cachet and they are going to be linked back into my working directory. So now we are tracking this file, these individual images files which are training. That does it. The other thing that happened is that DVC created this file, DVC lock. This file saves the output of the DVC repro so we know which is the state, the final state of running this script. We have the numbers files and the total size. It also tracks the hash of the script. So if we modify the prepare script is going to rerun the this stage to confirm that everything is okay. We can run DC repro again and this time it will not do anything because nothing has changed. The input script is the same, the output files are all the same, we haven't changed them now what happens if we change the output? If we delete one of these files? If we run DVC repro again, it's going to find that there are some missing files and it's going to pull that file from the cache. It's going to check out automatically the output from the last run and the files will be recreated. As you can see we have the same files I have deleted again. So these were stored initially in the cache and are now relinked into my working directory. Now let's add another stage. This case we're going to add a train stage that will fine tune a convolutional neural network. This is the same code we find in the Jupyter notebook, pulls a pre trained network and uses fine tuning to categorize inputs as images, as cats or dogs. This script also outputs a few graphics, the confusion metrics and the top losses and the fine tune results. These are all plots that evaluate the error of the model. In order to add this stage, in order to add this stage, we're going to call DVC stage at we're going to call this stage train. As inputs we have the train script. The images in the images folder and the outputs are two file that are the models. We can supply the plots as outputs or with the plots keyword this will treat the outputs differently because it will let DVC know that these are things that we can compare across different iterations of our training. So if you have different trainings you can compare the plots with different training data and parameters. Plots are usually images and we can also use the metrics keyword to add files like CSV or TXT files and also compare that those benchmarks across different runs. And finally we call the train script to take these inputs and grade this out. We can see that DVC YAML is modified. A new stage is here with all the inputs and outputs. And to run this stage we're going to run the PVC wrapper. This will skip the prepared stage because nothing has changed. This process will take some time so I will speed up the recording. So the whole process took about 15 minutes. I'm running this on my laptop, so it's not the best machine for this task, but hopefully you're using a more powerful machine. Let's check the DVC lock. We are going to see new files here due to the train stage. We're going to be the outputs which are the bots. The models are located in the model directory. These files, because I mark them as outputs in my stage are also ignored so they won't be uploaded into GitHub. Same things for the files in the metrics, the images that our training generated. Now if I want to run BBC repro again, going to skip both stages because nothing has changed. Let's try deleting some files. We can delete one of these plots and let's also delete the model files. And if we run DVC repro again going to find these files are missing and pull them from the cache. So here they are again and they are safe in our PVC cache. So to finish we have the test file. The test file loads some images from Wikipedia and loads the model and tries to run the prediction. Let's add the test stage. We're going to call call it test. The input is the test file, the model files and there are no outputs. The idea is that the test file will return with an exit code different from zero when there's an error, and it's going to exit with a serious test code when it passes. So again, running dvd reprocess will only run the test and we don't have any output, but we can check the status code which is zero as always. The stage are shown here in the DVC YAML. We can also visualize the stages calling DVC. It will create a graph with all the stages and dependencies. And we can also find here in DBC log the inputs and the outputs all hash. So we want to also go here into our repository. We are going to add the DVC lock. It's going to be tracked in git, DVC, YAML or git ignores these files we have deleted so we don't need to check them. And this one change that I made on the prepared script is superfluous. So we can, we can undo that. So that's it. We have this, we have tracked all our process. We, the data that came into the training script, the outputs, the models and the results of the test are all tracked using metadata in our git repository. Now let's see how we can run this application. We have an application file here. We are using the stream lead library for that. And this is a very easy way to quickly run a model in a browser. The ST namespace is from streamlit and we are using different methods here. One for the title. To set a title we can create widget to upload images. This widget will show the image on screen and we have a button to run the prediction. This will call make prediction, which loads the model and returns the probability the model will return. True or false? If it's true, it's yet. If it's false, it's a doc. So it's going to show me that message. Auto run this model we call streamlet run on the file. Now the application is running on my machine. Let's try it by uploading. One picture of my cat here she has just woken up and let's try the prediction. It's 99% certain that's cat and I think that's dude. One other thing we may want to do is to put this model inside a docker container. And here we have the basic docker file. To do this, we start from a by ten container. We add an application user to not run the application as root and copy the requirements, install them and, and then copy. Basically what we need here is the models, the source file, and then run the application with streamlit. Now that this tab is complete and we have committed all the files, what if we want to share this cache with my colleagues, with other team members? This is where remote storage comes in. DVC supports remote storage and when we add a remote in the same vein as it, we add the remote using a similar syntax and we can upload these files, push the files into our remote repository and other people can connect to that storage. But we have a common cache for everyone in the team and this will remember have all the versions, all the iterations that everyone had done during their work all in one place. DVC supports by default several cloud providers, and you can also bring your own. If you have a server you can connect via SSh or using different protocols. But in my case I will use AWS and s three buckets. I have created an s three bucket only for for storing this example. So the syntax to other modes very similar. To keep DBC remote, add a name. In this case I'm going to call it origin, just to keep the convention, but it can be anything. And then since I'm using s three, I'm going to prepend their packet with s three and the name of the bucket. Once we added the cache, we need to run PVC remote default and the name of the remote which I called origin and this will make that the default. And now we can push the files. If we run DB, push will connect to the street bucket, see what's missing and push the changes. In this case the bucket is empty, so it will push everything we have into the remote cachet. So it's starting to to do that right now. So once we have everything in our repository, it's we can share or work with other people. They only need to add the repository at the remote and then run DBC. And this will pull all the changes into our local file system. Here we can see the complete workflow. We have our code and our pointers to the files in our repository, git hub, Bitpacket, GitLab, any git provider and we run git pull. This pulls all the files that are the code, the pointers, the hashes, the DVC files, the DVC YAML, everything that preserves state. Then we run DBC pool. This will connect to their mode storage and pull all the big files that stored there and they're going to be stored in our cache. Any changes that were made will also be synced with our local cache. Then we will run EVC rePl, that will run our training, fine tuning, testing, everything we want. We can try different parameters, then we commit all the changes. This will contain any changes with it to the code and all the new reference to the new outputs, models, plots, metrics, everything that is stored in our local cache. And this will store push all this reference into git. And then when we run DVC push, it will actually push the copy of our cache into the remote storage. Now that we have everything in a remote repository and decoding git, we can set up continuous integration. You can use any continuous integration product. I'm going to use semaphore because I work for Semaphore and it's the tool that I know best. So here we have our workflow editor. This lets us configure our commands. First we're going to open the pipeline and select one of the machines that are available. And here are the commands that are going to run before each of my jobs. We're going to set the Python version, install DVC and install the dependencies of python and pull everything from the cache. This checkout pulls the code from git. Then if we go to the train step, we're going to run report train. This will show only the change in the envelope file in the logs. And then we're going to push the new models into the DVC cache and the train command will use DVC repro test. This will be the only command in this test. Remember that we put DVC pool as a common command in the pipeline. These are all the commands that are going to run before any of the jobs. So basically this job pulls the models and run the test stage. And then we have two continuous delivery pipelines. We have one for Docker that will build the docker image and pushes to Docker hub. Then we have a second pipeline that deploys the model into hacking phase using streamlet. And this will decode the call the code, pull the cache with all the models and run as deploy script. We are providing environment variables. One is the address of the space and the other is a private ssh key that we use to push the changes into the hacking face space. Hiking face uses git and lfs to support large files. So basically this job joins everything into one repository and pushes the bundle using SSH into the hangface repository. You can check the code in detail in the repository that I'm going to share in the slides. So you can download that and you will find links to blog post, to the source code and to the pipeline so you can replicate that into your CI CD system. So once I run the continuous integration pipeline and continue deployment pipeline, I have the application running on hiking face. This is by the way for free. You can host your models here on hiking face for free using different frameworks. I'm using streamlit which is supported by hiking face. And now let me again the picture just to ensure that's working same as before. And yeah, this case is even more serious than scan. So yeah, this is one way we can deploy quickly. And this is all running in automation. We don't need to deploy manually. We just need to push our changes into the git repository and let the A CD system to take over, drain and deploy for us. So that's all I have. Thank you for watching this talk. I hope it help you incorporate different rules or practices into your ML workflows. And if you want to contact me, here's my contact information so I will. Happy to talk to you. Thank you for watching and have a nice conference. Thank you.
...

Tomas Fernandez

Technical Editor @ Semaphore

Tomas Fernandez's LinkedIn account Tomas Fernandez's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways