Skip to content

Latest commit

 

History

History
161 lines (133 loc) · 8.73 KB

README.md

File metadata and controls

161 lines (133 loc) · 8.73 KB

py_tf2_gpu_dock_mlflow

Ready-to-run Python/Tensorflow2/MLflow setup to train models on GPU - get new Python/Tensorflow/GPU models running quickly, training on a GPU and logging performance and resulting model in MLflow, keeping everything in Docker via MLflow Projects.

💡 Note this repo is part of a trio that you might find useful together (but all are separate tools that can be used independently):

 

Summary

This Python/Tensorflow2 setup uses MLflow with GPU runs in a Docker container to train, evaluate, and log models to MLflow's model registry. Logged models can then be easily served via REST API or downloaded from the registry into python code or their own new Docker image for further use.

The training computation itself is handled entirely in the container; for that the host system only needs the Nvidia driver and Docker installed. However, currently to kick off the training one still needs a Python environment with MLflow installed (and a clone of this repo assuming you want to make your own changes to the model or data).

The training script includes function definitions for loading training data of different types (images from directory, tf-datasets, custom datafiles, etc) and for the neural network specification, so those things are easily experimented with independently of the rest of the code. This repo's tools spawns a Tensorflow model training in a self-contained Docker container with GPU access and then logs its results and solution model into MLflow at the end so one has a record of runs that can be compared.

In the default example implemented here, we use the malaria detection dataset from the Tensorflow datasets to train/test a VGG166-based image classification model to detect malaria parasite presence in thin blood smear images. A few options for alternate neural network definitions are included in the code (see the convolutions parameter which is actually "number of convolutional layers in the model").

malaria blood smear example images          malaria blood smear example images          malaria blood smear example images

How to install/run

Note this repo has been written and tested assuming running on Linux. It will almost surely will not work out of the box in Windows.

Option #1: follow this lowdown to kick off a low-cost AWS GPU instance and use the aws_ec2_install.bash script to quickly set it up to run this py_tf2_gpu_dock_mlflow repo with the MLflow setup from the docker_mlflow_db repo (assuming you already have an AWS account).

Option #2: follow these more generalized instructions to prepare/setup what's needed to run the py_tf2_gpu_dock_mlflow training process, whether it's so you can use your own separate MLflow instance, or your own already-running server or EC2 instance, or whatever. Also, these more general instructions can give additional context to what steps are taken by the canned setup in Option #1.

 

Note that make run just kicks off the project_driver.bash script. It might be useful to know that once you've got the repo forked in your own account and updated to suit your needs, technically you don't even need to clone the repo anymore to run your trainings - you can reference your repo URL or your pre-made remote docker image in the mlflow run command at the top of the project_driver.bash script. The MLflow Projects documentation has more about that; just something to think about.

Once the run is in progress, you should find metrics progress logging in your MLFlow instance, something like this. When you click on the metrics links in each run you can see plots over the epochs.

MLflow logged run example image

Upon clicking an individual Run Name to view its details, clicking on one of the metrics can show you timeseries plots of the performance-in-progress:

MLflow logged run example image          MLflow logged run example image

We can see for example how the transfer-learned VGG16 model does better than the other models tried above, and how it converges faster. It's not quite a fair comparison though, because the VGG16 run used transfer-learning to perturb pre-trained (Imagenet) weights for this problem, whereas the other (albeit smaller) models were trained from scratch. You'll find in the define_network() function in train.py that some extra layers were added to the end of the VGG16 network; this was to allow exploring different variations in transfer-learning and fine-tuning. Of course you can replace all that with whatever you wish.

Lastly: the make run macro runs the project_driver.bash shell script, but a Python script project_driver.py with mostly-corresponding functionality is left in here too from my experimentation. However, importantly note: as of this writing, while the python entry point allows to kick off multiple concurrent runs which is cool, it appears that GPU usage in Docker containers in MLFlow Projects can only be done if using the CLI (shell script) call to MLflow. I.e. the shell command mlflow finally now takes a --gpus=all argument, but the Python mlflow.projects.run() method still does not have an equivalent yet. That's strictly about an issue with MLflow, not with Python or Tensorflow or Docker.

References

About this malaria detection problem:

About relevant Tensorflow details in particular:

About additional computational tools used here (Docker, MLflow, etc):

The code and setup were initially based on George Novack's 2020 article in Towards Data Science, "Create Reusable ML Modules with MLflow Projects & Docker" (thank you). I've pulled things together into a single repo meant to use as a template, added some new functionality (artifact logging, model registry, GPU capability), and generalized it a bit.