This template runs on the Neuro Platform.
To dive into problem solving, you need to sign up at the Neuro Platform website, set up your local machine according to the instructions, and login to Neuro CLI:
neuro login
Local directory | Description | Storage URI | Environment mounting point |
---|---|---|---|
data/ |
Data | storage:mlops_spark/data/ |
/mlops_spark/data/ |
modules/ |
Python modules | storage:mlops_spark/modules/ |
/mlops_spark/modules/ |
config/ |
Configuration files | storage:mlops_spark/config/ |
/mlops_spark/config/ |
notebooks/ |
Jupyter notebooks | storage:mlops_spark/notebooks/ |
/mlops_spark/notebooks/ |
results/ |
Logs and results | storage:mlops_spark/results/ |
/mlops_spark/results/ |
Follow the instructions below to set up the environment on Neuro and start a Jupyter development session.
neuro-flow build myimage
Command results:
- The
requirements.txt
,apt.txt
, andsetup.cfg
files from the local project are uploaded to the platform storage. - A new job is started in our base environment.
- Pip requirements from
requirements.txt
andapt
applications fromapt.txt
are installed in the same environment. - The updated environment is saved under a new project-dependent name to be used further on.
neuro-flow run jupyter
Command results:
- The contents of the
modules
andnotebooks
directories are uploaded to the platform storage. - A job with Jupyter is started, and its web interface is opened in a new window of a local web browser.
neuro-flow kill jupyter
Command results:
- The job with Jupyter Notebooks is terminated. The notebooks are saved on the platform storage. You may run
neuro-flow download notebooks
to download them to the localnotebooks/
directory.
If you're not using the default neuromation/base
base image, you may want to protect the main processes in your jobs from being killed when there's not enough memory for them.
You can do this in two steps:
- Create an
oom_guard.sh
executable file with the following contents:
#!/bin/sh
for pid in $(ps x | awk 'NR>1 {print $1}' | xargs)
do
if [ "$pid" != "1" ]
then
echo 1000 > /proc/"$pid"/oom_score_adj
fi
done
The script above tells oom_killer
to avoid killing the process with pid = 1
for as long as possible.
- Add the following lines to your
Dockerfile
:
COPY oom_guard.sh /root/oom_guard.sh
RUN chmod +x /root/oom_guard.sh
RUN crontab -l 2>/dev/null | { cat; echo '* * * * * /root/oom_guard.sh'; } | crontab
This will ensure the script from step 1 is executed every minute.
neuro-flow ps
Command results:
- The list of all available template jobs is printed along with their statuses.
On a local machine, run neuro-flow run filebrowser
and open the job's URL on your mobile device or desktop.
Through a simple file explorer interface, you can upload test images and perform various file operations.
On a local machine, run neuro-flow mkvolumes
. This command creates storage folders for all defined volumes. You only need to run this once.
After the storage folders have been created, run neuro-flow upload data
from the a local machine as well. This command pushes local files stored in ./data
into the storage:mlops_spark/data
volume mounted to your development environment's /project/data
.
You can upload (or download) every folder for which the local
parameter is specified in the live.yml file.
Google Cloud SDK is pre-installed on all jobs produced from the base image.
Feel free to refer to the documentation explaining how to start using GCP with the template.
AWS CLI is pre-installed on all jobs produced from the base image.
Feel free to refer to the documentation explaining how to start using AWS with the template.
If you want to debug your code on GPU, you can run a sleeping job via neuro-flow run remote_debug
which will also open a shell to the job. You can also see job logs via neuro-flow logs remote_debug
. The job forwards your local port 2211 to its port 22 for remote debugging.
You can find the instructions on remote debugging via PyCharm Pro in the documentation.
Please don't forget to kill your job via neuro-flow kill remote_debug
to not waste your quota!
The Neuro Platform offers easy integration with Weights & Biases, an experiment tracking tool for deep learning.
Here you can find documentation for using W&B for experiment tracking with the template.
You can also refer to instructions on using Weights & Biases in your code in the W&B documentation. There are also W&B example projects or an example of a Neuro Project Template-based ML Recipe that uses W&B as a part of the workflow.
defaults:
preset: gpu-small-p
There are several machine types available on the platform. Run neuro config show
to see the full list. You can also override default presets for each job:
jobs:
train:
image: $[[ images.myimage.ref ]]
preset: gpu-large
...
When jobs with an HTTP interface are executed (for example, with Jupyter Notebooks or TensorBoard), this interface requires a user to be authenticated on the platform. However, if you want to share the link with someone who is not registered on the platform, you may disable the authentication adding this argument to your job configuration:
args:
http_auth: "False"
Running neuro-flow upload ALL
from a local machine will upload all of your code, configs, and notebooks to the storage so that these folders can be used by your jobs.
To tweak the training command, change the last line in this section of live.yaml
:
train:
image: $[[ images.myimage.ref ]]
detach: True
life_span: 10d
volumes:
- $[[ volumes.data.ref_ro ]]
- $[[ volumes.code.ref_ro ]]
- $[[ volumes.config.ref_ro ]]
- $[[ volumes.results.ref_rw ]]
env:
EXPOSE_SSH: "yes"
PYTHONPATH: $[[ volumes.code.mount ]]
bash: |
cd $[[ flow.workspace ]]
python -u $[[ volumes.code.mount ]]/train.py --data $[[ volumes.data.mount ]]
After this, just run neuro-flow run train
.
You can run multiple training experiments simultaneously.
neuro-flow run multitrain -s myidea-1
Note that this label becomes a suffix of the job name which can only contain alphanumeric characters and hyphens -
, cannot end with a hyphen, and cannot be longer than 40 characters. You can use suffixed job names to access jobs: neuro-flow status multitrain myidea-1
, neuro-flow logs multitrain myidea-1
, and so on.
Please don't forget to kill the jobs you started:
neuro-flow kill train
to kill the training job started vianeuro-flow run train
,neuro-flow kill multitrain
to kill the training job started vianeuro-flow run multitrain
,neuro-flow kill jupyter
to kill the job started vianeuro-flow run jupyter
,- ...
neuro-flow kill ALL
to kill all jobs started in the current project.
The Neuro Platform supports hyperparameter tuning via Weights & Biases.
Please refer to the corresponding documentation for more information.