Batch AI provides managed infrastructure to help data scientists with cluster
management and scheduling, scaling, and monitoring of AI jobs.
Batch AI works on top of virtual machine scale sets
and docker
.
Batch AI can run training jobs in docker containers or directly on the compute nodes.
- Cluster
- Jobs
- Azure File Share - stdout, stderr, may contain python scripts
- Azure Blob Storage - python scripts, data
You Only Look Once (YOLO)
is a real-time object detection system. We will be
running YOLOv3
on a single image with BatchAI. If you would like to run YOLO
without a cluster you can follow the steps on the
YOLO site.
git clone https://github.com/pjreddie/darknet
cd darknet
make
wget https://pjreddie.com/media/files/yolov3.weights
./darknet detect cfg/yolov3.cfg yolov3.weights data/dog.jpg
YOLOv3 should output something like:
...
104 conv 256 3 x 3 / 1 52 x 52 x 128 -> 52 x 52 x 256 1.595 BFLOPs
105 conv 255 1 x 1 / 1 52 x 52 x 256 -> 52 x 52 x 255 0.353 BFLOPs
106 detection
Loading weights from cfg/yolov3.weights...Done!
data/dog.jpg: Predicted in 24.016015 seconds.
dog: 99%
truck: 92%
bicycle: 99%
- Python train and test scripts define the
parallel strategy
used, not Batch AI.
For example,
CNTK
uses asynchronous data parallel
training strategyTensorflow
uses aasynchronous model parallel
training strategy
- Make sure
.sh
scripts haveLF
endings - usedos2unix
to fix - To enable faster communication between the nodes it´s necessary to use
Intel MPI
and haveInfiniBand
on the VM NC24r
(works withIntel MPI
andInfiniBand
) quota is1 core
by default in any subscription, so make quota increase requests early- There's no reset ssh-key for nodes
- Do not put
CMD
in the dockerfile used by Batch AI. Since the container runs in detached mode, it will exit onCMD
- Error messages within the container are not very descriptive
- Clusters take a long time to provision and deallocate
- Install Azure CLI 2.0 for WSL
- Batch AI Recipes
- Azure CLI Docs
- Swagger Docs for Batch AI
- Batch AI Environment Variables
- Setting up KeyVault
az account set -s <subscription id>
az account list -o table
az group create -n <rg name> -l eastus
az storage account create \
-n <storage account name> \
--sku Standard_LRS \
-l eastus \
-g <rg name>
az storage account keys list \
-n <storage account name> \
-g <rg name> \
--query "[0].value"
az storage share create \
-n <share name> \
--account-name <storage account name> \
--account-key <storage account key>
az storage directory create \
-s <share name> \
-n yolo \
--account-name <storage account name> \
--account-key <storage account key>
az storage file upload \
-s <share name> \
--source <python script> \
-p yolo \
--account-name <storage account name> \
--account-key <storage account key>
Config parameters defined by ClusterCreateParameters
in the batch ai swagger docs.
az batchai cluster create \
-n <cluster name> \
-l eastus \
-g <rg name> \
-c cluster.json
az batchai cluster create \
-n <cluster name> \
-g <rg name> \
-l eastus \
--storage-account-name <storage account name> \
--storage-account-key <storage account key> \
-i UbuntuDSVM \
-s Standard_NC6 \
--min 2 \
--max 2 \
--afs-name <share name> \
--afs-mount-path external \
-u $USER \
-k ~/.ssh/id_rsa.pub \
-p <password>
az batchai cluster show \
-n <cluster name> \
-g <rg name> \
-o table
- View
JobBaseProperties
in the batch ai swagger docs for the possible parameters to use injob.json
.
az batchai job create \
-g <rg name> \
-l eastus \
-n <job name> \
-r <cluster name> \
-c job.json
az batchai job show \
-n <job name> \
-g <rg name> \
-o table
az batchai job stream-file \
-j <job name> \
-n stdout.txt \
-d stdouterr \
-g <rg name>
az batchai cluster list-nodes \
-n <cluster name> \
-g <rg name>
ssh <ip> -p <port>
$AZ_BATCHAI_MOUNT_ROOT
is an environment variable set by Batch AI for each job, it's value depends on the image used for nodes creation. For example, on Ubuntu based images it's equal to /mnt/batch/tasks/shared/LS_root/mounts
. You can cd
to this directory and view the python scripts and logs.