title |
---|
Deployment |
You can spin up an agent on any machine: on-prem and/or cloud instance. When spinning up an agent, you assign it to service a queue(s). Utilize the machine by enqueuing tasks to the queue that the agent is servicing, and the agent will pull and execute the tasks.
:::tip cross-platform execution ClearML Agent is platform agnostic. When using the ClearML Agent to execute experiments cross-platform, set platform specific environment variables before launching the agent.
For example, to run an agent on an ARM device, set the core type environment variable before spinning up the agent:
export OPENBLAS_CORETYPE=ARMV8
clearml-agent daemon --queue <queue_name>
:::
To execute an agent, listening to a queue, run:
clearml-agent daemon --queue <queue_name>
To execute an agent in the background, run:
clearml-agent daemon --queue <execution_queue_to_pull_from> --detached
To stop an agent running in the background, run:
clearml-agent daemon <arguments> --stop
To specify GPUs associated with the agent, add the --gpus
flag.
:::info Docker Mode
Make sure to include the --docker
flag, as GPU management through the agent is only supported in Docker Mode.
:::
To execute multiple agents on the same machine (usually assigning GPU for the different agents), run:
clearml-agent daemon --gpus 0 --queue default --docker
clearml-agent daemon --gpus 1 --queue default --docker
To allocate more than one GPU, provide a list of allocated GPUs
clearml-agent daemon --gpus 0,1 --queue dual_gpu --docker
A single agent can listen to multiple queues. The priority is set by their order.
clearml-agent daemon --queue high_q low_q
This ensures the agent first tries to pull a Task from the high_q
queue, and only if it is empty, the agent will try to pull
from the low_q
queue.
To make sure an agent pulls from all queues equally, add the --order-fairness
flag.
clearml-agent daemon --queue group_a group_b --order-fairness
It will make sure the agent will pull from the group_a
queue, then from group_b
, then back to group_a
, etc. This ensures
that group_a
or group_b
will not be able to starve one another of resources.
By default, ClearML Agent maps the host's ~/.ssh
into the container's /root/.ssh
directory (configurable,
see clearml.conf).
If you want to use existing auth sockets with ssh-agent, you can verify your host ssh-agent is working correctly with:
echo $SSH_AUTH_SOCK
You should see a path to a temporary file, something like this:
/tmp/ssh-<random>/agent.<random>
Then run your clearml-agent
in Docker mode, which will automatically detect the SSH_AUTH_SOCK
environment variable,
and mount the socket into any container it spins.
You can also explicitly set the SSH_AUTH_SOCK
environment variable when executing an agent. The command below will
execute an agent in Docker mode and assign it to service a queue. The agent will have access to
the SSH socket provided in the environment variable.
SSH_AUTH_SOCK=<file_socket> clearml-agent daemon --gpus <your config> --queue <your queue name> --docker
Agents can be deployed bare-metal or as dockers in a Kubernetes cluster. ClearML Agent adds the missing scheduling capabilities to Kubernetes, allows for more flexible automation from code, and gives access to all of ClearML Agent's features.
ClearML Agent is deployed onto a Kubernetes cluster through its Kubernetes-Glue which maps ClearML jobs directly to K8s jobs:
- Use the ClearML Agent Helm Chart to spin an agent pod acting as a controller. Alternatively (less recommended) run a k8s glue script on a K8S cpu node
- The ClearML K8S glue pulls jobs from the ClearML job execution queue and prepares a K8s job (based on provided yaml template)
- Inside each job pod the
clearml-agent
will install the ClearML task's environment and run and monitor the experiment's process
:::important Enterprise Feature The ClearML Enterprise plan supports K8S servicing multiple ClearML queues, as well as providing a pod template for each queue for describing the resources for each pod to use.
For example, the following configures which resources to use for example_queue_1
and example_queue_2
:
agentk8sglue:
queues:
example_queue_1:
templateOverrides:
nodeSelector:
nvidia.com/gpu.product: A100-SXM4-40GB-MIG-1g.5gb
resources:
limits:
nvidia.com/gpu: 1
example_queue_2:
templateOverrides:
nodeSelector:
nvidia.com/gpu.product: A100-SXM4-40GB
resources:
limits:
nvidia.com/gpu: 2
:::
:::important Enterprise Feature Slurm Glue is available under the ClearML Enterprise plan. :::
Agents can be deployed bare-metal or inside Singularity
containers in Linux clusters managed with Slurm.
ClearML Agent Slurm Glue maps jobs to Slurm batch scripts: associate a ClearML queue to a batch script template, then
when a Task is pushed into the queue, it will be converted and executed as an sbatch
job according to the sbatch
template specification attached to the queue.
-
Install the Slurm Glue on a machine where you can run
sbatch
/squeue
etc.pip3 install -U --extra-index-url https://*****@*****.allegro.ai/repository/clearml_agent_slurm/simple clearml-agent-slurm
-
Create a batch template. Make sure to set the
SBATCH
variables to the resources you want to attach to the queue. The script below sets up an agent to run bare-metal, creating a virtual environment per job. For example:#!/bin/bash # available template variables (default value separator ":") # ${CLEARML_QUEUE_NAME} # ${CLEARML_QUEUE_ID} # ${CLEARML_WORKER_ID}. # complex template variables (default value separator ":") # ${CLEARML_TASK.id} # ${CLEARML_TASK.name} # ${CLEARML_TASK.project.id} # ${CLEARML_TASK.hyperparams.properties.user_key.value} # example #SBATCH --job-name=clearml_task_${CLEARML_TASK.id} # Job name DO NOT CHANGE #SBATCH --ntasks=1 # Run on a single CPU # #SBATCH --mem=1mb # Job memory request # #SBATCH --time=00:05:00 # Time limit hrs:min:sec #SBATCH --output=task-${CLEARML_TASK.id}-%j.log #SBATCH --partition debug #SBATCH --cpus-per-task=1 #SBATCH --priority=5 #SBATCH --nodes=${CLEARML_TASK.hyperparams.properties.num_nodes.value:1} ${CLEARML_PRE_SETUP} echo whoami $(whoami) ${CLEARML_AGENT_EXECUTE} ${CLEARML_POST_SETUP}
Notice: If you are using Slurm with Singularity container support replace
${CLEARML_AGENT_EXECUTE}
in the batch template withsingularity exec ${CLEARML_AGENT_EXECUTE}
. For additional required settings, see Slurm with Singularity.:::tip You can override the default values of a Slurm job template via the ClearML Web UI. The following command in the template sets the
nodes
value to be the ClearML Task’snum_nodes
user property:#SBATCH --nodes=${CLEARML_TASK.hyperparams.properties.num_nodes.value:1}
This user property can be modified in the UI, in the task's CONFIGURATION > User Properties section, and when the task is executed the new modified value will be used. :::
-
Launch the ClearML Agent Slurm Glue and assign the Slurm configuration to a ClearML queue. For example, the following associates the
default
queue to theslurm.example.template
script, so any jobs pushed to this queue will use the resources set by that script.clearml-agent-slurm --template-files slurm.example.template --queue default
You can also pass multiple templates and queues. For example:
clearml-agent-slurm --template-files slurm.template1 slurm.template2 --queue queue1 queue2
If you are running Slurm with Singularity containers support, set the following:
-
Make sure your
sbatch
template contains:singularity exec ${CLEARML_AGENT_EXECUTE}
Additional singularity arguments can be added, for example:
singularity exec --uts ${CLEARML_AGENT_EXECUTE}`
-
Set the default Singularity container to use in your clearml.conf file:
agent.default_docker.image="shub://repo/hello-world"
Or
agent.default_docker.image="docker://ubuntu"
-
Add
--singularity-mode
to the command line, for example:clearml-agent-slurm --singularity-mode --template-files slurm.example_singularity.template --queue default
ClearML Agent can run on a Google Colab instance. This helps users to leverage compute resources provided by Google Colab and send experiments for execution on it.
Check out this tutorial on how to run a ClearML Agent on Google Colab!
ClearML Agent can also execute specific tasks directly, without listening to a queue.
Execute a Task with a clearml-agent
worker without a queue.
clearml-agent execute --id <task-id>
Clone the specified Task and execute the cloned Task with a clearml-agent
worker without a queue.
clearml-agent execute --id <task-id> --clone
Execute a Task with a clearml-agent
worker using a Docker container without a queue.
clearml-agent execute --id <task-id> --docker
Run a clearml-agent
daemon in foreground mode, sending all output to the console.
clearml-agent daemon --queue default --foreground