DRMAA2 JobTracker implementation for Google Batch
Experimental Google Batch support for DRMAA2os.
The project is created for embedding it as a backend in https://github.com/dgruber/drmaa2os
It is a basic DRMAA2 implementation for Google Batch for Go. The DRMAA2 JobTemplate can be used for submitting Google Batch jobs. The DRMAA2 JobInfo struct is used for getting the status of a job. The job state model is converted to the DRMAA2 spec.
See examples directory which uses the interface directly.
DRMAA2 JobTemplate | Google Batch Job |
---|---|
RemoteCommand | Command to execute in container or script or script path |
Args | In case of container the arguments of the command (if RemoteCommand empty then the arguments of entrypoint) |
CandidateMachines[0] | Machine type or when prefixed with "template:" it uses an instance template with that name |
JobCategory | Container image or |
JobName | JobID |
AccountingID | Sets a tag "accounting" |
MinSlots | Specifies the parallelism (how many tasks to run in parallel) |
MaxSlots | Specifies the amount of tasks to run. For MPI set MinSlots = MaxSlots. |
MinPhysMemory | MB of memory to request; should be set to increase from default to full machine size |
ResourceLimits | key could be "cpumilli", "bootdiskmib", "runtime" -> runtime limit like "30m" for 30 minutes |
Override resource limits "cpumilli" to get full amount of resources one running just one task per machine (like 8000 for 8 cores)!
For StageInFiles and StageOutFiles see below.
In case of a container following files are always mounted from host:
"/etc/cloudbatch-taskgroup-hosts:/etc/cloudbatch-taskgroup-hosts",
"/etc/ssh:/etc/ssh",
"/root/.ssh:/root/.ssh",
For a container the following runtime options are set:
- "--network=host"
Default output path is cloud logging. If "OutputPath" is set it is changed to LogsPolicy_PATH with the OutputPath as destination.
DRMAA2 JobTemplate Extension Key | DRMAA2 JobTemplate Extension Value |
---|---|
ExtensionProlog / "prolog" | String which contains prolog script executed on machine level before the job starts |
ExtensionEpilog / "epilog" | String which contains epilog script executed on machine level after the job ends successfully |
ExtensionSpot / "spot" | "true"/"t"/... when machine should be spot |
ExctensionAccelerators / "accelerators" | "Amount*Accelerator name" for machine (like "1*nvidia-tesla-v100") |
ExtensionTasksPerNode / "tasks_per_node" | Amount of tasks per node |
ExtensionDockerOptions / "docker_options" | Override of docker run options in case a container image is used |
ExtensionGoogleSecretEnv / "secret_env" | Used for populating env variables from Google Secret Manager. Please use SetSecretEnvironmentVariables() |
DRMAA2 JobInfo | Batch Job |
---|---|
Slots | Parallelism |
Did not yet find some way to put a job in hold, suspend, or release a job. Terminating a job deletes it...
DRMAA2 State | Batch Job State |
---|---|
Done | JobStatus_SUCCEEDED |
Failed | JobStatus_FAILED |
Suspended | - |
Running | JobStatus_RUNNING JobStatus_DELETION_IN_PROGRESS |
Queued | JobStatus_QUEUED JobStatus_SCHEDULED |
Undetermined | JobStatus_STATE_UNSPECIFIED |
NFS (Google Filestore) and GCS is supported.
For NFS in containers besides directories also files can be specified. In case of files, the directory is mounted to the host and from there the file inside the container as specified in key. For the directory case a leading "/" is required.
StageInFiles: map[string]string{
"/etc/script.sh": "nfs:10.20.30.40:/filestore/user/dir/script.sh",
"/mnt/dir": "nfs:10.20.30.40:/filestore/user/dir/",
"/somedir": "gs://benchmarkfiles", // mount a bucket into container or host
},
StageOutFiles creates a bucket if it does not exist before the job is submitted. If that failes then the job submission call fails. Currently only gs:// is evaluated in the StageOutFiles map.
StageOutFiles: map[string]string{
"/tmp/joboutput": "gs://outputbucket",
},
See examples directory.