Skip to content
This repository has been archived by the owner on Sep 18, 2019. It is now read-only.

BC4 Monitoring

Will Price edited this page Dec 15, 2017 · 4 revisions

Watching machine usage

The following command lists the usage of machines in the gpu partition in the comsm0018 account:

$ squeue -p gpu --states=running -A comsm0018

Check gpu node states

List the state of nodes in the gpu partition

$ sinfo -p gpu
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
gpu          up 7-00:00:00      1  down* gpu06
gpu          up 7-00:00:00     31   idle gpu[01-05,07-32]

Monitoring the queue backlog

$ squeue --Format 'username:10,account:12,priority,reservation:12,state:10,gres:10,minmemory:12,timeleft' --sort='P,b,m' --partition gpu
USER      ACCOUNT     PRIORITY            RESERVATION STATE     GRES      MIN_MEMORY  TIME_LEFT
el14718   betatest    0.00000023283064    (null)      RUNNING   gpu:1     1000M       1-21:23:36
el14718   betatest    0.00000023283064    (null)      RUNNING   gpu:1     1000M       1-21:23:36
el14718   betatest    0.00000023283064    (null)      RUNNING   gpu:1     1000M       1-21:23:36
el14718   betatest    0.00000023283064    (null)      RUNNING   gpu:1     1000M       1-21:23:06

Monitoring the queue of comsm0018 alone

$ squeue --Format 'username:10,account:12,priority,reservation:12,state:10,gres:10,minmemory:12,starttime,timeleft,' --sort='P,b,m' --partition gpu --account=comsm0018 
USER      ACCOUNT     PRIORITY            RESERVATION STATE     GRES      MIN_MEMORY  START_TIME          TIME_LEFT
nl11111   comsm0018   0.00001184595749    (null)      RUNNING   gpu:1     4G          2017-12-15T10:23:24 1:47:15
sb11111   comsm0018   0.00001185410656    (null)      RUNNING   gpu:1     100G        2017-12-15T10:11:36 1:35:27
nh11111   comsm0018   0.00001185038127    (null)      RUNNING   gpu:1     100G        2017-12-15T09:25:24 49:15

Count number of GPUs currently in use

$ squeue --Format 'state,gres' -h  --partition gpu | grep 'RUNNING.*gpu' | sed 's/gpu://' | awk '{ s += $2 } END { print s }'
52

List nodes within GPU partition

$ sinfo -p gpu
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
gpu          up 7-00:00:00      1   comp gpu16
gpu          up 7-00:00:00     29    mix gpu[01,03-15,17-25,27-32]
gpu          up 7-00:00:00      2   down gpu[02,26]