- Simple website to monitor small SLURM cluster.
- Contains head node and work node scripts that query slurm and the system to obtain usage statistics.
- bin/slurm_task_tracker.py runs on each working node. Collects locally running jobids and system ps output. Outputs to /dev/shm/slurm_task_trackes_$(hostname -s).txt containing "raw JobID,username,account,job array id,elapsed seconds,time limit seconds,partition,cores allocated,ram allocated in bytes,hostname,jobname,current pcpu,peak pcpu,current rss in bytes,peak rss in bytes,resource tree of children pid,pcpu,cmd,ram".
- bin/slurm_pending_tasks.py runs on head node. Collects list of PENDING jobs. Stores in /dev/shm/slurm_pending_tasks.txt containing "raw JobID,username,account,job array id,elapsed seconds,time limit seconds,partition,cores allocated,ram allocated in bytes,hostname,jobname" only.
- bin/slurm_cluster_stats.py collects sinfo and sreport data for cluster-wide state. Stores in /dev/shm/slurm_cluster_stats.txt containing core-years data, node list, per-user usage stats.
- bin/slurm_report_usage_from.sh runs 'sreport -nP cluster AccountUtilizationByUser' from a set date and returns overall usage.
- bin/slurm_report_usagepercent_from.sh runs ''
- SLURM 17.2+ (squeue, sinfo, sreport)
- Python 2.7.13+ (os, datetime)
- Sufficent access to query $TMPDIR and $SCRATCH_DIR directory sizes.
- reading empty lines from historical data track
- CPU OverCommit handling.
- Rolling history code -- Not displayed yet
- Process usage error handling.
- Process usage error
- Condensed process names
- Memory-Over kill email.
- System username is now User's Firstname, First letter of last name.
- User usage now broken into CPU years, weeks, days, housr, minutes.
- All users displayed. Online are marked.
- Current percentage now partially transparrent to visualize over-usage.
- Minor Formatting
- SHM_DIR, TMP_DIR as separate disk usage areas. They are now the same.
- Commas Separated values need to remove commas from data
- Shared memory usage adds to memory usage.
- 1:10 chance to cancels any job that exceeds memory allocation.
- Email alert when job terminates due to exceeding memory allocation.
- Query $SHM_DIR, $TMP_DIR and $SCRATCH_DIR disk space reporting.
- Display column for Disk usage. Doesn't include non-slurm areas.
- LDAP resolve error halting updates with alert to printing error log.
- numfmt requirement.
- Error logging
- Comments, glorious comments.
- Settings and config files.
- System call for hostname
- Readme
- CPU time in seconds to per-user listing.
- Parent-Child process map & return without additional system call.
- $SHM_DIR query from slurm_task_tracker.py as slurm doesn't care yet. CGROUPS required?
- pstree shell command requirement.
- Moved common functions to separate file.
- CSS div alignment issues.
- Combine slurm_report_usage_from and slurm_report_usagepercent_from into single sreport call.
- Adaptable Layout for larger clusters, multiple QoS and multiple clusters.
- Mobile friendly layout.