Beyond submitting jobs, the slurm framework provides many other commands for interacting with the cluster management system.
These allow querying for advanced information about jobs in the execution queue, the states of the various compute nodes (like their available and committed resources, availability of GPU hardware, associated partitions, etc.) or for manipulating the jobs in the queue (cancelling, pausing, sending signals to the running environment and many other actions).
Here are a few of our favourites
Shows the status of submitted jobs. Usage: squeue -M <cluster>
More info in "man squeue" or here.
A shortcut for different format of squeue. Usage:
ssqueue [-M <cluster>]
[-k] [-r] [-l]
[-t <status_list>]
[-o <field_list>|ALL]
[-u <user[s]>]
[--fields]
[-w <nodes>]
[-A account[s]]
[-j job[s]]
Cancels a job. Usage: scancel <job id>
More info in "man scancel" or here.
To hold a job from executing (i.e. to give another job a chance to run), run: scontrol hold <job id>
To release it: scontrol release <job id>
Used to view statistics about previous jobs. e.g. sacct
Long format: sacct -l
All users: sacct -a
Since 1/1/2013: sacct -S 2013-01-01
Or any combination of the options.
A shortcut for different output formats of sacct. Usage: ssacct
or ssacct --res
Show data about the cluster and the nodes. More info in "man sinfo"
Show detailed data about each node. Usage: ssinfo
Show data about running jobs (e.g. memory, time, etc.)
Show general information about the available resources of the cluster (memory, GPUs...) and about the current usage of different users.
Show current usage and limits of an account.