Beyond submitting jobs, the slurm framework provides multiple other commands for interacting with the cluster management system. These commands allow querying for advanced information about jobs in the execution queue, the states of the various compute nodes (like their available and committed resources, availability of GPU hardware, associated partitions, etc.) or for manipulating the jobs in the queue (cancelling, pausing, sending signals to the running environment and many other actions).
Below is a short list of a few such common commands.
Used to schedule a script to run as soon as resources are available.
usage:
sbatch [options] <script>
options (most are also relevant for srun):
-c n | Allocate n cpus (per task). |
-t t | Total run time limit (e.g. "2:0:0" for 2 hours, or "2-0" for 2 days and 0 hours). |
--mem-per-cpu m | Allocate m MB per cpu. |
--mem m | Allocate m MB per node (--mem and --mem-per-cpu are mutually exclusive) |
--array=1‑k%p | Run the script k times (from 1 to k). The array index of the current run is in the SLURM_ARRAY_TASK_ID environment variable accessible from within the script. The optional %p parameter will limit the jobs to run at most p simultaneous jobs (usually it's nicer to the other users). |
--wrap cmd | instead of giving a script to sbatch, run the command cmd. |
-M cluster | The cluster to run on. Can be comma separated list of clusters which will choose the earliest expected job initiation time. |
-n n | Allocate resources for n tasks. Default is 1. Only relevant for parallel jobs, e.g. with mpi. |
--gres resource | specify general resource to use. Currently only GPU and vmem are supported. e.g. gpu:2 for two GPUs. On clusters with several types of GPUs, a specific GPU can be requested by, e.g. 'gpu:m60:2' for 2 M60 GPUs; or minimum video memory with e.g. 'gpu:1,vmem:6g'. |
More info in "man sbatch" or here.
Shows the status of submitted jobs.
Usage:
squeue -M <cluster>
More info in "man squeue" or here.
A shortcut for different format of squeue.
Usage:
ssqueue [-M <cluster>]
[-k] [-r] [-l]
[-t <status_list>]
[-o <field_list>|ALL]
[-u <user[s]>]
[--fields]
[-w <nodes>]
[-A account[s]]
[-j job[s]]
Cancels a job:
Usage:
scancel <job id>
More info in "man scancel" or here.
To hold a job from executing (e.g. to give another job a chance to run), run:
scontrol hold <job id>
To release it:
scontrol release <job id>
To run commands interactively, use the srun command. This will block until there are resources available, and will redirect the input/output of the program to the executing shell. srun has most of the same parameters as sbatch.
If the input/output isn't working currectly (e.g. with shell jobs), usually adding the --pty flag solves the issue.
On some of the clusters interactive jobs have some limitation compared to normal batch jobs.
More info in "man srun" or here.
Used to view statistics about previous jobs.
e.g.
sacct
Long format:
sacct -l
All users:
sacct -a
Since 1/1/2013
sacct -S 2013-01-01
Or any combination of the options.
A shortcut for different output formats of sacct.
Usage:
ssacct
or
ssacct --res
Show data about the cluster and the nodes
More info in "man sinfo"
Show detailed data about each node. Usage:
ssinfo
Show data about running jobs (e.g. memory, time, etc.)
Show general information about the available resources of the cluster (memory, GPUs...) and about the current usage of different users.
Show current usage and limits of an account.