Understanding memory (RAM) and processor (CPU cores) requirements of your jobs will help you utilize the cluster resources more efficiently, and in turn save money and get work done more quickly.
This wiki page will demonstrate and explain the usage of tools for checking processor (CPU core) and memory (RAM) utilization for jobs.
Slurm includes a tool called seff to check the memory utilization and CPU efficiency for completed jobs. Output of the seff command for running and failed jobs is not reliable so use this tool only for successfully completed jobs.
You can launch your program with the /usr/bin/time command in front of your command so that the system will watch your program and provide statistics about the CPU and RAM usage.
Below are some examples of how to measure your CPU and RAM usage with Slurm seff and Linux time command. In the examples we used both tools together, seff and time. You can choose one of them for your job utilization check.
The code that was used in the below examples is a Python script that does a multiplication of two 2-dimensional matrices, each one with 3000 columns and 3000 rows. The script was not written for running on multiple CPU cores, which means that only one CPU core will be used by the script for executing the multiplication.
Submitted sbatch script of the non-parallel code:
#!/bin/bash
#SBATCH --job-name=res_util
#SBATCH --ntasks=1
#SBATCH --mem=7G
#SBATCH --time=01:00:00
/usr/bin/time -v python3 matrix_multiplication_non_parallel.py
In this job we request resources for running 1 task with 7GB memory for a duration of 1 hour.
Slurm default is 1 core per task so stating that we are going to run 1 task will get Slurm to allocate us with 1 core for this job.
Once the job is finished run the seff tool with the job number:
seff <jobid>
Here is the sample output of seff:
Job ID: 914474
Cluster: moriah
User/Group: yaronw/system
State: COMPLETED (exit code 0)
Cores: 1
CPU Utilized: 00:03:22
CPU Efficiency: 98.54% of 00:03:25 core-walltime
Job Wall-clock time: 00:03:25
Memory Utilized: 565.56 MB
Memory Efficiency: 7.89% of 7.00 GB
As you can see, the total run time was 3 minutes and 25 seconds.
CPU Efficiency was 98.54% which means that the one CPU core that was allocated for this job was utilized the whole run time of the job.
Memory utilization was very low, only 565MB out of the 7GB that were requested for this job.
A good practice is to request 10-15% more memory than what was utilized. In this example requesting 700MB would have been enough.
To see the output of the /usr/bin/time command you need to look inside the job out file:
Command being timed: "python3 matrix_multiplication_non_parallel.py"
User time (seconds): 202.27
System time (seconds): 0.07
Percent of CPU this job got: 99%
Elapsed (wall clock) time (h:mm:ss or m:ss): 3:22.79
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 615484
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 134
Minor (reclaiming a frame) page faults: 6013
Voluntary context switches: 845
Involuntary context switches: 77695
Swaps: 0
File system inputs: 25424
File system outputs: 8
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
From this long output these are the important lines for you to understand resources utilization:
Percent of CPU this job got: 99%
Elapsed (wall clock) time (h:mm:ss or m:ss): 3:22.79
Maximum resident set size (kbytes): 615484
Maximum resident set size is maximum memory used by your code. In this example 615484 KB which is approximately 601MB.
Now let's test the non-parallel code with 4 cores by entering these lines into the submit script we used before:
#SBATCH --ntasks=4
#SBATCH --nodes=1
With this change Slurm will allocate 4 cores on one node.
seff output for this job:
Job ID: 914390
Cluster: moriah
User/Group: yaronw/system
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 4
CPU Utilized: 00:03:19
CPU Efficiency: 24.88% of 00:13:20 core-walltime
Job Wall-clock time: 00:03:20
Memory Utilized: 570.37 MB
Memory Efficiency: 7.96% of 7.00 GB
As expected the CPU efficiency is down to 24.88%, which means that out of the 4 cores that were allocated only one was utilized for this job.
Here is the relevant output from /usr/bin/time command:
Percent of CPU this job got: 99%
Elapsed (wall clock) time (h:mm:ss or m:ss): 3:19.45
Maximum resident set size (kbytes): 615488
A value around 100% of CPU percent means that the job used only 1 core. If all cores were used then we would have gotten a value closer to 400%.