Last updated: 2025-03-13
In this Exercise, we will be taking our first steps with the cluster. We will:
⁕ ⁕ ⁕ ⁕ ⁕ ⁕
It’s the same as for Exercise
0{:target=“_blank”}: you should be connected to the cluster and on
the master node (i.e. slurmlogin
should be written in your
terminal prefix).
⁕ ⁕ ⁕ ⁕ ⁕ ⁕
Using an interactive session (srun --pty bash
) is useful
when you’re setting up a new analysis pipeline and getting to know how
to use the software within it. In the long run, it’s better to use a
submission script instead so that you free the resources as soon as your
software has finished running. Also, this means you can switch off the
connection to the cluster without affecting your job (think of it like
posting a letter - switching your computer off won’t affect the job that
is running).
In this exercise, we will need to create & edit files. We suggest
you use the in-line text editor nano
but feel free to use
whatever editor you prefer. If you’re working on the
/store/
space, remember that you also have access to it
outside the cluster (see this
and this
page on the intranet).
The Slurm submission script is just a script written in bash (that’s
the language of the cluster, i.e. cd
and ls
are bash commands). You can put in a bash script whatever you would
normally write in the terminal with 1 command per line.
Let’s create a script called slurm_script.sh
that will
print “hello world” on the screen:
#! /bin/bash
echo "hello world"
Next, we will submit the script to the scheduler, which will look for a free node to run it on.
To submit your script, you can use the sbatch
command:
john.doe@slurmlogin:/home/john.doe$ sbatch slurm_script.sh
Submitted batch job 123456
john.doe@slurmlogin:/home/john.doe$
Note:
According to resource availability, your job might need to queue or
start running directly. To see its status, you can use slurm’s in-build
squeue
command.
NB: squeue
will only show currently running jobs so you
might not see your job if you’re not quick enough!
john.doe@slurmlogin:~$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
123456 common slurm_sc john.doe R 0:04 1 node24
In order to see past jobs, you can use the sacct
command
with a few options:
sacct -X
john.doe@slurmlogin:~$ sacct -X
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
115896 bash ngs 1 CANCELLED+ 0:0
115960 snakemake+ ngs 8 FAILED 1:0
123456 bash common 1 COMPLETED 0:0
The “FAILED” state is a bit missleading sometimes. It’s not because it’s marked as “FAILED” that your job didn’t run properly. Check if the expected output was generated before panicking ;-)
sacct -X
will list all your jobs that are or have been
running from midnight onwards. A few useful options are:
-X
: simplifies the output (very useful unless you are
using steps within your jobs)S 2025-03-10
: defines the start date/time (e.g. 10th of
March 2025 = show all jobs since then), default is midnight of the
current day-a
: not used here, but you can add it to show jobs of
all users and not only yours-j <jobid>
: only show the job with the given job
idThe SICS installed a set of scripts from slurm-tools that you can use to do the same as above. You don’t have to remember everything, it’s up to you to choose your favorite ones:
jobqueue
: get list of jobs in queue or runningjobhist
: get list of all jobs that have run or are
running or queuingExample outputs:
john.doe@slurmlogin:~$ jobqueue
-----------------------------------------------------------------------------------------
Job ID User Job Name Partition State Elapsed Nodelist(Reason)
------------ ---------- ------------ ---------- -------- ----------- --------------------
123456 john.doe bash common RUNNING 0:02 node24
john.doe@slurmlogin:~$ jobhist
----------------------------------------------------------------------------------------------------
Job ID Startdate User Job Name Partition State Elapsed Nodes CPUs Memory
------------- ---------- ---------- ------------ ---------- ---------- ----------- ----- ---- ------
115896 2025-03-05 jane.doe bash ngs CANCELLED+ 2-21:36:56 1 1 1000Mc
115960 2025-03-08 jane.doe snakemake_p+ ngs FAILED 00:00:02 1 8 64Gn
123456 2025-03-09 john.doe bash common RUNNING 00:00:42 1 1 1000Mc
NB: Gn = Gb/node and Gc = Gb/cpu, same for M (Mb), K (Kb) and T (Tb)
The command echo "hello world"
should normally print
“hello world” on your screen… When running scripts remotely on the
nodes, anything that is usually printed on the screen is saved in a file
instead.
Have a look at your working directory, you should see an extra file
in there. If you open it (with the cat
command for
example), you should see “Hello world” in there.
john.doe@slurmlogin:/home/john.doe$ ls
slurm_script.sh slurm-123456.out
john.doe@slurmlogin:/home/john.doe$ cat slurm-123456.out
Hello world
Take home message
srun --pty bash
)sbatch
commandsqueue
to list all jobs that are queuing/running, or
sacct
to also list past jobs🔗 Back to exercise page{:target=“_blank”}
The cluster is a shared resource so it’s important to make sure that your queries are submitted with a reasonable amount of asked resources. Default parameters are 2Gb of RAM memory, 1 CPU and a maximum running time of 2 hrs.
To know how much of the reserved resources your job actually used,
you can use a combination of different Slurm commands. However, usage
and outputs are not always very clear for beginner users. We suggest you
use the jobinfo <jobid>
command from slurm-tools (already
installed on the I2BC cluster).
jobinfo 123456
will output:
Job ID : 123456
Job name : bash
User : john.doe
Account :
Working directory : /data/work/I2BC/john.doe/testrun
Cluster : cluster
Partition : common
Nodes : 1
Nodelist : node24
Tasks : 1
CPUs : 1
GPUs : 0
State : COMPLETED
Exit code : 0:0
Submit time : 2025-03-09T09:07:56
Start time : 2025-03-09T09:07:56
End time : 2025-03-09T09:08:38
Wait time : 00:00:00
Reserved walltime : 00:00:00
Used walltime : 00:00:42 # Actual run time of job
Used CPU walltime : 00:00:42 # Used walltime x number of CPUs
Used CPU time : 00:00:00 # Total time that CPUs were actually used for
CPU efficiency : 0.18% # Used CPU time / Used CPU walltime
% User (computation) : 50.65%
% System (I/O) : 49.35%
Reserved memory : 1000M/core
Max memory used : 9.29M (estimate) # Maximum memory used
Memory efficiency : 0.93% # Max memory used / Reserved memory
Max disk write : 256.00K
Max disk read : 512.00K
How to read this output? Job id, Job name, User, Partition, Nodes, Nodelist and CPUs are like before and quite transparent. Interesting is:
CPU efficiency
= how efficiently you used the CPUs
you’ve reserved, this should be as close to 100% as possibleMemory efficiency
= how efficiently you used the memory
you’ve reserved, this should be as close to 100% as possibleSo in our case, with 0.18% and 0.93%, we’re quite bad in resource efficiency…
To adjust the resources, you can add a few extra options when running
srun
or sbatch
. Useful options in this case
are:
option | function |
---|---|
--mem=xxM --mem=xxG |
reserve the specified amount of RAM memory in Mb or Gb |
--cpus-per-task=x |
reserve the specified amount of processors (CPUs) |
You’ll find more options in the “cheat sheet” tab on the intranet
Let’s adjust the resources of our previous job
Since our previous job only used very little resources, there is no sens in reserving 2Gb, let’s reduce it to 100Mb. We’ve used 0.18% of the reserved CPU but 1 CPU is already the minimum so we’ll keep it that way.
There are 2 methods to specify these parameters to
sbatch
:
sbatch --mem=100M --cpus-per-task=1 slurm_script.sh
#! /bin/bash
#SBATCH --mem=100M
#SBATCH --cpus-per-task=1
echo "hello world"
then resubmit with sbatch slurm_script.sh
.
Note that the syntax is the same in both cases, with additionally the
#SBATCH
prefix in the script and each directive should be
on a separate line.
NB: It’s important to note that increasing the number of processors (CPUs or threads) won’t accelerate your job if the software you’re using doesn’t support parallelisation.
Take home message
jobinfo
is a customs script that gives you information
on what your job actually usedsbatch
command, whether directly at execution or within the Slurm submission
script (srun
takes the same options as
sbatch
)⁕ ⁕ ⁕ ⁕ ⁕ ⁕
What if I changed my mind? How can I stop the job I just submitted??!
scancel
In addition to sbatch
/srun
(submit a job)
and squeue
/sacct
(follow a job/check a job’s
resources), the third Slurm command you should know is
scancel
to cancel a job. You can only cancel your own jobs,
cancelling other people’s jobs won’t work.
Let’s see what jobs are running:
john.doe@slurmlogin:~$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
123456 common bash john.doe R 0:04 1 node24
123457 common test luke.doe R 0:09 1 node25
Let’s now try to delete Luke’s job:
john.doe@slurmlogin:~$ scancel 123457
scancel: Unauthorized Request 123457
See? Nothing happened because I can only cancel my own jobs.
Let’s first add a line to your previous slurm_script.sh
to make sure it runs long enough for us to cancel it:
#! /bin/bash
#SBATCH --mem=100M
#SBATCH --cpus-per-task=1
echo "hello world"
sleep 2m # wait for 2 minutes
sbatch slurm_script.sh
, and note its
job idsqueue
(you can get the job id here
too)scancel 123456
squeue
Your job should’ve changed status and then disappear after a little while, as expected.
john.doe@slurmlogin:~$ sbatch slurm_script.sh
Submitted batch job 123456
john.doe@slurmlogin:~$
Job id is 123456 in this case.
squeue
john.doe@slurmlogin:~$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
123456 common bash john.doe R 0:04 1 node24
john.doe@slurmlogin:~$ scancel 123456
squeue
john.doe@slurmlogin:~$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
123456 common bash john.doe CG 0:09 1 node24
CG = Completing
After a while, it should be gone.
Take home message
scancel
to cancel a job using it’s unique job id⁕ ⁕ ⁕ ⁕ ⁕ ⁕
sbatch
& srun
(submit),
squeue
(follow), sacct
(stats) and
scancel
(cancel)man
) pagessbatch
& srun
option | function |
---|---|
--mem=<mem> |
to specify the amout of memory to reserve per node |
--cpus-per-task=<cpus> |
to specify the number of CPUs to reserve per task (default nb tasks = 1) |
--job-name="<jobname>" |
to specify a job name (no spaces or special characters please) |
--time=[DD-]HH+:MM:SS |
to specify the maximum running time (default: 2hrs) |
--partition=common |
to specify the partition (=group of nodes) to submit your job to |
--output /path/to/output.log |
name of the file to save standard output to |
--error /path/to/error.log |
if specified, standard error is written to a separate file specified in the option |
More options in th “cheat sheet” tab on the intranet and the official Slurm documentation
🔗 Back to exercise page{:target=“_blank”}