Exercise 1 - Going further

Authors: BIOI2 & I2BC members

Last updated: 2025-03-13

Objectives

In this Exercise, we will be taking our first steps with the cluster. We will:

learn how to create a Slurm submission script
learn how to follow a job & its consumption
learn how to adjust the resources you ask for
learn how to cancel a job
play around with the parameters of Slurm commands

⁕ ⁕ ⁕ ⁕ ⁕ ⁕

Setup

It’s the same as for Exercise 0{:target=“_blank”}: you should be connected to the cluster and on the master node (i.e. slurmlogin should be written in your terminal prefix).

⁕ ⁕ ⁕ ⁕ ⁕ ⁕

Objective 1 - Transcribing your analysis workflow into a Slurm submission script

Using an interactive session (srun --pty bash) is useful when you’re setting up a new analysis pipeline and getting to know how to use the software within it. In the long run, it’s better to use a submission script instead so that you free the resources as soon as your software has finished running. Also, this means you can switch off the connection to the cluster without affecting your job (think of it like posting a letter - switching your computer off won’t affect the job that is running).

In this exercise, we will need to create & edit files. We suggest you use the in-line text editor nano but feel free to use whatever editor you prefer. If you’re working on the /store/ space, remember that you also have access to it outside the cluster (see this and this page on the intranet).

My first submission script

The Slurm submission script is just a script written in bash (that’s the language of the cluster, i.e. cd and ls are bash commands). You can put in a bash script whatever you would normally write in the terminal with 1 command per line.

Let’s create a script called slurm_script.sh that will print “hello world” on the screen:

#! /bin/bash

echo "hello world"

Submitting the script

Next, we will submit the script to the scheduler, which will look for a free node to run it on.

To submit your script, you can use the sbatch command:

john.doe@slurmlogin:/home/john.doe$ sbatch slurm_script.sh
Submitted batch job 123456
john.doe@slurmlogin:/home/john.doe$

Note:

that the terminal prompt didn’t change this time: the script will be run on a node but we’re still on the master node: slurmlogin (it’s like posting a letter)
we’re given a unique job id (this was also the case with the interactive session but the job id just wasn’t printed on the screen)

Following your job

According to resource availability, your job might need to queue or start running directly. To see its status, you can use slurm’s in-build squeue command.

NB: squeue will only show currently running jobs so you might not see your job if you’re not quick enough!

Click to see example output

john.doe@slurmlogin:~$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
            123456    common slurm_sc john.doe  R       0:04      1 node24

NAME: job name
ST: status (PD=pending, R=running, CD=completed, S=suspended)
TIME: time elapsed since job started
NODES: number of nodes the job is running on
NODELIST: node names

In order to see past jobs, you can use the sacct command with a few options:

sacct -X

Click to see example output & explanation

john.doe@slurmlogin:~$ sacct -X
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
115896             bash        ngs                     1 CANCELLED+      0:0
115960       snakemake+        ngs                     8     FAILED      1:0
123456             bash     common                     1  COMPLETED      0:0

The “FAILED” state is a bit missleading sometimes. It’s not because it’s marked as “FAILED” that your job didn’t run properly. Check if the expected output was generated before panicking ;-)

sacct -X will list all your jobs that are or have been running from midnight onwards. A few useful options are:

-X: simplifies the output (very useful unless you are using steps within your jobs)
S 2025-03-10: defines the start date/time (e.g. 10th of March 2025 = show all jobs since then), default is midnight of the current day
-a: not used here, but you can add it to show jobs of all users and not only yours
-j <jobid>: only show the job with the given job id

Other custom commands on the I2BC cluster you can use

The SICS installed a set of scripts from slurm-tools that you can use to do the same as above. You don’t have to remember everything, it’s up to you to choose your favorite ones:

jobqueue: get list of jobs in queue or running
jobhist: get list of all jobs that have run or are running or queuing

Example outputs:

john.doe@slurmlogin:~$ jobqueue
-----------------------------------------------------------------------------------------
Job ID             User     Job Name  Partition    State     Elapsed     Nodelist(Reason)
------------ ---------- ------------ ---------- -------- ----------- --------------------
123456         john.doe        bash     common  RUNNING        0:02               node24

john.doe@slurmlogin:~$ jobhist
----------------------------------------------------------------------------------------------------
Job ID         Startdate       User     Job Name  Partition      State     Elapsed Nodes CPUs Memory
------------- ---------- ---------- ------------ ---------- ---------- ----------- ----- ---- ------
115896        2025-03-05   jane.doe         bash        ngs CANCELLED+  2-21:36:56     1    1 1000Mc
115960        2025-03-08   jane.doe snakemake_p+        ngs     FAILED    00:00:02     1    8   64Gn
123456        2025-03-09   john.doe         bash     common    RUNNING    00:00:42     1    1 1000Mc

NB: Gn = Gb/node and Gc = Gb/cpu, same for M (Mb), K (Kb) and T (Tb)

Checking that your job worked

The command echo "hello world" should normally print “hello world” on your screen… When running scripts remotely on the nodes, anything that is usually printed on the screen is saved in a file instead.

Have a look at your working directory, you should see an extra file in there. If you open it (with the cat command for example), you should see “Hello world” in there.

john.doe@slurmlogin:/home/john.doe$ ls
slurm_script.sh  slurm-123456.out
john.doe@slurmlogin:/home/john.doe$ cat slurm-123456.out
Hello world

🔗 Back to exercise page{:target=“_blank”}

Objective 2 - Analysing & adjusting resource consumption

Analysing resource consumption

The cluster is a shared resource so it’s important to make sure that your queries are submitted with a reasonable amount of asked resources. Default parameters are 2Gb of RAM memory, 1 CPU and a maximum running time of 2 hrs.

To know how much of the reserved resources your job actually used, you can use a combination of different Slurm commands. However, usage and outputs are not always very clear for beginner users. We suggest you use the jobinfo <jobid> command from slurm-tools (already installed on the I2BC cluster).

Click to see example output and explanation

jobinfo 123456 will output:

Job ID               : 123456
Job name             : bash
User                 : john.doe
Account              :
Working directory    : /data/work/I2BC/john.doe/testrun
Cluster              : cluster
Partition            : common
Nodes                : 1
Nodelist             : node24
Tasks                : 1
CPUs                 : 1
GPUs                 : 0
State                : COMPLETED
Exit code            : 0:0
Submit time          : 2025-03-09T09:07:56
Start time           : 2025-03-09T09:07:56
End time             : 2025-03-09T09:08:38
Wait time            :  00:00:00
Reserved walltime    :  00:00:00
Used walltime        :  00:00:42  # Actual run time of job
Used CPU walltime    :  00:00:42  # Used walltime x number of CPUs
Used CPU time        :  00:00:00  # Total time that CPUs were actually used for
CPU efficiency       :  0.18%     # Used CPU time / Used CPU walltime
% User (computation) : 50.65%
% System (I/O)       : 49.35%
Reserved memory      : 1000M/core
Max memory used      : 9.29M (estimate) # Maximum memory used
Memory efficiency    :  0.93%           # Max memory used / Reserved memory
Max disk write       : 256.00K
Max disk read        : 512.00K

How to read this output? Job id, Job name, User, Partition, Nodes, Nodelist and CPUs are like before and quite transparent. Interesting is:

CPU efficiency = how efficiently you used the CPUs you’ve reserved, this should be as close to 100% as possible
Memory efficiency = how efficiently you used the memory you’ve reserved, this should be as close to 100% as possible

So in our case, with 0.18% and 0.93%, we’re quite bad in resource efficiency…

Adjusting the resources you ask for

bad memory efficiency? reduce the requested memory a little larger than the “Max memory used” (Of note, memory usage might vary according to you input data size and software that you run)
bad CPU efficiency? reduce the number of requested CPUs if possible (Of note, efficiency is also dependant on how efficient your software uses the available CPUs, more CPUs doesn’t always means more efficient!)

To adjust the resources, you can add a few extra options when running srun or sbatch. Useful options in this case are:

option	function
`--mem=xxM` `--mem=xxG`	reserve the specified amount of RAM memory in Mb or Gb
`--cpus-per-task=x`	reserve the specified amount of processors (CPUs)

You’ll find more options in the “cheat sheet” tab on the intranet

Let’s adjust the resources of our previous job

Since our previous job only used very little resources, there is no sens in reserving 2Gb, let’s reduce it to 100Mb. We’ve used 0.18% of the reserved CPU but 1 CPU is already the minimum so we’ll keep it that way.

There are 2 methods to specify these parameters to sbatch:

add these options to the command line at job submission

sbatch --mem=100M --cpus-per-task=1 slurm_script.sh

OR add these options directly to your slurm submission script

#! /bin/bash

#SBATCH --mem=100M
#SBATCH --cpus-per-task=1

echo "hello world"

then resubmit with sbatch slurm_script.sh.

Note that the syntax is the same in both cases, with additionally the #SBATCH prefix in the script and each directive should be on a separate line.

NB: It’s important to note that increasing the number of processors (CPUs or threads) won’t accelerate your job if the software you’re using doesn’t support parallelisation.

⁕ ⁕ ⁕ ⁕ ⁕ ⁕

Objective 3 - Cancelling a job

What if I changed my mind? How can I stop the job I just submitted??!

Getting familiar with `scancel`

In addition to sbatch/srun (submit a job) and squeue/sacct (follow a job/check a job’s resources), the third Slurm command you should know is scancel to cancel a job. You can only cancel your own jobs, cancelling other people’s jobs won’t work.

Let’s see what jobs are running:

john.doe@slurmlogin:~$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
            123456    common     bash john.doe  R       0:04      1 node24
            123457    common     test luke.doe  R       0:09      1 node25

Let’s now try to delete Luke’s job:

john.doe@slurmlogin:~$ scancel 123457
scancel: Unauthorized Request  123457

See? Nothing happened because I can only cancel my own jobs.

Delete one of your jobs

Let’s first add a line to your previous slurm_script.sh to make sure it runs long enough for us to cancel it:

#! /bin/bash

#SBATCH --mem=100M
#SBATCH --cpus-per-task=1

echo "hello world"
sleep 2m # wait for 2 minutes

submit the script: sbatch slurm_script.sh, and note its job id
check it’s running: squeue (you can get the job id here too)
delete the job you just created: scancel 123456
check again if it’s still running: squeue

Your job should’ve changed status and then disappear after a little while, as expected.

Click to see example outputs of the above steps

submit the script:

john.doe@slurmlogin:~$ sbatch slurm_script.sh
Submitted batch job 123456
john.doe@slurmlogin:~$

Job id is 123456 in this case.

check it’s running: squeue

john.doe@slurmlogin:~$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
            123456    common     bash john.doe  R       0:04      1 node24

delete the job you just created:

john.doe@slurmlogin:~$ scancel 123456

check again if it’s still running: squeue

john.doe@slurmlogin:~$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             123456   common     bash john.doe CG       0:09      1 node24

CG = Completing

After a while, it should be gone.

⁕ ⁕ ⁕ ⁕ ⁕ ⁕

Take home message

Commonly-used options for `sbatch` & `srun`

option	function
`--mem=<mem>`	to specify the amout of memory to reserve per node
`--cpus-per-task=<cpus>`	to specify the number of CPUs to reserve per task (default nb tasks = 1)
`--job-name="<jobname>"`	to specify a job name (no spaces or special characters please)
`--time=[DD-]HH+:MM:SS`	to specify the maximum running time (default: 2hrs)
`--partition=common`	to specify the partition (=group of nodes) to submit your job to
`--output /path/to/output.log`	name of the file to save standard output to
`--error /path/to/error.log`	if specified, standard error is written to a separate file specified in the option

More options in th “cheat sheet” tab on the intranet and the official Slurm documentation

🔗 Back to exercise page{:target=“_blank”}

Objectives

Setup

Objective 1 - Transcribing your analysis workflow into a Slurm submission script

My first submission script

Submitting the script

Following your job

Checking that your job worked

Objective 2 - Analysing & adjusting resource consumption

Analysing resource consumption

Adjusting the resources you ask for

Objective 3 - Cancelling a job

Getting familiar with scancel

Delete one of your jobs

Take home message

Commonly-used options for sbatch & srun

Getting familiar with `scancel`

Commonly-used options for `sbatch` & `srun`