Exercise 1 - Going further

Authors: BIOI2 & I2BC members

Last updated: 2025-03-13

Objectives

In this Exercise, we will be taking our first steps with the cluster. We will:

⁕ ⁕ ⁕ ⁕ ⁕ ⁕

Setup

It’s the same as for Exercise 0{:target=“_blank”}: you should be connected to the cluster and on the master node (i.e. slurmlogin should be written in your terminal prefix).

⁕ ⁕ ⁕ ⁕ ⁕ ⁕

Objective 1 - Transcribing your analysis workflow into a Slurm submission script

Using an interactive session (srun --pty bash) is useful when you’re setting up a new analysis pipeline and getting to know how to use the software within it. In the long run, it’s better to use a submission script instead so that you free the resources as soon as your software has finished running. Also, this means you can switch off the connection to the cluster without affecting your job (think of it like posting a letter - switching your computer off won’t affect the job that is running).

In this exercise, we will need to create & edit files. We suggest you use the in-line text editor nano but feel free to use whatever editor you prefer. If you’re working on the /store/ space, remember that you also have access to it outside the cluster (see this and this page on the intranet).

My first submission script

The Slurm submission script is just a script written in bash (that’s the language of the cluster, i.e. cd and ls are bash commands). You can put in a bash script whatever you would normally write in the terminal with 1 command per line.

Let’s create a script called slurm_script.sh that will print “hello world” on the screen:

#! /bin/bash

echo "hello world"

Submitting the script

Next, we will submit the script to the scheduler, which will look for a free node to run it on.

To submit your script, you can use the sbatch command:

john.doe@slurmlogin:/home/john.doe$ sbatch slurm_script.sh
Submitted batch job 123456
john.doe@slurmlogin:/home/john.doe$

Note:

Following your job

According to resource availability, your job might need to queue or start running directly. To see its status, you can use slurm’s in-build squeue command.

NB: squeue will only show currently running jobs so you might not see your job if you’re not quick enough!

Click to see example output
john.doe@slurmlogin:~$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
            123456    common slurm_sc john.doe  R       0:04      1 node24

In order to see past jobs, you can use the sacct command with a few options:

sacct -X
Click to see example output & explanation
john.doe@slurmlogin:~$ sacct -X
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
115896             bash        ngs                     1 CANCELLED+      0:0
115960       snakemake+        ngs                     8     FAILED      1:0
123456             bash     common                     1  COMPLETED      0:0

The “FAILED” state is a bit missleading sometimes. It’s not because it’s marked as “FAILED” that your job didn’t run properly. Check if the expected output was generated before panicking ;-)

sacct -X will list all your jobs that are or have been running from midnight onwards. A few useful options are:

Other custom commands on the I2BC cluster you can use

The SICS installed a set of scripts from slurm-tools that you can use to do the same as above. You don’t have to remember everything, it’s up to you to choose your favorite ones:

Example outputs:

john.doe@slurmlogin:~$ jobqueue
-----------------------------------------------------------------------------------------
Job ID             User     Job Name  Partition    State     Elapsed     Nodelist(Reason)
------------ ---------- ------------ ---------- -------- ----------- --------------------
123456         john.doe        bash     common  RUNNING        0:02               node24
john.doe@slurmlogin:~$ jobhist
----------------------------------------------------------------------------------------------------
Job ID         Startdate       User     Job Name  Partition      State     Elapsed Nodes CPUs Memory
------------- ---------- ---------- ------------ ---------- ---------- ----------- ----- ---- ------
115896        2025-03-05   jane.doe         bash        ngs CANCELLED+  2-21:36:56     1    1 1000Mc
115960        2025-03-08   jane.doe snakemake_p+        ngs     FAILED    00:00:02     1    8   64Gn
123456        2025-03-09   john.doe         bash     common    RUNNING    00:00:42     1    1 1000Mc

NB: Gn = Gb/node and Gc = Gb/cpu, same for M (Mb), K (Kb) and T (Tb)

Checking that your job worked

The command echo "hello world" should normally print “hello world” on your screen… When running scripts remotely on the nodes, anything that is usually printed on the screen is saved in a file instead.

Have a look at your working directory, you should see an extra file in there. If you open it (with the cat command for example), you should see “Hello world” in there.

john.doe@slurmlogin:/home/john.doe$ ls
slurm_script.sh  slurm-123456.out
john.doe@slurmlogin:/home/john.doe$ cat slurm-123456.out
Hello world

Take home message

  1. on a routine basis, using a Slurm submission script is better than using an interactive session (srun --pty bash)
  2. a submission script is just a bash script & to submit it, you use the sbatch command
  3. with a script, what’s normally printed on the screen will appear at the end of the job in a file
  4. squeue to list all jobs that are queuing/running, or sacct to also list past jobs

🔗 Back to exercise page{:target=“_blank”}

Objective 2 - Analysing & adjusting resource consumption

Analysing resource consumption

The cluster is a shared resource so it’s important to make sure that your queries are submitted with a reasonable amount of asked resources. Default parameters are 2Gb of RAM memory, 1 CPU and a maximum running time of 2 hrs.

To know how much of the reserved resources your job actually used, you can use a combination of different Slurm commands. However, usage and outputs are not always very clear for beginner users. We suggest you use the jobinfo <jobid> command from slurm-tools (already installed on the I2BC cluster).

Click to see example output and explanation

jobinfo 123456 will output:

Job ID               : 123456
Job name             : bash
User                 : john.doe
Account              :
Working directory    : /data/work/I2BC/john.doe/testrun
Cluster              : cluster
Partition            : common
Nodes                : 1
Nodelist             : node24
Tasks                : 1
CPUs                 : 1
GPUs                 : 0
State                : COMPLETED
Exit code            : 0:0
Submit time          : 2025-03-09T09:07:56
Start time           : 2025-03-09T09:07:56
End time             : 2025-03-09T09:08:38
Wait time            :  00:00:00
Reserved walltime    :  00:00:00
Used walltime        :  00:00:42  # Actual run time of job
Used CPU walltime    :  00:00:42  # Used walltime x number of CPUs
Used CPU time        :  00:00:00  # Total time that CPUs were actually used for
CPU efficiency       :  0.18%     # Used CPU time / Used CPU walltime
% User (computation) : 50.65%
% System (I/O)       : 49.35%
Reserved memory      : 1000M/core
Max memory used      : 9.29M (estimate) # Maximum memory used
Memory efficiency    :  0.93%           # Max memory used / Reserved memory
Max disk write       : 256.00K
Max disk read        : 512.00K

How to read this output? Job id, Job name, User, Partition, Nodes, Nodelist and CPUs are like before and quite transparent. Interesting is:

So in our case, with 0.18% and 0.93%, we’re quite bad in resource efficiency…

Adjusting the resources you ask for

To adjust the resources, you can add a few extra options when running srun or sbatch. Useful options in this case are:

option function
--mem=xxM
--mem=xxG
reserve the specified amount of RAM memory in Mb or Gb
--cpus-per-task=x reserve the specified amount of processors (CPUs)

You’ll find more options in the “cheat sheet” tab on the intranet

Let’s adjust the resources of our previous job

Since our previous job only used very little resources, there is no sens in reserving 2Gb, let’s reduce it to 100Mb. We’ve used 0.18% of the reserved CPU but 1 CPU is already the minimum so we’ll keep it that way.

There are 2 methods to specify these parameters to sbatch:

  1. add these options to the command line at job submission
sbatch --mem=100M --cpus-per-task=1 slurm_script.sh
  1. OR add these options directly to your slurm submission script
#! /bin/bash

#SBATCH --mem=100M
#SBATCH --cpus-per-task=1

echo "hello world"

then resubmit with sbatch slurm_script.sh.

Note that the syntax is the same in both cases, with additionally the #SBATCH prefix in the script and each directive should be on a separate line.

NB: It’s important to note that increasing the number of processors (CPUs or threads) won’t accelerate your job if the software you’re using doesn’t support parallelisation.

Take home message

  1. it’s important to adjust the resources you ask for to the job you’re running
  2. jobinfo is a customs script that gives you information on what your job actually used
  3. you can adjust resources using options in the sbatch command, whether directly at execution or within the Slurm submission script (srun takes the same options as sbatch)
  4. any Slurm options added to the script should be preceded by #SBATCH and written at the top of the script
  5. It’s not because you reserved 6 CPUs and 60Gb of memory that your software will use them! Check the software options first!

⁕ ⁕ ⁕ ⁕ ⁕ ⁕

Objective 3 - Cancelling a job

What if I changed my mind? How can I stop the job I just submitted??!

Getting familiar with scancel

In addition to sbatch/srun (submit a job) and squeue/sacct (follow a job/check a job’s resources), the third Slurm command you should know is scancel to cancel a job. You can only cancel your own jobs, cancelling other people’s jobs won’t work.

Let’s see what jobs are running:

john.doe@slurmlogin:~$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
            123456    common     bash john.doe  R       0:04      1 node24
            123457    common     test luke.doe  R       0:09      1 node25

Let’s now try to delete Luke’s job:

john.doe@slurmlogin:~$ scancel 123457
scancel: Unauthorized Request  123457

See? Nothing happened because I can only cancel my own jobs.

Delete one of your jobs

Let’s first add a line to your previous slurm_script.sh to make sure it runs long enough for us to cancel it:

#! /bin/bash

#SBATCH --mem=100M
#SBATCH --cpus-per-task=1

echo "hello world"
sleep 2m # wait for 2 minutes
  1. submit the script: sbatch slurm_script.sh, and note its job id
  2. check it’s running: squeue (you can get the job id here too)
  3. delete the job you just created: scancel 123456
  4. check again if it’s still running: squeue

Your job should’ve changed status and then disappear after a little while, as expected.

Click to see example outputs of the above steps
  1. submit the script:
john.doe@slurmlogin:~$ sbatch slurm_script.sh
Submitted batch job 123456
john.doe@slurmlogin:~$

Job id is 123456 in this case.

  1. check it’s running: squeue
john.doe@slurmlogin:~$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
            123456    common     bash john.doe  R       0:04      1 node24
  1. delete the job you just created:
john.doe@slurmlogin:~$ scancel 123456
  1. check again if it’s still running: squeue
john.doe@slurmlogin:~$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             123456   common     bash john.doe CG       0:09      1 node24

CG = Completing

After a while, it should be gone.

Take home message

  1. scancel to cancel a job using it’s unique job id
  2. you can only cancel your own jobs so no worries!

⁕ ⁕ ⁕ ⁕ ⁕ ⁕

Take home message

Commonly-used options for sbatch & srun

option function
--mem=<mem> to specify the amout of memory to reserve per node
--cpus-per-task=<cpus> to specify the number of CPUs to reserve per task (default nb tasks = 1)
--job-name="<jobname>" to specify a job name (no spaces or special characters please)
--time=[DD-]HH+:MM:SS to specify the maximum running time (default: 2hrs)
--partition=common to specify the partition (=group of nodes) to submit your job to
--output /path/to/output.log name of the file to save standard output to
--error /path/to/error.log if specified, standard error is written to a separate file specified in the option

More options in th “cheat sheet” tab on the intranet and the official Slurm documentation

🔗 Back to exercise page{:target=“_blank”}