Last updated: 2025-03-13
This exercise is divided into very detailed steps. If you’re already
quite comfortable with scripting and schedulers, go directly to
“Exercise 2, Case study C” (it’s the same steps, but it will also
introduce you to for
loops and job arrays) and follow up
with “Exercise 3” which will introduce you to using micromamba to
install programmes on the cluster.
⁕ ⁕ ⁕ ⁕ ⁕ ⁕
We just sequenced the RNA of a sample. The sequencing plateform gave us the raw output file after sequencing in fastq format. We would like to have a first overview of the quality of the sequencing performed. In the following example, we will try to run the FastQC programme on this sequencing output. The FastQC programme analyses fastq files and outputs a quality control report in html format (more information on FastQC). It’s a small programme that doesn’t require a lot of resources and it’s already installed on the I2BC cluster.
⁕ ⁕ ⁕ ⁕ ⁕ ⁕
In this case study, you will see the complete step-by-step of how to use a given software on the I2BC cluster. Starting with looking for it within the installed software, understanding how to use it, running it within a Slurm job and optimising the resources to reserve.
⁕ ⁕ ⁕ ⁕ ⁕ ⁕
It’s the same as for Exercise
0{:target=“_blank”}: you should be connected to the cluster and on
the master node (i.e. slurmlogin
should be written in your
terminal prefix).
You will need the example files available in Zenodo under this link, or on
the Forge Logicielle under
this link for those who are familiar with git
.
We’ll work in your home directory. Let’s move to it and fetch our
working files using wget
:
john.doe@slurmlogin:~$ cd /home/john.doe
john.doe@slurmlogin:/home/john.doe$ wget "https://zenodo.org/records/15017630/files/cluster_usage_examples.tar.gz?download=1" -O cluster_usage_examples.tar.gz
john.doe@slurmlogin:/home/john.doe$ tar -zxf cluster_usage_examples.tar.gz
john.doe@slurmlogin:/home/john.doe$ ls cluster_usage_examples/
example_fastqc example_mafft example_tmalign
In the example_fastqc
folder, you’ll see a sequencing
output in fastq format (the .gz extension suggests that it’s also
compressed) on which we’ll run the FastQC programme.
john.doe@slurmlogin:/home/john.doe$ ls cluster_usage_examples/example_fastqc/
head1000_SRR9732589_1.fastq.gz
⁕ ⁕ ⁕ ⁕ ⁕ ⁕
The FastQC
programme executable is called fastqc
. Try to find it using
the module
command.
The module command can be used from anywhere on the cluster. The main sub-commands are:
module avail
: to list all available softwaremodule load/unload <software name>
: to load
specific software (for use)module list
: to list currently loaded softwareTo get more details on options for these subcommands (e.g. to search
for a specific name), you can use the -h
option to get the
help page.
john.doe@slurmlogin:/home/john.doe$ module avail -C fastqc -i
------------------------------------- /usr/share/modules/modulefiles --------------------------------------
fastqc/fastqc_v0.10.1 fastqc/fastqc_v0.11.5 singularity/fastqc
In the above command, we used the -C
option to specify a
pattern to search for (“fastqc”) and -i
to make the search
case-insensitive.
According to the output, all we have to do is use:
module load fastqc/fastqc_v0.11.5
in order to load FastQC
version 0.11.5 (we chose the most recent version here).
fastqc
executableLet’s investigate how to use the fastqc
executable: How
do we specify the inputs? What options or parameters can we use? What
would the final command line look like to run fastqc
on
your input?
Hints:
fastqc
, you can often get a
help message with usage examples by using the --help
or
-h
optionsrun --pty bash
)module
aren’t available on the master
node (slurmlogin), so let’s first connect to a node and then load
fastqc
with module
:john.doe@slurmlogin:/home/john.doe$ srun --pty bash
john.doe@node01:/home/john.doe$ module load fastqc/fastqc_v0.11.5
man your_programme
” or
“your_programme --help
”. Let’s see if we can access the
help menu for fastqc
:john.doe@node01:/home/john.doe$ fastqc --help
FastQC - A high throughput sequence QC analysis tool
SYNOPSIS
fastqc seqfile1 seqfile2 .. seqfileN
fastqc [-o output dir] [--(no)extract] [-f fastq|bam|sam]
[-c contaminant file] seqfile1 .. seqfileN
[...]
Have a look at line 7: it says that the basic usage of fastqc in our case would look like this (executable in red, fastq file in blue): “fastqc cluster_usage_examples/example_fastqc/head1000_SRR9732589_1.fastq.gz”
-o
option but the folder you
specify has to exist first, for example:# create the output folder
john.doe@node01:/home/john.doe$ mkdir -p cluster_usage_examples/example_fastqc/results
# run fastqc
john.doe@node01:/home/john.doe$ fastqc -o cluster_usage_examples/example_fastqc/results cluster_usage_examples/example_fastqc/head1000_SRR9732589_1.fastq.gz
fastqc
and its execution. We no longer need to be connected
to a node and can now liberate the resources that we’ve blocked by
disconnecting from it:john.doe@node01:/home/john.doe$ exit 0
john.doe@slurmlogin:/home/john.doe$
Of note: as you can see, the terminal prompt changed again from node01 back to slurmlogin: we’ve returned to the master node (slurmlogin) of the cluster and the job we were running has terminated.
Let’s move to your example_fastqc
subdirectory first and
write the slurm_script.sh
in there.
john.doe@slurmlogin:/home/john.doe$ cd cluster_usage_examples/example_fastqc/
nano
(but there
are other possibilities such as vi
, vim
or
emacs
for example). To use nano:
nano slurm_script.sh
will create and open the
slurm_script.sh
file.^
= Ctrl, Ctrl+G
to see help message,
Ctrl+X
to exit), see nano
cheat sheet.#SBATCH
-prefixed lines at the head to
specify slurm options for submission, see Slurm cheat
sheet.The Slurm submission script is written like a common bash script
(same language as the terminal). You write in this script all the
commands (1 per line) that you would usually type in your terminal. The
only particularity are the Slurm submission options, that you can add
directly to this script, commonly at the beginning (each parameter
should be preceded by #SBATCH
):
#! /bin/bash
#SBATCH --job-name="my_jobname"
#SBATCH --partition=common
#SBATCH --cpus-per-task=1
### prefix start - create temporary directory for your job
export TMPDIR=$(mktemp -d)
### prefix end
module load fastqc/fastqc_v0.11.5
# This is a comment line - it will be ignored when the script is executed
cd /home/john.doe/cluster_usage_examples/example_fastqc/ #|
#| These are your shell commands
fastqc head1000_SRR9732589_1.fastq.gz #|
### suffix start - delete created temporary directory
rm -rf $TMPDIR
### suffix end
Explanation of the content:
#! /bin/bash
: this is the “shebang”, it specifies the
“language” of your script (in this case, the cluster understands that
the syntax of this text file is “bash” and will execute it with the
/bin/bash executable).#SBATCH
: All lines starting with #SBATCH
indicate to the Slurm job scheduler on the cluster that the following
information is information related to the job submission. This is where
you specify the slurm options such as your job name with
--job-name
or the partition you want to submit your job to
with -p
. There are many more options you can specify, see
the “cheat
sheet” tab on the intranet.module load
will load the software you need
(i.e. FastQC in this case)cd /path/to/your/folder
: although Slurm places you in
the same directory in which you’ve submitted the slurm script, it’s a
good habit to deliberately move to your working directly within the
submission script to avoid any nasty surprises… By moving to the
directory which contains your input, you won’t need to specify the full
path to the input, as you can see in the line of code that follows this
statement.$TMPDIR
. This folder is sometimes
overridden and dealt with automatically by the Scheduler (automatic
clean up) but it’s a good habit to do this yourself if you can (to avoid
saturating the temporary disk when it’s not dealt with automatically).
Here, we use the mktemp
command to create a temporary
folder with a random name at the beginning of the script and we use
rm
to delete it at the end of the script.When you exit the nano text editor, you should see the file created in your current directory:
john.doe@slurmlogin:/home/john.doe/cluster_usage_examples/example_fastqc$ ls
head1000_SRR9732589_1.fastq.gz slurm_script.sh
To submit a Slurm submission script, all you have to do is:
john.doe@slurmlogin:/home/john.doe/cluster_usage_examples/example_fastqc$ sbatch slurm_script.sh
Submitted batch job 287170
This will print your attributed job id on the screen (287170 in this case).
You can follow the progression of your job with
squeue
:
john.doe@slurmlogin:/home/john.doe/cluster_usage_examples/example_fastqc$ squeue -j 287170
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
287170 common my_jobname john.doe R 00:00:41 1 node06
(you can omit the -j
option to show all the currently
running jobs).
You can learn more about the options for squeue
in the
manual (type man squeue
, navigate with the up/down arrow
keys and exit by typing q).
Hints:
slurm-xxx.out
file (replace xxx with
your job id) to see if there were any problemsWhat files do we expect to see? There should be 3 new files in total:
Your job shouldn’t take too long to finish, then you should be able to see the output files in your folder:
john.doe@slurmlogin:/home/john.doe/cluster_usage_examples/example_fastqc$ ls
head1000_SRR9732589_1.fastq.gz head1000_SRR9732589_1_fastqc.zip head1000_SRR9732589_1_fastqc.html
slurm-287170.out slurm_script.sh
head1000_SRR9732589_1_fastqc.zip
and head1000_SRR9732589_1_fastqc.html
slurm-287170.out
Having issues? If you don’t have the output files, then there might be a problem in the execution somewhere. In that case, you can have a look at the log file from Slurm:
john.doe@slurmlogin:/home/john.doe/cluster_usage_examples/example_fastqc$ cat slurm-287170.out
Note that the log file is generated by default in the directory in
which you ran the sbatch
command. There is an option in
sbatch
with which you can change this behaviour.
Typical error messages are for example:
-bash: fastqc: command not found:
fastqc
with
module
firstSpecified output directory 'nonexistantdir' does not exist
:fastqc
cannot find the output directory that was
specified. It might be because you have to create it first or because it
couldn’t find it in the working directory (make sure to specify the full
path to that folder and check that you don’t have any typos)Skipping 'nonexistant.fastq' which didn't exist, or couldn't be read
:fastqc
cannot find the input that you gave it. It
could be linked to a typo in the name or could be because
fastqc
didn’t find it in your current working
directoryAnalyse your actual resource consumption: How much memory did you effectively use while running the job? How long did your job take to finish? How much CPU percentage did you use?
This is useful to know in order to adapt the resources you ask for in future jobs with similar proceedings (e.g. other FastQC submissions).
Hints:
jobinfo
command from the slurm-tools toolkit
(//! not a native command of Slurm) for thisjohn.doe@slurmlogin:/home/john.doe/cluster_usage_examples/example_fastqc$ jobinfo 287170
Job ID : 287170
Job name : my_jobname
User : john.doe
Account :
Working directory : /home/john.doe/cluster_usage_examples/example_fastqc
Cluster : cluster
Partition : common
Nodes : 1
Nodelist : node06
Tasks : 1
CPUs : 1
GPUs : 0
State : COMPLETED
Exit code : 0:0
Submit time : 2025-03-11T11:03:43
Start time : 2025-03-11T11:03:44
End time : 2025-03-11T11:03:51
Wait time : 00:00:01
Reserved walltime : 02:00:00
Used walltime : 00:00:07
Used CPU walltime : 00:00:07
Used CPU time : 00:00:04
CPU efficiency : 66.00%
% User (computation) : 94.89%
% System (I/O) : 5.11%
Reserved memory : 1000M/core
Max memory used : 2.85M (estimate)
Memory efficiency : 0.29%
Max disk write : 0.00
Max disk read : 0.00
The lines you should look at:
Adjustements to make:
//! it’s important to keep in mind that run times and resource usage also depends on the commands you run and on the size of your input.
Now try optimising your script to reseve not more than the resouces
you actually need to run fastqc
. Your colleagues will be
thankful ;-)
Hint: see Slurm
cheat sheet for a list of all options for sbatch
Your script could look like this:
#! /bin/bash
#SBATCH --job-name="my_jobname"
#SBATCH --partition=common
#SBATCH --cpus-per-task=1
#SBATCH --mem=100M
#SBATCH --time=00:10:00
### prefix start - create temporary directory for your job
export TMPDIR=$(mktemp -d)
### prefix end
module load fastqc/fastqc_v0.11.5
# This is a comment line - it will be ignored when the script is executed
cd /home/john.doe/cluster_usage_examples/example_fastqc/ #|
#| These are your shell commands
fastqc head1000_SRR9732589_1.fastq.gz #|
### suffix start - delete created temporary directory
rm -rf $TMPDIR
### suffix end
NB: only the #SBATCH
lines were changed…
Take home message
When discovering a new tool and wanting to use it on the cluster
module
module load
at the top#SBATCH
-prefixed lines (see Slurm cheat
sheet for a list of all options)sbatch
squeue
jobinfo
🔗 Back to exercise page{:target=“_blank”}