Exercise 2, case study A - QC analysis with FastQC

Authors: BIOI2 & I2BC members

Last updated: 2025-03-13

Foreword

This exercise is divided into very detailed steps. If you’re already quite comfortable with scripting and schedulers, go directly to “Exercise 2, Case study C” (it’s the same steps, but it will also introduce you to for loops and job arrays) and follow up with “Exercise 3” which will introduce you to using micromamba to install programmes on the cluster.

⁕ ⁕ ⁕ ⁕ ⁕ ⁕

Context

We just sequenced the RNA of a sample. The sequencing plateform gave us the raw output file after sequencing in fastq format. We would like to have a first overview of the quality of the sequencing performed. In the following example, we will try to run the FastQC programme on this sequencing output. The FastQC programme analyses fastq files and outputs a quality control report in html format (more information on FastQC). It’s a small programme that doesn’t require a lot of resources and it’s already installed on the I2BC cluster.

⁕ ⁕ ⁕ ⁕ ⁕ ⁕

Objectives

In this case study, you will see the complete step-by-step of how to use a given software on the I2BC cluster. Starting with looking for it within the installed software, understanding how to use it, running it within a Slurm job and optimising the resources to reserve.

⁕ ⁕ ⁕ ⁕ ⁕ ⁕

Setup

Connect to cluster

It’s the same as for Exercise 0{:target=“_blank”}: you should be connected to the cluster and on the master node (i.e. slurmlogin should be written in your terminal prefix).

Fetch example files

You will need the example files available in Zenodo under this link, or on the Forge Logicielle under this link for those who are familiar with git.

We’ll work in your home directory. Let’s move to it and fetch our working files using wget:

john.doe@slurmlogin:~$ cd /home/john.doe
john.doe@slurmlogin:/home/john.doe$ wget "https://zenodo.org/records/15017630/files/cluster_usage_examples.tar.gz?download=1" -O cluster_usage_examples.tar.gz
john.doe@slurmlogin:/home/john.doe$ tar -zxf cluster_usage_examples.tar.gz
john.doe@slurmlogin:/home/john.doe$ ls cluster_usage_examples/
example_fastqc  example_mafft  example_tmalign

In the example_fastqc folder, you’ll see a sequencing output in fastq format (the .gz extension suggests that it’s also compressed) on which we’ll run the FastQC programme.

john.doe@slurmlogin:/home/john.doe$ ls cluster_usage_examples/example_fastqc/
head1000_SRR9732589_1.fastq.gz

⁕ ⁕ ⁕ ⁕ ⁕ ⁕

Step-by-step

Task 1: Locate the FastQC software

The FastQC programme executable is called fastqc. Try to find it using the module command.

Click to see hints

The module command can be used from anywhere on the cluster. The main sub-commands are:

module avail: to list all available software
module load/unload <software name>: to load specific software (for use)
module list: to list currently loaded software

To get more details on options for these subcommands (e.g. to search for a specific name), you can use the -h option to get the help page.

Click to see answer

john.doe@slurmlogin:/home/john.doe$ module avail -C fastqc -i
------------------------------------- /usr/share/modules/modulefiles --------------------------------------
fastqc/fastqc_v0.10.1 fastqc/fastqc_v0.11.5 singularity/fastqc

In the above command, we used the -C option to specify a pattern to search for (“fastqc”) and -i to make the search case-insensitive.

According to the output, all we have to do is use: module load fastqc/fastqc_v0.11.5 in order to load FastQC version 0.11.5 (we chose the most recent version here).

Task 2: Determine how to use the `fastqc` executable

Let’s investigate how to use the fastqc executable: How do we specify the inputs? What options or parameters can we use? What would the final command line look like to run fastqc on your input?

Hints:

for “custom” software like fastqc, you can often get a help message with usage examples by using the --help or -h option
you can experiment with the executable in an interactive job on one of the nodes (srun --pty bash)

Click to see answer

Software shown in module aren’t available on the master node (slurmlogin), so let’s first connect to a node and then load fastqc with module:

john.doe@slurmlogin:/home/john.doe$ srun --pty bash
john.doe@node01:/home/john.doe$ module load fastqc/fastqc_v0.11.5

Most programmes come with help or usage messages that you can print on the screen using “man your_programme” or “your_programme --help”. Let’s see if we can access the help menu for fastqc:

john.doe@node01:/home/john.doe$ fastqc --help

            FastQC - A high throughput sequence QC analysis tool

SYNOPSIS

                    fastqc seqfile1 seqfile2 .. seqfileN

            fastqc [-o output dir] [--(no)extract] [-f fastq|bam|sam]
                  [-c contaminant file] seqfile1 .. seqfileN

                                   [...]

Have a look at line 7: it says that the basic usage of fastqc in our case would look like this (executable in red, fastq file in blue): “fastqc cluster_usage_examples/example_fastqc/head1000_SRR9732589_1.fastq.gz”

Also, according to the help message, if we would like to specify the output folder, we can use the -o option but the folder you specify has to exist first, for example:

# create the output folder
john.doe@node01:/home/john.doe$ mkdir -p cluster_usage_examples/example_fastqc/results

# run fastqc
john.doe@node01:/home/john.doe$ fastqc -o cluster_usage_examples/example_fastqc/results cluster_usage_examples/example_fastqc/head1000_SRR9732589_1.fastq.gz

At this point, we know everything there is to know about fastqc and its execution. We no longer need to be connected to a node and can now liberate the resources that we’ve blocked by disconnecting from it:

john.doe@node01:/home/john.doe$ exit 0
john.doe@slurmlogin:/home/john.doe$

Of note: as you can see, the terminal prompt changed again from node01 back to slurmlogin: we’ve returned to the master node (slurmlogin) of the cluster and the job we were running has terminated.

Task 3: Write your submission script

Let’s move to your example_fastqc subdirectory first and write the slurm_script.sh in there.

john.doe@slurmlogin:/home/john.doe$ cd cluster_usage_examples/example_fastqc/

Click to see hints

you can use an in-line text editor like nano (but there are other possibilities such as vi, vim or emacs for example). To use nano:
- nano slurm_script.sh will create and open the slurm_script.sh file.
- inside nano, you can use Ctrl+letter and Shft+letter combinations t to save and exit as specified at the bottom of the nano window (^ = Ctrl, Ctrl+G to see help message, Ctrl+X to exit), see nano cheat sheet.
a submission script is like a common bash script (terminal language) except that you use #SBATCH-prefixed lines at the head to specify slurm options for submission, see Slurm cheat sheet.

Click to see answer

The Slurm submission script is written like a common bash script (same language as the terminal). You write in this script all the commands (1 per line) that you would usually type in your terminal. The only particularity are the Slurm submission options, that you can add directly to this script, commonly at the beginning (each parameter should be preceded by #SBATCH):


#! /bin/bash

#SBATCH --job-name="my_jobname"
#SBATCH --partition=common
#SBATCH --cpus-per-task=1

### prefix start - create temporary directory for your job
export TMPDIR=$(mktemp -d)
### prefix end


module load fastqc/fastqc_v0.11.5

# This is a comment line - it will be ignored when the script is executed

cd /home/john.doe/cluster_usage_examples/example_fastqc/       #|
                                                               #| These are your shell commands
fastqc head1000_SRR9732589_1.fastq.gz                          #|


### suffix start - delete created temporary directory
rm -rf $TMPDIR
### suffix end

Explanation of the content:

#! /bin/bash: this is the “shebang”, it specifies the “language” of your script (in this case, the cluster understands that the syntax of this text file is “bash” and will execute it with the /bin/bash executable).
#SBATCH : All lines starting with #SBATCH indicate to the Slurm job scheduler on the cluster that the following information is information related to the job submission. This is where you specify the slurm options such as your job name with --job-name or the partition you want to submit your job to with -p. There are many more options you can specify, see the “cheat sheet” tab on the intranet.
module load will load the software you need (i.e. FastQC in this case)
cd /path/to/your/folder: although Slurm places you in the same directory in which you’ve submitted the slurm script, it’s a good habit to deliberately move to your working directly within the submission script to avoid any nasty surprises… By moving to the directory which contains your input, you won’t need to specify the full path to the input, as you can see in the line of code that follows this statement.
“prefix” and “suffix” blocks: by default, any software that is executed and that uses temporary files will save them to system default folder specified with $TMPDIR. This folder is sometimes overridden and dealt with automatically by the Scheduler (automatic clean up) but it’s a good habit to do this yourself if you can (to avoid saturating the temporary disk when it’s not dealt with automatically). Here, we use the mktemp command to create a temporary folder with a random name at the beginning of the script and we use rm to delete it at the end of the script.

When you exit the nano text editor, you should see the file created in your current directory:

john.doe@slurmlogin:/home/john.doe/cluster_usage_examples/example_fastqc$ ls
head1000_SRR9732589_1.fastq.gz   slurm_script.sh

Task 4: Submit your script to the cluster

Click to see answer

To submit a Slurm submission script, all you have to do is:

john.doe@slurmlogin:/home/john.doe/cluster_usage_examples/example_fastqc$ sbatch slurm_script.sh
Submitted batch job 287170

This will print your attributed job id on the screen (287170 in this case).

You can follow the progression of your job with squeue:

john.doe@slurmlogin:/home/john.doe/cluster_usage_examples/example_fastqc$ squeue -j 287170
             JOBID PARTITION       NAME     USER ST       TIME  NODES NODELIST(REASON)
            287170    common my_jobname john.doe  R   00:00:41      1 node06

(you can omit the -j option to show all the currently running jobs).

You can learn more about the options for squeue in the manual (type man squeue, navigate with the up/down arrow keys and exit by typing q).

Task 5: Check if your job finished correctly

Hints:

check if the expected output files were generated (1 html + 1 zip file)
have a look at the slurm-xxx.out file (replace xxx with your job id) to see if there were any problems

Click to see answer

What files do we expect to see? There should be 3 new files in total:

FastQC should generate two files: an html file with the visual summary of the quality assessment of your fastq file and a zip folder that contains individual png images and result files.
the Slurm scheduler should also generate 1 log file with your job id in the name, summarising the error log and the usual output log (it’s what’s usually printed on the screen that Slurm captures in this file instead).

Your job shouldn’t take too long to finish, then you should be able to see the output files in your folder:

john.doe@slurmlogin:/home/john.doe/cluster_usage_examples/example_fastqc$ ls
head1000_SRR9732589_1.fastq.gz   head1000_SRR9732589_1_fastqc.zip   head1000_SRR9732589_1_fastqc.html
slurm-287170.out                 slurm_script.sh

FastQC output files: head1000_SRR9732589_1_fastqc.zip and head1000_SRR9732589_1_fastqc.html
Slurm log file: slurm-287170.out

Click to see troubleshooting tips

Having issues? If you don’t have the output files, then there might be a problem in the execution somewhere. In that case, you can have a look at the log file from Slurm:

john.doe@slurmlogin:/home/john.doe/cluster_usage_examples/example_fastqc$ cat slurm-287170.out

Note that the log file is generated by default in the directory in which you ran the sbatch command. There is an option in sbatch with which you can change this behaviour.

Typical error messages are for example:

-bash: fastqc: command not found:
This is typical for commands that bash doesn’t know or cannot find. In this case, it might be because you forgot load fastqc with module first
Specified output directory 'nonexistantdir' does not exist:
As stated, fastqc cannot find the output directory that was specified. It might be because you have to create it first or because it couldn’t find it in the working directory (make sure to specify the full path to that folder and check that you don’t have any typos)
Skipping 'nonexistant.fastq' which didn't exist, or couldn't be read:
As stated, fastqc cannot find the input that you gave it. It could be linked to a typo in the name or could be because fastqc didn’t find it in your current working directory

Task 6: Analyse your resource consumption to optimise your next run

Analyse your actual resource consumption: How much memory did you effectively use while running the job? How long did your job take to finish? How much CPU percentage did you use?

This is useful to know in order to adapt the resources you ask for in future jobs with similar proceedings (e.g. other FastQC submissions).

Hints:

you can use the jobinfo command from the slurm-tools toolkit (//! not a native command of Slurm) for this

Click to see answer

john.doe@slurmlogin:/home/john.doe/cluster_usage_examples/example_fastqc$ jobinfo 287170
Job ID               : 287170
Job name             : my_jobname
User                 : john.doe
Account              :
Working directory    : /home/john.doe/cluster_usage_examples/example_fastqc
Cluster              : cluster
Partition            : common
Nodes                : 1
Nodelist             : node06
Tasks                : 1
CPUs                 : 1
GPUs                 : 0
State                : COMPLETED
Exit code            : 0:0
Submit time          : 2025-03-11T11:03:43
Start time           : 2025-03-11T11:03:44
End time             : 2025-03-11T11:03:51
Wait time            :     00:00:01
Reserved walltime    :     02:00:00
Used walltime        :     00:00:07
Used CPU walltime    :     00:00:07
Used CPU time        :     00:00:04
CPU efficiency       : 66.00%
% User (computation) : 94.89%
% System (I/O)       :  5.11%
Reserved memory      : 1000M/core
Max memory used      : 2.85M (estimate)
Memory efficiency    :  0.29%
Max disk write       : 0.00
Max disk read        : 0.00

The lines you should look at:

For CPUs: - line 24 “CPU efficiency” indicates how efficiently you used the CPUs you’ve reserved, this should be as close to 100% as possible
For memory: - line 29 “Memory efficiency” indicates how efficiently you used the memory you’ve reserved, this should be as close to 100% as possible - line 28 “Max memory used” is an estimate of the actual RAM memory you used
For walltime: - line 21 “Used walltime” indicates how long your job has run - line 22 “Used CPU walltime” is the run time multiplied by the number of reserved CPUs (i.e. approximately the time it would have taken on a singe CPU)

Adjustements to make:

For CPUs: we only used 60% of the allocated CPU but we can’t ask for less than 1 CPU so there’s nothing to be done in this case.
For memory: we only used 0.29% (=2.85Mb) of the reserved memory. A rule of thumb is to ask for at least +10% of the estimated memory usage. We’ll be putting 100Mb.
For walltime: adjusting the walltime is not as important as the previous 2 (the job will finish as soon as the commands within the script have finished running) but it can be necessary for jobs that might take longer than the default 2hrs. In this case, our job only took 7s so we could reduce the walltime to 10mins for example.

//! it’s important to keep in mind that run times and resource usage also depends on the commands you run and on the size of your input.

Task 7: Adjust the submission script with the previously-identified & required resources

Now try optimising your script to reseve not more than the resouces you actually need to run fastqc. Your colleagues will be thankful ;-)

Hint: see Slurm cheat sheet for a list of all options for sbatch

Click to see answer

Your script could look like this:


#! /bin/bash

#SBATCH --job-name="my_jobname"
#SBATCH --partition=common
#SBATCH --cpus-per-task=1
#SBATCH --mem=100M
#SBATCH --time=00:10:00

### prefix start - create temporary directory for your job
export TMPDIR=$(mktemp -d)
### prefix end


module load fastqc/fastqc_v0.11.5

# This is a comment line - it will be ignored when the script is executed

cd /home/john.doe/cluster_usage_examples/example_fastqc/       #|
                                                               #| These are your shell commands
fastqc head1000_SRR9732589_1.fastq.gz                          #|


### suffix start - delete created temporary directory
rm -rf $TMPDIR
### suffix end

NB: only the #SBATCH lines were changed…

🔗 Back to exercise page{:target=“_blank”}

Foreword

Context

Objectives

Setup

Connect to cluster

Fetch example files

Step-by-step

Task 1: Locate the FastQC software

Task 2: Determine how to use the fastqc executable

Task 3: Write your submission script

Task 4: Submit your script to the cluster

Task 5: Check if your job finished correctly

Task 6: Analyse your resource consumption to optimise your next run

Task 7: Adjust the submission script with the previously-identified & required resources

Task 2: Determine how to use the `fastqc` executable