Exercise 2, case study A - QC analysis with FastQC

Authors: BIOI2 & I2BC members

Last updated: 2025-03-13

Foreword

This exercise is divided into very detailed steps. If you’re already quite comfortable with scripting and schedulers, go directly to “Exercise 2, Case study C” (it’s the same steps, but it will also introduce you to for loops and job arrays) and follow up with “Exercise 3” which will introduce you to using micromamba to install programmes on the cluster.

⁕ ⁕ ⁕ ⁕ ⁕ ⁕

Context

We just sequenced the RNA of a sample. The sequencing plateform gave us the raw output file after sequencing in fastq format. We would like to have a first overview of the quality of the sequencing performed. In the following example, we will try to run the FastQC programme on this sequencing output. The FastQC programme analyses fastq files and outputs a quality control report in html format (more information on FastQC). It’s a small programme that doesn’t require a lot of resources and it’s already installed on the I2BC cluster.

⁕ ⁕ ⁕ ⁕ ⁕ ⁕

Objectives

In this case study, you will see the complete step-by-step of how to use a given software on the I2BC cluster. Starting with looking for it within the installed software, understanding how to use it, running it within a Slurm job and optimising the resources to reserve.

⁕ ⁕ ⁕ ⁕ ⁕ ⁕

Setup

Connect to cluster

It’s the same as for Exercise 0{:target=“_blank”}: you should be connected to the cluster and on the master node (i.e. slurmlogin should be written in your terminal prefix).

Fetch example files

You will need the example files available in Zenodo under this link, or on the Forge Logicielle under this link for those who are familiar with git.

We’ll work in your home directory. Let’s move to it and fetch our working files using wget:

john.doe@slurmlogin:~$ cd /home/john.doe
john.doe@slurmlogin:/home/john.doe$ wget "https://zenodo.org/records/15017630/files/cluster_usage_examples.tar.gz?download=1" -O cluster_usage_examples.tar.gz
john.doe@slurmlogin:/home/john.doe$ tar -zxf cluster_usage_examples.tar.gz
john.doe@slurmlogin:/home/john.doe$ ls cluster_usage_examples/
example_fastqc  example_mafft  example_tmalign

In the example_fastqc folder, you’ll see a sequencing output in fastq format (the .gz extension suggests that it’s also compressed) on which we’ll run the FastQC programme.

john.doe@slurmlogin:/home/john.doe$ ls cluster_usage_examples/example_fastqc/
head1000_SRR9732589_1.fastq.gz

⁕ ⁕ ⁕ ⁕ ⁕ ⁕

Step-by-step

Task 1: Locate the FastQC software

The FastQC programme executable is called fastqc. Try to find it using the module command.

Click to see hints

The module command can be used from anywhere on the cluster. The main sub-commands are:

To get more details on options for these subcommands (e.g. to search for a specific name), you can use the -h option to get the help page.

Click to see answer
john.doe@slurmlogin:/home/john.doe$ module avail -C fastqc -i
------------------------------------- /usr/share/modules/modulefiles --------------------------------------
fastqc/fastqc_v0.10.1 fastqc/fastqc_v0.11.5 singularity/fastqc

In the above command, we used the -C option to specify a pattern to search for (“fastqc”) and -i to make the search case-insensitive.

According to the output, all we have to do is use: module load fastqc/fastqc_v0.11.5 in order to load FastQC version 0.11.5 (we chose the most recent version here).

Task 2: Determine how to use the fastqc executable

Let’s investigate how to use the fastqc executable: How do we specify the inputs? What options or parameters can we use? What would the final command line look like to run fastqc on your input?

Hints:

Click to see answer
  1. Software shown in module aren’t available on the master node (slurmlogin), so let’s first connect to a node and then load fastqc with module:
john.doe@slurmlogin:/home/john.doe$ srun --pty bash
john.doe@node01:/home/john.doe$ module load fastqc/fastqc_v0.11.5
  1. Most programmes come with help or usage messages that you can print on the screen using “man your_programme” or “your_programme --help”. Let’s see if we can access the help menu for fastqc:
john.doe@node01:/home/john.doe$ fastqc --help

            FastQC - A high throughput sequence QC analysis tool

SYNOPSIS

                    fastqc seqfile1 seqfile2 .. seqfileN

            fastqc [-o output dir] [--(no)extract] [-f fastq|bam|sam]
                  [-c contaminant file] seqfile1 .. seqfileN

                                   [...]

Have a look at line 7: it says that the basic usage of fastqc in our case would look like this (executable in red, fastq file in blue): “fastqc cluster_usage_examples/example_fastqc/head1000_SRR9732589_1.fastq.gz

  1. Also, according to the help message, if we would like to specify the output folder, we can use the -o option but the folder you specify has to exist first, for example:
# create the output folder
john.doe@node01:/home/john.doe$ mkdir -p cluster_usage_examples/example_fastqc/results

# run fastqc
john.doe@node01:/home/john.doe$ fastqc -o cluster_usage_examples/example_fastqc/results cluster_usage_examples/example_fastqc/head1000_SRR9732589_1.fastq.gz
  1. At this point, we know everything there is to know about fastqc and its execution. We no longer need to be connected to a node and can now liberate the resources that we’ve blocked by disconnecting from it:
john.doe@node01:/home/john.doe$ exit 0
john.doe@slurmlogin:/home/john.doe$

Of note: as you can see, the terminal prompt changed again from node01 back to slurmlogin: we’ve returned to the master node (slurmlogin) of the cluster and the job we were running has terminated.

Task 3: Write your submission script

Let’s move to your example_fastqc subdirectory first and write the slurm_script.sh in there.

john.doe@slurmlogin:/home/john.doe$ cd cluster_usage_examples/example_fastqc/
Click to see hints
Click to see answer

The Slurm submission script is written like a common bash script (same language as the terminal). You write in this script all the commands (1 per line) that you would usually type in your terminal. The only particularity are the Slurm submission options, that you can add directly to this script, commonly at the beginning (each parameter should be preceded by #SBATCH):


#! /bin/bash

#SBATCH --job-name="my_jobname"
#SBATCH --partition=common
#SBATCH --cpus-per-task=1

### prefix start - create temporary directory for your job
export TMPDIR=$(mktemp -d)
### prefix end


module load fastqc/fastqc_v0.11.5

# This is a comment line - it will be ignored when the script is executed

cd /home/john.doe/cluster_usage_examples/example_fastqc/       #|
                                                               #| These are your shell commands
fastqc head1000_SRR9732589_1.fastq.gz                          #|


### suffix start - delete created temporary directory
rm -rf $TMPDIR
### suffix end

Explanation of the content:

When you exit the nano text editor, you should see the file created in your current directory:

john.doe@slurmlogin:/home/john.doe/cluster_usage_examples/example_fastqc$ ls
head1000_SRR9732589_1.fastq.gz   slurm_script.sh

Task 4: Submit your script to the cluster

Click to see answer

To submit a Slurm submission script, all you have to do is:

john.doe@slurmlogin:/home/john.doe/cluster_usage_examples/example_fastqc$ sbatch slurm_script.sh
Submitted batch job 287170

This will print your attributed job id on the screen (287170 in this case).

You can follow the progression of your job with squeue:

john.doe@slurmlogin:/home/john.doe/cluster_usage_examples/example_fastqc$ squeue -j 287170
             JOBID PARTITION       NAME     USER ST       TIME  NODES NODELIST(REASON)
            287170    common my_jobname john.doe  R   00:00:41      1 node06

(you can omit the -j option to show all the currently running jobs).

You can learn more about the options for squeue in the manual (type man squeue, navigate with the up/down arrow keys and exit by typing q).

Task 5: Check if your job finished correctly

Hints:

Click to see answer

What files do we expect to see? There should be 3 new files in total:

Your job shouldn’t take too long to finish, then you should be able to see the output files in your folder:

john.doe@slurmlogin:/home/john.doe/cluster_usage_examples/example_fastqc$ ls
head1000_SRR9732589_1.fastq.gz   head1000_SRR9732589_1_fastqc.zip   head1000_SRR9732589_1_fastqc.html
slurm-287170.out                 slurm_script.sh
Click to see troubleshooting tips

Having issues? If you don’t have the output files, then there might be a problem in the execution somewhere. In that case, you can have a look at the log file from Slurm:

john.doe@slurmlogin:/home/john.doe/cluster_usage_examples/example_fastqc$ cat slurm-287170.out

Note that the log file is generated by default in the directory in which you ran the sbatch command. There is an option in sbatch with which you can change this behaviour.

Typical error messages are for example:

Task 6: Analyse your resource consumption to optimise your next run

Analyse your actual resource consumption: How much memory did you effectively use while running the job? How long did your job take to finish? How much CPU percentage did you use?

This is useful to know in order to adapt the resources you ask for in future jobs with similar proceedings (e.g. other FastQC submissions).

Hints:

Click to see answer
john.doe@slurmlogin:/home/john.doe/cluster_usage_examples/example_fastqc$ jobinfo 287170
Job ID               : 287170
Job name             : my_jobname
User                 : john.doe
Account              :
Working directory    : /home/john.doe/cluster_usage_examples/example_fastqc
Cluster              : cluster
Partition            : common
Nodes                : 1
Nodelist             : node06
Tasks                : 1
CPUs                 : 1
GPUs                 : 0
State                : COMPLETED
Exit code            : 0:0
Submit time          : 2025-03-11T11:03:43
Start time           : 2025-03-11T11:03:44
End time             : 2025-03-11T11:03:51
Wait time            :     00:00:01
Reserved walltime    :     02:00:00
Used walltime        :     00:00:07
Used CPU walltime    :     00:00:07
Used CPU time        :     00:00:04
CPU efficiency       : 66.00%
% User (computation) : 94.89%
% System (I/O)       :  5.11%
Reserved memory      : 1000M/core
Max memory used      : 2.85M (estimate)
Memory efficiency    :  0.29%
Max disk write       : 0.00
Max disk read        : 0.00

The lines you should look at:

Adjustements to make:

//! it’s important to keep in mind that run times and resource usage also depends on the commands you run and on the size of your input.

Task 7: Adjust the submission script with the previously-identified & required resources

Now try optimising your script to reseve not more than the resouces you actually need to run fastqc. Your colleagues will be thankful ;-)

Hint: see Slurm cheat sheet for a list of all options for sbatch

Click to see answer

Your script could look like this:


#! /bin/bash

#SBATCH --job-name="my_jobname"
#SBATCH --partition=common
#SBATCH --cpus-per-task=1
#SBATCH --mem=100M
#SBATCH --time=00:10:00

### prefix start - create temporary directory for your job
export TMPDIR=$(mktemp -d)
### prefix end


module load fastqc/fastqc_v0.11.5

# This is a comment line - it will be ignored when the script is executed

cd /home/john.doe/cluster_usage_examples/example_fastqc/       #|
                                                               #| These are your shell commands
fastqc head1000_SRR9732589_1.fastq.gz                          #|


### suffix start - delete created temporary directory
rm -rf $TMPDIR
### suffix end

NB: only the #SBATCH lines were changed…

Take home message

When discovering a new tool and wanting to use it on the cluster

  1. search for your software with module
  2. write the commands that you want to run within a bash script, don’t forget module load at the top
  3. Slurm options can be specified within this script using #SBATCH-prefixed lines (see Slurm cheat sheet for a list of all options)
  4. submit the script with sbatch
  5. follow your job with squeue
  6. check if your job worked by searching for expected output files & looking at the slurm log
  7. analyse the resouces used for future use with jobinfo

🔗 Back to exercise page{:target=“_blank”}