Getting started with the I2BC cluster
Case study 1 - QC analysis with FastQC
Instructions: This exercise is divided into 7 very detailed steps. If you’re already quite comfortable with scripting and schedulers, go directly to Exercise 3 (it’s the same steps, but it will also introduce you to for loops and job arrays) and follow up with Exercise 4 which will introduce you to using conda to install programmes on the cluster.
Context: We just sequenced the RNA of a sample. The sequencing plateform gave us the raw output file after sequencing in fastq format. We would like to have a first overview of the quality of the sequencing performed. In the following example, we will try to run the FastQC programme on this sequencing output. The FastQC programme analyses fastq files and outputs a quality control report in html format (more information on FastQC). It’s a small programme that doesn’t require a lot of resources and it’s already installed on the I2BC cluster.
https://forge.i2bc.paris-saclay.fr/git/partage-bioinfo/cluster_usage_examples.git
If you haven’t already got a session open on the Frontale (the master node) of the cluster, please do so as the rest of the steps are performed on the cluster. If you don’t know how to connect, don’t hesitate to refer to the previous section.
john.doe@cluster-i2bc:~$ cd /home/john.doe
john.doe@cluster-i2bc:/home/john.doe$ wget "https://zenodo.org/record/8340293/files/cluster_usage_examples.tar.gz"
john.doe@cluster-i2bc:/home/john.doe$ tar -zxf cluster_usage_examples.tar.gz
john.doe@cluster-i2bc:/home/john.doe$ ls cluster_usage_examples/
example_fastqc example_mafft example_tmalign
In the example_fastqc
folder, you’ll see a sequencing output in fastq format (the .gz extension suggests that it’s also compressed) on which we’ll run the FastQC programme.
john.doe@cluster-i2bc:/home/john.doe$ ls cluster_usage_examples/example_fastqc/
head1000_SRR9732589_1.fastq.gz
fastqc
, let’s see if we can locate it.
Many programmes are already installed on the cluster. Some are on the Frontale (the master node) but most of them are on the nodes only.
- If you want to check if a programme is installed, you’ll have to connect to a node first in interactive mode:
Of note:john.doe@cluster-i2bc:/home/john.doe$ qsub -I qsub: waiting for job 287169.pbsserver to start qsub: job 287169.pbsserver ready john.doe@node01:/home/john.doe$
- with the
qsub
command, you are actually running a job on the cluster with a job identifier. - all jobs are dispatched to one of the available nodes of the cluster – in this case, we’re using node01 (NB: cluster-i2bc is the name of the Frontale).
- with the
- Once on the node, there are several places your executable could be:
- (the easiest scenario) your executable could already be installed and saved in your system’s
$PATH
variable, in which case, you should be able to just type the command name as is in the terminal. - your executable could be installed in the
/opt
folder at the root of the node, in which case you’ll have to look for it first. - there is also a
module
but not all programmes in/opt
are listed in there.
Let’s check first if thefastqc
executable is in the$PATH
by directly typing thefastqc
command in the terminal:
Did this work? no… This error message is typical when you type a command that the shell doesn’t understand or cannot find and means that thejohn.doe@node01:/home/john.doe$ fastqc -bash: fastqc: command not found
fastqc
command doesn’t exist in your$PATH
. Note: This doesn’t mean that the programme is not installed, it just means that the system doesn’t know where to find it, if it’s there.
Isfastqc
in the/opt
folder? Indeed, many programmes are installed on the cluster but weren’t added to the $PATH and are thus not directly accessible. The/opt
folder is at the root of the system (it starts with a slash) and is where most programmes are installed on the I2BC cluster. Let’s see if we can findfastqc
in there:
Of note: thejohn.doe@node01:/home/john.doe$ find /opt -maxdepth 3 -name "fastqc" 2>/dev/null /opt/FastQC/fastqc /opt/fastqc_v0.10.1/FastQC/fastqc /opt/fastqc_v0.10.1/fastqc /opt/singularity/calls_fastqc_v0.11.9_c/fastqc /opt/fastqc_v0.11.5/fastqc /opt/fastqc_v0.11.5/FastQC/fastqc
find
command will list all fastqc files in the/opt
directory
Did we find anything? yes! There are severalfastqc
versions installed in/opt
, let’s use the most recent v0.11.5. To use the programme, we’ll have to specify it’s full path:/opt/fastqc_v0.11.5/fastqc
Of note: If there’s also nothing in/opt/
, you can ask the SICS to install it for you or install a local version in your home directory. Using conda is also a good alternative. - (the easiest scenario) your executable could already be installed and saved in your system’s
fastqc
executable: How do we specify the inputs? What options or parameters can we use? What would the final command line look like to run fastqc
on your input?
We can make the most of still being connected to a node on the cluster to run a few execution tests: most programmes come with help or usage messages that you can print on the screen using “
man your_programme
” or “your_programme --help
“. Let’s see if we can access the help menu for fastqc
:
john.doe@node01:/home/john.doe$ /opt/fastqc_v0.11.5/fastqc --help
FastQC - A high throughput sequence QC analysis tool
SYNOPSIS
fastqc seqfile1 seqfile2 .. seqfileN
fastqc [-o output dir] [--(no)extract] [-f fastq|bam|sam]
[-c contaminant file] seqfile1 .. seqfileN
[...]
So the basic usage of fastqc
in our case would look like this (full path to the executable in red, fastq file in blue):
john.doe@node01:/home/john.doe$ /opt/fastqc_v0.11.5/fastqc /home/john.doe/cluster_usage_examples/example_fastqc/head1000_SRR9732589_1.fastq.gz
If we would like to specify the output folder, we can use the -o
option but the folder you specify has to exist, for example:
john.doe@node01:/home/john.doe$ mkdir -p /home/john.doe.cluster_usage_examples/example_fastqc/results
john.doe@node01:/home/john.doe$ /opt/fastqc_v0.11.5/fastqc -o /home/john.doe.cluster_usage_examples/example_fastqc/results /home/john.doe/cluster_usage_examples/example_fastqc/head1000_SRR9732589_1.fastq.gz
At this point, we know everything there is to know about
fastqc
and its execution. We no longer need to be connected to a node and can now liberate the resources that we’ve blocked by disconnecting from it:
john.doe@node01:/home/john.doe$ logout
qsub: job 287169.pbsserver completed
john.doe@cluster-i2bc:/home/john.doe$
Of note: as you can see, the terminal prompt prefix changed again from node01
back to cluster-i2bc
: we’ve returned to the Frontale of the cluster and the job we were running has terminated.example_fastqc
subdirectory.
Let’s move to the example_fastqc
subdirectory first:
john.doe@cluster-i2bc:/home/john.doe$ ls cluster_usage_examples/
example_fastqc example_mafft example_tmalign
john.doe@cluster-i2bc:/home/john.doe$ cd cluster_usage_examples/example_fastqc/
Now let’s write a script called pbs_script.sh
in there.
In this example, we will write pbs_script.sh
using the in-line text editor nano
(but there are other possibilities such as vi
, vim
or emacs
for example):
john.doe@cluster-i2bc:/home/john.doe/cluster_usage_examples/example_fastqc$ nano pbs_script.sh
This will create a file called pbs_script.sh
in your current directory and will open up an in-line “window” to edit your file that looks like the screenshot below.
Screenshot of the nano editor
About the
nano
text editor:
It’s in-line, you navigate through it with your arrow keys and you have a certain number of functionalities (e.g. copy-paste, search etc.) that are accessible through keyboard shortcuts (that are also listed on the bottom of your screen, ^
stands for the Ctrl key).
The main shortcuts are: Ctrl+S to save (^S
) and Ctrl+X to exit (^X
).
See the nano cheat sheet and tutorial for more information and shortcuts.
The PBS submission script is written like a common bash script (same language as the terminal). Write in this script all the commands (1 per line) that you would usually type in your terminal. The only particularity are the PBS submission options, that you can add directly to this script, commonly at the beginning (each parameter should be preceded by
#PBS
like in the example below).
#! /bin/bash
#PBS -N my_jobname
#PBS -q common
#PBS -l ncpus=1
# This is a comment line - it will be ignored when the script is executed
cd /home/john.doe/cluster_usage_examples/example_fastqc/ #|
#| These are your shell commands
/opt/fastqc_v0.11.5/fastqc head1000_SRR9732589_1.fastq.gz #|
Explanation of the content:
#! /bin/bash
: this is the “shebang”, it specifies the “language” of your script (in this case, the cluster understands that the syntax of this text file is “bash” and will execute it with the/bin/bash
executable).#PBS
: All lines starting with#PBS
indicate to the PBS job scheduler on the cluster that the following information is information related to the job submission. This is where you specify theqsub
options such as your job name with-N
or the queue you want to submit your job to with-q
. There are many more options you can specify, see the SICS webpage.cd /path/to/your/folder
: by default, when you connect to the Frontale or the nodes, you land on your “home” directory (/home/john.doe
). By moving to the directory which contains your input, you won’t need to specify the full path to the input, as you can see in the line of code that follows this statement.
john.doe@cluster-i2bc:/home/john.doe/cluster_usage_examples/example_fastqc$ ls
head1000_SRR9732589_1.fastq.gz pbs_script.sh
To submit a pbs submission script, all you have to do is:
john.doe@cluster-i2bc:/home/john.doe/cluster_usage_examples/example_fastqc$ qsub pbs_script.sh
287170.pbsserver
This will print your attributed job id on the screen (287170
in this case).
You can follow the progression of your job with qstat
:
john.doe@cluster-i2bc:/home/john.doe/cluster_usage_examples/example_fastqc$ qstat 287170.pbsserver
Job id Name User Time Use S Queue
---------------- ---------------- ---------------- -------- - -----
287170.pbsserver my_jobname john.doe 00:00:05 R common
Or to see all your jobs running and details on resources:
john.doe@cluster-i2bc:/home/john.doe/cluster_usage_examples/example_fastqc$ qstat -u john.doe -w
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
------------------------------ --------------- --------------- --------------- -------- ---- ----- ------ ----- - -----
287170.pbsserver john.doe common my_jobname 3090738 1 1 2gb 02:00 R 00:05
287172.pbsserver john.doe common my_job2 3090739 1 1 2gb 02:00 R 00:01
You can learn more about the options for qstat
on the SICS website or in the manual (type man qstat
, navigate with the up/down arrow keys and exit by typing q
).
- FastQC should generate two files: an html file with the visual summary of the quality assessment of your fastq file and a zip folder that contains individual png images and result files.
- the PBS scheduler should also generate two files with your jobname as prefix and the job identifier as suffix: one summarising the error log, the other the usual output log (it’s what’s usually printed on the screen that PBS captures in two separate files instead).
john.doe@cluster-i2bc:/home/john.doe/cluster_usage_examples/example_fastqc$ ls
head1000_SRR9732589_1.fastq.gz head1000_SRR9732589_1_fastqc.zip head1000_SRR9732589_1_fastqc.html
my_jobname.e287170 my_jobname.o287170 pbs_script.sh
You will see the output files generated by fastqc
but also the log files generated by the PBS job scheduler to which the output and error messages that are normally printed on the screen are written (e=error, o=output).
Having issues? If you don’t have the output files, then there might be a problem in the execution somewhere. In that case, you can have a look at the two log files from PBS, especially the error file (
*.e*
).
john.doe@cluster-i2bc:/home/john.doe/cluster_usage_examples/example_fastqc$ cat my_jobname.e287170
Note that both log files are generated by default in the directory in which you ran the qsub
command. There are options in qsub
with which you can change this behaviour.
Typical error messages are for example:
-bash: fastqc: command not found
:
This is typical for commands that bash doesn’t know or cannot find. In this case, it might be because you forgot to specify the full path to thefastqc
executable e.g./opt/fastqc_v0.11.5/fastqc
Specified output directory 'nonexistantdir' does not exist
:
As stated,fastqc
cannot find the output directory that was specified. It might be because you have to create it first or because it couldn’t find it in the working directory (make sure to specify the full path to that folder and check that you don’t have any typos)Skipping 'nonexistant.fastq' which didn't exist, or couldn't be read
:
As stated,fastqc
cannot find the input that you gave it. It could be linked to a typo in the name or could be becausefastqc
didn’t find it in your current working directory (keep in mind that by default, when you connect to the Frontale and the nodes, you land on your home directory andfastqc
won’t find your inputs unless you move to the right directory or specify the full path to those files).
This is useful to know in order to adapt the resources you ask for in future jobs with similar proceedings (e.g. other FastQC submissions). To see how much resource your job used, you can use
qshow -j MY_JOB_ID
or use qstat -fxw -G MY_JOB_ID
(both commands are equivalent):
john.doe@cluster-i2bc:/home/john.doe/cluster_usage_examples/example_fastqc$ qshow -j 287170
Job Id: 287170.pbsserver
Job_Name = my_jobname
Job_Owner = john.doe@master.example.org
resources_used.cpupercent = 16
resources_used.cput = 00:00:04
resources_used.mem = 79684kb
resources_used.ncpus = 1
resources_used.vmem = 2630856kb
resources_used.walltime = 00:00:06
job_state = F
queue = common
[...]
Resource_List.mem = 2gb
Resource_List.ncpus = 1
Resource_List.nodect = 1
Resource_List.place = pack
Resource_List.preempt_targets = QUEUE=lowprio
Resource_List.select = 1:mem=2gb:ncpus=1
Resource_List.walltime = 02:00:00
[...]
Answer?
- Memory: It’s the amount of RAM memory the job is allocated.
We reserved 2Gb by default (
Resource_List.mem
) but only used about 80Mb (resources_used.mem
). For next time, we could consider asking for less memory, for example 1Gb instead of the 2Gb with-l mem=1Gb
. This will leave more memory available for others on the cluster. - CPU percentage: It reflects how much of the CPU you used during your job (
resources_used.cpupercent
). For 1 CPU reserved, cpupercent can go from 0% (sub-optimal use) to 100% (optimal use). For N cpus, it can go up to N x 100%, if all CPUs are working full time. It’s an approximate measure of how efficiently the tasks are distributed over the CPUs. In our case, we only used 16% of the allocated CPU but we can’t ask for less than 1 CPU so there’s nothing to be done. - Wall time: it’s the maximum computation time given to a job. Beyond this time, your job will be killed, whatever it’s state.
We reserved 2 hrs (
Resource_List.walltime
) but the job only took 5 seconds (resources_used.walltime
). For next time, knowing thatfastqc
is very fast, we could put a wall time of 10 minutes for example with-l walltime=00:10:00
.
pbs_script.sh
could then look like this:
#! /bin/bash
#PBS -N my_jobname
#PBS -q common
#PBS -l ncpus=1
#PBS -l mem=1gb
#PBS -l walltime=00:10:00
# This is a comment line - it will be ignored when the script is executed
cd /home/john.doe/cluster_usage_examples/example_fastqc/ #|
#| These are your shell commands
/opt/fastqc_v0.11.5/fastqc head1000_SRR9732589_1.fastq.gz #|