Case study 2 – BIOI2 – Integrative BIOInformatics platforme

Getting started with the I2BC cluster

About this course | Before the session | About the cluster | Course material | Exercises

Exercise 2 - Aligning sequences with MAFFT

Instructions: This exercise is identical to the previous one but with a different input and programme. You can use it to check what you’ve learned in Exercise 1 (try answering the questions without looking at the answers). If you were quite comfortable with Exercise 1, you can go directly to Exercise 3 which will introduce you to “for” loops and job arrays.

Context: We are studying the human ASF1A protein and would like to find conserved regions in its sequence. So we ran BLAST to search for homologs and downloaded the full length sequences of all hits. In the following example, we will try to run the MAFFT programme on this set of sequences to align them all between each other (more information on MAFFT). MAFFT is a small programme that doesn’t require a lot of resources (for a relatively small set of sequences) and it’s already installed on the I2BC cluster.

Note: Files that go with the examples mentioned in this training session are in https://forge.i2bc.paris-saclay.fr/redmine/projects/partage-bioinfo/repository/cluster_usage_examples. You can access them by cloning the repository: https://forge.i2bc.paris-saclay.fr/git/partage-bioinfo/cluster_usage_examples.git

Step 0 - Connect to the I2BC cluster

If you haven’t already got a session open on the Frontale (the master node) of the cluster, please do so as the rest of the steps are performed on the cluster. If you don’t know how to connect, don’t hesitate to refer to the previous section.

Step 1 - Fetch the input files

We will work in your home directory (see this page for more information on the file spaces accessible from the cluster). Let’s move to it and fetch our working files:

john.doe@cluster-i2bc:~$ cd /home/john.doe
john.doe@cluster-i2bc:/home/john.doe$ wget "https://zenodo.org/record/8340293/files/cluster_usage_examples.tar.gz?download=1" -O cluster_usage_examples.tar.gz
john.doe@cluster-i2bc:/home/john.doe$ tar -zxf cluster_usage_examples.tar.gz
john.doe@cluster-i2bc:/home/john.doe$ ls cluster_usage_examples/
example_fastqc  example_mafft  example_tmalign

In the example_mafft folder, you’ll see a fasta file with unaligned protein sequences on which we’ll run the mafft programme.

john.doe@cluster-i2bc:/home/john.doe$ ls cluster_usage_examples/example_mafft/
ASF1_HUMAN_FL_blastp_output.fasta

Step 2 - Locate the software

The MAFFT programme executable is called mafft, let’s see if we can find it in the modules:

john.doe@cluster-i2bc:/home/john.doe$ module avail -C mafft -i
------------------------------------- /usr/share/modules/modulefiles --------------------------------------
nodes/mafft-7.475

So all we have to do is use: module load nodes/mafft-7.475 in order to load MAFFT.

Step 3 - Determine how to use the executable

Let’s investigate how to use the mafft executable: How do we specify the inputs? What options or parameters can we use? What would the final command line look like to run mafft on your input?

You can have a look at the documentation or you can experiment with the executable in an interactive session on one of the nodes:

john.doe@cluster-i2bc:/home/john.doe$ qsub -I
qsub: waiting for job 287169.pbsserver to start
qsub: job 287169.pbsserver ready

john.doe@node01:/home/john.doe$ module load nodes/mafft-7.475

Of note:

- with the qsub command, you are actually running a job on the cluster with a job identifier.
- all jobs are dispatched to one of the available nodes of the cluster – in this case, we’re using node01 (NB: cluster-i2bc is the name of the Frontale).

Most programmes come with help or usage messages that you can print on the screen using “man your_programme” or “your_programme --help“.

Let’s see if we can access the help menu for mafft:

john.doe@node01:/home/john.doe$ mafft --help
------------------------------------------------------------------------------
  MAFFT v7.475 (2020/Nov/23)
  https://mafft.cbrc.jp/alignment/software/
  MBE 30:772-780 (2013), NAR 30:3059-3066 (2002)
------------------------------------------------------------------------------
High speed:
  % mafft in > out
  % mafft --retree 1 in > out (fast)

High accuracy (for <~200 sequences x <~2,000 aa/nt): % mafft --maxiterate 1000 --localpair in > out (% linsi in > out is also ok)
  % mafft --maxiterate 1000 --genafpair  in > out (% einsi in > out)
  % mafft --maxiterate 1000 --globalpair in > out (% ginsi in > out)

If unsure which option to use:
  % mafft --auto in > out

--op # :         Gap opening penalty, default: 1.53
--ep # :         Offset (works like gap extension penalty), default: 0.0
--maxiterate # : Maximum number of iterative refinement, default: 0
--clustalout :   Output: clustal format, default: fasta
--reorder :      Outorder: aligned, default: input order
--quiet :        Do not report progress
--thread # :     Number of threads (if unsure, --thread -1)
--dash :         Add structural information (Rozewicki et al, submitted)

So the basic usage of mafft in our case would look like this (executable in red, fasta input file in blue, aligned output file in purple):

john.doe@node01:/home/john.doe$ mafft /home/john.doe/cluster_usage_examples/example_mafft/ASF1_HUMAN_FL_blastp_output.fasta > /home/john.doe/cluster_usage_examples/example_mafft/ASF1_HUMAN_FL_blastp_output_mafft_aln.fasta

Note: by default, mafft will just print the alignment output to the screen. In order to capture this printed output into a file, we use the redirection sign “>” followed by the name of the file we want to redirect the printed output to.

At this point, we know everything there is to know about mafft and its execution. We no longer need to be connected to a node and can now liberate the resources that we’ve blocked by disconnecting from it:

john.doe@node01:/home/john.doe$ logout

qsub: job 287169.pbsserver completed
john.doe@cluster-i2bc:/home/john.doe$

Of note: as you can see, the terminal prompt prefix changed again from node01 back to cluster-i2bc: we’ve returned to the Frontale of the cluster and the job we were running has terminated.

Step 4 - Write your submission script

Best is to write the submission script in your example_mafft subdirectory.

Let’s move to the example_mafft subdirectory first:

john.doe@cluster-i2bc:/home/john.doe$ ls cluster_usage_examples/
example_fastqc  example_mafft  example_tmalign
john.doe@cluster-i2bc:/home/john.doe$ cd cluster_usage_examples/example_mafft/

Now let’s write a script called pbs_script.sh in there.

In this example, we will write pbs_script.sh using the in-line text editor nano (but there are other possibilities such as vi, vim or emacs for example):

john.doe@cluster-i2bc:/home/john.doe/cluster_usage_examples/example_mafft$ nano pbs_script.sh

This will create a file called pbs_script.sh in your current directory and will open up an in-line “window” to edit your file that looks like the screenshot below.

Nano_editor

Screenshot of the nano editor

About the nano text editor:
It’s in-line, you navigate through it with your arrow keys and you have a certain number of functionalities (e.g. copy-paste, search etc.) that are accessible through keyboard shortcuts (that are also listed on the bottom of your screen, ^ stands for the Ctrl key).
The main shortcuts are: Ctrl+S to save (^S) and Ctrl+X to exit (^X).
See the nano cheat sheet and tutorial for more information and shortcuts.

The PBS submission script is written like a common bash script (same language as the terminal). Write in this script all the commands (1 per line) that you would usually type in your terminal. The only particularity are the PBS submission options, that you can add directly to this script, commonly at the beginning (each parameter should be preceded by #PBS like in the example below).

#! /bin/bash

#PBS -N my_jobname     
#PBS -q common        
#PBS -l ncpus=1       

module load nodes/mafft-7.475

# This is a comment line - it will be ignored when the script is executed
# Comment lines start with a "#" symbol and can be put anywhere you like
# You can also add a comment at the end of a line, as shown below

cd /home/john.doe/cluster_usage_examples/example_mafft/                                 #|
                                                                                        #| These are your shell commands
mafft ASF1_HUMAN_FL_blastp_output.fasta > ASF1_HUMAN_FL_blastp_output_mafft_aln.fasta   #|

Explanation of the content:

#! /bin/bash: this is the “shebang”, it specifies the “language” of your script (in this case, the cluster understands that the syntax of this text file is “bash” and will execute it with the /bin/bash executable).
#PBS : All lines starting with #PBS indicate to the PBS job scheduler on the cluster that the following information is information related to the job submission. This is where you specify the qsub options such as your job name with -N or the queue you want to submit your job to with -q. There are many more options you can specify, see the “cheat sheet” tab on the intranet.
module load is used to load the software you require (i.e. MAFFT in this case)
cd /path/to/your/folder: by default, when you connect to the Frontale or the nodes, you land on your “home” directory (/home/john.doe). By moving to the directory which contains your input, you won’t need to specify the full path to the input, as you can see in the line of code that follows this statement.

When you exit the nano text editor, you should see the file created in your current directory:

john.doe@cluster-i2bc:/home/john.doe/cluster_usage_examples/example_mafft$ ls
ASF1_HUMAN_FL_blastp_output.fasta   pbs_script.sh

Step 5 - Submit your script to the cluster

To submit a pbs submission script, all you have to do is:

john.doe@cluster-i2bc:/home/john.doe/cluster_usage_examples/example_mafft$ qsub pbs_script.sh
287170.pbsserver

This will print your attributed job id on the screen (287170 in this case).

You can follow the progression of your job with qstat:

john.doe@cluster-i2bc:/home/john.doe/cluster_usage_examples/example_mafft$ qstat 287170.pbsserver
Job id           Name             User             Time Use S Queue
---------------- ---------------- ---------------- -------- - -----
287170.pbsserver my_jobname       john.doe         00:00:05 R common

Note: if you get a message saying your “Job has finished”, you can add the -x option to the qstat command, this activates the search through your past jobs as well as the ones that are currently running: qstat -x 287170.pbsserver.

Or to see all your jobs running and details on resources:

john.doe@cluster-i2bc:/home/john.doe/cluster_usage_examples/example_mafft$ qstat -u john.doe -w
Job ID                         Username        Queue           Jobname         SessID   NDS  TSK   Memory Time  S Time
------------------------------ --------------- --------------- --------------- -------- ---- ----- ------ ----- - -----
287170.pbsserver               john.doe        common          my_jobname      3090738   1   1     2gb    02:00 R 00:05
287172.pbsserver               john.doe        common          my_job2         3090739   1   1     2gb    02:00 R 00:01

You can learn more about the options for qstat on the SICS website or in the manual (type man qstat, navigate with the up/down arrow keys and exit by typing q).

Step 6 - Check if your job finished correctly

What files do we expect to see? There should be 3 new files in total:

MAFFT should generate one file: the output file that you specified in your script that will have the aligned sequences (multiple sequence alignment).
the PBS scheduler should also generate two files with your jobname as prefix and the job identifier as suffix: one summarising the error log, the other the usual output log (it’s what’s usually printed on the screen that PBS captures in two separate files instead).

Your job shouldn’t take too long to finish, then you should be able to see the output files in your folder:

john.doe@cluster-i2bc:/home/john.doe/cluster_usage_examples/example_mafft$ ls
ASF1_HUMAN_FL_blastp_output.fasta     ASF1_HUMAN_FL_blastp_output_mafft_aln.fasta
my_jobname.e287170                    my_jobname.o287170                 
pbs_script.sh

You will see the output file generated by mafft but also the log files generated by the PBS job scheduler to which the output and error messages that are normally printed on the screen are written (e=error, o=output). Let’s have a quick look at the alignment:

john.doe@cluster-i2bc:/home/john.doe/cluster_usage_examples/example_mafft$ head ASF1_HUMAN_FL_blastp_output_mafft_aln.fasta
>my_query_seq Q9Y294 ASF1_HUMAN
------------------------MAK---------------VQVNNVVVLDNPSPFYNP
FQFEIT--------FECIEDLSE----DLEWKIIYVGSAESEEY--DQVLDSVLVGPVPA
GRHMFVFQADAPNPGLIPDADAVGVTVVLITCTYRGQEFIRVGYYVNNEYTETELRENPP
VKPDFSKLQRNILASNPRVTRFHINWEDNTEKLEDAE-SSNPNLQSLLSTDALPSA-SKG
WSTSENSLNVMLESHMDCM-----------------------------------------
----------
>KAI6071314.1 Histone chaperone ASF1A [Aix galericulata]
------------------------MAK---------------VQVNNVVVLDNPSPFYNP
FQFEIT--------FECIEDLSE----DLEWKIIYVGSAESEEY--DQVLDSVLVGPVPA
[...]

If you’re curious, you can also view your sequence alignment in a more graphical way through the EBI’s MView web server.

Having issues?
If you don’t have the output files, then there might be a problem in the execution somewhere. In that case, you can have a look at the two log files from PBS, especially the error file (*.e*).

john.doe@cluster-i2bc:/home/john.doe/cluster_usage_examples/example_mafft$ cat my_jobname.e287170

Note that both log files are generated by default in the directory in which you ran the qsub command. There are options in qsub with which you can change this behaviour.

Typical error messages are for example:

-bash: mfft: command not found:
This is typical for commands that bash doesn’t know or cannot find. In this case, it’s because we didn’t spell mafft correctly (mfft instead of mafft)
/usr/bin/mafft: Cannot open ASF1_HUMAN_FL_blastp_output.fasta.:
As stated, mafft cannot find the input that you gave it. It could be linked to a typo in the name or could be because mafft didn’t find it in your current working directory (keep in mind that by default, when you connect to the Frontale and the nodes, you land on your home directory and mafft won’t find your inputs unless you move to the right directory with cd or specify the full path to those files).

Step 7 - Analyse your resource consumption to optimise your next run

Analyse your actual resource consumption: How much memory did you effectively use while running the job? How long did your job take to finish? How much CPU percentage did you use?

This is useful to know in order to adapt the resources you ask for in future jobs with similar proceedings (e.g. other MAFFT submissions). To see how much resource your job used, you can use qshow -j MY_JOB_ID or use qstat -fxw -G MY_JOB_ID (both commands are equivalent):

john.doe@cluster-i2bc:/home/john.doe/cluster_usage_examples/example_mafft$ qshow -j 287170
Job Id: 287170.pbsserver
    Job_Name = my_jobname
    Job_Owner = john.doe@master.example.org
    resources_used.cpupercent = 0
    resources_used.cput = 00:00:01
    resources_used.mem = 26096kb
    resources_used.ncpus = 1
    resources_used.vmem = 0kb
    resources_used.walltime = 00:00:01
    job_state = F
    queue = common
    [...]
    Resource_List.mem = 2gb
    Resource_List.ncpus = 1
    Resource_List.nodect = 1
    Resource_List.place = pack
    Resource_List.preempt_targets = QUEUE=lowprio
    Resource_List.select = 1:mem=2gb:ncpus=1
    Resource_List.walltime = 02:00:00
    [...]

Answer?

Memory: It’s the amount of RAM memory the job is allocated.
We reserved 2Gb by default (Resource_List.mem) but only used about 26Mb (resources_used.mem).
For next time, we could consider asking for less memory, for example 100Mb instead of the 2Gb with -l mem=100Mb. This will leave more memory available for others on the cluster.
CPU percentage: It reflects how much of the CPU you used during your job (resources_used.cpupercent). For 1 CPU reserved, cpupercent can go from 0% (sub-optimal use) to 100% (optimal use). For N cpus, it can go up to N x 100%, if all CPUs are working full time. It’s an approximate measure of how efficiently the tasks are distributed over the CPUs.
In our case, we only used 0% of the allocated CPU (yes, that’s how little MAFFT uses – it’s also just an approximate calculation) but we can’t ask for less than 1 CPU so there’s nothing to be done.
Wall time: it’s the maximum computation time given to a job. Beyond this time, your job will be killed, whatever it’s state. Setting a maximum limit can be useful in some cases, where the programme will freeze after finding an error for example.
We reserved 2 hrs (Resource_List.walltime) but the job only took a second (resources_used.walltime).
For next time, knowing that mafft is very fast, we could put a wall time of 10 minutes for example with -l walltime=00:10:00.

Our adjusted job script pbs_script.sh could then look like this:

#! /bin/bash

#PBS -N my_jobname         
#PBS -q common              
#PBS -l ncpus=1            
#PBS -l mem=100mb            
#PBS -l walltime=00:10:00  

module load nodes/mafft-7.475

# This is a comment line - it will be ignored when the script is executed

cd /home/john.doe/cluster_usage_examples/example_mafft/                                 #|
                                                                                        #| These are your shell commands
mafft ASF1_HUMAN_FL_blastp_output.fasta > ASF1_HUMAN_FL_blastp_output_mafft_aln.fasta   #|