Exercise 2, case study B - Aligning sequences with MAFFT

Authors: BIOI2 & I2BC members

Last updated: 2025-03-13

Foreword

This exercise is identical to the previous one but with a different input and programme. You can use it to check what you’ve learned in the first case study, Exercise 2A (try answering the questions without looking at the answers). If you were quite comfortable with Exercise 2A, you can go directly to Exercise 2C which will introduce you to for loops and job arrays.

⁕ ⁕ ⁕ ⁕ ⁕ ⁕

Context

We are studying the human ASF1A protein and would like to find conserved regions in its sequence. So we ran BLAST to search for homologs and downloaded the full length sequences of all hits. In the following example, we will try to run the MAFFT programme on this set of sequences to align them all between each other (more information on MAFFT). MAFFT is a small programme that doesn’t require a lot of resources (for a relatively small set of sequences) and it’s already installed on the I2BC cluster.

⁕ ⁕ ⁕ ⁕ ⁕ ⁕

Objectives

In this case study, you will see the complete step-by-step of how to use a given software on the I2BC cluster. Starting with looking for it within the installed software, understanding how to use it, running it within a Slurm job and optimising the resources to reserve.

⁕ ⁕ ⁕ ⁕ ⁕ ⁕

Setup

Connect to cluster

It’s the same as for Exercise 0{:target=“_blank”}: you should be connected to the cluster and on the master node (i.e. slurmlogin should be written in your terminal prefix).

Fetch example files

It’s the same as for the previous case study, Exercise 2A. You can skip this step if you’ve already done it.

If not:

You will need the example files available in Zenodo under this link, or on the Forge Logicielle under this link for those who are familiar with git.

We’ll work in your home directory. Let’s move to it and fetch our working files using wget:

john.doe@slurmlogin:~$ cd /home/john.doe
john.doe@slurmlogin:/home/john.doe$ wget "https://zenodo.org/records/15017630/files/cluster_usage_examples.tar.gz?download=1" -O cluster_usage_examples.tar.gz
john.doe@slurmlogin:/home/john.doe$ tar -zxf cluster_usage_examples.tar.gz
john.doe@slurmlogin:/home/john.doe$ ls cluster_usage_examples/
example_fastqc  example_mafft  example_tmalign

In the example_mafft folder, you’ll see a fasta file with unaligned protein sequences on which we’ll run the mafft programme.

john.doe@slurmlogin:/home/john.doe$ ls cluster_usage_examples/example_mafft/
ASF1_HUMAN_FL_blastp_output.fasta

⁕ ⁕ ⁕ ⁕ ⁕ ⁕

Step-by-step

Task 1: Locate the mafft software

The Mafft programme executable is called mafft. Try to find it using the module command.

Click to see hints

The module command can be used from anywhere on the cluster. The main sub-commands are:

To get more details on options for these subcommands (e.g. to search for a specific name), you can use the -h option to get the help page.

Click to see answer
john.doe@slurmlogin:/home/john.doe$ module avail -C mafft -i
------------------------------------- /usr/share/modules/modulefiles --------------------------------------
nodes/mafft-7.475

In the above command, we used the -C option to specify a pattern to search for (“mafft”) and -i to make the search case-insensitive.

According to the output, all we have to do is use: module load nodes/mafft-7.475 in order to load mafft version 7.475.

Task 2: Determine how to use the mafft executable

Let’s investigate how to use the mafft executable: How do we specify the inputs? What options or parameters can we use? What would the final command line look like to run mafft on your input?

Hints:

Click to see answer
  1. Software shown in module aren’t available on the master node (slurmlogin), so let’s first connect to a node and then load mafft with module:
john.doe@slurmlogin:/home/john.doe$ srun --pty bash
john.doe@node01:/home/john.doe$ module load mafft/mafft-7.475
  1. Most programmes come with help or usage messages that you can print on the screen using “man your_programme” or “your_programme --help”. Let’s see if we can access the help menu for mafft:
john.doe@node01:/home/john.doe$ mafft --help
------------------------------------------------------------------------------
  MAFFT v7.475 (2020/Nov/23)
  https://mafft.cbrc.jp/alignment/software/
  MBE 30:772-780 (2013), NAR 30:3059-3066 (2002)
------------------------------------------------------------------------------
High speed:
  % mafft in > out
  % mafft --retree 1 in > out (fast)

High accuracy (for <~200 sequences x <~2,000 aa/nt): % mafft --maxiterate 1000 --localpair in > out (% linsi in > out is also ok)
  % mafft --maxiterate 1000 --genafpair  in > out (% einsi in > out)
  % mafft --maxiterate 1000 --globalpair in > out (% ginsi in > out)

If unsure which option to use:
  % mafft --auto in > out

--op # :         Gap opening penalty, default: 1.53
--ep # :         Offset (works like gap extension penalty), default: 0.0
--maxiterate # : Maximum number of iterative refinement, default: 0
--clustalout :   Output: clustal format, default: fasta
--reorder :      Outorder: aligned, default: input order
--quiet :        Do not report progress
--thread # :     Number of threads (if unsure, --thread -1)
--dash :         Add structural information (Rozewicki et al, submitted)

HSo the basic usage of mafft in our case would look like this (executable in red, fasta input file in blue, aligned output file in purple): “mafft cluster_usage_examples/example_mafft/ASF1_HUMAN_FL_blastp_output.fasta cluster_usage_examples/example_mafft/ASF1_HUMAN_FL_blastp_output_mafft_aln.fasta

Note: by default, mafft will just print the alignment output to the screen. In order to capture this printed output into a file, we use the redirection sign “>” followed by the name of the file we want to redirect the printed output to.

  1. At this point, we know everything there is to know about mafft and its execution. We no longer need to be connected to a node and can now liberate the resources that we’ve blocked by disconnecting from it:
john.doe@node01:/home/john.doe$ exit 0
john.doe@slurmlogin:/home/john.doe$

Of note: as you can see, the terminal prompt changed again from node01 back to slurmlogin: we’ve returned to the master node (slurmlogin) of the cluster and the job we were running has terminated.

Task 3: Write your submission script

Let’s move to your example_mafft subdirectory first and write the slurm_script.sh in there.

john.doe@slurmlogin:/home/john.doe$ cd cluster_usage_examples/example_mafft/
Click to see hints
Click to see answer

The Slurm submission script is written like a common bash script (same language as the terminal). You write in this script all the commands (1 per line) that you would usually type in your terminal. The only particularity are the Slurm submission options, that you can add directly to this script, commonly at the beginning (each parameter should be preceded by #SBATCH):


#! /bin/bash

#SBATCH --job-name="my_jobname"
#SBATCH --partition=common
#SBATCH --cpus-per-task=1

### prefix start - create temporary directory for your job
export TMPDIR=$(mktemp -d)
### prefix end


module load nodes/mafft-7.475

# This is a comment line - it will be ignored when the script is executed
# Comment lines start with a "#" symbol and can be put anywhere you like
# You can also add a comment at the end of a line, as shown below

cd /home/john.doe/cluster_usage_examples/example_mafft/                                 #|
                                                                                        #| These are your shell commands
mafft ASF1_HUMAN_FL_blastp_output.fasta > ASF1_HUMAN_FL_blastp_output_mafft_aln.fasta   #|


### suffix start - delete created temporary directory
rm -rf $TMPDIR
### suffix end

Explanation of the content:

When you exit the nano text editor, you should see the file created in your current directory:

john.doe@slurmlogin:/home/john.doe/cluster_usage_examples/example_mafft$ ls
ASF1_HUMAN_FL_blastp_output.fasta   slurm_script.sh

Task 4: Submit your script to the cluster

Click to see answer

To submit a Slurm submission script, all you have to do is:

john.doe@slurmlogin:/home/john.doe/cluster_usage_examples/example_mafft$ sbatch slurm_script.sh
Submitted batch job 287170

This will print your attributed job id on the screen (287170 in this case).

You can follow the progression of your job with squeue:

john.doe@slurmlogin:/home/john.doe/cluster_usage_examples/example_mafft$ squeue -j 287170
             JOBID PARTITION       NAME     USER ST       TIME  NODES NODELIST(REASON)
            287170    common my_jobname john.doe  R   00:00:41      1 node06

(you can omit the -j option to show all the currently running jobs).

You can learn more about the options for squeue in the manual (type man squeue, navigate with the up/down arrow keys and exit by typing q).

Task 5: Check if your job finished correctly

Hints:

Click to see answer

What files do we expect to see? There should be 2 new files in total:

Your job shouldn’t take too long to finish, then you should be able to see the output files in your folder:

john.doe@slurmlogin:/home/john.doe/cluster_usage_examples/example_mafft$ ls
ASF1_HUMAN_FL_blastp_output.fasta     ASF1_HUMAN_FL_blastp_output_mafft_aln.fasta
slurm-287170.out                      slurm_script.sh

Let’s have a quick look at the alignment:

john.doe@slurmlogin:/home/john.doe/cluster_usage_examples/example_mafft$ head ASF1_HUMAN_FL_blastp_output_mafft_aln.fasta
>my_query_seq Q9Y294 ASF1_HUMAN
------------------------MAK---------------VQVNNVVVLDNPSPFYNP
FQFEIT--------FECIEDLSE----DLEWKIIYVGSAESEEY--DQVLDSVLVGPVPA
GRHMFVFQADAPNPGLIPDADAVGVTVVLITCTYRGQEFIRVGYYVNNEYTETELRENPP
VKPDFSKLQRNILASNPRVTRFHINWEDNTEKLEDAE-SSNPNLQSLLSTDALPSA-SKG
WSTSENSLNVMLESHMDCM-----------------------------------------
----------
>KAI6071314.1 Histone chaperone ASF1A [Aix galericulata]
------------------------MAK---------------VQVNNVVVLDNPSPFYNP
FQFEIT--------FECIEDLSE----DLEWKIIYVGSAESEEY--DQVLDSVLVGPVPA
[...]

If you’re curious, you can also view your sequence alignment in a more graphical way through the EBI’s MView web server.

Click to see troubleshooting tips

Having issues? If you don’t have the output files, then there might be a problem in the execution somewhere. In that case, you can have a look at the log file from Slurm:

john.doe@slurmlogin:/home/john.doe/cluster_usage_examples/example_mafft$ cat slurm-287170.out

Note that the log file is generated by default in the directory in which you ran the sbatch command. There is an option in sbatch with which you can change this behaviour.

Typical error messages are for example:

Task 6: Analyse your resource consumption to optimise your next run

Analyse your actual resource consumption: How much memory did you effectively use while running the job? How long did your job take to finish? How much CPU percentage did you use?

This is useful to know in order to adapt the resources you ask for in future jobs with similar proceedings (e.g. other mafft submissions).

Hints:

Click to see answer
john.doe@slurmlogin:/home/john.doe/cluster_usage_examples/example_mafft$ jobinfo 287170
Job ID               : 287170
Job name             : my_jobname
User                 : john.doe
Account              :
Working directory    : /home/john.doe/cluster_usage_examples/example_mafft
Cluster              : cluster
Partition            : common
Nodes                : 1
Nodelist             : node06
Tasks                : 1
CPUs                 : 1
GPUs                 : 0
State                : COMPLETED
Exit code            : 0:0
Submit time          : 2025-03-11T11:03:43
Start time           : 2025-03-11T11:03:44
End time             : 2025-03-11T11:03:51
Wait time            :     00:00:01
Reserved walltime    :     02:00:00
Used walltime        :     00:00:07
Used CPU walltime    :     00:00:07
Used CPU time        :     00:00:04
CPU efficiency       : 66.00%
% User (computation) : 94.89%
% System (I/O)       :  5.11%
Reserved memory      : 1000M/core
Max memory used      : 2.85M (estimate)
Memory efficiency    :  0.29%
Max disk write       : 0.00
Max disk read        : 0.00

The lines you should look at:

Adjustements to make:

//! it’s important to keep in mind that run times and resource usage also depends on the commands you run and on the size of your input.

Task 7: Adjust the submission script with the previously-identified & required resources

Now try optimising your script to reseve not more than the resouces you actually need to run mafft. Your colleagues will be thankful ;-)

Hint: see Slurm cheat sheet for a list of all options for sbatch

Click to see answer

Your script could look like this:


#! /bin/bash

#SBATCH --job-name="my_jobname"
#SBATCH --partition=common
#SBATCH --cpus-per-task=1
#SBATCH --mem=100M
#SBATCH --time=00:10:00

### prefix start - create temporary directory for your job
export TMPDIR=$(mktemp -d)
### prefix end


module load nodes/mafft-7.475

# This is a comment line - it will be ignored when the script is executed

cd /home/john.doe/cluster_usage_examples/example_mafft/                                 #|
                                                                                        #| These are your shell commands
mafft ASF1_HUMAN_FL_blastp_output.fasta > ASF1_HUMAN_FL_blastp_output_mafft_aln.fasta   #|


### suffix start - delete created temporary directory
rm -rf $TMPDIR
### suffix end

NB: only the #SBATCH lines were changed…

Take home message

When discovering a new tool and wanting to use it on the cluster

  1. search for your software with module
  2. write the commands that you want to run within a bash script, don’t forget module load at the top
  3. Slurm options can be specified within this script using #SBATCH-prefixed lines (see Slurm cheat sheet for a list of all options)
  4. submit the script with sbatch
  5. follow your job with squeue
  6. check if your job worked by searching for expected output files & looking at the slurm log
  7. analyse the resouces used for future use with jobinfo

⁕ ⁕ ⁕ ⁕ ⁕ ⁕