Exercise 2, case study B - Aligning sequences with MAFFT

Authors: BIOI2 & I2BC members

Last updated: 2025-03-13

Foreword

This exercise is identical to the previous one but with a different input and programme. You can use it to check what you’ve learned in the first case study, Exercise 2A (try answering the questions without looking at the answers). If you were quite comfortable with Exercise 2A, you can go directly to Exercise 2C which will introduce you to for loops and job arrays.

⁕ ⁕ ⁕ ⁕ ⁕ ⁕

Context

We are studying the human ASF1A protein and would like to find conserved regions in its sequence. So we ran BLAST to search for homologs and downloaded the full length sequences of all hits. In the following example, we will try to run the MAFFT programme on this set of sequences to align them all between each other (more information on MAFFT). MAFFT is a small programme that doesn’t require a lot of resources (for a relatively small set of sequences) and it’s already installed on the I2BC cluster.

⁕ ⁕ ⁕ ⁕ ⁕ ⁕

Objectives

In this case study, you will see the complete step-by-step of how to use a given software on the I2BC cluster. Starting with looking for it within the installed software, understanding how to use it, running it within a Slurm job and optimising the resources to reserve.

⁕ ⁕ ⁕ ⁕ ⁕ ⁕

Setup

Connect to cluster

It’s the same as for Exercise 0{:target=“_blank”}: you should be connected to the cluster and on the master node (i.e. slurmlogin should be written in your terminal prefix).

Fetch example files

It’s the same as for the previous case study, Exercise 2A. You can skip this step if you’ve already done it.

If not:

You will need the example files available in Zenodo under this link, or on the Forge Logicielle under this link for those who are familiar with git.

We’ll work in your home directory. Let’s move to it and fetch our working files using wget:

john.doe@slurmlogin:~$ cd /home/john.doe
john.doe@slurmlogin:/home/john.doe$ wget "https://zenodo.org/records/15017630/files/cluster_usage_examples.tar.gz?download=1" -O cluster_usage_examples.tar.gz
john.doe@slurmlogin:/home/john.doe$ tar -zxf cluster_usage_examples.tar.gz
john.doe@slurmlogin:/home/john.doe$ ls cluster_usage_examples/
example_fastqc  example_mafft  example_tmalign

In the example_mafft folder, you’ll see a fasta file with unaligned protein sequences on which we’ll run the mafft programme.

john.doe@slurmlogin:/home/john.doe$ ls cluster_usage_examples/example_mafft/
ASF1_HUMAN_FL_blastp_output.fasta

⁕ ⁕ ⁕ ⁕ ⁕ ⁕

Step-by-step

Task 1: Locate the mafft software

The Mafft programme executable is called mafft. Try to find it using the module command.

Click to see hints

The module command can be used from anywhere on the cluster. The main sub-commands are:

module avail: to list all available software
module load/unload <software name>: to load specific software (for use)
module list: to list currently loaded software

To get more details on options for these subcommands (e.g. to search for a specific name), you can use the -h option to get the help page.

Click to see answer

john.doe@slurmlogin:/home/john.doe$ module avail -C mafft -i
------------------------------------- /usr/share/modules/modulefiles --------------------------------------
nodes/mafft-7.475

In the above command, we used the -C option to specify a pattern to search for (“mafft”) and -i to make the search case-insensitive.

According to the output, all we have to do is use: module load nodes/mafft-7.475 in order to load mafft version 7.475.

Task 2: Determine how to use the `mafft` executable

Let’s investigate how to use the mafft executable: How do we specify the inputs? What options or parameters can we use? What would the final command line look like to run mafft on your input?

Hints:

for “custom” software like mafft, you can often get a help message with usage examples by using the --help or -h option
you can experiment with the executable in an interactive job on one of the nodes (srun --pty bash)

Click to see answer

Software shown in module aren’t available on the master node (slurmlogin), so let’s first connect to a node and then load mafft with module:

john.doe@slurmlogin:/home/john.doe$ srun --pty bash
john.doe@node01:/home/john.doe$ module load mafft/mafft-7.475

Most programmes come with help or usage messages that you can print on the screen using “man your_programme” or “your_programme --help”. Let’s see if we can access the help menu for mafft:

john.doe@node01:/home/john.doe$ mafft --help
------------------------------------------------------------------------------
  MAFFT v7.475 (2020/Nov/23)
  https://mafft.cbrc.jp/alignment/software/
  MBE 30:772-780 (2013), NAR 30:3059-3066 (2002)
------------------------------------------------------------------------------
High speed:
  % mafft in > out
  % mafft --retree 1 in > out (fast)

High accuracy (for <~200 sequences x <~2,000 aa/nt): % mafft --maxiterate 1000 --localpair in > out (% linsi in > out is also ok)
  % mafft --maxiterate 1000 --genafpair  in > out (% einsi in > out)
  % mafft --maxiterate 1000 --globalpair in > out (% ginsi in > out)

If unsure which option to use:
  % mafft --auto in > out

--op # :         Gap opening penalty, default: 1.53
--ep # :         Offset (works like gap extension penalty), default: 0.0
--maxiterate # : Maximum number of iterative refinement, default: 0
--clustalout :   Output: clustal format, default: fasta
--reorder :      Outorder: aligned, default: input order
--quiet :        Do not report progress
--thread # :     Number of threads (if unsure, --thread -1)
--dash :         Add structural information (Rozewicki et al, submitted)

HSo the basic usage of mafft in our case would look like this (executable in red, fasta input file in blue, aligned output file in purple): “mafft cluster_usage_examples/example_mafft/ASF1_HUMAN_FL_blastp_output.fasta cluster_usage_examples/example_mafft/ASF1_HUMAN_FL_blastp_output_mafft_aln.fasta”

Note: by default, mafft will just print the alignment output to the screen. In order to capture this printed output into a file, we use the redirection sign “>” followed by the name of the file we want to redirect the printed output to.

At this point, we know everything there is to know about mafft and its execution. We no longer need to be connected to a node and can now liberate the resources that we’ve blocked by disconnecting from it:

john.doe@node01:/home/john.doe$ exit 0
john.doe@slurmlogin:/home/john.doe$

Of note: as you can see, the terminal prompt changed again from node01 back to slurmlogin: we’ve returned to the master node (slurmlogin) of the cluster and the job we were running has terminated.

Task 3: Write your submission script

Let’s move to your example_mafft subdirectory first and write the slurm_script.sh in there.

john.doe@slurmlogin:/home/john.doe$ cd cluster_usage_examples/example_mafft/

Click to see hints

you can use an in-line text editor like nano (but there are other possibilities such as vi, vim or emacs for example). To use nano:
- nano slurm_script.sh will create and open the slurm_script.sh file.
- inside nano, you can use Ctrl+letter and Shft+letter combinations t to save and exit as specified at the bottom of the nano window (^ = Ctrl, Ctrl+G to see help message, Ctrl+X to exit), see nano cheat sheet.
a submission script is like a common bash script (terminal language) except that you use #SBATCH-prefixed lines at the head to specify slurm options for submission, see Slurm cheat sheet.

Click to see answer

The Slurm submission script is written like a common bash script (same language as the terminal). You write in this script all the commands (1 per line) that you would usually type in your terminal. The only particularity are the Slurm submission options, that you can add directly to this script, commonly at the beginning (each parameter should be preceded by #SBATCH):


#! /bin/bash

#SBATCH --job-name="my_jobname"
#SBATCH --partition=common
#SBATCH --cpus-per-task=1

### prefix start - create temporary directory for your job
export TMPDIR=$(mktemp -d)
### prefix end


module load nodes/mafft-7.475

# This is a comment line - it will be ignored when the script is executed
# Comment lines start with a "#" symbol and can be put anywhere you like
# You can also add a comment at the end of a line, as shown below

cd /home/john.doe/cluster_usage_examples/example_mafft/                                 #|
                                                                                        #| These are your shell commands
mafft ASF1_HUMAN_FL_blastp_output.fasta > ASF1_HUMAN_FL_blastp_output_mafft_aln.fasta   #|


### suffix start - delete created temporary directory
rm -rf $TMPDIR
### suffix end

Explanation of the content:

#! /bin/bash: this is the “shebang”, it specifies the “language” of your script (in this case, the cluster understands that the syntax of this text file is “bash” and will execute it with the /bin/bash executable).
#SBATCH : All lines starting with #SBATCH indicate to the Slurm job scheduler on the cluster that the following information is information related to the job submission. This is where you specify the slurm options such as your job name with --job-name or the partition you want to submit your job to with -p. There are many more options you can specify, see the “cheat sheet” tab on the intranet
module load will load the software you need (i.e. mafft in this case)
cd /path/to/your/folder: although Slurm places you in the same directory in which you’ve submitted the slurm script, it’s a good habit to deliberately move to your working directly within the submission script to avoid any nasty surprises… By moving to the directory which contains your input, you won’t need to specify the full path to the input, as you can see in the line of code that follows this statement.
“prefix” and “suffix” blocks: by default, any software that is executed and that uses temporary files will save them to system default folder specified with $TMPDIR. This folder is sometimes overridden and dealt with automatically by the Scheduler (automatic clean up) but it’s a good habit to do this yourself if you can (to avoid saturating the temporary disk when it’s not dealt with automatically). Here, we use the mktemp command to create a temporary folder with a random name at the beginning of the script and we use rm to delete it at the end of the script.

When you exit the nano text editor, you should see the file created in your current directory:

john.doe@slurmlogin:/home/john.doe/cluster_usage_examples/example_mafft$ ls
ASF1_HUMAN_FL_blastp_output.fasta   slurm_script.sh

Task 4: Submit your script to the cluster

Click to see answer

To submit a Slurm submission script, all you have to do is:

john.doe@slurmlogin:/home/john.doe/cluster_usage_examples/example_mafft$ sbatch slurm_script.sh
Submitted batch job 287170

This will print your attributed job id on the screen (287170 in this case).

You can follow the progression of your job with squeue:

john.doe@slurmlogin:/home/john.doe/cluster_usage_examples/example_mafft$ squeue -j 287170
             JOBID PARTITION       NAME     USER ST       TIME  NODES NODELIST(REASON)
            287170    common my_jobname john.doe  R   00:00:41      1 node06

(you can omit the -j option to show all the currently running jobs).

You can learn more about the options for squeue in the manual (type man squeue, navigate with the up/down arrow keys and exit by typing q).

Task 5: Check if your job finished correctly

Hints:

check if the expected output files were generated (1 fasta file)
have a look at the slurm-xxx.out file (replace xxx with your job id) to see if there were any problems

Click to see answer

What files do we expect to see? There should be 2 new files in total:

mafft should generate one file: the output file that you specified in your script that will have the aligned sequences (multiple sequence alignment).
the Slurm scheduler should also generate 1 log file with your job id in the name, summarising the error log and the usual output log (it’s what’s usually printed on the screen that Slurm captures in this file instead).

Your job shouldn’t take too long to finish, then you should be able to see the output files in your folder:

john.doe@slurmlogin:/home/john.doe/cluster_usage_examples/example_mafft$ ls
ASF1_HUMAN_FL_blastp_output.fasta     ASF1_HUMAN_FL_blastp_output_mafft_aln.fasta
slurm-287170.out                      slurm_script.sh

mafft output file: ASF1_HUMAN_FL_blastp_output_mafft_aln.fasta
Slurm log file: slurm-287170.out

Let’s have a quick look at the alignment:

john.doe@slurmlogin:/home/john.doe/cluster_usage_examples/example_mafft$ head ASF1_HUMAN_FL_blastp_output_mafft_aln.fasta
>my_query_seq Q9Y294 ASF1_HUMAN
------------------------MAK---------------VQVNNVVVLDNPSPFYNP
FQFEIT--------FECIEDLSE----DLEWKIIYVGSAESEEY--DQVLDSVLVGPVPA
GRHMFVFQADAPNPGLIPDADAVGVTVVLITCTYRGQEFIRVGYYVNNEYTETELRENPP
VKPDFSKLQRNILASNPRVTRFHINWEDNTEKLEDAE-SSNPNLQSLLSTDALPSA-SKG
WSTSENSLNVMLESHMDCM-----------------------------------------
----------
>KAI6071314.1 Histone chaperone ASF1A [Aix galericulata]
------------------------MAK---------------VQVNNVVVLDNPSPFYNP
FQFEIT--------FECIEDLSE----DLEWKIIYVGSAESEEY--DQVLDSVLVGPVPA
[...]

If you’re curious, you can also view your sequence alignment in a more graphical way through the EBI’s MView web server.

Click to see troubleshooting tips

Having issues? If you don’t have the output files, then there might be a problem in the execution somewhere. In that case, you can have a look at the log file from Slurm:

john.doe@slurmlogin:/home/john.doe/cluster_usage_examples/example_mafft$ cat slurm-287170.out

Note that the log file is generated by default in the directory in which you ran the sbatch command. There is an option in sbatch with which you can change this behaviour.

Typical error messages are for example:

-bash: mfft: command not found:
This is typical for commands that bash doesn’t know or cannot find. In this case, it’s because we didn’t spell mafft correctly (mfft instead of mafft)
/usr/bin/mafft: Cannot open ASF1_HUMAN_FL_blastp_output.fasta.:
As stated, mafft cannot find the input that you gave it. It could be linked to a typo in the name or could be because mafft didn’t find it in your current working directory (double-check you’re in the right directory or double-check your input paths).

Task 6: Analyse your resource consumption to optimise your next run

Analyse your actual resource consumption: How much memory did you effectively use while running the job? How long did your job take to finish? How much CPU percentage did you use?

This is useful to know in order to adapt the resources you ask for in future jobs with similar proceedings (e.g. other mafft submissions).

Hints:

you can use the jobinfo command from the slurm-tools toolkit (//! not a native command of Slurm) for this

Click to see answer

john.doe@slurmlogin:/home/john.doe/cluster_usage_examples/example_mafft$ jobinfo 287170
Job ID               : 287170
Job name             : my_jobname
User                 : john.doe
Account              :
Working directory    : /home/john.doe/cluster_usage_examples/example_mafft
Cluster              : cluster
Partition            : common
Nodes                : 1
Nodelist             : node06
Tasks                : 1
CPUs                 : 1
GPUs                 : 0
State                : COMPLETED
Exit code            : 0:0
Submit time          : 2025-03-11T11:03:43
Start time           : 2025-03-11T11:03:44
End time             : 2025-03-11T11:03:51
Wait time            :     00:00:01
Reserved walltime    :     02:00:00
Used walltime        :     00:00:07
Used CPU walltime    :     00:00:07
Used CPU time        :     00:00:04
CPU efficiency       : 66.00%
% User (computation) : 94.89%
% System (I/O)       :  5.11%
Reserved memory      : 1000M/core
Max memory used      : 2.85M (estimate)
Memory efficiency    :  0.29%
Max disk write       : 0.00
Max disk read        : 0.00

The lines you should look at:

For CPUs: - line 24 “CPU efficiency” indicates how efficiently you used the CPUs you’ve reserved, this should be as close to 100% as possible
For memory: - line 29 “Memory efficiency” indicates how efficiently you used the memory you’ve reserved, this should be as close to 100% as possible - line 28 “Max memory used” is an estimate of the actual RAM memory you used
For walltime: - line 21 “Used walltime” indicates how long your job has run - line 22 “Used CPU walltime” is the run time multiplied by the number of reserved CPUs (i.e. approximately the time it would have taken on a singe CPU)

Adjustements to make:

For CPUs: we only used 60% of the allocated CPU but we can’t ask for less than 1 CPU so there’s nothing to be done in this case.
For memory: we only used 0.29% (=2.85Mb) of the reserved memory. A rule of thumb is to ask for at least +10% of the estimated memory usage. We’ll be putting 100Mb.
For walltime: adjusting the walltime is not as important as the previous 2 (the job will finish as soon as the commands within the script have finished running) but it can be necessary for jobs that might take longer than the default 2hrs. In this case, our job only took 7s so we could reduce the walltime to 10mins for example.

//! it’s important to keep in mind that run times and resource usage also depends on the commands you run and on the size of your input.

Task 7: Adjust the submission script with the previously-identified & required resources

Now try optimising your script to reseve not more than the resouces you actually need to run mafft. Your colleagues will be thankful ;-)

Hint: see Slurm cheat sheet for a list of all options for sbatch

Click to see answer

Your script could look like this:


#! /bin/bash

#SBATCH --job-name="my_jobname"
#SBATCH --partition=common
#SBATCH --cpus-per-task=1
#SBATCH --mem=100M
#SBATCH --time=00:10:00

### prefix start - create temporary directory for your job
export TMPDIR=$(mktemp -d)
### prefix end


module load nodes/mafft-7.475

# This is a comment line - it will be ignored when the script is executed

cd /home/john.doe/cluster_usage_examples/example_mafft/                                 #|
                                                                                        #| These are your shell commands
mafft ASF1_HUMAN_FL_blastp_output.fasta > ASF1_HUMAN_FL_blastp_output_mafft_aln.fasta   #|


### suffix start - delete created temporary directory
rm -rf $TMPDIR
### suffix end

NB: only the #SBATCH lines were changed…

⁕ ⁕ ⁕ ⁕ ⁕ ⁕

Foreword

Context

Objectives

Setup

Connect to cluster

Fetch example files

Step-by-step

Task 1: Locate the mafft software

Task 2: Determine how to use the mafft executable

Task 3: Write your submission script

Task 4: Submit your script to the cluster

Task 5: Check if your job finished correctly

Task 6: Analyse your resource consumption to optimise your next run

Task 7: Adjust the submission script with the previously-identified & required resources

Task 2: Determine how to use the `mafft` executable