Last updated: 2025-03-13
This exercise is identical to the previous one but with a different
input and programme. You can use it to check what you’ve learned in the
first case study, Exercise 2A (try answering the questions without
looking at the answers). If you were quite comfortable with Exercise 2A,
you can go directly to Exercise 2C which will introduce you to
for
loops and job arrays.
⁕ ⁕ ⁕ ⁕ ⁕ ⁕
We are studying the human ASF1A protein and would like to find conserved regions in its sequence. So we ran BLAST to search for homologs and downloaded the full length sequences of all hits. In the following example, we will try to run the MAFFT programme on this set of sequences to align them all between each other (more information on MAFFT). MAFFT is a small programme that doesn’t require a lot of resources (for a relatively small set of sequences) and it’s already installed on the I2BC cluster.
⁕ ⁕ ⁕ ⁕ ⁕ ⁕
In this case study, you will see the complete step-by-step of how to use a given software on the I2BC cluster. Starting with looking for it within the installed software, understanding how to use it, running it within a Slurm job and optimising the resources to reserve.
⁕ ⁕ ⁕ ⁕ ⁕ ⁕
It’s the same as for Exercise
0{:target=“_blank”}: you should be connected to the cluster and on
the master node (i.e. slurmlogin
should be written in your
terminal prefix).
It’s the same as for the previous case study, Exercise 2A. You can skip this step if you’ve already done it.
If not:
You will need the example files available in Zenodo under this link, or on
the Forge Logicielle under
this link for those who are familiar with git
.
We’ll work in your home directory. Let’s move to it and fetch our
working files using wget
:
john.doe@slurmlogin:~$ cd /home/john.doe
john.doe@slurmlogin:/home/john.doe$ wget "https://zenodo.org/records/15017630/files/cluster_usage_examples.tar.gz?download=1" -O cluster_usage_examples.tar.gz
john.doe@slurmlogin:/home/john.doe$ tar -zxf cluster_usage_examples.tar.gz
john.doe@slurmlogin:/home/john.doe$ ls cluster_usage_examples/
example_fastqc example_mafft example_tmalign
In the example_mafft
folder, you’ll see a fasta file
with unaligned protein sequences on which we’ll run the mafft
programme.
john.doe@slurmlogin:/home/john.doe$ ls cluster_usage_examples/example_mafft/
ASF1_HUMAN_FL_blastp_output.fasta
⁕ ⁕ ⁕ ⁕ ⁕ ⁕
The Mafft
programme executable is called mafft
. Try to find it using
the module
command.
The module command can be used from anywhere on the cluster. The main sub-commands are:
module avail
: to list all available softwaremodule load/unload <software name>
: to load
specific software (for use)module list
: to list currently loaded softwareTo get more details on options for these subcommands (e.g. to search
for a specific name), you can use the -h
option to get the
help page.
john.doe@slurmlogin:/home/john.doe$ module avail -C mafft -i
------------------------------------- /usr/share/modules/modulefiles --------------------------------------
nodes/mafft-7.475
In the above command, we used the -C
option to specify a
pattern to search for (“mafft”) and -i
to make the search
case-insensitive.
According to the output, all we have to do is use:
module load nodes/mafft-7.475
in order to load mafft
version 7.475.
mafft
executableLet’s investigate how to use the mafft
executable: How
do we specify the inputs? What options or parameters can we use? What
would the final command line look like to run mafft
on your
input?
Hints:
mafft
, you can often get a
help message with usage examples by using the --help
or
-h
optionsrun --pty bash
)module
aren’t available on the master
node (slurmlogin), so let’s first connect to a node and then load
mafft
with module
:john.doe@slurmlogin:/home/john.doe$ srun --pty bash
john.doe@node01:/home/john.doe$ module load mafft/mafft-7.475
man your_programme
” or
“your_programme --help
”. Let’s see if we can access the
help menu for mafft
:john.doe@node01:/home/john.doe$ mafft --help
------------------------------------------------------------------------------
MAFFT v7.475 (2020/Nov/23)
https://mafft.cbrc.jp/alignment/software/
MBE 30:772-780 (2013), NAR 30:3059-3066 (2002)
------------------------------------------------------------------------------
High speed:
% mafft in > out
% mafft --retree 1 in > out (fast)
High accuracy (for <~200 sequences x <~2,000 aa/nt): % mafft --maxiterate 1000 --localpair in > out (% linsi in > out is also ok)
% mafft --maxiterate 1000 --genafpair in > out (% einsi in > out)
% mafft --maxiterate 1000 --globalpair in > out (% ginsi in > out)
If unsure which option to use:
% mafft --auto in > out
--op # : Gap opening penalty, default: 1.53
--ep # : Offset (works like gap extension penalty), default: 0.0
--maxiterate # : Maximum number of iterative refinement, default: 0
--clustalout : Output: clustal format, default: fasta
--reorder : Outorder: aligned, default: input order
--quiet : Do not report progress
--thread # : Number of threads (if unsure, --thread -1)
--dash : Add structural information (Rozewicki et al, submitted)
HSo the basic usage of mafft in our case would look like this (executable in red, fasta input file in blue, aligned output file in purple): “mafft cluster_usage_examples/example_mafft/ASF1_HUMAN_FL_blastp_output.fasta cluster_usage_examples/example_mafft/ASF1_HUMAN_FL_blastp_output_mafft_aln.fasta”
Note: by default, mafft
will just print the
alignment output to the screen. In order to capture this printed output
into a file, we use the redirection sign “>
” followed by
the name of the file we want to redirect the printed output to.
mafft
and its execution. We no longer need to be connected
to a node and can now liberate the resources that we’ve blocked by
disconnecting from it:john.doe@node01:/home/john.doe$ exit 0
john.doe@slurmlogin:/home/john.doe$
Of note: as you can see, the terminal prompt changed again from node01 back to slurmlogin: we’ve returned to the master node (slurmlogin) of the cluster and the job we were running has terminated.
Let’s move to your example_mafft
subdirectory first and
write the slurm_script.sh
in there.
john.doe@slurmlogin:/home/john.doe$ cd cluster_usage_examples/example_mafft/
nano
(but there
are other possibilities such as vi
, vim
or
emacs
for example). To use nano:
nano slurm_script.sh
will create and open the
slurm_script.sh
file.^
= Ctrl, Ctrl+G
to see help message,
Ctrl+X
to exit), see nano
cheat sheet.#SBATCH
-prefixed lines at the head to
specify slurm options for submission, see Slurm cheat
sheet.The Slurm submission script is written like a common bash script
(same language as the terminal). You write in this script all the
commands (1 per line) that you would usually type in your terminal. The
only particularity are the Slurm submission options, that you can add
directly to this script, commonly at the beginning (each parameter
should be preceded by #SBATCH
):
#! /bin/bash
#SBATCH --job-name="my_jobname"
#SBATCH --partition=common
#SBATCH --cpus-per-task=1
### prefix start - create temporary directory for your job
export TMPDIR=$(mktemp -d)
### prefix end
module load nodes/mafft-7.475
# This is a comment line - it will be ignored when the script is executed
# Comment lines start with a "#" symbol and can be put anywhere you like
# You can also add a comment at the end of a line, as shown below
cd /home/john.doe/cluster_usage_examples/example_mafft/ #|
#| These are your shell commands
mafft ASF1_HUMAN_FL_blastp_output.fasta > ASF1_HUMAN_FL_blastp_output_mafft_aln.fasta #|
### suffix start - delete created temporary directory
rm -rf $TMPDIR
### suffix end
Explanation of the content:
#! /bin/bash
: this is the “shebang”, it specifies the
“language” of your script (in this case, the cluster understands that
the syntax of this text file is “bash” and will execute it with the
/bin/bash executable).#SBATCH
: All lines starting with #SBATCH
indicate to the Slurm job scheduler on the cluster that the following
information is information related to the job submission. This is where
you specify the slurm options such as your job name with
--job-name
or the partition you want to submit your job to
with -p
. There are many more options you can specify, see
the “cheat
sheet” tab on the intranetmodule load
will load the software you need (i.e. mafft
in this case)cd /path/to/your/folder
: although Slurm places you in
the same directory in which you’ve submitted the slurm script, it’s a
good habit to deliberately move to your working directly within the
submission script to avoid any nasty surprises… By moving to the
directory which contains your input, you won’t need to specify the full
path to the input, as you can see in the line of code that follows this
statement.$TMPDIR
. This folder is sometimes
overridden and dealt with automatically by the Scheduler (automatic
clean up) but it’s a good habit to do this yourself if you can (to avoid
saturating the temporary disk when it’s not dealt with automatically).
Here, we use the mktemp
command to create a temporary
folder with a random name at the beginning of the script and we use
rm
to delete it at the end of the script.When you exit the nano text editor, you should see the file created in your current directory:
john.doe@slurmlogin:/home/john.doe/cluster_usage_examples/example_mafft$ ls
ASF1_HUMAN_FL_blastp_output.fasta slurm_script.sh
To submit a Slurm submission script, all you have to do is:
john.doe@slurmlogin:/home/john.doe/cluster_usage_examples/example_mafft$ sbatch slurm_script.sh
Submitted batch job 287170
This will print your attributed job id on the screen (287170 in this case).
You can follow the progression of your job with
squeue
:
john.doe@slurmlogin:/home/john.doe/cluster_usage_examples/example_mafft$ squeue -j 287170
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
287170 common my_jobname john.doe R 00:00:41 1 node06
(you can omit the -j
option to show all the currently
running jobs).
You can learn more about the options for squeue
in the
manual (type man squeue
, navigate with the up/down arrow
keys and exit by typing q).
Hints:
slurm-xxx.out
file (replace xxx with
your job id) to see if there were any problemsWhat files do we expect to see? There should be 2 new files in total:
Your job shouldn’t take too long to finish, then you should be able to see the output files in your folder:
john.doe@slurmlogin:/home/john.doe/cluster_usage_examples/example_mafft$ ls
ASF1_HUMAN_FL_blastp_output.fasta ASF1_HUMAN_FL_blastp_output_mafft_aln.fasta
slurm-287170.out slurm_script.sh
ASF1_HUMAN_FL_blastp_output_mafft_aln.fasta
slurm-287170.out
Let’s have a quick look at the alignment:
john.doe@slurmlogin:/home/john.doe/cluster_usage_examples/example_mafft$ head ASF1_HUMAN_FL_blastp_output_mafft_aln.fasta
>my_query_seq Q9Y294 ASF1_HUMAN
------------------------MAK---------------VQVNNVVVLDNPSPFYNP
FQFEIT--------FECIEDLSE----DLEWKIIYVGSAESEEY--DQVLDSVLVGPVPA
GRHMFVFQADAPNPGLIPDADAVGVTVVLITCTYRGQEFIRVGYYVNNEYTETELRENPP
VKPDFSKLQRNILASNPRVTRFHINWEDNTEKLEDAE-SSNPNLQSLLSTDALPSA-SKG
WSTSENSLNVMLESHMDCM-----------------------------------------
----------
>KAI6071314.1 Histone chaperone ASF1A [Aix galericulata]
------------------------MAK---------------VQVNNVVVLDNPSPFYNP
FQFEIT--------FECIEDLSE----DLEWKIIYVGSAESEEY--DQVLDSVLVGPVPA
[...]
If you’re curious, you can also view your sequence alignment in a more graphical way through the EBI’s MView web server.
Having issues? If you don’t have the output files, then there might be a problem in the execution somewhere. In that case, you can have a look at the log file from Slurm:
john.doe@slurmlogin:/home/john.doe/cluster_usage_examples/example_mafft$ cat slurm-287170.out
Note that the log file is generated by default in the directory in
which you ran the sbatch
command. There is an option in
sbatch
with which you can change this behaviour.
Typical error messages are for example:
-bash: mfft: command not found:
mafft
correctly (mfft
instead of mafft
)/usr/bin/mafft: Cannot open ASF1_HUMAN_FL_blastp_output.fasta.
:mafft
cannot find the input that you gave it. It
could be linked to a typo in the name or could be because
mafft
didn’t find it in your current working directory
(double-check you’re in the right directory or double-check your input
paths).Analyse your actual resource consumption: How much memory did you effectively use while running the job? How long did your job take to finish? How much CPU percentage did you use?
This is useful to know in order to adapt the resources you ask for in future jobs with similar proceedings (e.g. other mafft submissions).
Hints:
jobinfo
command from the slurm-tools toolkit
(//! not a native command of Slurm) for thisjohn.doe@slurmlogin:/home/john.doe/cluster_usage_examples/example_mafft$ jobinfo 287170
Job ID : 287170
Job name : my_jobname
User : john.doe
Account :
Working directory : /home/john.doe/cluster_usage_examples/example_mafft
Cluster : cluster
Partition : common
Nodes : 1
Nodelist : node06
Tasks : 1
CPUs : 1
GPUs : 0
State : COMPLETED
Exit code : 0:0
Submit time : 2025-03-11T11:03:43
Start time : 2025-03-11T11:03:44
End time : 2025-03-11T11:03:51
Wait time : 00:00:01
Reserved walltime : 02:00:00
Used walltime : 00:00:07
Used CPU walltime : 00:00:07
Used CPU time : 00:00:04
CPU efficiency : 66.00%
% User (computation) : 94.89%
% System (I/O) : 5.11%
Reserved memory : 1000M/core
Max memory used : 2.85M (estimate)
Memory efficiency : 0.29%
Max disk write : 0.00
Max disk read : 0.00
The lines you should look at:
Adjustements to make:
//! it’s important to keep in mind that run times and resource usage also depends on the commands you run and on the size of your input.
Now try optimising your script to reseve not more than the resouces
you actually need to run mafft
. Your colleagues will be
thankful ;-)
Hint: see Slurm
cheat sheet for a list of all options for sbatch
Your script could look like this:
#! /bin/bash
#SBATCH --job-name="my_jobname"
#SBATCH --partition=common
#SBATCH --cpus-per-task=1
#SBATCH --mem=100M
#SBATCH --time=00:10:00
### prefix start - create temporary directory for your job
export TMPDIR=$(mktemp -d)
### prefix end
module load nodes/mafft-7.475
# This is a comment line - it will be ignored when the script is executed
cd /home/john.doe/cluster_usage_examples/example_mafft/ #|
#| These are your shell commands
mafft ASF1_HUMAN_FL_blastp_output.fasta > ASF1_HUMAN_FL_blastp_output_mafft_aln.fasta #|
### suffix start - delete created temporary directory
rm -rf $TMPDIR
### suffix end
NB: only the #SBATCH
lines were changed…
Take home message
When discovering a new tool and wanting to use it on the cluster
module
module load
at the top#SBATCH
-prefixed lines (see Slurm cheat
sheet for a list of all options)sbatch
squeue
jobinfo
⁕ ⁕ ⁕ ⁕ ⁕ ⁕