BIOI2-Training – Cluster – Exercise 3 – BIOI2 – Integrative BIOInformatics platforme

Getting started with the I2BC cluster

About this course | Before the session | About the cluster | Course material | Exercises

Exercise 1 – FastQC | Exercise 2 – MAFFT | Exercise 3 – TM-align | Exercise 4 – Conda

Exercise 3 - Comparing protein structures with TM-align

Instructions: This exercise is identical to the previous ones but we’ll be using different a different input and programme. It also has a few extra steps at the end to introduce you to “for” loops and job arrays.

Context: We just ran AlphaFold to predict the structure of a protein (Protein transport protein SEC39) from its sequence. We downloaded several models and would like to see how different they are from the experimental structure. In the following example, we will try to run the TM-align programme to structurally align our protein structure models onto the experimental one and calculate the TM-score similarity measure between them (more about the TM-score). It’s a small programme that doesn’t require a lot of resources and it’s already installed on the I2BC cluster. In our example, we will first be aligning only two structures on each other, then we’ll have a look at solutions to align several in one go.

Note: Files that go with the examples mentioned in this training session are in https://forge.i2bc.paris-saclay.fr/redmine/projects/partage-bioinfo/repository/cluster_usage_examples. You can access them by cloning the repository: https://forge.i2bc.paris-saclay.fr/git/partage-bioinfo/cluster_usage_examples.git

Step 0 - Connect to the I2BC cluster

If you haven’t already got a session open on the Frontale (the master node) of the cluster, please do so as the rest of the steps are performed on the cluster. If you don’t know how to connect, don’t hesitate to refer to the previous section.

Step 1 - Fetch the input files

We will work in your home directory (see this page for more information on the file spaces accessible from the cluster). Let’s move to it and fetch our working files from the Forge website through command line using git:

john.doe@cluster-i2bc:~$ cd /home/john.doe
john.doe@cluster-i2bc:/home/john.doe$ git clone https://forge.i2bc.paris-saclay.fr/git/partage-bioinfo/cluster_usage_examples.git
john.doe@cluster-i2bc:/home/john.doe$ ls cluster_usage_examples/
example_fastqc  example_mafft  example_tmalign

In the example_tmalign folder, you’ll see a set of protein structure files in pdb format and containing the 3D coordinates of each atom in the protein. 8FTU.pdb corresponds to the coordinates of the experimental structure, the other files correspond to AlphaFold’s predictions ranked from 001 to 005.

john.doe@cluster-i2bc:/home/john.doe$ ls cluster_usage_examples/example_tmalign/
8FTU.pdb                                                       SEC39_unrelaxed_rank_003_alphafold2_ptm_model_3_seed_96009.pdb
SEC39_unrelaxed_rank_001_alphafold2_ptm_model_4_seed_96009.pdb SEC39_unrelaxed_rank_004_alphafold2_ptm_model_2_seed_96009.pdb
SEC39_unrelaxed_rank_002_alphafold2_ptm_model_5_seed_96009.pdb SEC39_unrelaxed_rank_005_alphafold2_ptm_model_1_seed_96009.pdb

Step 2 - Locate the executable

The TM-align programme executable is called tmalign, let’s see if we can locate it.

Many programmes are already installed on the cluster. Some are on the Frontale (the master node) but most of them are on the nodes only.

If you want to check if a programme is installed, you’ll have to connect to a node first in interactive mode:
```
john.doe@cluster-i2bc:/home/john.doe$ qsub -I
qsub: waiting for job 287169.pbsserver to start
qsub: job 287169.pbsserver ready

john.doe@node01:/home/john.doe$ 
```
Of note:
- with the qsub command, you are actually running a job on the cluster with a job identifier.
- all jobs are dispatched to one of the available nodes of the cluster – in this case, we’re using node01 (NB: cluster-i2bc is the name of the Frontale).

Once on the node, there are several places your executable could be:
1. (the easiest scenario) your executable could already be installed and saved in your system’s $PATH variable, in which case, you should be able to just type the command name as is in the terminal.
2. your executable could be installed in the /opt folder at the root of the node, in which case you’ll have to look for it first.
3. there is also a module but not all programmes in /opt are listed in there.
Let’s check first if the TMalign executable is in the $PATH by directly typing the TMalign command in the terminal:
```
john.doe@node01:/home/john.doe$ TMalign
 Brief instruction for running TM-align program:
 (For detail: Zhang & Skolnick, Nucl. Acid. Res. 33: 2302-9, 2005)

 1. Align 'chain_1.pdb' and 'chain_2.pdb':
    >TMalign chain_1.pdb chain_2.pdb

               [...]
```
Did this work? yes! Running the command started an interactive session. To get out of it, use the Ctrl+C keybord shortcut.

So, to use the programme, we just have to type: TMalign and add the arguments

Step 3 - Determine how to use the executable

Let’s investigate how to use the TMalign executable: How do we specify the inputs? What options or parameters can we use? What would the final command line look like to run TMalign on your input?

We can make the most of still being connected to a node on the cluster to run a few execution tests: most programmes come with help or usage messages that you can print on the screen using “man your_programme” or “your_programme --help” or “your_programme -h” or sometimes just the executable command itself. Let’s see if we can access the help menu for TMalign:

john.doe@node01:/home/john.doe$ TMalign -h
 Brief instruction for running TM-align program:
 (For detail: Zhang & Skolnick, Nucl. Acid. Res. 33: 2302-9, 2005)

 1. Align 'chain_1.pdb' and 'chain_2.pdb':
    >TMalign chain_1.pdb chain_2.pdb

                  [...]

So the basic usage of TMalign in our case would look like this (executable in red, both input pdb files in blue, output file in purple):

john.doe@node01:/home/john.doe$ TMalign /home/john.doe/cluster_usage_examples/example_tmalign/8FTU.pdb /home/john.doe/cluster_usage_examples/example_tmalign/SEC39_unrelaxed_rank_001_alphafold2_ptm_model_4_seed_96009.pdb > /home/john.doe/cluster_usage_examples/example_tmalign/tmalign_exp_vs_rank1.txt

Note: by default, TMalign will just print the alignment and score information to the screen (no option to specify the output file). In order to capture this printed output into a file, we use the redirection sign “>” followed by the name of the file we want to redirect the printed output to.

At this point, we know everything there is to know about TMalign and its execution. We no longer need to be connected to a node and can now liberate the resources that we’ve blocked by disconnecting from it:

john.doe@node01:/home/john.doe$ logout

qsub: job 287169.pbsserver completed
john.doe@cluster-i2bc:/home/john.doe$

Of note: as you can see, the terminal prompt prefix changed again from node01 back to cluster-i2bc: we’ve returned to the Frontale of the cluster and the job we were running has terminated.

Step 4 - Write your submission script

Best is to write the submission script in your example_tmalign subdirectory.

Let’s move to the example_tmalign subdirectory first:

john.doe@cluster-i2bc:/home/john.doe$ ls cluster_usage_examples/
example_fastqc  example_mafft  example_tmalign
john.doe@cluster-i2bc:/home/john.doe$ cd cluster_usage_examples/example_tmalign/

Now let’s write a script called pbs_script.sh in there.

In this example, we will write pbs_script.sh using the in-line text editor nano (but there are other possibilities such as vi, vim or emacs for example):

john.doe@cluster-i2bc:/home/john.doe/cluster_usage_examples/example_tmalign$ nano pbs_script.sh

This will create a file called pbs_script.sh in your current directory and will open up an in-line “window” to edit your file that looks like the screenshot below.

Nano_editor

Screenshot of the nano editor

About the nano text editor:
It’s in-line, you navigate through it with your arrow keys and you have a certain number of functionalities (e.g. copy-paste, search etc.) that are accessible through keyboard shortcuts (that are also listed on the bottom of your screen, ^ stands for the Ctrl key).
The main shortcuts are: Ctrl+S to save (^S) and Ctrl+X to exit (^X).
See the nano cheat sheet and tutorial for more information and shortcuts.

The PBS submission script is written like a common bash script (same language as the terminal). Write in this script all the commands (1 per line) that you would usually type in your terminal. The only particularity are the PBS submission options, that you can add directly to this script, commonly at the beginning (each parameter should be preceded by #PBS like in the example below).

#! /bin/bash

#PBS -N my_jobname     
#PBS -q common        
#PBS -l ncpus=1       

# This is a comment line - it will be ignored when the script is executed
# Comment lines start with a "#" symbol and can be put anywhere you like
# You can also add a comment at the end of a line, as shown below

cd /home/john.doe/cluster_usage_examples/example_tmalign/                                                    #|
                                                                                                             #| These are your shell commands
TMalign 8FTU.pdb SEC39_unrelaxed_rank_001_alphafold2_ptm_model_4_seed_96009.pdb > tmalign_exp_vs_rank1.txt   #|

Explanation of the content:

#! /bin/bash: this is the “shebang”, it specifies the “language” of your script (in this case, the cluster understands that the syntax of this text file is “bash” and will execute it with the /bin/bash executable).
#PBS : All lines starting with #PBS indicate to the PBS job scheduler on the cluster that the following information is information related to the job submission. This is where you specify the qsub options such as your job name with -N or the queue you want to submit your job to with -q. There are many more options you can specify, see the SICS webpage.
cd /path/to/your/folder: by default, when you connect to the Frontale or the nodes, you land on your “home” directory (/home/john.doe). By moving to the directory which contains your input, you won’t need to specify the full path to the input, as you can see in the line of code that follows this statement.

When you exit the nano text editor, you should see the file created in your current directory:

john.doe@cluster-i2bc:/home/john.doe/cluster_usage_examples/example_tmalign$ ls
8FTU.pdb                                                       SEC39_unrelaxed_rank_003_alphafold2_ptm_model_3_seed_96009.pdb
SEC39_unrelaxed_rank_001_alphafold2_ptm_model_4_seed_96009.pdb SEC39_unrelaxed_rank_004_alphafold2_ptm_model_2_seed_96009.pdb
SEC39_unrelaxed_rank_002_alphafold2_ptm_model_5_seed_96009.pdb SEC39_unrelaxed_rank_005_alphafold2_ptm_model_1_seed_96009.pdb
pbs_script.sh

Step 5 - Submit your script to the cluster

To submit a pbs submission script, all you have to do is:

john.doe@cluster-i2bc:/home/john.doe/cluster_usage_examples/example_tmalign$ qsub pbs_script.sh
287170.pbsserver

This will print your attributed job id on the screen (287170 in this case).

You can follow the progression of your job with qstat:

john.doe@cluster-i2bc:/home/john.doe/cluster_usage_examples/example_tmalign$ qstat 287170.pbsserver
Job id           Name             User             Time Use S Queue
---------------- ---------------- ---------------- -------- - -----
287170.pbsserver my_jobname       john.doe         00:00:05 R common

Note: if you get a message saying your “Job has finished”, you can add the -x option to the qstat command, this activates the search through your past jobs as well as the ones that are currently running: qstat -x 287170.pbsserver.

Or to see all your jobs running and details on resources:

john.doe@cluster-i2bc:/home/john.doe/cluster_usage_examples/example_tmalign$ qstat -u john.doe -w
Job ID                         Username        Queue           Jobname         SessID   NDS  TSK   Memory Time  S Time
------------------------------ --------------- --------------- --------------- -------- ---- ----- ------ ----- - -----
287170.pbsserver               john.doe        common          my_jobname      3090738   1   1     2gb    02:00 R 00:05
287172.pbsserver               john.doe        common          my_job2         3090739   1   1     2gb    02:00 R 00:01

You can learn more about the options for qstat on the SICS website or in the manual (type man qstat, navigate with the up/down arrow keys and exit by typing q).

Step 6 - Check if your job finished correctly

What files do we expect to see? There should be 3 new files in total:

TM-align should generate one file: the output file that you specified in your script that will have the information on your structural alignment.
the PBS scheduler should also generate two files with your job name as prefix and the job identifier as suffix: one summarising the error log, the other the usual output log (it’s what’s usually printed on the screen that PBS captures in two separate files instead).

Your job shouldn’t take too long to finish, then you should be able to see the output files in your folder:

john.doe@cluster-i2bc:/home/john.doe/cluster_usage_examples/example_tmalign$ ls
8FTU.pdb                                                       SEC39_unrelaxed_rank_003_alphafold2_ptm_model_3_seed_96009.pdb
SEC39_unrelaxed_rank_001_alphafold2_ptm_model_4_seed_96009.pdb SEC39_unrelaxed_rank_004_alphafold2_ptm_model_2_seed_96009.pdb
SEC39_unrelaxed_rank_002_alphafold2_ptm_model_5_seed_96009.pdb SEC39_unrelaxed_rank_005_alphafold2_ptm_model_1_seed_96009.pdb
my_jobname.e287170                                             my_jobname.o287170                 
pbs_script.sh                                                  tmalign_exp_vs_rank1.txt

You will see the output file generated by mafft but also the log files generated by the PBS job scheduler to which the output and error messages that are normally printed on the screen are written (e=error, o=output).

Your can also have a look at your output, in which we see that the model of rank #1 has a TM-score of 0.84 with the reference:

john.doe@cluster-i2bc:/home/john.doe/cluster_usage_examples/example_tmalign$ cat tmalign_exp_vs_rank1.txt
 **************************************************************************
 *                        TM-align (Version 20190822)                     *
 * An algorithm for protein structure alignment and comparison            *
 * Based on statistics:                                                   *
 *       0.0 < TM-score < 0.30, random structural similarity              *
 *       0.5 < TM-score < 1.00, in about the same fold                    *
 * Reference: Y Zhang and J Skolnick, Nucl Acids Res 33, 2302-9 (2005)    *
 * Please email your comments and suggestions to: zhng@umich.edu          *
 **************************************************************************

Name of Chain_1: 8FTU.pdb                                          
Name of Chain_2: SEC39_unrelaxed_rank_001_alphafold2_ptm_model_4_se
Length of Chain_1:  627 residues
Length of Chain_2:  672 residues

Aligned length=  626, RMSD=   4.10, Seq_ID=n_identical/n_aligned= 0.954
TM-score= 0.83692 (if normalized by length of Chain_1)
TM-score= 0.78720 (if normalized by length of Chain_2)
(You should use TM-score normalized by length of the reference protein)

                               [...]

Having issues?
If you don’t have the output files, then there might be a problem in the execution somewhere. In that case, you can have a look at the two log files from PBS, especially the error file (*.e*).

john.doe@cluster-i2bc:/home/john.doe/cluster_usage_examples/example_tmalign$ cat my_jobname.e287170

Note that both log files are generated by default in the directory in which you ran the qsub command. There are options in qsub with which you can change this behaviour.

Typical error messages are for example:

-bash: tmalign: command not found:
This is typical for commands that bash doesn’t know or cannot find. In this case, it’s because we didn’t spell TMalign correctly (tmalign instead of TMalign)
At line 293 of file TMalign.f (unit = 10) Fortran runtime error: Cannot open file '8FTT.pdb': No such file or directory:
As stated, TMalign cannot find the input that you gave it. In this case, it’s linked to a typo in the name (8FTT.pdb instead of 8FTU.pdb) but it could also have been because TMalign didn’t find your input in your current working directory (keep in mind that by default, when you connect to the Frontale and the nodes, you land on your home directory and TMalign won’t find your inputs unless you move to the right directory with cd or specify the full path to those files).

Step 7 - Analyse your resource consumption to optimise your next run

Analyse your actual resource consumption: How much memory did you effectively use while running the job? How long did your job take to finish? How much CPU percentage did you use?

This is useful to know in order to adapt the resources you ask for in future jobs with similar proceedings (e.g. other MAFFT submissions). To see how much resource your job used, you can use qshow -j MY_JOB_ID or use qstat -fxw -G MY_JOB_ID (both commands are equivalent):

john.doe@cluster-i2bc:/home/john.doe/cluster_usage_examples/example_mafft$ qshow -j 287170
Job Id: 287170.pbsserver
    Job_Name = my_jobname
    Job_Owner = john.doe@master.example.org
    resources_used.cpupercent = 0
    resources_used.cput = 00:00:03
    resources_used.mem = 70748kb
    resources_used.ncpus = 1
    resources_used.vmem = 0kb
    resources_used.walltime = 00:00:04
    job_state = F
    queue = common
    [...]
    Resource_List.mem = 2gb
    Resource_List.ncpus = 1
    Resource_List.nodect = 1
    Resource_List.place = pack
    Resource_List.preempt_targets = QUEUE=lowprio
    Resource_List.select = 1:mem=2gb:ncpus=1
    Resource_List.walltime = 02:00:00
    [...]

Answer?

Memory: It’s the amount of RAM memory the job is allocated.
We reserved 2Gb by default (Resource_List.mem) but only used about 70Mb (resources_used.mem).
For next time, we could consider asking for less memory, for example 200Mb instead of the 2Gb with -l mem=200Mb. This will leave more memory available for others on the cluster.
CPU percentage: It reflects how much of the CPU you used during your job (resources_used.cpupercent). For 1 CPU reserved, cpupercent can go from 0% (sub-optimal use) to 100% (optimal use). For N cpus, it can go up to N x 100%, if all CPUs are working full time. It’s an approximate measure of how efficiently the tasks are distributed over the CPUs.
In our case, we only used 0% of the allocated CPU (yes, that’s how little TM-align uses – it’s also just an approximate calculation) but we can’t ask for less than 1 CPU so there’s nothing to be done.
Wall time: it’s the maximum computation time given to a job. Beyond this time, your job will be killed, whatever it’s state. Setting a maximum limit can be useful in some cases, where the programme will freeze after finding an error for example.
We reserved 2 hrs (Resource_List.walltime) but the job only took a second (resources_used.walltime).
For next time, knowing that TMalign is very fast, we could put a wall time of 10 minutes for example with -l walltime=00:10:00.

Our adjusted job script pbs_script.sh could then look like this:

#! /bin/bash

#PBS -N my_jobname         
#PBS -q common              
#PBS -l ncpus=1            
#PBS -l mem=200mb            
#PBS -l walltime=00:10:00  

# This is a comment line - it will be ignored when the script is executed

cd /home/john.doe/cluster_usage_examples/example_tmalign/                                                    #|
                                                                                                             #| These are your shell commands
TMalign 8FTU.pdb SEC39_unrelaxed_rank_001_alphafold2_ptm_model_4_seed_96009.pdb > tmalign_exp_vs_rank1.txt   #|

Step 8 (Bonus) - What about the other structures?

Can you see a way to adapt your job submission script to run TM-align on several pairs of structures without manually typing every single command line?

What we would like to do is to run TM-align on each model versus the experimental reference structure:

TMalign 8FTU.pdb SEC39_unrelaxed_rank_001_alphafold2_ptm_model_4_seed_96009.pdb > tmalign_exp_vs_rank1.txt  
TMalign 8FTU.pdb SEC39_unrelaxed_rank_002_alphafold2_ptm_model_5_seed_96009.pdb > tmalign_exp_vs_rank2.txt  
TMalign 8FTU.pdb SEC39_unrelaxed_rank_003_alphafold2_ptm_model_3_seed_96009.pdb > tmalign_exp_vs_rank3.txt  
TMalign 8FTU.pdb SEC39_unrelaxed_rank_004_alphafold2_ptm_model_2_seed_96009.pdb > tmalign_exp_vs_rank4.txt  
TMalign 8FTU.pdb SEC39_unrelaxed_rank_005_alphafold2_ptm_model_1_seed_96009.pdb > tmalign_exp_vs_rank5.txt

To submit 1 vs all, there are 2 solutions (which will avoid you running each command line individually):

use a “for” loop within a same PBS job script
use a job array
(you could also write 1 PBS job script per structure pair or write 1 line per structure pair within a same script, but that’s time consuming, especially if we had more than just 5 models to run)

If your are comfortable with programming, try implementing the for loop on your own. Refer to Step 9 below if you need any tips. Job arrays will be detailed in Step 10.

Step 9 (Bonus) - Using a "for" loop

We can make use of the “*” in bash that replaces any (set of) character(s) and the for command in bash to run one structure versus all others sequentially.

Our adjusted job script pbs_script.sh could then look like this:

#! /bin/bash

#PBS -N my_jobname         
#PBS -q common             
#PBS -l ncpus=1           
#PBS -l mem=200mb          
#PBS -l walltime=00:10:00 

# This is a comment line - it will be ignored when the script is executed

cd /home/john.doe/cluster_usage_examples/example_tmalign/ 

for pdb in SEC39_unrelaxed_rank_* 
do
    TMalign 8FTU.pdb $pdb >> tma_1vall.txt  
done

Explanations:

$pdb is a variable in bash that will successively take the name of all files starting with “SEC39_unrelaxed_rank_” in the folder (note that we could’ve called it something else if we wanted)
Using >> is a way of redirecting the output to a file in bash without overwriting what’s already in the file if it exists. This means that, in this case, all outputs will be written successively to the same file.

Submit the script using qsub as previously. Once the job has finished, you can have a look at the output to see if AlphaFold’s ranking is coherent with the experimental structure (in terms of TM-score).

john.doe@cluster-i2bc:/home/john.doe/cluster_usage_examples/example_tmalign$ grep "Name of Chain_2\|TM-score=" tma_1vall.txt
Name of Chain_2: SEC39_unrelaxed_rank_001_alphafold2_ptm_model_4_se
TM-score= 0.83692 (if normalized by length of Chain_1)
TM-score= 0.78720 (if normalized by length of Chain_2)
Name of Chain_2: SEC39_unrelaxed_rank_002_alphafold2_ptm_model_5_se
TM-score= 0.84478 (if normalized by length of Chain_1)
TM-score= 0.79436 (if normalized by length of Chain_2)
Name of Chain_2: SEC39_unrelaxed_rank_003_alphafold2_ptm_model_3_se
TM-score= 0.93824 (if normalized by length of Chain_1)
TM-score= 0.87825 (if normalized by length of Chain_2)
Name of Chain_2: SEC39_unrelaxed_rank_004_alphafold2_ptm_model_2_se
TM-score= 0.85504 (if normalized by length of Chain_1)
TM-score= 0.80356 (if normalized by length of Chain_2)
Name of Chain_2: SEC39_unrelaxed_rank_005_alphafold2_ptm_model_1_se
TM-score= 0.94959 (if normalized by length of Chain_1)
TM-score= 0.88831 (if normalized by length of Chain_2)

Explanation: Above, we used the grep command in order to only print the lines that interest us in the output file but you could also just use cat tma_1vsall.txt to show the whole lot if you’re not familiar with this command.

As you can see, structure with the highest TM-score is actually the one that was ranked last. There could be many reasons for this seeming discrepancy, especially in this case where the structure is very oblong.

Step 10 (Bonus) - Using job arrays

What are job arrays?
In PBS, you can use job arrays. An array of jobs is a set of jobs that share the same parameters (e.g. number of CPUs, amount of memory etc.) but each of them work on different inputs. A job array runs as a collection of related yet separate basic jobs that might be distributed across multiple hosts and might run concurrently (instead of sequentially).
Why use job arrays?
- The advantage of job arrays in this case is their parallelism: we will be running TMalign on each pair of structures in parallel. This is particularly useful when your individual jobs take a while to run. In the case of TMalign, it’s not that much different from using a for loop.
- The disadvantage of using job arrays are the constraints on the input names – but this can be solved quite easily through work-arounds, as you’ll see below.

2 important parameters in job arrays:

the PBS option “-J start-stop“: a range of integers
the variable “$PBS_ARRAY_INDEX” which corresponds to the job index and which takes the values in the start-stop range defined above

For example, if you add :

parameter value	this will run	value of `$PBS_ARRAY_INDEX` in these jobs
`#PBS -J 1-3`	3 individual jobs	1 in the first job 2 in the second job 3 in the third job
`#PBS -J 6-9`	4 individual jobs	6 in the first job 7 in the second job 8 in the third job 9 in the fourth job

All jobs within a job array run the same pbs submission script with the same PBS parameters (i.e. the same resources asked). It’s up to you to play with the $PBS_ARRAY_INDEX variable (=the job index) to avoid each job doing exactly the same thing.

The job script:
For example, in our case, we would like to run 5 TMalign commands and we would like each job in the job array to run TM-align on a different pair of structures. Given the input names, we could take advantage of their ranks to select different structures from one job to the other (1, 2, 3, 4 and 5) like this:
```
#! /bin/bash

#PBS -N my_jobname
#PBS -q common
#PBS -l ncpus=1
#PBS -l mem=200mb
#PBS -l walltime=00:10:00
#PBS -J 1-5

cd /home/john.doe/cluster_usage_examples/example_tmalign/ 

TMalign 8FTU.pdb SEC39_unrelaxed_rank_00"${PBS_ARRAY_INDEX}"_alphafold2_ptm_model_*_seed_96009.pdb >> tma_1vall_"${PBS_ARRAY_INDEX}".txt
```
Explanation:
- About the syntax of the second input for TM-align: Since the model number is variable and doesn’t always correspond to the rank, we replaced it with the * symbol which replaces any (set of) characters. The rank, however, can be directly set with the job index (using the $PBS_ARRAY_INDEX variable).
- The output: We use the job index ($PBS_ARRAY_INDEX) in order to individualise the outputs so they don’t overwrite each other.

Follow a job array with qstat:
If you submit this job script, you will see a [] after the job id outputted on the screen, this indicates that you’ve submitted a job array.

john.doe@cluster-i2bc:/home/john.doe/cluster_usage_examples/example_tmalign$ qsub job_script.sh 
288550[].pbsserver

To follow your job, if you just use qstat MY_JOB_ID, you will only see one job with status “B”:

john.doe@cluster-i2bc:/home/john.doe/cluster_usage_examples/example_tmalign$ qstat 288550[].pbsserver
Job id            Name             User              Time Use S Queue
----------------  ---------------- ----------------  -------- - -----
288550[].pbsserv* my_jobname       john.doe                 0 B common

In order to see all the individual jobs that are running within your job array, you have to add the -t option to qstat:

john.doe@cluster-i2bc:/home/john.doe/cluster_usage_examples/example_tmalign$ qstat -t 288550[].pbsserver
Job id            Name             User              Time Use S Queue
----------------  ---------------- ----------------  -------- - -----
288550[].pbsserv* my_jobname       john.doe                 0 B common          
288550[2].pbsser* my_jobname       john.doe          00:00:02 R common          
288550[3].pbsser* my_jobname       john.doe          00:00:02 R common          
288550[4].pbsser* my_jobname       john.doe          00:00:02 X common          
288550[5].pbsser* my_jobname       john.doe          00:00:02 X common

Note: in order to delete a whole job array, don’t forget the [] e.g. qdel 288550[]. To delete one of the jobs in your job array, add the job index e.g. qdel 288550[2] to delete the job with index 2. This will change the status of job 2 to “X” instead of “R”. Status “X” in job arrays indicates that the job is finished (or cancelled), it’s not using any resources anymore but it will be shown on the screen as long as there are still some jobs of the job array that are running.

Have a look at the output:
Instead of having only 1 error and 1 output log generated by the PBS job scheduler, with job arrays, you have a pair for each job in the job array.

john.doe@cluster-i2bc:/home/john.doe/cluster_usage_examples/example_tmalign$ ls my_jobname.*
my_jobname.e288550.1  my_jobname.e288550.2  my_jobname.e288550.3  my_jobname.e288550.4  my_jobname.e288550.5
my_jobname.o288550.1  my_jobname.o288550.2  my_jobname.o288550.3  my_jobname.o288550.4  my_jobname.o288550.5

Explanation: The numbers at the end of each file corresponds to the job index (which is also the value of $PBS_ARRAY_INDEX in that job).

What to do if our inputs don’t conveniently have numbers in their names?
As you realised above, the particularity with jobs within a job array is that you have to try and individualise what they’re doing using only the job index ($PBS_ARRAY_INDEX variable).

you use it directly in the input name (but your input names have to be adapted to this) -> that’s what we did earlier
you list all your inputs in a file and read the ith line
for more intricate scripts, you could use if/else or cases to run completely different codes according to $PBS_ARRAY_INDEX

Example for option 2 above:
Let’s create a file that lists all our models:

john.doe@cluster-i2bc:/home/john.doe/cluster_usage_examples/example_tmalign$ ls SEC39_unrelaxed_rank_*.pdb > list_of_structures.txt

Your job submission script would then look like:

#! /bin/bash

#PBS -N my_jobname
#PBS -q common
#PBS -l ncpus=1
#PBS -l mem=200mb
#PBS -l walltime=00:10:00
#PBS -J 1-5

cd /home/john.doe/cluster_usage_examples/example_tmalign/ 

pdb=$(sed "${PBS_ARRAY_INDEX}q;d" list_of_structures.txt) # access the ith line in the list of structures, $pdb will take the value of that line i.e. the structure name
TMalign 8FTU.pdb $pdb >> tma_1vall_"${PBS_ARRAY_INDEX}".txt

Explanation: We’re taking advantage of the sed function in bash to extract the i-th line of a file. “i” is the index of our job aka. $PBS_ARRAY_INDEX. Then, we run TMalign on the reference structure vs the one listed on the i-th line of your file.

Example for option 3 above:
Your job submission script could look like:

#! /bin/bash

#PBS -N my_jobname
#PBS -q common
#PBS -l ncpus=1
#PBS -l mem=200mb
#PBS -l walltime=00:10:00
#PBS -J 1-5

cd /home/john.doe/cluster_usage_examples/example_tmalign/ 

case "$PBS_ARRAY_INDEX"in
   1)
	   TMalign 8FTU.pdb SEC39_unrelaxed_rank_001_alphafold2_ptm_model_4_seed_96009.pdb >> tma_1vall_"${PBS_ARRAY_INDEX}".txt
	   ;;
   2)
	   TMalign 8FTU.pdb SEC39_unrelaxed_rank_002_alphafold2_ptm_model_5_seed_96009.pdb >> tma_1vall_"${PBS_ARRAY_INDEX}".txt
	   ;;
   3)
	   TMalign 8FTU.pdb SEC39_unrelaxed_rank_003_alphafold2_ptm_model_3_seed_96009.pdb >> tma_1vall_"${PBS_ARRAY_INDEX}".txt
	   ;;
   4)
	   TMalign 8FTU.pdb SEC39_unrelaxed_rank_004_alphafold2_ptm_model_2_seed_96009.pdb >> tma_1vall_"${PBS_ARRAY_INDEX}".txt
	   ;;
   5)
	   TMalign 8FTU.pdb SEC39_unrelaxed_rank_005_alphafold2_ptm_model_1_seed_96009.pdb >> tma_1vall_"${PBS_ARRAY_INDEX}".txt
	   ;;
   *)
	   echo "Unknown job index"
	   ;;
esac

Explanation: In this script, we are using the “cases” function of bash: when the job index is a certain value, we only execute a certain (set of) lines in the submission script. As you might guess, in this case, using cases isn’t well suited. For example, imagine you would like to add a few extra structures to the list, this would mean re-adjusting the script. Also, imagine you have more than just 5 structures to deal with… However, in some cases it might be useful to have this sort of system, especially if you want to vary the commands or options from one job to the next.

In summary, when you submit a job array:
- you are submitting a set of jobs that all have the same root job id (e.g. 288550) and are differentiable with their index number (e.g. [2] => complete job identifier: 288550[2]).
- all jobs within a job array run the same pbs submission script with the same PBS parameters (i.e. the same resources asked), it’s up to you to use the $PBS_ARRAY_INDEX to individualise their tasks within this script. Note that this also means that if you run a job array with 4 sub-jobs, you’ll be using 4 times the resources that you’ve asked for in terms of memory and CPU.
- you have to add the -t parameter to qstat to see the individual jobs in the queue.
- on the I2BC cluster, you are limited in the number of jobs you can submit within a same job array (to avoid slowing down the whole system).