Exercise 2, case study C - Comparing protein structures with TM-align

This exercise is identical to the previous 2 case studies (Exercies 2A & 2B) but with a different input and programme. It also has a few extra steps at the end to introduce you to for loops and job arrays.

Context

We just ran AlphaFold to predict the structure of a protein (Protein transport protein SEC39) from its sequence. We downloaded several models and would like to see how different they are from the experimental structure. In the following example, we will try to run the TM-align programme to structurally align our protein structure models onto the experimental one and calculate the TM-score similarity measure between them (more about the TM-score). It’s a small programme that doesn’t require a lot of resources and it’s already installed on the I2BC cluster. In our example, we will first be aligning only two structures on each other, then we’ll have a look at solutions to align several in one go.

Objectives

In this case study, you will see the complete step-by-step of how to use a given software on the I2BC cluster. Starting with looking for it within the installed software, understanding how to use it, running it within a Slurm job and optimising the resources to reserve.

Setup

Connect to cluster

It’s the same as for Exercise 0{:target=“_blank”}: you should be connected to the cluster and on the master node (i.e. slurmlogin should be written in your terminal prefix).

Fetch example files

It’s the same as for the previous cases studies, Exercises 2A & 2B. You can skip this step if you’ve already done it.

You will need the example files available in Zenodo under this link, or on the Forge Logicielle under this link for those who are familiar with git.

We’ll work in your home directory. Let’s move to it and fetch our working files using wget:

In the example_tmalign folder, you’ll see a set of protein structure files in pdb format and containing the 3D coordinates of each atom in the protein. 8FTU.pdb corresponds to the coordinates of the experimental structure, the other files correspond to AlphaFold’s predictions ranked from 001 to 005.

Step-by-step

The TM-align programme executable is called tmalign. Try to find it using the module command.

Let’s investigate how to use the TMalign executable: How do we specify the inputs? What options or parameters can we use? What would the final command line look like to run TMalign on your input?

Let’s move to your example_tmalign subdirectory first and write the slurm_script.sh in there.

Analyse your actual resource consumption: How much memory did you effectively use while running the job? How long did your job take to finish? How much CPU percentage did you use?

This is useful to know in order to adapt the resources you ask for in future jobs with similar proceedings (e.g. other TMalign submissions).

Now try optimising your script to reseve not more than the resouces you actually need to run TMalign. Your colleagues will be thankful ;-)

Bonus - What about other structures?

Can you see a way to adapt your job submission script to run TM-align on several pairs of structures without manually typing every single command line?

What we would like to do is to run TM-align on each model versus the experimental reference structure:

To submit 1 vs all, there are (at least) 2 solutions (other than running each command line individually):

If your are comfortable with programming, try implementing the for loop on your own. Refer to Task 8 below if you need any tips. Job arrays will be detailed in Task 9.

First a little context:

What are job arrays?
In Slurm, you can use job arrays. An array of jobs is a set of jobs that share the same parameters (e.g. number of CPUs, amount of memory etc.) but each of them work on different inputs. A job array runs as a collection of related yet separate basic jobs that might be distributed across multiple hosts and might run concurrently (instead of sequentially).
Why use job arrays?
The advantage of job arrays in this case is their parallelism: we will be running TMalign on each pair of structures in parallel compared to for loops where commands are run sequentially within a single job. This is particularly useful when your individual commands take a while to run. In this case, it’s not that much different from using a for loop because TMalign is fast and we only have 5 comparisons to do. On the other side, the disadvantage of using job arrays are the constraints on the input names – but this can be solved quite easily through work-arounds, as you’ll see below.
2 important parameters in job arrays: - the Slurm option --array=start-stop: a range of integers - the variable “$SLURM_ARRAY_TASK_ID” which corresponds to the job index and which takes the values in the start-stop range defined above

For example, if you add:

parameter value	this will run	value of `$SLURM_ARRAY_TASK_ID` in these jobs
`#SBATCH --array=1-3`	3 individual jobs	1 in the first job 2 in the second job 3 in the third job
`#SBATCH --array=6-9`	4 individual jobs	6 in the first job 7 in the second job 8 in the third job 9 in the fourth job

All jobs within a job array run the same Slurm submission script with the same Slurm parameters (i.e. the same resources asked). It’s up to you to play with the “$SLURM_ARRAY_TASK_ID” variable (=the job index) to avoid each job doing exactly the same thing.

i.e. in th TM-align context, “$SLURM_ARRAY_TASK_ID” could reflect the rank of our SEC39 models and would be used to specify different inputs for each job within the job array.

Given this information, can you try adjusting your previous script to use job arrays instead?

Bonus on job arrays - Tricks on how to use $SLURM_ARRAY_TASK_ID even when you don’t have numbers in your input names

As you realised above, the particularity with jobs within a job array is that you have to try and individualise what they’re doing using only the job index ($SLURM_ARRAY_TASK_ID variable).

Take home message

When discovering a new tool and wanting to use it on the cluster

search for your software with module
write the commands that you want to run within a bash script, don’t forget module load at the top
Slurm options can be specified within this script using #SBATCH-prefixed lines (see Slurm cheat sheet for a list of all options)
submit the script with sbatch
follow your job with squeue
check if your job worked by searching for expected output files & looking at the slurm log
analyse the resouces used for future use with jobinfo

Exercise 2, case study C - Comparing protein structures with TM-align

Foreword

Context

Objectives

Setup

Connect to cluster

Fetch example files

Step-by-step

Task 1: Locate the TM-align software

Task 2: Determine how to use the `TMalign` executable

Task 3: Write your submission script

Task 4: Submit your script to the cluster

Task 5: Check if your job finished correctly

Task 6: Analyse your resource consumption to optimise your next run

Task 7: Adjust the submission script with the previously-identified & required resources

Bonus - What about other structures?

Task 8: Adapt your script to use a “for” loop to run all comparisons

Task 9: Adapt your script to use job arrays to run all comparisons

Task 10: Submit, follow & check the outputs of your job

Bonus on job arrays - Tricks on how to use `$SLURM_ARRAY_TASK_ID` even when you don’t have numbers in your input names

Foreword

Context

Objectives

Setup

Connect to cluster

Fetch example files

Step-by-step

Task 1: Locate the TM-align software

Task 2: Determine how to use the TMalign executable

Task 3: Write your submission script

Task 4: Submit your script to the cluster

Task 5: Check if your job finished correctly

Task 6: Analyse your resource consumption to optimise your next run

Task 7: Adjust the submission script with the previously-identified & required resources

Bonus - What about other structures?

Task 8: Adapt your script to use a “for” loop to run all comparisons

Task 9: Adapt your script to use job arrays to run all comparisons

Task 10: Submit, follow & check the outputs of your job

Bonus on job arrays - Tricks on how to use $SLURM_ARRAY_TASK_ID even when you don’t have numbers in your input names

Task 2: Determine how to use the `TMalign` executable

Bonus on job arrays - Tricks on how to use `$SLURM_ARRAY_TASK_ID` even when you don’t have numbers in your input names