BIOI2-Training – Cluster – Exercise 4 – BIOI2 – Integrative BIOInformatics platforme

Getting started with the I2BC cluster

About this course | Before the session | About the cluster | Course material | Exercises

Exercise 1 – FastQC | Exercise 2 – MAFFT | Exercise 3 – TM-align | Exercise 4 – Conda

Exercise 4 - Using conda

Instructions: In this exercise, you will learn how to install and use conda, a useful tool to manage programmes and their environments.

Context: We would like to use the seqkit tool to get the statistics (number of sequences, average sequence length, sequence type etc.) of the fasta file input from Exercise 2. As you will quickly see, seqkit isn’t installed on the cluster. We will be installing it ourselves using conda.

Note: Files that go with the examples mentioned in this training session are in https://forge.i2bc.paris-saclay.fr/redmine/projects/partage-bioinfo/repository/cluster_usage_examples. You can access them by cloning the repository: https://forge.i2bc.paris-saclay.fr/git/partage-bioinfo/cluster_usage_examples.git

Step 0 - Connect to the I2BC cluster

If you haven’t already got a session open on the Frontale (the master node) of the cluster, please do so as the rest of the steps are performed on the cluster. If you don’t know how to connect, don’t hesitate to refer to the previous section.

Step 1 - Fetch the input files

We will work in your home directory (see this page for more information on the file spaces accessible from the cluster). Let’s move to it and fetch our working files from the Forge website through command line using git:

john.doe@cluster-i2bc:~$ cd /home/john.doe
john.doe@cluster-i2bc:/home/john.doe$ git clone https://forge.i2bc.paris-saclay.fr/git/partage-bioinfo/cluster_usage_examples.git
john.doe@cluster-i2bc:/home/john.doe$ ls cluster_usage_examples/
example_fastqc  example_mafft  example_tmalign

The file we’ll be using is in the example_mafft folder, it’s a file in fasta format with full length protein sequences homologous to the human ASF1 protein, as outputted by BLAST.

john.doe@cluster-i2bc:/home/john.doe$ ls cluster_usage_examples/example_mafft
ASF1_HUMAN_FL_blastp_output.fasta

Step 2 - Locate the executable

The SeqKit programme executable is called seqkit, let’s see if we can locate it.

Many programmes are already installed on the cluster. Some are on the Frontale (the master node) but most of them are on the nodes only.

1. If you want to check if a programme is installed, you’ll have to connect to a node first in interactive mode:
```
john.doe@cluster-i2bc:/home/john.doe$ qsub -I
qsub: waiting for job 287169.pbsserver to start
qsub: job 287169.pbsserver ready

john.doe@node01:/home/john.doe$ 
```
  Of note:
  - with the qsub command, you are actually running a job on the cluster with a job identifier.
  - all jobs are dispatched to one of the available nodes of the cluster – in this case, we’re using node01 (NB: cluster-i2bc is the name of the Frontale).
Once on the node, there are several places your executable could be:
1. (the easiest scenario) your executable could already be installed and saved in your system’s $PATH variable, in which case, you should be able to just type the command name as is in the terminal.
2. your executable could be installed in the /opt folder at the root of the node, in which case you’ll have to look for it first.
3. there is also a module but not all programmes in /opt are listed in there.
Try the above three options (refer to Exercise 1 if you’re having trouble with this). As you will see, seqkit isn’t installed on the cluster. We will be using Conda to install it ourselves.

Step 3 - Why conda? How to install it?

Conda is an open source package management system and environment management system. It helps you to quickly and easily install (without root privileges) and run programmes with their dependencies.

Conda is already installed on the I2BC cluster but the version differs between the Frontale and the nodes and they are both not very up-to-date:

john.doe@cluster-i2bc:/home/john.doe$ /opt/anaconda/bin/conda --version
conda 4.9.2
john.doe@cluster-i2bc:/home/john.doe$ qsub -I
qsub: waiting for job 356617.pbsserver to start
qsub: job 356617.pbsserver ready

john.doe@cluster-i2bc:/home/john.doe$ /opt/miniconda2/bin/conda --version
conda 4.6.14

That is why we will be installing our own version of conda in our home directory on the I2BC cluster. In this exercise, we will be installing Miniconda, a free minimal installer for conda. We’ll be following these instructions:

john.doe@cluster-i2bc:/home/john.doe$ mkdir -p ~/miniconda3
john.doe@cluster-i2bc:/home/john.doe$ wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh
john.doe@cluster-i2bc:/home/john.doe$ bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3
john.doe@cluster-i2bc:/home/john.doe$ rm -rf ~/miniconda3/miniconda.sh

Optionally, you can also initialise conda so that it is automatically activated when you connect to the Frontale and the nodes, but this might slow down your session a bit every time you connect to the cluster or one of the nodes. If you would like it to activate automatically, you can run:

john.doe@cluster-i2bc:/home/john.doe$ ~/miniconda3/bin/conda init bash

Or, the other option is to activate it when you need it only, like so:

john.doe@cluster-i2bc:/home/john.doe$ source ~/miniconda3/bin/activate
(base) john.doe@cluster-i2bc:/home/john.doe$

As you can see, the prefix in your terminal changes upon activation of conda: you now have “(base)” indicating that you’re in the “base” environment of conda.

Of note: installing conda or miniconda only has to be done once.

Step 4 - How to install a programme with conda?

Conda works with environments. Once you activate conda, you’re in its base environment. It’s a good habit not to use the base environment but to create an environment that is specific to a project or to a tool (or set of tools) you want to use.

If you want to install a programme through conda, it has to exist in the conda hub. If you look up seqkit, you’ll find it in the bioconda channel of conda (among others): https://anaconda.org/bioconda/seqkit. The website will also tell you how to install it (usually very straightforward).

However, it’s a good habit (aka FAIR practices) to install your programme in a dedicated environment and to create this environment using a configuration file. In our case, we will create a configuration file called condaEnv_seqkit.yml and its content will look like this:

name: seqkit_ce 
channels: 
   - conda-forge 
   - bioconda 
dependencies: 
   - bioconda::seqkit=2.3.1

Explanation: here, we’re going to create a new environment called seqkit_ce (the name is totally arbitrary). We would like to use the conda-forge and bioconda channels in conda for our installations (cf. channels section) and in this environment, we want seqkit version 2.3.1 from bioconda (cf. dependencies section).

Next, all we have to do is to create the environment above using the following command (make sure you have conda activated first i.e. you should have “(base)” in your terminal prefix):

(base) john.doe@cluster-i2bc:/home/john.doe$ conda env create -f condaEnv_seqkit.yml

Once the environment is created, you can activate it like this:

(base) john.doe@cluster-i2bc:/home/john.doe$ conda activate seqkit_ce
(seqkit_ce) john.doe@cluster-i2bc:/home/john.doe$

Note: (base) is now replaced by (seqkit_ce) because you’re now in your seqkit_ce environment in which seqkit is installed:

(seqkit_ce) john.doe@cluster-i2bc:/home/john.doe$ seqkit version
seqkit v2.3.0

To exit the environment, you have to deactivate it using the conda deactivate function like this:

(seqkit_ce) john.doe@cluster-i2bc:/home/john.doe$ conda deactivate
(base) john.doe@cluster-i2bc:/home/john.doe$ conda deactivate
john.doe@cluster-i2bc:/home/john.doe$

Step 5 - Run seqkit in a job

The next steps are similar to the previous three exercises, namely:

have a look at the help of seqkit to see how to use it (we’ll be using it this way: seqkit stat your_file.fasta > stats.txt)
write a submission script
submit your script to the cluster
check if the job finished correctly and analyse your resource consumption for future runs

The particularity in this case, is that you have to create (from the yml configuration file) or activate your already existing environment at the beginning of the script:

#! /bin/bash

#PBS -N my_jobname     
#PBS -q common        
#PBS -l ncpus=1       

# activate your environment
source ~/miniconda3/bin/activate # use this to activate conda if you haven't initialised it
conda activate seqkit_ce

# move to the working directory 
cd /home/john.doe/cluster_usage_examples/example_mafft/ 

# run seqkit
seqkit stat ASF1_HUMAN_FL_blastp_output.fasta > stats.txt

Next, you can submit your job to the cluster as usual:

john.doe@cluster-i2bc:/home/john.doe/cluster_usage_examples/example_mafft$ qsub pbs_script.sh
287170.pbsserver

For a more detailed step-by-step explanation of conda and its usage combined with snakemake, have a look at the tutorial on the Wiki pages of the Forge I2BC.