Exercise 2 – BIOI2 – Integrative BIOInformatics platforme

Getting started with the I2BC cluster

About this course | Before the session | About the cluster | Course material | Exercises

Exercise 2 - Using conda

Instructions: In this exercise, you will learn how to install and use conda, a useful tool to manage programmes and their environments.

Context: We would like to use the seqkit tool to get the statistics (number of sequences, average sequence length, sequence type etc.) of the fasta file input from Exercise 2. As you will quickly see, seqkit isn’t installed on the cluster. We will be installing it ourselves using conda.

Note: Files that go with the examples mentioned in this training session are in https://forge.i2bc.paris-saclay.fr/redmine/projects/partage-bioinfo/repository/cluster_usage_examples. You can access them by cloning the repository: https://forge.i2bc.paris-saclay.fr/git/partage-bioinfo/cluster_usage_examples.git

Step 0 - Connect to the I2BC cluster

If you haven’t already got a session open on the Frontale (the master node) of the cluster, please do so as the rest of the steps are performed on the cluster. If you don’t know how to connect, don’t hesitate to refer to the previous section.

Step 1 - Fetch the input files

We will work in your home directory (see this page for more information on the file spaces accessible from the cluster). Let’s move to it and fetch our working files from the Forge website through command line using git:

john.doe@cluster-i2bc:~$ cd /home/john.doe
john.doe@cluster-i2bc:/home/john.doe$ git clone https://forge.i2bc.paris-saclay.fr/git/partage-bioinfo/cluster_usage_examples.git
john.doe@cluster-i2bc:/home/john.doe$ ls cluster_usage_examples/
example_fastqc  example_mafft  example_tmalign

The file we’ll be using is in the example_mafft folder, it’s a file in fasta format with full length protein sequences homologous to the human ASF1 protein, as outputted by BLAST.

john.doe@cluster-i2bc:/home/john.doe$ ls cluster_usage_examples/example_mafft
ASF1_HUMAN_FL_blastp_output.fasta

Step 2 - Locate the software

The SeqKit programme executable is called seqkit, let’s see if we can find it in the modules:

john.doe@cluster-i2bc:/home/john.doe$ module avail -C seqkit -i
john.doe@cluster-i2bc:/home/john.doe$

As you will see, seqkit isn’t installed on the cluster. In cases like these, it’s preferable to ask the SICS to install it for you (so that everyone can benefit from a global installation). However, in some cases, if you’re just trying out new stuff for example, you might want to install it locally first. We will be using Conda to do this.

Step 3 - Why conda/mamba? How to install it?

Conda and Mamba (a fast alternative to Conda) are system packet managers for several different programming languages (Python, R, …). They allow you to create separate environments, each containing their own files, packages/package versions, and their dependencies. As such, they also allow you to control and switch package versions more easily (and without root privileges) and in a traceable manner (cf. F.A.I.R. practices). Please visit the dedicated Forge page for more information on the “conda world”.

To install these tools, the easiest way is through the Miniforge distribution, which only downloads and installs the minimum required to function. Of note, using the Anaconda installer should be avoided as it requires a license and is not free as of 2020. This also applies to using the default anaconda channels (pkgs/main, pkgs/r and pkgs/msys2, regrouped under the vague term of “defaults”).

Conda is already installed on the I2BC cluster but the version differs between the Frontale and the nodes and they are both not very up-to-date:

john.doe@cluster-i2bc:/home/john.doe$ /opt/anaconda/bin/conda --version
conda 4.9.2
john.doe@cluster-i2bc:/home/john.doe$ /opt/miniconda2/bin/conda --version
conda 4.6.14

That is why we will be installing our own version of conda & mamba in our home directory on the I2BC cluster through Miniforge:

john.doe@cluster-i2bc:/home/john.doe$ wget "https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh" -O ~/installer.sh
john.doe@cluster-i2bc:/home/john.doe$ bash ~/installer.sh -p ~/miniforge3
john.doe@cluster-i2bc:/home/john.doe$ rm -rf ~/installer.sh

It will ask you:

to read and accept their license & license of individual packages that will be installed
to specify the location where binaries and environments will be saved or accept their default one (e.g. $HOME/miniforge3)
if you want to update your shell profile to automatically initialise conda/mamba (default: no), you can specify “yes” & then use conda config --set auto_activate_base false to avoid automatically activating conda/mamba’s base environment at startup

To activate an environment, you just have to type conda/mamba activate followed by the environment name (if you don’t specify a name, it’ll activate the base environment by default):

john.doe@cluster-i2bc:/home/john.doe$ mamba activate
(base) john.doe@cluster-i2bc:/home/john.doe$

As you can see, the prefix in your terminal changes upon activation of an environment: you now have “(base)” indicating that you’re in the “base” environment of conda.

Of note: installing conda or mamba only has to be done once.

Step 4 - How to install a programme with conda?

Conda and Mamba work with environments. They both have base environments but it’s a good habit not to use this base environment and to create an environment that is specific to a project or to a tool (or set of tools) that you want to use instead.

If you want to install a programme through conda or mamba, it has to exist in a channel, you can search which channels have it in the Anaconda repository or Pixi’s repository. If you look up seqkit, you’ll find it in the bioconda channel of conda (among others): https://anaconda.org/bioconda/seqkit. The website will also tell you how to install it (usually very straightforward).

However, it’s a good habit (aka FAIR practices) to install your programme in a dedicated environment and to create this environment using a configuration file. In our case, we will create a configuration file in YAML format called condaEnv_seqkit.yml and its content will look like this:

name: seqkit_ce 
channels: 
   - conda-forge 
   - bioconda 
dependencies: 
   - bioconda:seqkit=2.3.1

Explanation: here, we’re going to create a new environment called seqkit_ce (the name is totally arbitrary). We would like to use the conda-forge and bioconda channels in conda for our installations (cf. channels section) and in this environment, we want seqkit version 2.3.1 from bioconda (cf. dependencies section).

Next, all we have to do is to create the environment above using the following command:

john.doe@cluster-i2bc:/home/john.doe$ mamba env create -f condaEnv_seqkit.yml

Once the environment is created, you can activate it like this:

john.doe@cluster-i2bc:/home/john.doe$ mamba activate seqkit_ce
(seqkit_ce) john.doe@cluster-i2bc:/home/john.doe$

Note: you have (seqkit_ce) because you’re now in your seqkit_ce environment in which seqkit is installed:

(seqkit_ce) john.doe@cluster-i2bc:/home/john.doe$ seqkit version
seqkit v2.3.0

To exit the environment, you have to deactivate it using the conda/mamba deactivate function like this:

(seqkit_ce) john.doe@cluster-i2bc:/home/john.doe$ mamba deactivate
john.doe@cluster-i2bc:/home/john.doe$

Step 5 - Run seqkit in a job

The next steps are similar to the previous three exercises, namely:

have a look at the help of seqkit to see how to use it (we’ll be using it this way: seqkit stat your_file.fasta > stats.txt)
write a submission script
submit your script to the cluster
check if the job finished correctly and analyse your resource consumption for future runs

The particularity in this case, is that you have to create (from the yml configuration file) or activate your already existing environment at the beginning of the script:

#! /bin/bash

#PBS -N my_jobname     
#PBS -q common        
#PBS -l ncpus=1       

# activate your environment
source ~/miniforge3/bin/activate # use this to activate conda if you haven't initialised it
mamba activate seqkit_ce

# move to the working directory 
cd /home/john.doe/cluster_usage_examples/example_mafft/ 

# run seqkit
seqkit stat ASF1_HUMAN_FL_blastp_output.fasta > stats.txt

Next, you can submit your job to the cluster as usual:

john.doe@cluster-i2bc:/home/john.doe/cluster_usage_examples/example_mafft$ qsub pbs_script.sh
287170.pbsserver

For a more detailed step-by-step explanation of conda and its usage combined with snakemake, have a look at the tutorial on the Wiki pages of the Forge I2BC.

Note that you can also opt to install micromamba instead, installation instructions and explanations on what Micromamba is compared to Conda and Mamba can be found on the Forge I2BC website.