Exercise 3 - Using “conda” environments

Authors: BIOI2 & I2BC members

Last updated: 2025-03-13

Foreword

In this exercise, you will learn how to install and use “conda”, a useful tool to manage programmes and their environments.

⁕ ⁕ ⁕ ⁕ ⁕ ⁕

Context

We would like to use the seqkit tool to get the statistics (number of sequences, average sequence length, sequence type etc.) of the fasta file input from Case study A (Exercise 2A). As you will quickly see, seqkit isn’t installed on the cluster. Although it is advised to send an email to the SICS to ask for a quick installation, in some contexts, it might be useful to test the software first. We will be installing it ourselves on the cluster using “conda” (micromamba in our case).

⁕ ⁕ ⁕ ⁕ ⁕ ⁕

Objectives

In this case study, you will see the complete step-by-step of how to use a given software on the I2BC cluster if this software doesn’t exist on the cluster.

⁕ ⁕ ⁕ ⁕ ⁕ ⁕

Setup

Connect to cluster

It’s the same as for Exercise 0{:target=“_blank”}: you should be connected to the cluster and on the master node (i.e. slurmlogin should be written in your terminal prefix).

Fetch example files

It’s the same as for the previous case study, Exercise 2A. You can skip this step if you’ve already done it.

If not:

You will need the example files available in Zenodo under this link, or on the Forge Logicielle under this link for those who are familiar with git.

We’ll work in your home directory. Let’s move to it and fetch our working files using wget:

john.doe@slurmlogin:~$ cd /home/john.doe
john.doe@slurmlogin:/home/john.doe$ wget "https://zenodo.org/records/15017630/files/cluster_usage_examples.tar.gz?download=1" -O cluster_usage_examples.tar.gz
john.doe@slurmlogin:/home/john.doe$ tar -zxf cluster_usage_examples.tar.gz
john.doe@slurmlogin:/home/john.doe$ ls cluster_usage_examples/
example_fastqc  example_mafft  example_tmalign

In the example_mafft folder, you’ll see a fasta file with unaligned protein sequences on which we’ll run the mafft programme.

john.doe@slurmlogin:/home/john.doe$ ls cluster_usage_examples/example_mafft/
ASF1_HUMAN_FL_blastp_output.fasta

⁕ ⁕ ⁕ ⁕ ⁕ ⁕

Step-by-step

Task 1: Locate the `seqkit` software

The SeqKit programme executable is called seqkit. Try to find it using the module command.

Click to see hints

The module command can be used from anywhere on the cluster. The main sub-commands are:

module avail: to list all available software
module load/unload <software name>: to load specific software (for use)
module list: to list currently loaded software

To get more details on options for these subcommands (e.g. to search for a specific name), you can use the -h option to get the help page.

Click to see answer

john.doe@slurmlogin:/home/john.doe$ module avail -C seqkit -i
john.doe@cluster-i2bc:/home/john.doe$

In the above command, we used the -C option to specify a pattern to search for (“mafft”) and -i to make the search case-insensitive.

As you will see, seqkit isn’t installed on the cluster.

In cases like these, it’s preferable to ask the SICS to install it for you (so that everyone can benefit from a global installation). However, in some cases, if you’re just trying out new stuff for example, you might want to install it locally first. We will be using Conda to do this.

Why conda?

Conda is a system packet manager for several different programming languages (Python, R, …). Several “versions” of it exist, of which conda, miniconda, mamba and micromamba (in increasing order of efficiency and speed).

They all allow you to create separate environments, each containing their own files, packages/package versions, and their dependencies. As such, they also allow you to control and switch package versions more easily (and without root privileges) and in a traceable manner (cf. F.A.I.R. practices). Please visit the dedicated Forge page for more information on the “conda world”.

/!\ Please note that, as of 2020, using the Anaconda installer for conda is now only authorised under license, as well as using the default anaconda channels (pkgs/main, pkgs/r and pkgs/msys2, regrouped under the vague term of “defaults”) with any of the conda “version” above!

Task 2: Install micromamba on the cluster in your home

micromamba is the most “light weighted” version of conda and also one of the most efficient is solving software dependencies. Micromamba also doesn’t use Anaconda’s default channels but the one from conda-forge instead by default.

Click to see steps

Connect to a node (connection is faster)

john.doe@slurmlogin:/home/john.doe$ srun --pty bash
john.doe@node01:/home/john.doe$

Run the micromamba installation command (this could take a little while)

john.doe@node01:/home/john.doe$ "${SHELL}" <(curl -L micro.mamba.pm/install.sh)

It will ask you a few questions. You can leave the default paths it suggests by pressing on “Enter” on your keyboard and you can type “yes” when it asks you if you’d like to initialise your shell.

Restart your shell & check micromamba is installed

john.doe@node01:/home/john.doe$ source ~/.bashrc
john.doe@node01:/home/john.doe$ micromamba --version
2.0.4

Disconnect from the node

john.doe@node01:/home/john.doe$ exit 0
john.doe@slurmlogin:/home/john.doe$

Conditions to install software

it has to exist in the repositories.

Softwares are accessible within channels, which are in turn accessible within distance repositories. The default channel in micromamba is conda-forge, which has a dedicated repository that you can search through: conda-forge; another interesting channel is bioconda as is contains many bioinformatics-related tools. For a more exhaustive search, you can directly look at Pixi’s or Anaconda’s repository (which remain free for the open channels), which regroup several channels, of which conda-forge and bioconda. The website will also tell you how to install it (usually very straightforward).

you have to first create an environment within which you can install it.

NB: it’s a good habit to create separate environments for separate software or projects

Task 3: Install `seqkit` with micromamba

There are 2 ways of installing software within an environment: whether directly through the command line or through a configuration file in YAML format. We’ll do the latter.

Click to see steps

Connect to a node (connection is faster)

john.doe@slurmlogin:/home/john.doe$ srun --pty bash
john.doe@node01:/home/john.doe$

Create a configuration file called condaEnv_seqkit.yml


name: seqkit_ce
channels:
   - bioconda
dependencies:
   - seqkit=2.3.1

Explanation: here, we’re going to create a new environment called seqkit_ce (the name is totally arbitrary). We would like to use the bioconda channel for our installation (cf. channels section) and in this environment, we want seqkit version 2.3.1 (cf. dependencies section). You can, of course, list several channels and/or several tools in the corresponding sections.

Create the environment

john.doe@node01:/home/john.doe$ micromamba env create -f condaEnv_seqkit.yml

Check installation was completed successfully

john.doe@node01:/home/john.doe$ micromamba activate seqkit_ce
(seqkit_ce) john.doe@node01:/home/john.doe$ seqkit version
seqkit v2.3.0

Note: you have (seqkit_ce) in front of your terminal prompt because you’re now in your seqkit_ce environment in which seqkit is installed.

Deactivate the environment when you finished using seqkit

(seqkit_ce) john.doe@node01:/home/john.doe$ micromamba deactivate
john.doe@node01:/home/john.doe$

Task 4: Run `seqkit` within a job

The next steps are similar to the previous three case studies (exercises 2, A to C), namely:

have a look at the help of seqkit to see how to use it (we’ll be using it this way: seqkit stat your_file.fasta > stats.txt)
write a submission script
submit your script to the cluster
check if the job finished correctly and analyse your resource consumption for future runs

A quick tip: the micromamba command might not be in your Path variables when you’re running an sbatch job, a workaround is to specify the full path to the micromamba executable. If you’ve used the default paths at installation, it’s probably in your equivalent to: /home/john.doe/.local/bin/micromamba. If you’re not sure, type which micromamba in the terminal.

Click to see answer

The only subtility here is the fact that you have to first activate the micromamba environment before using seqkit. You job script could look like this:

#! /bin/bash

#SBATCH --job-name="my_jobname"
#SBATCH --partition=common
#SBATCH --cpus-per-task=1

### prefix start - create temporary directory for your job
export TMPDIR=$(mktemp -d)
### prefix end


# activate your environment
/home/john.doe/.local/bin/micromamba activate seqkit_ce

# move to the working directory
cd /home/john.doe/cluster_usage_examples/example_mafft/

# run seqkit
seqkit stat ASF1_HUMAN_FL_blastp_output.fasta > stats.txt


### suffix start - delete created temporary directory
rm -rf $TMPDIR
### suffix end

Next, you can submit your job to the cluster as usual:

john.doe@slurmlogin:/home/john.doe/cluster_usage_examples/example_mafft$ sbatch slurm_script.sh
Submitted batch job 287170

Task 5: Delete your environment

The environment you’ve created will stay forever in your Home directory. It’s up to you to delete it when you don’t have any use for it anymore.

To do so: micromamba env remove -n seqkit_ce

🔗 Back to exercise page{:target=“_blank”}

Foreword

Context

Objectives

Setup

Connect to cluster

Fetch example files

Step-by-step

Task 1: Locate the seqkit software

Why conda?

Task 2: Install micromamba on the cluster in your home

Conditions to install software

Task 3: Install seqkit with micromamba

Task 4: Run seqkit within a job

Task 5: Delete your environment

Task 1: Locate the `seqkit` software

Task 3: Install `seqkit` with micromamba

Task 4: Run `seqkit` within a job