Last updated: 2025-03-13
In this exercise, you will learn how to install and use “conda”, a useful tool to manage programmes and their environments.
⁕ ⁕ ⁕ ⁕ ⁕ ⁕
We would like to use the seqkit
tool to get the
statistics (number of sequences, average sequence length, sequence type
etc.) of the fasta file input from Case study A (Exercise 2A). As you
will quickly see, seqkit
isn’t installed on the cluster.
Although it is advised to send an email to the SICS to ask for a quick
installation, in some contexts, it might be useful to test the software
first. We will be installing it ourselves on the cluster using “conda”
(micromamba in our case).
⁕ ⁕ ⁕ ⁕ ⁕ ⁕
In this case study, you will see the complete step-by-step of how to use a given software on the I2BC cluster if this software doesn’t exist on the cluster.
⁕ ⁕ ⁕ ⁕ ⁕ ⁕
It’s the same as for Exercise
0{:target=“_blank”}: you should be connected to the cluster and on
the master node (i.e. slurmlogin
should be written in your
terminal prefix).
It’s the same as for the previous case study, Exercise 2A. You can skip this step if you’ve already done it.
If not:
You will need the example files available in Zenodo under this link, or on
the Forge Logicielle under
this link for those who are familiar with git
.
We’ll work in your home directory. Let’s move to it and fetch our
working files using wget
:
john.doe@slurmlogin:~$ cd /home/john.doe
john.doe@slurmlogin:/home/john.doe$ wget "https://zenodo.org/records/15017630/files/cluster_usage_examples.tar.gz?download=1" -O cluster_usage_examples.tar.gz
john.doe@slurmlogin:/home/john.doe$ tar -zxf cluster_usage_examples.tar.gz
john.doe@slurmlogin:/home/john.doe$ ls cluster_usage_examples/
example_fastqc example_mafft example_tmalign
In the example_mafft
folder, you’ll see a fasta file
with unaligned protein sequences on which we’ll run the mafft
programme.
john.doe@slurmlogin:/home/john.doe$ ls cluster_usage_examples/example_mafft/
ASF1_HUMAN_FL_blastp_output.fasta
⁕ ⁕ ⁕ ⁕ ⁕ ⁕
seqkit
softwareThe SeqKit programme executable is called seqkit
. Try to
find it using the module
command.
The module command can be used from anywhere on the cluster. The main sub-commands are:
module avail
: to list all available softwaremodule load/unload <software name>
: to load
specific software (for use)module list
: to list currently loaded softwareTo get more details on options for these subcommands (e.g. to search
for a specific name), you can use the -h
option to get the
help page.
john.doe@slurmlogin:/home/john.doe$ module avail -C seqkit -i
john.doe@cluster-i2bc:/home/john.doe$
In the above command, we used the -C
option to specify a
pattern to search for (“mafft”) and -i
to make the search
case-insensitive.
As you will see, seqkit
isn’t installed on the
cluster.
In cases like these, it’s preferable to ask the SICS to install it for you (so that everyone can benefit from a global installation). However, in some cases, if you’re just trying out new stuff for example, you might want to install it locally first. We will be using Conda to do this.
Conda is a system packet manager for several different programming
languages (Python, R, …). Several “versions” of it exist, of which
conda
, miniconda
, mamba
and
micromamba
(in increasing order of efficiency and
speed).
They all allow you to create separate environments, each containing their own files, packages/package versions, and their dependencies. As such, they also allow you to control and switch package versions more easily (and without root privileges) and in a traceable manner (cf. F.A.I.R. practices). Please visit the dedicated Forge page for more information on the “conda world”.
/!\ Please note that, as of 2020, using the Anaconda installer for conda is now only authorised under license, as well as using the default anaconda channels (pkgs/main, pkgs/r and pkgs/msys2, regrouped under the vague term of “defaults”) with any of the conda “version” above!
micromamba
is the most “light weighted” version of conda
and also one of the most efficient is solving software dependencies.
Micromamba also doesn’t use Anaconda’s default channels but the one from
conda-forge
instead by default.
john.doe@slurmlogin:/home/john.doe$ srun --pty bash
john.doe@node01:/home/john.doe$
john.doe@node01:/home/john.doe$ "${SHELL}" <(curl -L micro.mamba.pm/install.sh)
It will ask you a few questions. You can leave the default paths it suggests by pressing on “Enter” on your keyboard and you can type “yes” when it asks you if you’d like to initialise your shell.
john.doe@node01:/home/john.doe$ source ~/.bashrc
john.doe@node01:/home/john.doe$ micromamba --version
2.0.4
john.doe@node01:/home/john.doe$ exit 0
john.doe@slurmlogin:/home/john.doe$
Softwares are accessible within channels, which are in turn accessible within distance repositories. The default channel in micromamba is conda-forge, which has a dedicated repository that you can search through: conda-forge; another interesting channel is bioconda as is contains many bioinformatics-related tools. For a more exhaustive search, you can directly look at Pixi’s or Anaconda’s repository (which remain free for the open channels), which regroup several channels, of which conda-forge and bioconda. The website will also tell you how to install it (usually very straightforward).
NB: it’s a good habit to create separate environments for separate software or projects
seqkit
with
micromambaThere are 2 ways of installing software within an environment: whether directly through the command line or through a configuration file in YAML format. We’ll do the latter.
john.doe@slurmlogin:/home/john.doe$ srun --pty bash
john.doe@node01:/home/john.doe$
condaEnv_seqkit.yml
name: seqkit_ce
channels:
- bioconda
dependencies:
- seqkit=2.3.1
Explanation: here, we’re going to create a new environment called
seqkit_ce
(the name is totally arbitrary). We would like to
use the bioconda channel for our installation (cf. channels section) and
in this environment, we want seqkit
version 2.3.1
(cf. dependencies section). You can, of course, list several channels
and/or several tools in the corresponding sections.
john.doe@node01:/home/john.doe$ micromamba env create -f condaEnv_seqkit.yml
john.doe@node01:/home/john.doe$ micromamba activate seqkit_ce
(seqkit_ce) john.doe@node01:/home/john.doe$ seqkit version
seqkit v2.3.0
Note: you have (seqkit_ce)
in front of your terminal
prompt because you’re now in your seqkit_ce
environment in
which seqkit
is installed.
seqkit
(seqkit_ce) john.doe@node01:/home/john.doe$ micromamba deactivate
john.doe@node01:/home/john.doe$
seqkit
within a jobThe next steps are similar to the previous three case studies (exercises 2, A to C), namely:
seqkit
to see how to use it
(we’ll be using it this way:
seqkit stat your_file.fasta > stats.txt
)A quick tip: the micromamba
command might not be in your
Path variables when you’re running an sbatch job, a workaround is to
specify the full path to the micromamba executable. If you’ve used the
default paths at installation, it’s probably in your equivalent to:
/home/john.doe/.local/bin/micromamba
. If you’re not sure,
type which micromamba
in the terminal.
The only subtility here is the fact that you have to first activate
the micromamba environment before using seqkit
. You job
script could look like this:
#! /bin/bash
#SBATCH --job-name="my_jobname"
#SBATCH --partition=common
#SBATCH --cpus-per-task=1
### prefix start - create temporary directory for your job
export TMPDIR=$(mktemp -d)
### prefix end
# activate your environment
/home/john.doe/.local/bin/micromamba activate seqkit_ce
# move to the working directory
cd /home/john.doe/cluster_usage_examples/example_mafft/
# run seqkit
seqkit stat ASF1_HUMAN_FL_blastp_output.fasta > stats.txt
### suffix start - delete created temporary directory
rm -rf $TMPDIR
### suffix end
Next, you can submit your job to the cluster as usual:
john.doe@slurmlogin:/home/john.doe/cluster_usage_examples/example_mafft$ sbatch slurm_script.sh
Submitted batch job 287170
The environment you’ve created will stay forever in your Home directory. It’s up to you to delete it when you don’t have any use for it anymore.
To do so: micromamba env remove -n seqkit_ce
Take home message
The important steps that we’ve seen are:
micromamba
is one of the more efficient and
light-weight versionsmicromamba env create -f condaEnv_seqkit.yml
micromamba activate -n seqkit_ce
(the environment name will
show up before your terminal’s prompt)micromamba deactivate
micromamba env remove -n seqkit_ce
🔗 Back to exercise page{:target=“_blank”}