Getting started with Snakemake
For this practical exercise, we will:
In the following objectives, we will continue building on the Snakefile from Exercises 1A and 1B which successively runs FastQC then MultiQC on a set of RNA-seq data, and try to adapt it to an HPC environment, namely the I2BC’s cluster.
Motivation
Up until now, our workflow just runs on a single processor. Thus, each job is run sequentially, which takes time and is frustrating when you know that more resources are available on the cluster (=> using more processors reduces computation time in most cases).
There are two ways of scaling up your workflow:
- run multiple jobs in parallel: if you have several inputs and each of them can be processed independently from each other by a specific rule (e.g. the fastqc rule in this Exercise), then you can run all of these jobs simultaneously instead of sequentially (=> 1 processor per job)
- run steps multithreaded: if you’re using a tool in your rule that handles multithreading (e.g. it has an option like –threads for example), you could run this rule on more than one processor (=> several processors per job)
How this exercise is organised:
As for the previous exercises, each step will reply to an objective. Thus, we will be doing several cycles of executing snakemake, observing the results and improving the code. Each code version will be noted ex1c_oX.smk
, with X
a progressive digit corresponding to the objective number.
Warning: keep in mind that we’ll be using commands that are specific to the I2BC cluster’s scheduler system (they might be different on other clusters). Also, if you’re not familiar with these commands, don’t hesitate to have a look at the course material of the I2BC cluster training.