Workflow = a set of instructions or operations used to complete a
process (e.g. fill, close & label a bottle):
input: empty bottle
step 1: fill bottle
step 2: put lid on bottle
step 3: stick on label
final output: filled, labelled & closed
bottle
What is a workflow?
Workflow = a set of instructions or operations used to complete a
process (e.g. fill, close & label a bottle):
In bioinformatics, bottles are data (i.e. an analysis of
“input data” to get the final results thereof: “output data”):
Why use a workflow management system?
The pros:
minimise the number of manual
steps in an analysis
simplify pipeline development, maintenance, and
use, by dealing with:
task parallelisation & efficient use of
resources
resuming of failed runs or steps
tracking of parameters and tool versions
(=reproducibility)
make your code:
less complex to read and more modular
more easily scalable to large sets of data
more transportable onto different systems (local
PC, HPC or cloud)
The cons:
learning effort…
What workflow management systems?
Many workflow management systems exist & in many forms:
command line (shell): need to script pallelisation process manually,
not easy
command line (rules): e.g.,
,
,
…
graphic interface: e.g.,
Taverna, Keppler, …
Focus on Snakemake
Today, we’re going to learn how to use Snakemake. Its features:
works on files (rather than streams,
reading/writing from databases, or passing variables into memory)
is based on Python (but knowing Python is not
required)
has features for defining the environment for each
task (running a large number of small third-party tools is common in
bioinformatics)
easily scaled-up from desktop to server, cluster,
grid or cloud environments without modifications of your initial script
(i.e. develop on a laptop using a small subset of data, then
run the real analysis on a cluster)
The principle behind Snakemake
Snakemake = Python (aka
“snake”, a programming language) + Make (a rule-based
automation tool)
Workflows are like legos:
Workflows are made up of blocks, each block performs
a specific (set of) instruction(s)
1 “block” = 1 rule:
- 1 rule = 1 instruction (ideally) - inputs and
outputs are one or multiple files - at least 1
input and/or 1 output per rule
Linking data flows
Rule order is not important…
execution order ≠ code order => Snakemake does a pick & mix of
the rules it needs at execution
…but matching file names is
key!
Rules are linked together by Snakemake using matching filenames in
their input and output directives.
At execution, Snakemake creates a DAG (directed
acyclic graph), that it will follow to generate the final output of your
pipeline:
A workflow example
Below is a workflow example using 2 tools sequentially to check the
quality of NGS data:
In this example, we have:
2 linked rules: fastQC and multiQC
input RNAseq files named *.fastq.gz
intermediate files generated by FastQC named *.zip
& *.html
the final output named multiqc_report.html generated by
MultiQC
How Snakemake creates your workflow
How Snakemake creates your workflow (summary)
Snakemake (Smk) steps
running path
Smk creates the DAG from the
snakefile
Smk sees that the final output
multiqc-report.html doesn’t exist but knows it can create it
with the multiQC rule
multiQC needs zip files (don’t
exist) but the fastQC rule can generate them
fastQC needs fastq.gz files
fastq.gz files exist! Smk stops
backtracking and goes to execute the fastQC rule