Workflow = a set of instructions or operations used to complete a process (e.g. fill, close & label a bottle):
Workflow = a set of instructions or operations used to complete a process (e.g. fill, close & label a bottle):
In bioinformatics, bottles are data (i.e. an analysis of “input data” to get the final results thereof: “output data”):
Many workflow management systems exist & in many forms:
Today, we’re going to learn how to use Snakemake. Its features:
Snakemake = Python (aka “snake”, a programming language) + Make (a rule-based automation tool)
Workflows are made up of blocks, each block performs a specific (set of) instruction(s)
- 1 rule = 1 instruction (ideally) - inputs and outputs are one or multiple files - at least 1 input and/or 1 output per rule |
execution order ≠ code order => Snakemake does a pick & mix of the rules it needs at execution
Rules are linked together by Snakemake using matching filenames in their input and output directives.
At execution, Snakemake creates a DAG (directed acyclic graph), that it will follow to generate the final output of your pipeline:
Below is a workflow example using 2 tools sequentially to check the quality of NGS data:
In this example, we have:
fastQC
and multiQC
*.fastq.gz
*.zip
& *.html
multiqc_report.html
generated by
MultiQCSnakemake (Smk) steps | running path |
---|---|
Smk creates the DAG from the snakefile | |
Smk sees that the final output multiqc-report.html doesn’t exist but knows it can create it with the multiQC rule | |
multiQC needs zip files (don’t exist) but the fastQC rule can generate them | |
fastQC needs fastq.gz files | |
fastq.gz files exist! Smk stops backtracking and goes to execute the fastQC rule |
Snakemake steps | running path |
---|---|
There are 3 sequence files so Smk launches 3 fastQC rules | |
After 3 exec. of the fastQC rule, zip files exist and feed the multiQC rule | |
the final output (multiqc-report) is generated, the workflow has finished |
Snakemake’s job is to make sure that everything is up-to-date, otherwise it (re-)runs the rules that need to be run…
Rules are run if:
Many default files constitute the “Snakemake system” & there are standards on how to organise them.
They are not all necessary for a basic pipeline execution.
The most important is the Snakefile
, that’s where all
the code is saved.
=> Rules usually have a unique name which defines them
=> Rules usually have a unique name which defines
them
=> input
, output
, shell
etc.
are called directives
=> Rules usually have a unique name which defines
them
=> input
, output
, shell
etc.
are called directives
=> input
& output
specify 1 or more
input & output files
=> Rules usually have a unique name which defines
them
=> input
, output
, shell
etc.
are called directives
=> input
& output
specify 1 or more
input & output files
=> shell
specifies what to do (shell
commands in this case -> alternative directives exist)
=> Rules usually have a unique name which defines
them
=> input
, output
, shell
etc.
are called directives
=> input
& output
specify 1 or more
input & output files
=> shell
specifies what to do (shell
commands in this case -> alternative directives exist)
=> {input}
& {output}
are
placeholders replaced by input & output file names
at execution
=> Rules usually have a unique name which defines
them
=> input
, output
, shell
etc.
are called directives
=> input
& output
specify 1 or more
input & output files
=> shell
specifies what to do (shell
commands in this case -> alternative directives exist)
=> {input}
& {output}
are
placeholders replaced by input & output file names
at execution
shell
directives should be given within
quotes ('
, "
or """
for
multi-line code)params:
, resources:
, log:
, etc.
(we’ll see them later)Example inspired from snakemake_examples/exercise0/Snakefile
- 2 rules: fusionFasta & mafft - fusionFasta : 2 input (p1 &
p2 ) & 1 output file- mafft : 1 input
& 1 output file |
When Snakemake is installed (how to install):
Snakefile
snakemake --cores 1 myOutputFile
to run the
pipeline to generate the myOutputFile
outpute.g. (previous example:)
snakemake --cores 1 alignedSequences.fasta
-s --snakefile mySmk
-n --dryrun
-p --printshellcmds
-r --reason
-D
-j 1
(cores:
-c 1
)All Snakemake options can be found here.
When you run Snakemake, you’ll get a full report printed on the screen of its progress:
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Job stats:
job count
------- -------
all 1
fastqc 3
multiqc 1
total 5
[...]
5 of 5 steps (100%) done
Complete log: .snakemake/log/2024-02-20T150605.574089.snakemake.log
When it’s finished, a .snakemake
folder will appear in
your working directory:
ls -a
to see itSo far, we’ve seen:
input
and output
to specify input & output
files):rule myRuleName
input: "myInputFile"
output: "myOutputFile"
shell: "echo {input} > {output}"
snakemake --cores 1
command
(+ other options available)The aim of this presentation and its associated exercises is to get you started with snakemake.
Several exercises of increasing complexity are proposed. The first two are our primary objectives, and the following ones are designed to accommodate learners with varying levels of snakemake expertise.
Please complete the progression table: it will help us when you will ask for help!
Now it’s up to you!
https://bioi2.i2bc.paris-saclay.fr/training/snakemake/exercises/