Introduction to Snakemake

On schedule today:

introduction to workflows
introduction to Snakemake & the concept of rules
Snakemake & SnakeFiles
Illustration with a 2-step workflow example

What is a workflow?

Workflow = a set of instructions or operations used to complete a process (e.g. fill, close & label a bottle):

workflow with bottles

input: empty bottle
step 1: fill bottle
step 2: put lid on bottle
step 3: stick on label
final output: filled, labelled & closed bottle

What is a workflow?

Workflow = a set of instructions or operations used to complete a process (e.g. fill, close & label a bottle):

workflow with bottles

In bioinformatics, bottles are data (i.e. an analysis of “input data” to get the final results thereof: “output data”):

workflow without bottles

Why use a workflow management system?

The pros:

minimise the number of manual steps in an analysis
simplify pipeline development, maintenance, and use, by dealing with:
- task parallelisation & efficient use of resources
- resuming of failed runs or steps
- tracking of parameters and tool versions (=reproducibility)
make your code:
- less complex to read and more modular
- more easily scalable to large sets of data
- more transportable onto different systems (local PC, HPC or cloud)

The cons:

learning effort…

What workflow management systems?

Many workflow management systems exist & in many forms:

command line (shell): need to script pallelisation process manually, not easy
command line (rules): e.g. , , , …
graphic interface: e.g. , Taverna, Keppler, …

Focus on Snakemake

Today, we’re going to learn how to use Snakemake. Its features:

works on files (rather than streams, reading/writing from databases, or passing variables into memory)
is based on Python (but knowing Python is not required)
has features for defining the environment for each task (running a large number of small third-party tools is common in bioinformatics)
easily scaled-up from desktop to server, cluster, grid or cloud environments without modifications of your initial script (i.e. develop on a laptop using a small subset of data, then run the real analysis on a cluster)

The principle behind Snakemake

Snakemake = Python (aka “snake”, a programming language) + Make (a rule-based automation tool)

Workflows are like legos:

Workflows are made up of blocks, each block performs a specific (set of) instruction(s)

workflow divided into rules

1 “block” = 1 rule:

- 1 rule = 1 instruction (ideally)
- inputs and outputs are one or multiple files
- at least 1 input and/or 1 output per rule

Linking data flows

Rule order is not important…

execution order ≠ code order => Snakemake does a pick & mix of the rules it needs at execution

…but matching file names is key!

Rules are linked together by Snakemake using matching filenames in their input and output directives.

2 rules linked together

At execution, Snakemake creates a DAG (directed acyclic graph), that it will follow to generate the final output of your pipeline:

2-step pipeline

A workflow example

Below is a workflow example using 2 tools sequentially to check the quality of NGS data:

In this example, we have:

2 linked rules: fastQC and multiQC
input RNAseq files named *.fastq.gz
intermediate files generated by FastQC named *.zip & *.html
the final output named multiqc_report.html generated by MultiQC

How Snakemake creates your workflow

How Snakemake creates your workflow (summary)

Snakemake (Smk) steps	running path
Smk creates the DAG from the snakefile
Smk sees that the final output multiqc-report.html doesn’t exist but knows it can create it with the multiQC rule
multiQC needs zip files (don’t exist) but the fastQC rule can generate them
fastQC needs fastq.gz files
fastq.gz files exist! Smk stops backtracking and goes to execute the fastQC rule

How Snakemake creates your workflow (summary)

Snakemake steps	running path
There are 3 sequence files so Smk launches 3 fastQC rules
After 3 exec. of the fastQC rule, zip files exist and feed the multiQC rule
the final output (multiqc-report) is generated, the workflow has finished

Rules are run when outputs are missing… but not only

Snakemake’s job is to make sure that everything is up-to-date, otherwise it (re-)runs the rules that need to be run…

Rules are run if:

output doesn’t exist
output exists but is older than the input
changes detected in parameters, code or tool versions since last execution

The Snakemake world

Many default files constitute the “Snakemake system” & there are standards on how to organise them.

They are not all necessary for a basic pipeline execution.

The most important is the Snakefile, that’s where all the code is saved.

Within the Snakefile…

The Snakefile is where rules are defined
The basic syntax of a rule is:

Within the Snakefile…

The Snakefile is where rules are defined
The basic syntax of a rule is:

=> Rules usually have a unique name which defines them

Within the Snakefile…

The Snakefile is where rules are defined
The basic syntax of a rule is:

=> Rules usually have a unique name which defines them
=> input, output, shell etc. are called directives

Within the Snakefile…

The Snakefile is where rules are defined
The basic syntax of a rule is:

=> Rules usually have a unique name which defines them
=> input, output, shell etc. are called directives
=> input & output specify 1 or more input & output files

Within the Snakefile…

The Snakefile is where rules are defined
The basic syntax of a rule is:

=> Rules usually have a unique name which defines them
=> input, output, shell etc. are called directives
=> input & output specify 1 or more input & output files
=> shell specifies what to do (shell commands in this case -> alternative directives exist)

Within the Snakefile…

The Snakefile is where rules are defined
The basic syntax of a rule is:

Within the Snakefile…

The Snakefile is where rules are defined
The basic syntax of a rule is:

* Warning *

code alignment (=indentations) is important
files and shell directives should be given within quotes (', " or """ for multi-line code)
additional & optional directives exist, e.g.: params:, resources:, log:, etc. (we’ll see them later)

Example - pairwise protein sequence alignment

Example inspired from snakemake_examples/exercise0/Snakefile

- 2 rules: fusionFasta & mafft
- fusionFasta: 2 input (p1 & p2) & 1 output file
- mafft: 1 input & 1 output file

How to run a Snakemake pipeline?

When Snakemake is installed (how to install):

move into the directory containing the Snakefile
type snakemake --cores 1 myOutputFile to run the pipeline to generate the myOutputFile output

e.g. (previous example:) snakemake --cores 1 alignedSequences.fasta

useful other options

change the default snakefile name: -s --snakefile mySmk
dry-run, do not execute anything, display what will be done: -n --dryrun
print the shell command the is run: -p --printshellcmds
print the reason for each rule execution: -r --reason
print a summary and status of rule: -D
limit the number of jobs in parallel: -j 1 (cores: -c 1)

All Snakemake options can be found here.

Snakemake’s monolog & it’s hidden treasure chest

When you run Snakemake, you’ll get a full report printed on the screen of its progress:

Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Job stats:
job        count
-------  -------
all            1
fastqc         3
multiqc        1
total          5

[...]

5 of 5 steps (100%) done
Complete log: .snakemake/log/2024-02-20T150605.574089.snakemake.log

When it’s finished, a .snakemake folder will appear in your working directory:

it can be heavy (when using environments)
it can contain a lot of files (unsuited for some file systems)
it’s a hidden folder so ls -a to see it
don’t forget to remove it once you’re sure you’ve finished your analysis

Where to get a Snakemake workflow?

from your colleagues ==> exercise 0
from github, and in particular the snakemake “core” (nf-core equivalent) : https://snakemake.github.io/snakemake-workflow-catalog/ (up 2k pipelines, 260 tested)
create from scratch ==> exercise 1
compose with snakemake wrappers
by using a Nextflow workflow! (integration via snakemake wrappers)

Conclusion

So far, we’ve seen:

Snakemake workflow = set of rules
Rules are written in Snakefiles
Snakemake links rules through common input/output files
Rules are defined by their name and contain directives (of which input and output to specify input & output files):

rule myRuleName
    input: "myInputFile"
    output: "myOutputFile"
    shell: "echo {input} > {output}"

A Snakefile is run with the snakemake --cores 1 command (+ other options available)

Now it’s your turn!

The aim of this presentation and its associated exercises is to get you started with snakemake.

Several exercises of increasing complexity are proposed. The first two are our primary objectives, and the following ones are designed to accommodate learners with varying levels of snakemake expertise.

Exercise 0: running an already-existing Snakemake workflow
Exercise 1A: create a “basic” snakefile from scratch
Exercise 1B: improving your first snakefile
Exercise 1C: scaling up your workflow to an HPC environment
Exercice 2: from bash script to a snakefile

Please complete the progression table: it will help us when you will ask for help!

Now it’s up to you!

https://bioi2.i2bc.paris-saclay.fr/training/snakemake/exercises/

Introduction to Snakemake

On schedule today:

What is a workflow?

What is a workflow?

Why use a workflow management system?

The pros:

The cons:

What workflow management systems?

Focus on Snakemake

The principle behind Snakemake

Workflows are like legos:

1 “block” = 1 rule:

Linking data flows

Rule order is not important…

…but matching file names is key!

A workflow example

How Snakemake creates your workflow

How Snakemake creates your workflow (summary)

How Snakemake creates your workflow (summary)

Rules are run when outputs are missing… but not only

The Snakemake world

Within the Snakefile…

Within the Snakefile…

Within the Snakefile…

Within the Snakefile…

Within the Snakefile…

Within the Snakefile…

Within the Snakefile…

*** Warning ***

Example - pairwise protein sequence alignment

How to run a Snakemake pipeline?

useful other options

Snakemake’s monolog & it’s hidden treasure chest

Where to get a Snakemake workflow?

Conclusion

Now it’s your turn!

* Warning *