Introduction to Snakemake

On schedule today:

  • introduction to workflows
  • introduction to Snakemake & the concept of rules
  • Snakemake & SnakeFiles
  • Illustration with a 2-step workflow example

What is a workflow?

Workflow = a set of instructions or operations used to complete a process (e.g. fill, close & label a bottle):

  • input: empty bottle
  • step 1: fill bottle
  • step 2: put lid on bottle
  • step 3: stick on label
  • final output: filled, labelled & closed bottle

What is a workflow?

Workflow = a set of instructions or operations used to complete a process (e.g. fill, close & label a bottle):

In bioinformatics, bottles are data (i.e. an analysis of “input data” to get the final results thereof: “output data”):

Why use a workflow management system?

The pros:

  • minimise the number of manual steps in an analysis
  • simplify pipeline development, maintenance, and use, by dealing with:
    • task parallelisation & efficient use of resources
    • resuming of failed runs or steps
    • tracking of parameters and tool versions (=reproducibility)
  • make your code:
    • less complex to read and more modular
    • more easily scalable to large sets of data
    • more transportable onto different systems (local PC, HPC or cloud)

The cons:

  • learning effort…

What workflow management systems?

Many workflow management systems exist & in many forms:

  • command line (shell): need to script pallelisation process manually, not easy
  • command line (rules): e.g. , , , …
  • graphic interface: e.g. , Taverna, Keppler, …

Focus on Snakemake

Today, we’re going to learn how to use Snakemake. Its features:

  • works on files (rather than streams, reading/writing from databases, or passing variables into memory)
  • is based on Python (but knowing Python is not required)
  • has features for defining the environment for each task (running a large number of small third-party tools is common in bioinformatics)
  • easily scaled-up from desktop to server, cluster, grid or cloud environments without modifications of your initial script (i.e. develop on a laptop using a small subset of data, then run the real analysis on a cluster)

The principle behind Snakemake

Snakemake = Python (aka “snake”, a programming language) + Make (a rule-based automation tool)

Workflows are like legos:

Workflows are made up of blocks, each block performs a specific (set of) instruction(s)

1 “block” = 1 rule:

- 1 rule = 1 instruction (ideally)
- inputs and outputs are one or multiple files
- at least 1 input and/or 1 output per rule
single rule example

Linking data flows

Rule order is not important…

execution order ≠ code order => Snakemake does a pick & mix of the rules it needs at execution

…but matching file names is key!

Rules are linked together by Snakemake using matching filenames in their input and output directives.

At execution, Snakemake creates a DAG (directed acyclic graph), that it will follow to generate the final output of your pipeline:

 A workflow example

Below is a workflow example using 2 tools sequentially to check the quality of NGS data:

2-step example outline

In this example, we have:

  • 2 linked rules: fastQC and multiQC
  • input RNAseq files named *.fastq.gz
  • intermediate files generated by FastQC named *.zip & *.html
  • the final output named multiqc_report.html generated by MultiQC
detailed illustration

How Snakemake creates your workflow

2-step example outline

How Snakemake creates your workflow (summary)

Snakemake (Smk) steps running path
Smk creates the DAG from the snakefile 2-step example outline
Smk sees that the final output multiqc-report.html doesn’t exist but knows it can create it with the multiQC rule 2-step example outline
multiQC needs zip files (don’t exist) but the fastQC rule can generate them 2-step example outline
fastQC needs fastq.gz files 2-step example outline
fastq.gz files exist! Smk stops backtracking and goes to execute the fastQC rule