Exercise 1A 4 – BIOI2 – Integrative BIOInformatics platforme

Getting started with Snakemake

About this course | Before the session | About Snakemake | Course material | Exercises

Exercise 1A - create your first snakefile

objective > setup > o1 > o2 > o3 > o4 > o5 > o6 > recap

Objective 4

Create a new snakefile named ex1_o4.smk in which we add a new & dedicated target rule at the beginning of the Snakefile.

Motivation - how can I run my multiqc rule?

Your first thought might be to change the order of your rules in the Snakefile to make sure multiqc is first. There’s actually a way of telling Snakemake at execution what rule you want to be the target rule (which avoids editing the file), by just adding its name. For example, the following command specifies multiqc as the target:

				
					snakemake -s ex1_o3.smk --cores 1 -p multiqc

This solution works but it’s limited to sequential pipelines (i.e. the output of one is the direct input of the next). For less linear pipelines (e.g. output of rule1 is input for rules 2 and 3), this solution won’t work as only one rule can be the target rule.

As we now know, if the input files to the target rule don’t exist, Snakemake will first execute any of the other rules in the Snakefile that will help it generate these missing files. If we use this knowledge to our advantage, we can add a dedicated target rule at the top of our Snakefile which requires the outputs of key rules in our Snakefile. This will enable us to execute them all via a single line of code. We will be doing this below.

Where to start?

add a new rule: define a new rule (above the other rules in your Snakefile) containing only an input directive.
fill in the input directive: list all files that you expect as output in your pipeline (in this case, you could list all the outputs of fastqc and multiqc, or just the outputs of multiqc since our pipeline is completely linear).

more about target rules

Solution

Your code for ex1_o4.smk should look like this:

rule all:
  input: 
    "FastQC/SRR3099585_chr18_fastqc.zip", 
    "FastQC/SRR3099585_chr18_fastqc.html",
    "FastQC/SRR3099586_chr18_fastqc.zip", 
    "FastQC/SRR3099586_chr18_fastqc.html", 
    "multiqc_report.html", 
    "multiqc_data", 

rule fastqc:
  input: 
    "Data/SRR3099585_chr18.fastq.gz",
    "Data/SRR3099586_chr18.fastq.gz",
  output: 
    "FastQC/SRR3099585_chr18_fastqc.zip", 
    "FastQC/SRR3099585_chr18_fastqc.html",
    "FastQC/SRR3099586_chr18_fastqc.zip", 
    "FastQC/SRR3099586_chr18_fastqc.html", 
  shell: "fastqc --outdir FastQC {input}"

rule multiqc: 
  input: 
    "FastQC/SRR3099585_chr18_fastqc.zip", 
    "FastQC/SRR3099586_chr18_fastqc.zip", 
  output:
    "multiqc_report.html", 
    directory("multiqc_data") 
  shell: 
    "multiqc {input}"

Note that, technically, you could remove all FastQC html and zip files from the new “all” target rule and only keep the multiqc outputs since our pipeline is completely linear in this example (i.e. Snakemake is capable of retracing the sequence of rules it needs to run in order to get to the final MultiQC output).

Test the script

Next, let’s run the pipeline and see if it works:

				
					snakemake -s ex1_o4.smk --cores 1 -p

You should see something similar to the following output on your screen:

				
					Assuming unrestricted shared filesystem usage for local execution.
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Job stats:
job        count
-------  -------
all            1
multiqc        1
total          2

Select jobs to execute...
Execute 1 jobs...

[Tue Feb 20 14:17:43 2024]
localrule multiqc:
    input: FastQC/SRR3099585_chr18_fastqc.zip, FastQC/SRR3099586_chr18_fastqc.zip
    output: multiqc_report.html, multiqc_data
    jobid: 2
    reason: Missing output files: multiqc_data, multiqc_report.html
    resources: tmpdir=/var/tmp/pbs.743371.pbsserver

multiqc FastQC/SRR3099585_chr18_fastqc.zip FastQC/SRR3099586_chr18_fastqc.zip
[WARNING]         multiqc : MultiQC Version v1.20 now available!
[INFO   ]         multiqc : This is MultiQC v1.9
[INFO   ]         multiqc : Template    : default
[INFO   ]         multiqc : Searching   : /data/work/I2BC/chloe.quignot/snakemake_tutorial/FastQC/SRR3099585_chr18_fastqc.zip
[INFO   ]         multiqc : Searching   : /data/work/I2BC/chloe.quignot/snakemake_tutorial/FastQC/SRR3099586_chr18_fastqc.zip
[INFO   ]          fastqc : Found 2 reports
[INFO   ]         multiqc : Compressing plot data
[INFO   ]         multiqc : Report      : multiqc_report.html
[INFO   ]         multiqc : Data        : multiqc_data
[INFO   ]         multiqc : MultiQC complete
[Tue Feb 20 14:18:01 2024]
Finished job 2.
1 of 2 steps (50%) done
Select jobs to execute...
Execute 1 jobs...

[Tue Feb 20 14:18:01 2024]
localrule all:
    input: FastQC/SRR3099585_chr18_fastqc.zip, FastQC/SRR3099585_chr18_fastqc.html, FastQC/SRR3099586_chr18_fastqc.zip, FastQC/SRR3099586_chr18_fastqc.html, multiqc_report.html, multiqc_data
    jobid: 0
    reason: Input files updated by another job: multiqc_data, multiqc_report.html
    resources: tmpdir=/var/tmp/pbs.743371.pbsserver

[Tue Feb 20 14:18:01 2024]
Finished job 0.
2 of 2 steps (100%) done
Complete log: .snakemake/log/2024-02-20T141743.558426.snakemake.log

Observe the output

This time, multiqc is executed and you should be able to see its outputs in your working directory:

				
					john.doe@node06:/data/work/I2BC/john.doe/snakemake_tutorial/$ ls
Data  FastQC  ex1_o4.smk  multiqc_data  multiqc_report.html