Getting started with Snakemake
Objective 4
ex1_o4.smk
in which we add a new & dedicated target rule at the beginning of the Snakefile.Motivation - how can I run my multiqc rule?
Your first thought might be to change the order of your rules in the Snakefile to make sure multiqc is first. There’s actually a way of telling Snakemake at execution what rule you want to be the target rule (which avoids editing the file), by just adding its name. For example, the following command specifies multiqc as the target:
snakemake -s ex1_o3.smk --cores 1 -p multiqc
This solution works but it’s limited to sequential pipelines (i.e. the output of one is the direct input of the next). For less linear pipelines (e.g. output of rule1 is input for rules 2 and 3), this solution won’t work as only one rule can be the target rule.
As we now know, if the input files to the target rule don’t exist, Snakemake will first execute any of the other rules in the Snakefile that will help it generate these missing files. If we use this knowledge to our advantage, we can add a dedicated target rule at the top of our Snakefile which requires the outputs of key rules in our Snakefile. This will enable us to execute them all via a single line of code. We will be doing this below.
Where to start?
- add a new rule: define a new rule (above the other rules in your Snakefile) containing only an input directive.
- fill in the input directive: list all files that you expect as output in your pipeline (in this case, you could list all the outputs of fastqc and multiqc, or just the outputs of multiqc since our pipeline is completely linear).
Your code for ex1_o4.smk
should look like this:
rule all: input: "FastQC/SRR3099585_chr18_fastqc.zip", "FastQC/SRR3099585_chr18_fastqc.html", "FastQC/SRR3099586_chr18_fastqc.zip", "FastQC/SRR3099586_chr18_fastqc.html", "multiqc_report.html",
"multiqc_data",
rule fastqc: input: "Data/SRR3099585_chr18.fastq.gz", "Data/SRR3099586_chr18.fastq.gz", output: "FastQC/SRR3099585_chr18_fastqc.zip", "FastQC/SRR3099585_chr18_fastqc.html", "FastQC/SRR3099586_chr18_fastqc.zip", "FastQC/SRR3099586_chr18_fastqc.html", shell: "fastqc --outdir FastQC {input}" rule multiqc: input: "FastQC/SRR3099585_chr18_fastqc.zip", "FastQC/SRR3099586_chr18_fastqc.zip", output: "multiqc_report.html", directory("multiqc_data") shell: "multiqc {input}"
Note that, technically, you could remove all FastQC html and zip files from the new “all” target rule and only keep the multiqc outputs since our pipeline is completely linear in this example (i.e. Snakemake is capable of retracing the sequence of rules it needs to run in order to get to the final MultiQC output).
Test the script
Next, let’s run the pipeline and see if it works:
snakemake -s ex1_o4.smk --cores 1 -p
You should see something similar to the following output on your screen:
Assuming unrestricted shared filesystem usage for local execution.
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Job stats:
job count
------- -------
all 1
multiqc 1
total 2
Select jobs to execute...
Execute 1 jobs...
[Tue Feb 20 14:17:43 2024]
localrule multiqc:
input: FastQC/SRR3099585_chr18_fastqc.zip, FastQC/SRR3099586_chr18_fastqc.zip
output: multiqc_report.html, multiqc_data
jobid: 2
reason: Missing output files: multiqc_data, multiqc_report.html
resources: tmpdir=/var/tmp/pbs.743371.pbsserver
multiqc FastQC/SRR3099585_chr18_fastqc.zip FastQC/SRR3099586_chr18_fastqc.zip
[WARNING] multiqc : MultiQC Version v1.20 now available!
[INFO ] multiqc : This is MultiQC v1.9
[INFO ] multiqc : Template : default
[INFO ] multiqc : Searching : /data/work/I2BC/chloe.quignot/snakemake_tutorial/FastQC/SRR3099585_chr18_fastqc.zip
[INFO ] multiqc : Searching : /data/work/I2BC/chloe.quignot/snakemake_tutorial/FastQC/SRR3099586_chr18_fastqc.zip
[INFO ] fastqc : Found 2 reports
[INFO ] multiqc : Compressing plot data
[INFO ] multiqc : Report : multiqc_report.html
[INFO ] multiqc : Data : multiqc_data
[INFO ] multiqc : MultiQC complete
[Tue Feb 20 14:18:01 2024]
Finished job 2.
1 of 2 steps (50%) done
Select jobs to execute...
Execute 1 jobs...
[Tue Feb 20 14:18:01 2024]
localrule all:
input: FastQC/SRR3099585_chr18_fastqc.zip, FastQC/SRR3099585_chr18_fastqc.html, FastQC/SRR3099586_chr18_fastqc.zip, FastQC/SRR3099586_chr18_fastqc.html, multiqc_report.html, multiqc_data
jobid: 0
reason: Input files updated by another job: multiqc_data, multiqc_report.html
resources: tmpdir=/var/tmp/pbs.743371.pbsserver
[Tue Feb 20 14:18:01 2024]
Finished job 0.
2 of 2 steps (100%) done
Complete log: .snakemake/log/2024-02-20T141743.558426.snakemake.log
Observe the output
This time, multiqc is executed and you should be able to see its outputs in your working directory:
john.doe@node06:/data/work/I2BC/john.doe/snakemake_tutorial/$ ls
Data FastQC ex1_o4.smk multiqc_data multiqc_report.html