Exercise 1B 4 – BIOI2 – Integrative BIOInformatics platforme

Getting started with Snakemake

About this course | Before the session | About Snakemake | Course material | Exercises

Exercise 1B - improving your snakefile

objective > setup > o1 > o2 > o3> o4 > recap

Objective 4

In this objective, we’ll learn how to control when Snakemake re-runs rules.

Motivation

Have you noticed how Snakemake sometimes decides to re-run everything (although output files already exist) and sometimes not?

Snakemake always justifies its choices in the output log. Sometimes its because the files are missing, other times it might because the code has changed, etc.. For example:

				
					[Tue Feb 20 15:35:31 2024]
localrule fastqc:
    input: Data/SRR3105698_chr18.fastq.gz
    output: FastQC/SRR3105698_chr18_fastqc.zip, FastQC/SRR3105698_chr18_fastqc.html
    log: Logs/SRR3105698_chr18_fastqc.std, Logs/SRR3105698_chr18_fastqc.err
    jobid: 2
    reason: Code has changed since last execution
    wildcards: sample=SRR3105698_chr18
    resources: tmpdir=/var/tmp/pbs.743371.pbsserver

fastqc --outdir FastQC Data/SRR3105698_chr18.fastq.gz 1>Logs/SRR3105698_chr18_fastqc.std 2>Logs/SRR3105698_chr18_fastqc.err

Or:

				
					[Tue Feb 20 15:06:06 2024]
localrule fastqc:
    input: Data/SRR3105698_chr18.fastq.gz
    output: FastQC/SRR3105698_chr18_fastqc.zip, FastQC/SRR3105698_chr18_fastqc.html
    jobid: 2
    reason: Missing output files: FastQC/SRR3105698_chr18_fastqc.zip, FastQC/SRR3105698_chr18_fastqc.html
    wildcards: sample=SRR3105698_chr18
    resources: tmpdir=/var/tmp/pbs.743371.pbsserver

fastqc --outdir FastQC Data/SRR3105698_chr18.fastq.gz

This behaviour is because Snakemake takes into account all provenance information to define which jobs to rerun and not solely the fact that outputs exist or not. Thus, parameter, code and software environment changes, as well as changes in the set of input files of a job will make Snakemake rerun all the steps, even if the output files already exist.

Of note: in previous versions of Snakemake (before v.7.8.0), rerunning jobs relied purely on file modification times so Snakemake would only re-execute the pipeline if the timestamp of output files were older than the ones of input files.

How to control re-running criteria?

There is a command line option that you can use in Snakemake in order to specify your re-running criteria: --rerun-triggers. By default, all triggers are used (code,input,mtime,params,software-env), which guarantees that results are consistent with the workflow code and configuration. To revert to Snakemake’s behaviour before v.7.8.0, you can use --rerun-triggers mtime. This option will tell Snakemake to only use modification time when determining whether a job should be executed or not. For example:

				
					snakemake -s ex1b_o3.smk -c 1 -p --rerun-triggers mtime

Force re-run

If you rerun your snakemake command line now, without changing anything to the code (with or without the --rerun-trigger mtime option), you should see a message from Snakemake telling you that nothing needs doing:

				
					Building DAG of jobs...
Nothing to be done (all requested files are present and up to date).

However, if you’d like to force Snakemake to re-run the whole pipeline, you can do one of the following:

– delete all output folders and results before re-running the Snakemake command

				
					rm -rf FastQC multiqc*
snakemake -s ex1b_o3.smk -c 1 -p --configfile ex1.yml

– use Snakemake’s --forcerun (-R) or --forceall (-F) options when you run the Snakemake command. --forcerun reruns a specific rule or input which you will have to specify in the command line. --forceall forces everything to be re-run. For example:

				
					snakemake -s ex1b_o3.smk -c 1 -p -R fastqc --configfile ex1.yml

				
					snakemake -s ex1b_o3.smk -c 1 -p -R FastQC/SRR3099585_chr18_fastqc.zip --configfile ex1.yml

				
					snakemake -s ex1b_o3.smk -c 1 -p -F --configfile ex1.yml