Getting started with Snakemake

Exercise 1B - improving your snakefile

objective > setup > o1 > o2 > o3 > o4 > recap

Objective 1

Make Snakemake find your input files for you, using the glob_wildcards() function (similar to the glob module in Python).
Where to start?
  • Replace SAMPLES: try using the glob_wildcards() syntax to create your list instead of listing all the “variable” parts of your sample file names manually with SAMPLES = ["SRR3099585_chr18","SRR3099586_chr18","SRR3099587_chr18"].

Example using glob_wildcards()

Let’s imagine we have a folder that looks like this:
john.doe@PC1:~$ ls mydir/
batchA_sample1.txt   batchB_sample1.txt   batchB_sample3.txt 
batchA_sample2.txt   batchB_sample2.txt   
Defining the search pattern:
Look for common roots between your files. In this case, the search pattern for all batchA files could look like this:
samplelist, = glob_wildcards("mydir/batchA_sample{sampleid}.txt")
The above function will search through the current directory for files fitting this pattern and infer the possible values for your {sampleid} wildcard.
 
The output format:
Whatever the number of wildcards in your pattern, glob_wildcards() will always output a “tuple” (=immutable list) of lists (that is why the comma in “samplelist, =” above is important). Each list corresponds to the output of a wildcard.
In the example above, we only have 1 wildcard, thus it will return a tuple containing a single list of all possible values for {sampleid}["1", "2"].
 
A search pattern with 2 wildcards:
To include batchB, we can use a second wildcard like this:
samplelist, batchlist = glob_wildcards("mydir/batch{batchid}_sample{sampleid}.txt")
In this example, the output will look like this: ["A", "A", "B", "B", "B"] (for batchlist) and ["1", "2", "1", "2", "3"] (for samplelist).

Your code in ex1b_o1.smk should look like this:

SAMPLES, = glob_wildcards("Data/{sample}.fastq.gz")

rule all:
  input:
    expand("FastQC/{sample}_fastqc.html", sample=SAMPLES),
    "multiqc_report.html"

rule fastqc:
  input:
    "Data/{sample}.fastq.gz"
  output:
    "FastQC/{sample}_fastqc.zip",
    "FastQC/{sample}_fastqc.html"
  shell: "fastqc --outdir FastQC {input}"

rule multiqc:
  input:
    expand("FastQC/{sample}_fastqc.zip", sample = SAMPLES)
  output:
    "multiqc_report.html",
    directory("multiqc_data")
  shell: "multiqc {input}"

As you might have noticed, we simplified the input of the “all” target rule to just 2 files: the html outputs of the fastqc rule and the html output of the multiqc rule, since the zip files are always generated with the html outputs in fastqc and the multiqc_data directory with the html report in multiqc.

Test the script

Next, let’s check if your pipeline still works as it should using the following command:

				
					snakemake -s ex1b_o1.smk -c 1 -p
				
			
NB: -c is the short format of --cores

You should see something similar to the following output on your screen.

				
					Assuming unrestricted shared filesystem usage for local execution.
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Job stats:
job        count
-------  -------
all            1
fastqc         3
multiqc        1
total          5

Select jobs to execute...
Execute 1 jobs...

[Tue Feb 20 15:06:06 2024]
localrule fastqc:
    input: Data/SRR3105698_chr18.fastq.gz
    output: FastQC/SRR3105698_chr18_fastqc.zip, FastQC/SRR3105698_chr18_fastqc.html
    jobid: 2
    reason: Missing output files: FastQC/SRR3105698_chr18_fastqc.zip, FastQC/SRR3105698_chr18_fastqc.html
    wildcards: sample=SRR3105698_chr18
    resources: tmpdir=/var/tmp/pbs.743371.pbsserver

fastqc --outdir FastQC Data/SRR3105698_chr18.fastq.gz
Started analysis of SRR3105698_chr18.fastq.gz
Approx 5% complete for SRR3105698_chr18.fastq.gz
Approx 10% complete for SRR3105698_chr18.fastq.gz
Approx 15% complete for SRR3105698_chr18.fastq.gz
Approx 20% complete for SRR3105698_chr18.fastq.gz
Approx 25% complete for SRR3105698_chr18.fastq.gz
Approx 30% complete for SRR3105698_chr18.fastq.gz
Approx 35% complete for SRR3105698_chr18.fastq.gz
Approx 40% complete for SRR3105698_chr18.fastq.gz
Approx 45% complete for SRR3105698_chr18.fastq.gz
Approx 50% complete for SRR3105698_chr18.fastq.gz
Approx 55% complete for SRR3105698_chr18.fastq.gz
Approx 60% complete for SRR3105698_chr18.fastq.gz
Approx 65% complete for SRR3105698_chr18.fastq.gz
Approx 70% complete for SRR3105698_chr18.fastq.gz
Approx 75% complete for SRR3105698_chr18.fastq.gz
Approx 80% complete for SRR3105698_chr18.fastq.gz
Approx 85% complete for SRR3105698_chr18.fastq.gz
Approx 90% complete for SRR3105698_chr18.fastq.gz
Approx 95% complete for SRR3105698_chr18.fastq.gz
Analysis complete for SRR3105698_chr18.fastq.gz
[Tue Feb 20 15:06:13 2024]
Finished job 2.
1 of 5 steps (20%) done
Select jobs to execute...
Execute 1 jobs...

[Tue Feb 20 15:06:13 2024]
localrule fastqc:
    input: Data/SRR3105699_chr18.fastq.gz
    output: FastQC/SRR3105699_chr18_fastqc.zip, FastQC/SRR3105699_chr18_fastqc.html
    jobid: 6
    reason: Missing output files: FastQC/SRR3105699_chr18_fastqc.html, FastQC/SRR3105699_chr18_fastqc.zip
    wildcards: sample=SRR3105699_chr18
    resources: tmpdir=/var/tmp/pbs.743371.pbsserver

fastqc --outdir FastQC Data/SRR3105699_chr18.fastq.gz
Started analysis of SRR3105699_chr18.fastq.gz
Approx 5% complete for SRR3105699_chr18.fastq.gz
Approx 10% complete for SRR3105699_chr18.fastq.gz
Approx 15% complete for SRR3105699_chr18.fastq.gz
Approx 20% complete for SRR3105699_chr18.fastq.gz
Approx 25% complete for SRR3105699_chr18.fastq.gz
Approx 30% complete for SRR3105699_chr18.fastq.gz
Approx 35% complete for SRR3105699_chr18.fastq.gz
Approx 40% complete for SRR3105699_chr18.fastq.gz
Approx 45% complete for SRR3105699_chr18.fastq.gz
Approx 50% complete for SRR3105699_chr18.fastq.gz
Approx 55% complete for SRR3105699_chr18.fastq.gz
Approx 60% complete for SRR3105699_chr18.fastq.gz
Approx 65% complete for SRR3105699_chr18.fastq.gz
Approx 70% complete for SRR3105699_chr18.fastq.gz
Approx 75% complete for SRR3105699_chr18.fastq.gz
Approx 80% complete for SRR3105699_chr18.fastq.gz
Approx 85% complete for SRR3105699_chr18.fastq.gz
Approx 90% complete for SRR3105699_chr18.fastq.gz
Approx 95% complete for SRR3105699_chr18.fastq.gz
Analysis complete for SRR3105699_chr18.fastq.gz
[Tue Feb 20 15:06:20 2024]
Finished job 6.
2 of 5 steps (40%) done
Select jobs to execute...
Execute 1 jobs...

[Tue Feb 20 15:06:20 2024]
localrule fastqc:
    input: Data/SRR3105697_chr18.fastq.gz
    output: FastQC/SRR3105697_chr18_fastqc.zip, FastQC/SRR3105697_chr18_fastqc.html
    jobid: 5
    reason: Missing output files: FastQC/SRR3105697_chr18_fastqc.zip, FastQC/SRR3105697_chr18_fastqc.html
    wildcards: sample=SRR3105697_chr18
    resources: tmpdir=/var/tmp/pbs.743371.pbsserver

fastqc --outdir FastQC Data/SRR3105697_chr18.fastq.gz
Started analysis of SRR3105697_chr18.fastq.gz
Approx 5% complete for SRR3105697_chr18.fastq.gz
Approx 10% complete for SRR3105697_chr18.fastq.gz
Approx 15% complete for SRR3105697_chr18.fastq.gz
Approx 20% complete for SRR3105697_chr18.fastq.gz
Approx 25% complete for SRR3105697_chr18.fastq.gz
Approx 30% complete for SRR3105697_chr18.fastq.gz
Approx 35% complete for SRR3105697_chr18.fastq.gz
Approx 40% complete for SRR3105697_chr18.fastq.gz
Approx 45% complete for SRR3105697_chr18.fastq.gz
Approx 50% complete for SRR3105697_chr18.fastq.gz
Approx 55% complete for SRR3105697_chr18.fastq.gz
Approx 60% complete for SRR3105697_chr18.fastq.gz
Approx 65% complete for SRR3105697_chr18.fastq.gz
Approx 70% complete for SRR3105697_chr18.fastq.gz
Approx 75% complete for SRR3105697_chr18.fastq.gz
Approx 80% complete for SRR3105697_chr18.fastq.gz
Approx 85% complete for SRR3105697_chr18.fastq.gz
Approx 90% complete for SRR3105697_chr18.fastq.gz
Approx 95% complete for SRR3105697_chr18.fastq.gz
Analysis complete for SRR3105697_chr18.fastq.gz
[Tue Feb 20 15:06:27 2024]
Finished job 5.
3 of 5 steps (60%) done
Select jobs to execute...
Execute 1 jobs...

[Tue Feb 20 15:06:27 2024]
localrule multiqc:
    input: FastQC/SRR3099586_chr18_fastqc.zip, FastQC/SRR3105698_chr18_fastqc.zip, FastQC/SRR3099587_chr18_fastqc.zip, FastQC/SRR3099585_chr18_fastqc.zip, FastQC/SRR3105697_chr18_fastqc.zip, FastQC/SRR3105
699_chr18_fastqc.zip
    output: multiqc_report.html, multiqc_data
    jobid: 7
    reason: Input files updated by another job: FastQC/SRR3105697_chr18_fastqc.zip, FastQC/SRR3105698_chr18_fastqc.zip, FastQC/SRR3105699_chr18_fastqc.zip; Set of input files has changed since last executi
on
    resources: tmpdir=/var/tmp/pbs.743371.pbsserver

multiqc FastQC/SRR3099586_chr18_fastqc.zip FastQC/SRR3105698_chr18_fastqc.zip FastQC/SRR3099587_chr18_fastqc.zip FastQC/SRR3099585_chr18_fastqc.zip FastQC/SRR3105697_chr18_fastqc.zip FastQC/SRR3105699_chr1
8_fastqc.zip
[WARNING]         multiqc : MultiQC Version v1.20 now available!
[INFO   ]         multiqc : This is MultiQC v1.9
[INFO   ]         multiqc : Template    : default
[INFO   ]         multiqc : Searching   : /data/work/I2BC/chloe.quignot/snakemake_tutorial/FastQC/SRR3099586_chr18_fastqc.zip
[INFO   ]         multiqc : Searching   : /data/work/I2BC/chloe.quignot/snakemake_tutorial/FastQC/SRR3105698_chr18_fastqc.zip
[INFO   ]         multiqc : Searching   : /data/work/I2BC/chloe.quignot/snakemake_tutorial/FastQC/SRR3099587_chr18_fastqc.zip
[INFO   ]         multiqc : Searching   : /data/work/I2BC/chloe.quignot/snakemake_tutorial/FastQC/SRR3099585_chr18_fastqc.zip
[INFO   ]         multiqc : Searching   : /data/work/I2BC/chloe.quignot/snakemake_tutorial/FastQC/SRR3105697_chr18_fastqc.zip
[INFO   ]         multiqc : Searching   : /data/work/I2BC/chloe.quignot/snakemake_tutorial/FastQC/SRR3105699_chr18_fastqc.zip
[INFO   ]          fastqc : Found 6 reports
[INFO   ]         multiqc : Compressing plot data
[INFO   ]         multiqc : Report      : multiqc_report.html
[INFO   ]         multiqc : Data        : multiqc_data
[INFO   ]         multiqc : MultiQC complete
[Tue Feb 20 15:06:32 2024]
Finished job 7.
4 of 5 steps (80%) done
Select jobs to execute...
Execute 1 jobs...

[Tue Feb 20 15:06:32 2024]
localrule all:
    input: FastQC/SRR3099586_chr18_fastqc.html, FastQC/SRR3105698_chr18_fastqc.html, FastQC/SRR3099587_chr18_fastqc.html, FastQC/SRR3099585_chr18_fastqc.html, FastQC/SRR3105697_chr18_fastqc.html, FastQC/SR
R3105699_chr18_fastqc.html, multiqc_report.html
    jobid: 0
    reason: Input files updated by another job: FastQC/SRR3105699_chr18_fastqc.html, FastQC/SRR3105697_chr18_fastqc.html, multiqc_report.html, FastQC/SRR3105698_chr18_fastqc.html
    resources: tmpdir=/var/tmp/pbs.743371.pbsserver

[Tue Feb 20 15:06:32 2024]
Finished job 0.
5 of 5 steps (100%) done
Complete log: .snakemake/log/2024-02-20T150605.574089.snakemake.log

				
			
Observe the output
As you can see, Snakemake found 3 extra files in you Data/ directory to run FastQC on and regenerated the MultiQC report to include them.
Scroll to Top