Getting started with Snakemake
Objective 1
Make Snakemake find your input files for you, using the
glob_wildcards()
function (similar to the glob
module in Python).Where to start?
- Replace
SAMPLES
: try using theglob_wildcards()
syntax to create your list instead of listing all the “variable” parts of your sample file names manually withSAMPLES = ["SRR3099585_chr18","SRR3099586_chr18","SRR3099587_chr18"]
.
Example using glob_wildcards()
Let’s imagine we have a folder that looks like this:
john.doe@PC1:~$ ls mydir/ batchA_sample1.txt batchB_sample1.txt batchB_sample3.txt batchA_sample2.txt batchB_sample2.txt
Defining the search pattern:
Look for common roots between your files. In this case, the search pattern for all batchA files could look like this:
samplelist, = glob_wildcards("mydir/batchA_sample{sampleid}.txt")
The above function will search through the current directory for files fitting this pattern and infer the possible values for your
{sampleid}
wildcard.The output format:
Whatever the number of wildcards in your pattern,
glob_wildcards()
will always output a “tuple” (=immutable list) of lists (that is why the comma in “samplelist, =
” above is important). Each list corresponds to the output of a wildcard.In the example above, we only have 1 wildcard, thus it will return a tuple containing a single list of all possible values for
{sampleid}
: ["1", "2"]
. A search pattern with 2 wildcards:
To include batchB, we can use a second wildcard like this:
samplelist, batchlist = glob_wildcards("mydir/batch{batchid}_sample{sampleid}.txt")
In this example, the output will look like this:
["A", "A", "B", "B", "B"]
(for batchlist
) and ["1", "2", "1", "2", "3"]
(for samplelist
).Your code in ex1b_o1.smk
should look like this:
SAMPLES, = glob_wildcards("Data/{sample}.fastq.gz")
rule all:
input:
expand("FastQC/{sample}_fastqc.html", sample=SAMPLES),
"multiqc_report.html"
rule fastqc:
input:
"Data/{sample}.fastq.gz"
output:
"FastQC/{sample}_fastqc.zip",
"FastQC/{sample}_fastqc.html"
shell: "fastqc --outdir FastQC {input}"
rule multiqc:
input:
expand("FastQC/{sample}_fastqc.zip", sample = SAMPLES)
output:
"multiqc_report.html",
directory("multiqc_data")
shell: "multiqc {input}"
As you might have noticed, we simplified the input of the “all” target rule to just 2 files: the html outputs of the fastqc rule and the html output of the multiqc rule, since the zip files are always generated with the html outputs in fastqc and the multiqc_data directory with the html report in multiqc.
Test the script
Next, let’s check if your pipeline still works as it should using the following command:
snakemake -s ex1b_o1.smk -c 1 -p
NB:
-c
is the short format of --cores
You should see something similar to the following output on your screen.
Assuming unrestricted shared filesystem usage for local execution.
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Job stats:
job count
------- -------
all 1
fastqc 3
multiqc 1
total 5
Select jobs to execute...
Execute 1 jobs...
[Tue Feb 20 15:06:06 2024]
localrule fastqc:
input: Data/SRR3105698_chr18.fastq.gz
output: FastQC/SRR3105698_chr18_fastqc.zip, FastQC/SRR3105698_chr18_fastqc.html
jobid: 2
reason: Missing output files: FastQC/SRR3105698_chr18_fastqc.zip, FastQC/SRR3105698_chr18_fastqc.html
wildcards: sample=SRR3105698_chr18
resources: tmpdir=/var/tmp/pbs.743371.pbsserver
fastqc --outdir FastQC Data/SRR3105698_chr18.fastq.gz
Started analysis of SRR3105698_chr18.fastq.gz
Approx 5% complete for SRR3105698_chr18.fastq.gz
Approx 10% complete for SRR3105698_chr18.fastq.gz
Approx 15% complete for SRR3105698_chr18.fastq.gz
Approx 20% complete for SRR3105698_chr18.fastq.gz
Approx 25% complete for SRR3105698_chr18.fastq.gz
Approx 30% complete for SRR3105698_chr18.fastq.gz
Approx 35% complete for SRR3105698_chr18.fastq.gz
Approx 40% complete for SRR3105698_chr18.fastq.gz
Approx 45% complete for SRR3105698_chr18.fastq.gz
Approx 50% complete for SRR3105698_chr18.fastq.gz
Approx 55% complete for SRR3105698_chr18.fastq.gz
Approx 60% complete for SRR3105698_chr18.fastq.gz
Approx 65% complete for SRR3105698_chr18.fastq.gz
Approx 70% complete for SRR3105698_chr18.fastq.gz
Approx 75% complete for SRR3105698_chr18.fastq.gz
Approx 80% complete for SRR3105698_chr18.fastq.gz
Approx 85% complete for SRR3105698_chr18.fastq.gz
Approx 90% complete for SRR3105698_chr18.fastq.gz
Approx 95% complete for SRR3105698_chr18.fastq.gz
Analysis complete for SRR3105698_chr18.fastq.gz
[Tue Feb 20 15:06:13 2024]
Finished job 2.
1 of 5 steps (20%) done
Select jobs to execute...
Execute 1 jobs...
[Tue Feb 20 15:06:13 2024]
localrule fastqc:
input: Data/SRR3105699_chr18.fastq.gz
output: FastQC/SRR3105699_chr18_fastqc.zip, FastQC/SRR3105699_chr18_fastqc.html
jobid: 6
reason: Missing output files: FastQC/SRR3105699_chr18_fastqc.html, FastQC/SRR3105699_chr18_fastqc.zip
wildcards: sample=SRR3105699_chr18
resources: tmpdir=/var/tmp/pbs.743371.pbsserver
fastqc --outdir FastQC Data/SRR3105699_chr18.fastq.gz
Started analysis of SRR3105699_chr18.fastq.gz
Approx 5% complete for SRR3105699_chr18.fastq.gz
Approx 10% complete for SRR3105699_chr18.fastq.gz
Approx 15% complete for SRR3105699_chr18.fastq.gz
Approx 20% complete for SRR3105699_chr18.fastq.gz
Approx 25% complete for SRR3105699_chr18.fastq.gz
Approx 30% complete for SRR3105699_chr18.fastq.gz
Approx 35% complete for SRR3105699_chr18.fastq.gz
Approx 40% complete for SRR3105699_chr18.fastq.gz
Approx 45% complete for SRR3105699_chr18.fastq.gz
Approx 50% complete for SRR3105699_chr18.fastq.gz
Approx 55% complete for SRR3105699_chr18.fastq.gz
Approx 60% complete for SRR3105699_chr18.fastq.gz
Approx 65% complete for SRR3105699_chr18.fastq.gz
Approx 70% complete for SRR3105699_chr18.fastq.gz
Approx 75% complete for SRR3105699_chr18.fastq.gz
Approx 80% complete for SRR3105699_chr18.fastq.gz
Approx 85% complete for SRR3105699_chr18.fastq.gz
Approx 90% complete for SRR3105699_chr18.fastq.gz
Approx 95% complete for SRR3105699_chr18.fastq.gz
Analysis complete for SRR3105699_chr18.fastq.gz
[Tue Feb 20 15:06:20 2024]
Finished job 6.
2 of 5 steps (40%) done
Select jobs to execute...
Execute 1 jobs...
[Tue Feb 20 15:06:20 2024]
localrule fastqc:
input: Data/SRR3105697_chr18.fastq.gz
output: FastQC/SRR3105697_chr18_fastqc.zip, FastQC/SRR3105697_chr18_fastqc.html
jobid: 5
reason: Missing output files: FastQC/SRR3105697_chr18_fastqc.zip, FastQC/SRR3105697_chr18_fastqc.html
wildcards: sample=SRR3105697_chr18
resources: tmpdir=/var/tmp/pbs.743371.pbsserver
fastqc --outdir FastQC Data/SRR3105697_chr18.fastq.gz
Started analysis of SRR3105697_chr18.fastq.gz
Approx 5% complete for SRR3105697_chr18.fastq.gz
Approx 10% complete for SRR3105697_chr18.fastq.gz
Approx 15% complete for SRR3105697_chr18.fastq.gz
Approx 20% complete for SRR3105697_chr18.fastq.gz
Approx 25% complete for SRR3105697_chr18.fastq.gz
Approx 30% complete for SRR3105697_chr18.fastq.gz
Approx 35% complete for SRR3105697_chr18.fastq.gz
Approx 40% complete for SRR3105697_chr18.fastq.gz
Approx 45% complete for SRR3105697_chr18.fastq.gz
Approx 50% complete for SRR3105697_chr18.fastq.gz
Approx 55% complete for SRR3105697_chr18.fastq.gz
Approx 60% complete for SRR3105697_chr18.fastq.gz
Approx 65% complete for SRR3105697_chr18.fastq.gz
Approx 70% complete for SRR3105697_chr18.fastq.gz
Approx 75% complete for SRR3105697_chr18.fastq.gz
Approx 80% complete for SRR3105697_chr18.fastq.gz
Approx 85% complete for SRR3105697_chr18.fastq.gz
Approx 90% complete for SRR3105697_chr18.fastq.gz
Approx 95% complete for SRR3105697_chr18.fastq.gz
Analysis complete for SRR3105697_chr18.fastq.gz
[Tue Feb 20 15:06:27 2024]
Finished job 5.
3 of 5 steps (60%) done
Select jobs to execute...
Execute 1 jobs...
[Tue Feb 20 15:06:27 2024]
localrule multiqc:
input: FastQC/SRR3099586_chr18_fastqc.zip, FastQC/SRR3105698_chr18_fastqc.zip, FastQC/SRR3099587_chr18_fastqc.zip, FastQC/SRR3099585_chr18_fastqc.zip, FastQC/SRR3105697_chr18_fastqc.zip, FastQC/SRR3105
699_chr18_fastqc.zip
output: multiqc_report.html, multiqc_data
jobid: 7
reason: Input files updated by another job: FastQC/SRR3105697_chr18_fastqc.zip, FastQC/SRR3105698_chr18_fastqc.zip, FastQC/SRR3105699_chr18_fastqc.zip; Set of input files has changed since last executi
on
resources: tmpdir=/var/tmp/pbs.743371.pbsserver
multiqc FastQC/SRR3099586_chr18_fastqc.zip FastQC/SRR3105698_chr18_fastqc.zip FastQC/SRR3099587_chr18_fastqc.zip FastQC/SRR3099585_chr18_fastqc.zip FastQC/SRR3105697_chr18_fastqc.zip FastQC/SRR3105699_chr1
8_fastqc.zip
[WARNING] multiqc : MultiQC Version v1.20 now available!
[INFO ] multiqc : This is MultiQC v1.9
[INFO ] multiqc : Template : default
[INFO ] multiqc : Searching : /data/work/I2BC/chloe.quignot/snakemake_tutorial/FastQC/SRR3099586_chr18_fastqc.zip
[INFO ] multiqc : Searching : /data/work/I2BC/chloe.quignot/snakemake_tutorial/FastQC/SRR3105698_chr18_fastqc.zip
[INFO ] multiqc : Searching : /data/work/I2BC/chloe.quignot/snakemake_tutorial/FastQC/SRR3099587_chr18_fastqc.zip
[INFO ] multiqc : Searching : /data/work/I2BC/chloe.quignot/snakemake_tutorial/FastQC/SRR3099585_chr18_fastqc.zip
[INFO ] multiqc : Searching : /data/work/I2BC/chloe.quignot/snakemake_tutorial/FastQC/SRR3105697_chr18_fastqc.zip
[INFO ] multiqc : Searching : /data/work/I2BC/chloe.quignot/snakemake_tutorial/FastQC/SRR3105699_chr18_fastqc.zip
[INFO ] fastqc : Found 6 reports
[INFO ] multiqc : Compressing plot data
[INFO ] multiqc : Report : multiqc_report.html
[INFO ] multiqc : Data : multiqc_data
[INFO ] multiqc : MultiQC complete
[Tue Feb 20 15:06:32 2024]
Finished job 7.
4 of 5 steps (80%) done
Select jobs to execute...
Execute 1 jobs...
[Tue Feb 20 15:06:32 2024]
localrule all:
input: FastQC/SRR3099586_chr18_fastqc.html, FastQC/SRR3105698_chr18_fastqc.html, FastQC/SRR3099587_chr18_fastqc.html, FastQC/SRR3099585_chr18_fastqc.html, FastQC/SRR3105697_chr18_fastqc.html, FastQC/SR
R3105699_chr18_fastqc.html, multiqc_report.html
jobid: 0
reason: Input files updated by another job: FastQC/SRR3105699_chr18_fastqc.html, FastQC/SRR3105697_chr18_fastqc.html, multiqc_report.html, FastQC/SRR3105698_chr18_fastqc.html
resources: tmpdir=/var/tmp/pbs.743371.pbsserver
[Tue Feb 20 15:06:32 2024]
Finished job 0.
5 of 5 steps (100%) done
Complete log: .snakemake/log/2024-02-20T150605.574089.snakemake.log
Observe the output
As you can see, Snakemake found 3 extra files in you
Data/
directory to run FastQC on and regenerated the MultiQC report to include them.