Getting started with Snakemake
Objective 2
ex1b_o2.smk
) to use a configuration file called myConfig.yml
that specifies the path of the input data directory, and replace the hard-coding values (Data/
) in the snakefile.Where to start?
- Create the configfile: Snakemake accepts both yaml and json formats.
- Import it in the Snakefile: you can read and load any variable in the configuration file from your Snakefile by using the following syntax:
config["myItem"]
(“myItem” being the name of your variable in the configfile). - At run time: so that Snakemake knows where to look, you have to add the
--configfile ex1.yml
to Snakemake’s command line option.
NB: Alternativelly, you can add the directiveconfigfile: ex1.yml
at the beginning of your snakefile instead
Your configuration file
Your ex1.yml
configfile should look like this (YAML syntax):
dataDir: "Data"
The snakefile:
Your code in ex1b_o2.smk
should look like this:
SAMPLES, = glob_wildcards(config["dataDir"]+"/{sample}.fastq.gz")
rule all:
input:
expand("FastQC/{sample}_fastqc.html", sample=SAMPLES),
"multiqc_report.html",
rule fastqc:
input:
config["dataDir"]+"/{sample}.fastq.gz",
output:
"FastQC/{sample}_fastqc.zip",
"FastQC/{sample}_fastqc.html"
shell: "fastqc --outdir FastQC {input}"
rule multiqc:
input:
expand("FastQC/{sample}_fastqc.zip", sample = SAMPLES),
output:
"multiqc_report.html",
directory("multiqc_data")
shell: "multiqc {input}"
Test the script
Next, let’s check if your pipeline still works as it should using the following command:
snakemake -s ex1b_o2.smk -c 1 -p --configfile ex1.yml
You should see something similar to the following output on your screen.
Assuming unrestricted shared filesystem usage for local execution.
Building DAG of jobs...
Nothing to be done (all requested files are present and up to date).
Observe the output
To go further - more parameters
params
directive:You configfile (ex1.yml
) could like this:
dataDir: "Data" fastqcOutDir: "FastQC"
And your code in ex1b_o2b.smk
could look like this:
SAMPLES, = glob_wildcards(config["dataDir"]+"/{sample}.fastq.gz")
rule all:
input:
expand(config["fastqcOutDir"]+"/{sample}_fastqc.html", sample=SAMPLES),
"multiqc_report.html",
rule fastqc:
input:
config["dataDir"]+"/{sample}.fastq.gz",
output:
config["fastqcOutDir"]+"/{sample}_fastqc.zip",
config["fastqcOutDir"]+"/{sample}_fastqc.html",
params:
outdir=config["fastqcOutDir"],
shell: "fastqc --outdir {params.outdir} {input} "
rule multiqc:
input:
expand(config["fastqcOutDir"]+"/{sample}_fastqc.zip", sample = SAMPLES),
output:
"multiqc_report.html",
directory("multiqc_data"),
shell: "multiqc {input}"
How values of directives are used in "shell"
Let’s take a moment to compare the various “wildcards” that we’ve used within the shell calls and the way we specify the files in our directives. For example, in the fastqc rule:
rule fastqc:
input:
config["dataDir"]+"/{sample}.fastq.gz"
output:
config["fastqcOutDir"]+"/{sample}_fastqc.zip",
config["fastqcOutDir"]+"/{sample}_fastqc.html"
params:
outdir=config["fastqcOutDir"]
shell: "fastqc --outdir {params.outdir} {input} "
As you can see, you are free to add as many files to your directives as you wish: input
has only 1, output
has 2 and params
has only 1 (and we’ve named it “outdir”). Note that if you provide more than one file to a directive, these should be separated by a comma (see output
directive above for an example). If you are afraid of forgetting it, you can systematically add a comma at the end of your directive lines.
To use the files specified in your directives within the shell directive, you can use the wildcard syntax ({variable}
). In the case of our input directive, as we only have 1 file, we used the directive name directly ({input}
). However, as soon as you have several files, you can use Python’s list syntax. For example, if we had to explicitly mention our output files in the shell command, we would have specified them with {output[0]}
and {output[1]}
. We can also give names to each element if there are several to avoid using indexing (i.e. [0]
, [1]
etc.). For example, in the params
directive, we chose to name the file, so we use the syntax {directive.name}
instead.
Transfer your Python skills - bonus for Python addicts only 😉
And finally, it’s important to note that the language Snakemake uses is Python, so don’t hesitate to use Python’s modules within the snakefile if you’re familiar with the language. For example, you could use the os.path
module’s join()
function instead of concatenating your paths with the +
operator. You could also use Python’s string formats or triple quote strings too.
Here’s an example with a few Python “upgrades” ex1_o7c.smk
:
from os.path import join
SAMPLES, = glob_wildcards(join(config["dataDir"],"{sample}.fastq.gz"))
rule all:
input:
expand(join(config["fastqcOutDir"],"{sample}_fastqc.html"), sample=SAMPLES),
"multiqc_report.html"
rule fastqc:
input:
join(config["dataDir"],"{sample}.fastq.gz")
output:
join(config["fastqcOutDir"],"{sample}_fastqc.zip"),
join(config["fastqcOutDir"],"{sample}_fastqc.html")
shell:
f"""fastqc --outdir {config["fastqcOutDir"]} {{input}}"""
rule multiqc:
input:
expand(join(config["fastqcOutDir"],"{sample}_fastqc.zip"), sample = SAMPLES)
output:
"multiqc_report.html",
directory("multiqc_data")
shell: "multiqc {input}"
Note that if you use Python’s string formatting (e.g. f"mystring"
), you should replace braces in wildcard annotations with double braces (e.g. f"{{input}}"
).