Exercise 1B 2 – BIOI2 – Integrative BIOInformatics platforme

Getting started with Snakemake

About this course | Before the session | About Snakemake | Course material | Exercises

Exercise 1B - improving your snakefile

objective > setup > o1 > o2 > o3 > o4 > recap

Objective 2

Adapt your pipeline (ex1b_o2.smk) to use a configuration file called myConfig.yml that specifies the path of the input data directory, and replace the hard-coding values (Data/) in the snakefile.

Where to start?

Create the configfile: Snakemake accepts both yaml and json formats.
Import it in the Snakefile: you can read and load any variable in the configuration file from your Snakefile by using the following syntax: config["myItem"] (“myItem” being the name of your variable in the configfile).
At run time: so that Snakemake knows where to look, you have to add the --configfile ex1.yml to Snakemake’s command line option.

NB: Alternativelly, you can add the directive configfile: ex1.yml at the beginning of your snakefile instead

Why use a configuration file? In this file, you will place all hard-coding values of the snakefile (paths to files, core numbers, parameter values, etc). In this way, you have a generalised analysis pipeline that you can easily apply to new input data by simply changing the configfile rather than the snakefile directly.

Solution

Your configuration file

Your ex1.yml configfile should look like this (YAML syntax):

dataDir:
  "Data"

The snakefile:

Your code in ex1b_o2.smk should look like this:

SAMPLES, = glob_wildcards(config["dataDir"]+"/{sample}.fastq.gz")

rule all:
  input:
    expand("FastQC/{sample}_fastqc.html", sample=SAMPLES),
    "multiqc_report.html",

rule fastqc:
  input:
    config["dataDir"]+"/{sample}.fastq.gz",
  output:
    "FastQC/{sample}_fastqc.zip",
    "FastQC/{sample}_fastqc.html"
  shell: "fastqc --outdir FastQC {input}"

rule multiqc:
  input:
    expand("FastQC/{sample}_fastqc.zip", sample = SAMPLES),
  output:
    "multiqc_report.html",
    directory("multiqc_data")
  shell: "multiqc {input}"

Test the script

Next, let’s check if your pipeline still works as it should using the following command:

				
					snakemake -s ex1b_o2.smk -c 1 -p --configfile ex1.yml

You should see something similar to the following output on your screen.

				
					Assuming unrestricted shared filesystem usage for local execution.
Building DAG of jobs...
Nothing to be done (all requested files are present and up to date).

Observe the output

As you can see, Snakemake doesn’t see anything that changed that might influence the outputs that were already generated and tells you that there’s “Nothing to be done“.

To go further - more parameters

The configfile can be used to specify as many parameters as you want, for example, the name of the output directory in the fastqc rule. In order to add parameters to the shell command line, you can use the params directive:

Solution 2.0

You configfile (ex1.yml) could like this:

dataDir:
  "Data"
fastqcOutDir:
  "FastQC"

And your code in ex1b_o2b.smk could look like this:

SAMPLES, = glob_wildcards(config["dataDir"]+"/{sample}.fastq.gz")

rule all:
  input:
    expand(config["fastqcOutDir"]+"/{sample}_fastqc.html", sample=SAMPLES),
    "multiqc_report.html",

rule fastqc:
  input:
    config["dataDir"]+"/{sample}.fastq.gz",
  output:
    config["fastqcOutDir"]+"/{sample}_fastqc.zip",
    config["fastqcOutDir"]+"/{sample}_fastqc.html",
  params:
    outdir=config["fastqcOutDir"],
  shell: "fastqc --outdir {params.outdir} {input} "

rule multiqc:
  input:
    expand(config["fastqcOutDir"]+"/{sample}_fastqc.zip", sample = SAMPLES),
  output:
    "multiqc_report.html",
    directory("multiqc_data"),
  shell: "multiqc {input}"

How values of directives are used in "shell"

Let’s take a moment to compare the various “wildcards” that we’ve used within the shell calls and the way we specify the files in our directives. For example, in the fastqc rule:

rule fastqc: 
  input: 
    config["dataDir"]+"/{sample}.fastq.gz" 
  output:  
    config["fastqcOutDir"]+"/{sample}_fastqc.zip",  
    config["fastqcOutDir"]+"/{sample}_fastqc.html"  
  params:
    outdir=config["fastqcOutDir"] 
  shell: "fastqc --outdir {params.outdir} {input} "

As you can see, you are free to add as many files to your directives as you wish: input has only 1, output has 2 and params has only 1 (and we’ve named it “outdir”). Note that if you provide more than one file to a directive, these should be separated by a comma (see output directive above for an example). If you are afraid of forgetting it, you can systematically add a comma at the end of your directive lines.

To use the files specified in your directives within the shell directive, you can use the wildcard syntax ({variable}). In the case of our input directive, as we only have 1 file, we used the directive name directly ({input}). However, as soon as you have several files, you can use Python’s list syntax. For example, if we had to explicitly mention our output files in the shell command, we would have specified them with {output[0]} and {output[1]}. We can also give names to each element if there are several to avoid using indexing (i.e. [0], [1] etc.). For example, in the params directive, we chose to name the file, so we use the syntax {directive.name} instead.

more about wildcards

Transfer your Python skills - bonus for Python addicts only 😉

And finally, it’s important to note that the language Snakemake uses is Python, so don’t hesitate to use Python’s modules within the snakefile if you’re familiar with the language. For example, you could use the os.path module’s join() function instead of concatenating your paths with the + operator. You could also use Python’s string formats or triple quote strings too.

Solution 3.0

Here’s an example with a few Python “upgrades” ex1_o7c.smk:

from os.path import join

SAMPLES, = glob_wildcards(join(config["dataDir"],"{sample}.fastq.gz"))

rule all:
  input:
    expand(join(config["fastqcOutDir"],"{sample}_fastqc.html"), sample=SAMPLES),
    "multiqc_report.html"

rule fastqc:
  input:
    join(config["dataDir"],"{sample}.fastq.gz")
  output:
    join(config["fastqcOutDir"],"{sample}_fastqc.zip"),
    join(config["fastqcOutDir"],"{sample}_fastqc.html")
  shell: 
    f"""fastqc --outdir {config["fastqcOutDir"]} {{input}}"""

rule multiqc:
  input:
    expand(join(config["fastqcOutDir"],"{sample}_fastqc.zip"), sample = SAMPLES)
  output:
    "multiqc_report.html",
    directory("multiqc_data")
  shell: "multiqc {input}"

Note that if you use Python’s string formatting (e.g. f"mystring"), you should replace braces in wildcard annotations with double braces (e.g. f"{{input}}").