Exercise 1A 1 – BIOI2 – Integrative BIOInformatics platforme

Getting started with Snakemake

About this course | Before the session | About Snakemake | Course material | Exercises

Exercise 1A - create your first snakefile

objective > setup > o1 > o2 > o3 > o4 > o5 > o6 > recap

Objective 1

Create a snakemake file named ex1_o1.smk including the first step of the RNAseq workflow (aka. execution of the QC analysis using the FastQC tool) on one of the RNA-seq files.

Where to start?

This training session is not about learning how to analyse RNA-seq data, so here are a few hints about FastQC usage:

Input file: SRR3099585_chr18.fastq.gz in the ${PWD}/Data directory
FastQC command: fastqc --outdir FastQCResultDirectory inputFileName
Expected output: the 2 result files (*_fastqc.zip and *_fastqc.html) will be located in your outdir and named using the base name of your input file (eg. SRR3099585_chr18_fastqc.zip)

Solution

Your Code for ex1_o1.smk should look like this:

rule fastqc:
  input: 
    "Data/SRR3099585_chr18.fastq.gz" 
  output: 
    "FastQC/SRR3099585_chr18_fastqc.zip", 
    "FastQC/SRR3099585_chr18_fastqc.html"
  shell: 
    "fastqc --outdir FastQC/ {input}"

Test the script

Next, let’s check if your pipeline actually works using the following command to run your snakefile:

				
					snakemake --snakefile ex1_o1.smk --cores 1

You should see something similar to the following output on your screen:

				
					Assuming unrestricted shared filesystem usage for local execution.
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Job stats:
job       count
------  -------
fastqc        1
total         1

Select jobs to execute...
Execute 1 jobs...

[Tue Feb 20 14:03:01 2024]
localrule fastqc:
    input: Data/SRR3099585_chr18.fastq.gz
    output: FastQC/SRR3099585_chr18_fastqc.zip, FastQC/SRR3099585_chr18_fastqc.html
    jobid: 0
    reason: Missing output files: FastQC/SRR3099585_chr18_fastqc.zip, FastQC/SRR3099585_chr18_fastqc.html
    resources: tmpdir=/var/tmp/pbs.743371.pbsserver

Started analysis of SRR3099585_chr18.fastq.gz
Approx 5% complete for SRR3099585_chr18.fastq.gz
Approx 10% complete for SRR3099585_chr18.fastq.gz
Approx 15% complete for SRR3099585_chr18.fastq.gz
Approx 20% complete for SRR3099585_chr18.fastq.gz
Approx 25% complete for SRR3099585_chr18.fastq.gz
Approx 30% complete for SRR3099585_chr18.fastq.gz
Approx 35% complete for SRR3099585_chr18.fastq.gz
Approx 40% complete for SRR3099585_chr18.fastq.gz
Approx 45% complete for SRR3099585_chr18.fastq.gz
Approx 50% complete for SRR3099585_chr18.fastq.gz
Approx 55% complete for SRR3099585_chr18.fastq.gz
Approx 60% complete for SRR3099585_chr18.fastq.gz
Approx 65% complete for SRR3099585_chr18.fastq.gz
Approx 70% complete for SRR3099585_chr18.fastq.gz
Approx 75% complete for SRR3099585_chr18.fastq.gz
Approx 80% complete for SRR3099585_chr18.fastq.gz
Approx 85% complete for SRR3099585_chr18.fastq.gz
Approx 90% complete for SRR3099585_chr18.fastq.gz
Approx 95% complete for SRR3099585_chr18.fastq.gz
Analysis complete for SRR3099585_chr18.fastq.gz
[Tue Feb 20 14:03:09 2024]
Finished job 0.
1 of 1 steps (100%) done
Complete log: .snakemake/log/2024-02-20T140256.859420.snakemake.log

Observe the output

Have a look at the log that’s printed on your screen. The “Job stats” table in Snakemake’s output log will show you a summary of exactly what rules will be run and how many times (i.e. on how many input files). In this case, the fastqc rule will be run once.

				
					Job stats:
job       count
------  -------
fastqc        1
total         1

Now, let’s have a look at your working directory:

				
					john.doe@node06:/data/work/I2BC/john.doe/snakemake_tutorial$ ls -a 
.  ..  .snakemake  Data  FastQC  ex1_o1.smk

Snakemake saw that the output directory FastQC was missing and created it before running FastQC (and that’s a good thing since FastQC doesn’t do this naturally and would have created an error otherwise!).

Also, notice the existence of a hidden folder called .snakemake. This is were Snakemake stores temporary files relative to the jobs that were executed. By default, this folder is created in the current working directory (i.e. the one from which the Snakemake command was executed).