Getting started with Snakemake
Objective 1
Create a snakemake file named ex1_o1.smk
including the first step of the RNAseq workflow (aka. execution of the QC analysis using the FastQC tool) on one of the RNA-seq files.
Where to start?
This training session is not about learning how to analyse RNA-seq data, so here are a few hints about FastQC usage:
- Input file:
SRR3099585_chr18.fastq.gz
in the${PWD}/Data
directory - FastQC command:
fastqc --outdir FastQCResultDirectory inputFileName
- Expected output: the 2 result files (
*_fastqc.zip
and*_fastqc.html
) will be located in youroutdir
and named using the base name of your input file (eg.SRR3099585_chr18_fastqc.zip
)
Your Code for ex1_o1.smk
should look like this:
rule fastqc: input: "Data/SRR3099585_chr18.fastq.gz" output: "FastQC/SRR3099585_chr18_fastqc.zip", "FastQC/SRR3099585_chr18_fastqc.html" shell: "fastqc --outdir FastQC/ {input}"
Test the script
Next, let’s check if your pipeline actually works using the following command to run your snakefile:
snakemake --snakefile ex1_o1.smk --cores 1
You should see something similar to the following output on your screen:
Assuming unrestricted shared filesystem usage for local execution.
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Job stats:
job count
------ -------
fastqc 1
total 1
Select jobs to execute...
Execute 1 jobs...
[Tue Feb 20 14:03:01 2024]
localrule fastqc:
input: Data/SRR3099585_chr18.fastq.gz
output: FastQC/SRR3099585_chr18_fastqc.zip, FastQC/SRR3099585_chr18_fastqc.html
jobid: 0
reason: Missing output files: FastQC/SRR3099585_chr18_fastqc.zip, FastQC/SRR3099585_chr18_fastqc.html
resources: tmpdir=/var/tmp/pbs.743371.pbsserver
Started analysis of SRR3099585_chr18.fastq.gz
Approx 5% complete for SRR3099585_chr18.fastq.gz
Approx 10% complete for SRR3099585_chr18.fastq.gz
Approx 15% complete for SRR3099585_chr18.fastq.gz
Approx 20% complete for SRR3099585_chr18.fastq.gz
Approx 25% complete for SRR3099585_chr18.fastq.gz
Approx 30% complete for SRR3099585_chr18.fastq.gz
Approx 35% complete for SRR3099585_chr18.fastq.gz
Approx 40% complete for SRR3099585_chr18.fastq.gz
Approx 45% complete for SRR3099585_chr18.fastq.gz
Approx 50% complete for SRR3099585_chr18.fastq.gz
Approx 55% complete for SRR3099585_chr18.fastq.gz
Approx 60% complete for SRR3099585_chr18.fastq.gz
Approx 65% complete for SRR3099585_chr18.fastq.gz
Approx 70% complete for SRR3099585_chr18.fastq.gz
Approx 75% complete for SRR3099585_chr18.fastq.gz
Approx 80% complete for SRR3099585_chr18.fastq.gz
Approx 85% complete for SRR3099585_chr18.fastq.gz
Approx 90% complete for SRR3099585_chr18.fastq.gz
Approx 95% complete for SRR3099585_chr18.fastq.gz
Analysis complete for SRR3099585_chr18.fastq.gz
[Tue Feb 20 14:03:09 2024]
Finished job 0.
1 of 1 steps (100%) done
Complete log: .snakemake/log/2024-02-20T140256.859420.snakemake.log
Observe the output
Have a look at the log that’s printed on your screen. The “Job stats” table in Snakemake’s output log will show you a summary of exactly what rules will be run and how many times (i.e. on how many input files). In this case, the fastqc rule will be run once.
Job stats:
job count
------ -------
fastqc 1
total 1
Now, let’s have a look at your working directory:
john.doe@node06:/data/work/I2BC/john.doe/snakemake_tutorial$ ls -a
. .. .snakemake Data FastQC ex1_o1.smk
FastQC
was missing and created it before running FastQC (and that’s a good thing since FastQC doesn’t do this naturally and would have created an error otherwise!). .snakemake
. This is were Snakemake stores temporary files relative to the jobs that were executed. By default, this folder is created in the current working directory (i.e. the one from which the Snakemake command was executed).