Getting started with Snakemake
Objective 5:
Create the
ex1c_o5.smk
Snakefile in which we will specify custom resources for each rule.Where to start?
These resources can be added using the threads
(number of processors) and resources
(memory, walltime, etc.) directives, for example:
rule ruleName:
input:
inputFile.txt
output:
outputFile.txt
envmodules:
"nodes/mySoftware",
threads: 1
resources:
mem="100Mb",
time_min="00:05:00"
shell:
"""
mySoftware {input} > {output}
"""
Your code for ex1c_o5.smk
should look like this:
SAMPLES, = glob_wildcards(config["dataDir"]+"/{sample}.fastq.gz") rule all: input: expand("FastQC/{sample}_fastqc.html", sample=SAMPLES), "multiqc_report.html" rule fastqc: input: config["dataDir"]+"/{sample}.fastq.gz" output: "FastQC/{sample}_fastqc.zip", "FastQC/{sample}_fastqc.html" log: "Logs/{sample}_fastqc.std", "Logs/{sample}_fastqc.err"
envmodules: "fastqc/fastqc_v0.11.5"
threads: 1
resources:
mem="100Mb",
time_min="00:05:00" shell: "fastqc --outdir FastQC {input} 1>{log[0]} 2>{log[1]}" rule multiqc: input: expand("FastQC/{sample}_fastqc.zip", sample = SAMPLES) output: "multiqc_report.html", directory("multiqc_data") log: std="Logs/multiqc.std", err="Logs/multiqc.err"
envmodules: "nodes/multiqc-1.9"
threads: 1
resources:
mem="1Gb",
time_min="00:10:00" shell: "multiqc {input} 1>{log.std} 2>{log.err}"
Run you Snakefile
Let’s try running your Snakefile again:
snakemake -s ex1c_o5.smk --configfile ex1.yml -R fastqc --profile pbs
Your output should look like this:
Using profile pbs for setting default command line arguments.
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided remote nodes: 6
Job stats:
job count
------- -------
all 1
fastqc 6
multiqc 1
total 8
Select jobs to execute...
Execute 6 jobs...
[Wed Feb 21 23:41:09 2024]
rule fastqc:
input: Data/SRR3099586_chr18.fastq.gz
output: FastQC/SRR3099586_chr18_fastqc.zip, FastQC/SRR3099586_chr18_fastqc.html
log: Logs/SRR3099586_chr18_fastqc.std, Logs/SRR3099586_chr18_fastqc.err
jobid: 1
reason: Forced execution
wildcards: sample=SRR3099586_chr18
resources: mem_mb=100, mem_mib=954, disk_mb=1000, disk_mib=954, tmpdir=, threads=1, mem=100Mb, time_min=00:05:00
fastqc --outdir FastQC Data/SRR3099586_chr18.fastq.gz 1>Logs/SRR3099586_chr18_fastqc.std 2>Logs/SRR3099586_chr18_fastqc.err
Submitted job 1 with external jobid '748821.pbsserver'.
[Wed Feb 21 23:41:09 2024]
rule fastqc:
input: Data/SRR3105699_chr18.fastq.gz
output: FastQC/SRR3105699_chr18_fastqc.zip, FastQC/SRR3105699_chr18_fastqc.html
log: Logs/SRR3105699_chr18_fastqc.std, Logs/SRR3105699_chr18_fastqc.err
jobid: 6
reason: Forced execution
wildcards: sample=SRR3105699_chr18
resources: mem_mb=100, mem_mib=954, disk_mb=1000, disk_mib=954, tmpdir=, threads=1, mem=100Mb, time_min=00:05:00
fastqc --outdir FastQC Data/SRR3105699_chr18.fastq.gz 1>Logs/SRR3105699_chr18_fastqc.std 2>Logs/SRR3105699_chr18_fastqc.err
Submitted job 6 with external jobid '748822.pbsserver'.
[Wed Feb 21 23:41:09 2024]
rule fastqc:
input: Data/SRR3105698_chr18.fastq.gz
output: FastQC/SRR3105698_chr18_fastqc.zip, FastQC/SRR3105698_chr18_fastqc.html
log: Logs/SRR3105698_chr18_fastqc.std, Logs/SRR3105698_chr18_fastqc.err
jobid: 2
reason: Forced execution
wildcards: sample=SRR3105698_chr18
resources: mem_mb=100, mem_mib=954, disk_mb=1000, disk_mib=954, tmpdir=, threads=1, mem=100Mb, time_min=00:05:00
fastqc --outdir FastQC Data/SRR3105698_chr18.fastq.gz 1>Logs/SRR3105698_chr18_fastqc.std 2>Logs/SRR3105698_chr18_fastqc.err
Submitted job 2 with external jobid '748823.pbsserver'.
[Wed Feb 21 23:41:09 2024]
rule fastqc:
input: Data/SRR3105697_chr18.fastq.gz
output: FastQC/SRR3105697_chr18_fastqc.zip, FastQC/SRR3105697_chr18_fastqc.html
log: Logs/SRR3105697_chr18_fastqc.std, Logs/SRR3105697_chr18_fastqc.err
jobid: 5
reason: Forced execution
wildcards: sample=SRR3105697_chr18
resources: mem_mb=100, mem_mib=954, disk_mb=1000, disk_mib=954, tmpdir=, threads=1, mem=100Mb, time_min=00:05:00
fastqc --outdir FastQC Data/SRR3105697_chr18.fastq.gz 1>Logs/SRR3105697_chr18_fastqc.std 2>Logs/SRR3105697_chr18_fastqc.err
Submitted job 5 with external jobid '748824.pbsserver'.
[Wed Feb 21 23:41:09 2024]
rule fastqc:
input: Data/SRR3099587_chr18.fastq.gz
output: FastQC/SRR3099587_chr18_fastqc.zip, FastQC/SRR3099587_chr18_fastqc.html
log: Logs/SRR3099587_chr18_fastqc.std, Logs/SRR3099587_chr18_fastqc.err
jobid: 3
reason: Forced execution
wildcards: sample=SRR3099587_chr18
resources: mem_mb=100, mem_mib=954, disk_mb=1000, disk_mib=954, tmpdir=, threads=1, mem=100Mb, time_min=00:05:00
fastqc --outdir FastQC Data/SRR3099587_chr18.fastq.gz 1>Logs/SRR3099587_chr18_fastqc.std 2>Logs/SRR3099587_chr18_fastqc.err
Submitted job 3 with external jobid '748825.pbsserver'.
[Wed Feb 21 23:41:09 2024]
rule fastqc:
input: Data/SRR3099585_chr18.fastq.gz
output: FastQC/SRR3099585_chr18_fastqc.zip, FastQC/SRR3099585_chr18_fastqc.html
log: Logs/SRR3099585_chr18_fastqc.std, Logs/SRR3099585_chr18_fastqc.err
jobid: 4
reason: Forced execution
wildcards: sample=SRR3099585_chr18
resources: mem_mb=100, mem_mib=954, disk_mb=1000, disk_mib=954, tmpdir=, threads=1, mem=100Mb, time_min=00:05:00
fastqc --outdir FastQC Data/SRR3099585_chr18.fastq.gz 1>Logs/SRR3099585_chr18_fastqc.std 2>Logs/SRR3099585_chr18_fastqc.err
Submitted job 4 with external jobid '748826.pbsserver'.
[Wed Feb 21 23:41:38 2024]
Finished job 1.
1 of 8 steps (12%) done
[Wed Feb 21 23:41:38 2024]
Finished job 6.
2 of 8 steps (25%) done
[Wed Feb 21 23:41:38 2024]
Finished job 2.
3 of 8 steps (38%) done
[Wed Feb 21 23:41:38 2024]
Finished job 5.
4 of 8 steps (50%) done
[Wed Feb 21 23:41:38 2024]
Finished job 3.
5 of 8 steps (62%) done
[Wed Feb 21 23:41:39 2024]
Finished job 4.
6 of 8 steps (75%) done
Select jobs to execute...
Execute 1 jobs...
[Wed Feb 21 23:41:39 2024]
rule multiqc:
input: FastQC/SRR3099586_chr18_fastqc.zip, FastQC/SRR3105698_chr18_fastqc.zip, FastQC/SRR3099587_chr18_fastqc.zip, FastQC/SRR3099585_chr18_fastqc.zip, FastQC/SRR3105697_chr18_fastqc.zip, FastQC/SRR3105
699_chr18_fastqc.zip
output: multiqc_report.html, multiqc_data
log: Logs/multiqc.std, Logs/multiqc.err
jobid: 7
reason: Input files updated by another job: FastQC/SRR3099587_chr18_fastqc.zip, FastQC/SRR3105698_chr18_fastqc.zip, FastQC/SRR3099586_chr18_fastqc.zip, FastQC/SRR3105697_chr18_fastqc.zip, FastQC/SRR309
9585_chr18_fastqc.zip, FastQC/SRR3105699_chr18_fastqc.zip
resources: mem_mb=1000, mem_mib=954, disk_mb=1000, disk_mib=954, tmpdir=, threads=1, mem=1Gb, time_min=00:10:00
multiqc FastQC/SRR3099586_chr18_fastqc.zip FastQC/SRR3105698_chr18_fastqc.zip FastQC/SRR3099587_chr18_fastqc.zip FastQC/SRR3099585_chr18_fastqc.zip FastQC/SRR3105697_chr18_fastqc.zip FastQC/SRR3105699_chr1
8_fastqc.zip 1>Logs/multiqc.std 2>Logs/multiqc.err
Submitted job 7 with external jobid '748827.pbsserver'.
[Wed Feb 21 23:41:49 2024]
Finished job 7.
7 of 8 steps (88%) done
Select jobs to execute...
Execute 1 jobs...
[Wed Feb 21 23:41:49 2024]
localrule all:
input: FastQC/SRR3099586_chr18_fastqc.html, FastQC/SRR3105698_chr18_fastqc.html, FastQC/SRR3099587_chr18_fastqc.html, FastQC/SRR3099585_chr18_fastqc.html, FastQC/SRR3105697_chr18_fastqc.html, FastQC/SR
R3105699_chr18_fastqc.html, multiqc_report.html
jobid: 0
reason: Input files updated by another job: FastQC/SRR3099585_chr18_fastqc.html, FastQC/SRR3105699_chr18_fastqc.html, FastQC/SRR3105697_chr18_fastqc.html, FastQC/SRR3099587_chr18_fastqc.html, FastQC/SR
R3105698_chr18_fastqc.html, FastQC/SRR3099586_chr18_fastqc.html, multiqc_report.html
resources: mem_mb=1000, mem_mib=954, disk_mb=1000, disk_mib=954, tmpdir=/var/tmp/pbs.748722.pbsserver, threads=1, mem=1Gb
[Wed Feb 21 23:41:49 2024]
Finished job 0.
8 of 8 steps (100%) done
Complete log: .snakemake/log/2024-02-21T234108.721811.snakemake.log
Observe the output
As you can see in the log output, fastqc and multiqc jobs weren’t run with the same resources as you can see in the log (cf. highlighted lines above, e.g.:
resources: mem_mb=100, mem_mib=954, disk_mb=1000, disk_mib=954, tmpdir=<TBD>, threads=1, mem=100Mb, time_min=00:05:00
).In order to know how much resources your jobs actually used, you can use the cluster’s
qstat
command: qstat -fxw -G <jobID>
. The job id given by the cluster are also listed in the log that’s printed on your screen (cf. highlighted lines above, e.g.: Submitted job 1 with external jobid '748821.pbsserver'
).If we compare the resources used in
ex1c_o3.smk
(default resources) and ex1c_o5.smk
(customised resources):
# ex1c_o3.smk
Job Id: 1533104.pbsserver
session_id = 51161
Resource_List.mem = 1gb
Resource_List.ncpus = 1
Resource_List.nodect = 1
Resource_List.place = pack
Resource_List.preempt_targets = QUEUE=lowprio
Resource_List.select = 1:mem=1gb:ncpus=1
Resource_List.walltime = 02:00:00
Job_Name = snakejob.fastqc.2.sh
Job_Owner = c.toffano-nioche@node06.example.org
resources_used.cpupercent = 0
resources_used.cput = 00:00:10
resources_used.mem = 102400kb
resources_used.ncpus = 1
resources_used.vmem = 2823656kb
resources_used.walltime = 00:00:17
job_state = F
queue = common
# ex1c_o5.smk
Job Id: 1533800.pbsserver
session_id = 51712
Resource_List.mem = 100mb
Resource_List.ncpus = 1
Resource_List.nodect = 1
Resource_List.place = pack
Resource_List.preempt_targets = QUEUE=lowprio
Resource_List.select = 1:mem=100mb:ncpus=1
Resource_List.walltime = 00:05:00
Job_Name = snakejob.fastqc.2.sh
Job_Owner = c.toffano-nioche@node06.example.org
resources_used.cpupercent = 0
resources_used.cput = 00:00:10
resources_used.mem = 102400kb
resources_used.ncpus = 1
resources_used.vmem = 2823656kb
resources_used.walltime = 00:00:17
job_state = F
queue = common
Lines starting with
Resource_List
summarise the resources reserved for your job. In particular, lines 4, 5 and 10 show the (RAM) memory, the number of processors and the walltime that were reserved.Lines starting with
resources_used
summarise the resources that were actually used by your job. In particular, lines 13, 15 and 18 show the percentage of processors used (between 0 and 100% x ncpus), the (RAM) memory and the actual time that was used.For example, we can see that 100mb is much more adapted than 1gb, considering that the actual memory used is about 100mb.