Getting started with Snakemake

Exercise 1C - optimising resource usage

objective > setup > o1 > o2 > o3 > o4 > o5 > recap

Objective 1:

Learn how to tell Snakemake that multiple processors are available locally using the --cores option.
Where to start?
Remember --cores/-c from your previous commands? This option tells Snakemake how many processors (/CPUs/cores/threads) it may use to run your jobs locally. It’s a mandatory option so if you run Snakemake without it, you’ll get an error message.
john.doe@node06:/data/work/I2BC/john.doe/snakemake_tutorial$ qstat -u $USER -wn1

pbsserver: 
                                                                                                   Req'd  Req'd   Elap
Job ID                         Username        Queue           Jobname         SessID   NDS  TSK   Memory Time  S Time
------------------------------ --------------- --------------- --------------- -------- ---- ----- ------ ----- - -----
747716.pbsserver               john.doe        common          STDIN             825631    1     1    2gb 02:00 R 00:18 node06/3
john.doe@node06:/data/work/I2BC/john.doe/snakemake_tutorial$ logout

qsub: job 747716.pbsserver completed
john.doe@cluster-i2bc:~$ qsub -I -q common -l ncpus=2
qsub: waiting for job 747738.pbsserver to start
qsub: job 747738.pbsserver ready

john.doe@node06:~$ qstat -u $USER -wn1

pbsserver: 
                                                                                                   Req'd  Req'd   Elap
Job ID                         Username        Queue           Jobname         SessID   NDS  TSK   Memory Time  S Time
------------------------------ --------------- --------------- --------------- -------- ---- ----- ------ ----- - -----
747738.pbsserver               john.doe        common          STDIN             825631    1     2    2gb 02:00 R 00:18 node06/3
john.doe@node06:~$ cd /data/work/I2BC/$USER/snakemake_tutorial
john.doe@node06:/data/work/I2BC/john.doe/snakemake_tutorial$ module load snakemake/snakemake-8.4.6 fastqc/fastqc_v0.11.5 nodes/multiqc-1.9
Run Snakemake on 2 local processors
Running Snakemake on 2 processors instead of just one is quite straightforward (we’ll add -R fastqc to force Snakemake to re-run everything so you can observe the changes):
				
					snakemake -s ex1b_o3.smk -p --configfile ex1.yml -R fastqc --cores 2
				
			
Your output should look like this:
				
					Assuming unrestricted shared filesystem usage for local execution.
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 2
Rules claiming more threads will be scaled down.
Job stats:
job        count
-------  -------
all            1
fastqc         6
multiqc        1
total          8

Select jobs to execute...

Execute 2 jobs...

[Wed Feb 21 13:37:38 2024]
localrule fastqc:
    input: Data/SRR3105698_chr18.fastq.gz
    output: FastQC/SRR3105698_chr18_fastqc.zip, FastQC/SRR3105698_chr18_fastqc.html
    log: Logs/SRR3105698_chr18_fastqc.std, Logs/SRR3105698_chr18_fastqc.err
    jobid: 2
    reason: Forced execution
    wildcards: sample=SRR3105698_chr18
    resources: tmpdir=/var/tmp/pbs.747714.pbsserver


[Wed Feb 21 13:37:38 2024]
localrule fastqc:
    input: Data/SRR3105699_chr18.fastq.gz
    output: FastQC/SRR3105699_chr18_fastqc.zip, FastQC/SRR3105699_chr18_fastqc.html
    log: Logs/SRR3105699_chr18_fastqc.std, Logs/SRR3105699_chr18_fastqc.err
    jobid: 6
    reason: Forced execution
    wildcards: sample=SRR3105699_chr18
    resources: tmpdir=/var/tmp/pbs.747714.pbsserver

[Wed Feb 21 13:37:52 2024]
Finished job 6.
1 of 8 steps (12%) done
Select jobs to execute...
Execute 1 jobs...

[Wed Feb 21 13:37:52 2024]
localrule fastqc:
    input: Data/SRR3099586_chr18.fastq.gz
    output: FastQC/SRR3099586_chr18_fastqc.zip, FastQC/SRR3099586_chr18_fastqc.html
    log: Logs/SRR3099586_chr18_fastqc.std, Logs/SRR3099586_chr18_fastqc.err
    jobid: 1
    reason: Forced execution
    wildcards: sample=SRR3099586_chr18
    resources: tmpdir=/var/tmp/pbs.747714.pbsserver

[Wed Feb 21 13:37:54 2024]
Finished job 2.
2 of 8 steps (25%) done
Select jobs to execute...
Execute 1 jobs...

[Wed Feb 21 13:37:54 2024]
localrule fastqc:
    input: Data/SRR3105697_chr18.fastq.gz
    output: FastQC/SRR3105697_chr18_fastqc.zip, FastQC/SRR3105697_chr18_fastqc.html
    log: Logs/SRR3105697_chr18_fastqc.std, Logs/SRR3105697_chr18_fastqc.err
    jobid: 5
    reason: Forced execution
    wildcards: sample=SRR3105697_chr18
    resources: tmpdir=/var/tmp/pbs.747714.pbsserver

[Wed Feb 21 13:38:02 2024]
Finished job 1.
3 of 8 steps (38%) done
Select jobs to execute...
Execute 1 jobs...

[Wed Feb 21 13:38:02 2024]
localrule fastqc:
    input: Data/SRR3099587_chr18.fastq.gz
    output: FastQC/SRR3099587_chr18_fastqc.zip, FastQC/SRR3099587_chr18_fastqc.html
    log: Logs/SRR3099587_chr18_fastqc.std, Logs/SRR3099587_chr18_fastqc.err
    jobid: 3
    reason: Forced execution
    wildcards: sample=SRR3099587_chr18
    resources: tmpdir=/var/tmp/pbs.747714.pbsserver

[Wed Feb 21 13:38:04 2024]
Finished job 5.
4 of 8 steps (50%) done
Select jobs to execute...
Execute 1 jobs...

[Wed Feb 21 13:38:04 2024]
localrule fastqc:
    input: Data/SRR3099585_chr18.fastq.gz
    output: FastQC/SRR3099585_chr18_fastqc.zip, FastQC/SRR3099585_chr18_fastqc.html
    log: Logs/SRR3099585_chr18_fastqc.std, Logs/SRR3099585_chr18_fastqc.err
    jobid: 4
    reason: Forced execution
    wildcards: sample=SRR3099585_chr18
    resources: tmpdir=/var/tmp/pbs.747714.pbsserver

[Wed Feb 21 13:38:14 2024]
Finished job 3.
5 of 8 steps (62%) done
[Wed Feb 21 13:38:15 2024]
Finished job 4.
6 of 8 steps (75%) done
Select jobs to execute...
Execute 1 jobs...

[Wed Feb 21 13:38:15 2024]
localrule multiqc:
    input: FastQC/SRR3099586_chr18_fastqc.zip, FastQC/SRR3105698_chr18_fastqc.zip, FastQC/SRR3099587_chr18_fastqc.zip, FastQC/SRR3099585_chr18_fastqc.zip, FastQC/SRR3105697_chr18_fastqc.zip, FastQC/SRR3105
699_chr18_fastqc.zip
    output: multiqc_report.html, multiqc_data
    log: Logs/multiqc.std, Logs/multiqc.err
    jobid: 7
    reason: Input files updated by another job: FastQC/SRR3105698_chr18_fastqc.zip, FastQC/SRR3099587_chr18_fastqc.zip, FastQC/SRR3099585_chr18_fastqc.zip, FastQC/SRR3105697_chr18_fastqc.zip, FastQC/SRR310
5699_chr18_fastqc.zip, FastQC/SRR3099586_chr18_fastqc.zip
    resources: tmpdir=/var/tmp/pbs.747714.pbsserver

[Wed Feb 21 13:38:37 2024]
Finished job 7.
7 of 8 steps (88%) done
Select jobs to execute...
Execute 1 jobs...

[Wed Feb 21 13:38:37 2024]
localrule all:
    input: FastQC/SRR3099586_chr18_fastqc.html, FastQC/SRR3105698_chr18_fastqc.html, FastQC/SRR3099587_chr18_fastqc.html, FastQC/SRR3099585_chr18_fastqc.html, FastQC/SRR3105697_chr18_fastqc.html, FastQC/SR
R3105699_chr18_fastqc.html, multiqc_report.html
    jobid: 0
    reason: Input files updated by another job: FastQC/SRR3099587_chr18_fastqc.html, multiqc_report.html, FastQC/SRR3105697_chr18_fastqc.html, FastQC/SRR3105699_chr18_fastqc.html, FastQC/SRR3099585_chr18_f
astqc.html, FastQC/SRR3099586_chr18_fastqc.html, FastQC/SRR3105698_chr18_fastqc.html
    resources: tmpdir=/var/tmp/pbs.747714.pbsserver

[Wed Feb 21 13:38:37 2024]
Finished job 0.
8 of 8 steps (100%) done
Complete log: .snakemake/log/2024-02-21T133728.474300.snakemake.log
				
			
Observe the output
You can now see in the log that Snakemake registered the 2 processors: Provided cores: 2. It’s difficult to see if your jobs are actually running simultaniously just by looking at the log but you can clearly see that Snakemake runs the first 2 jobs at the same time: Execute 2 jobs...
Running Snakemake on 2 processors doesn’t reduce the computation time considerably in this case since we only have 6 input files to process and the tools that we are running are already quite fast to execute.
A little recap before moving on...
Local execution:
Everything we’ve seen up until now could technically also be run on a local computer, even what we’ve just seen above if you have several processors on your PC (and provided you have the necessary software installed).
There are better ways than --cores to parallelise:
We’ve also seen that using --cores N (N being the number of processors to use), enables us to distribute the workload and, thus, to accelerate the computation. However, when running on a cluster, this is not the most optimal solution for parallelising your pipeline.

Why? Let’s take what we just did as an example:

ex1C_cores
All 6 fastqc jobs can be evenly distributed over the 2 processors but the second rule (multiqc) is only using 1 of the 2 that we’ve reserved because there’s only one job for this rule. It’s not that bad when you’re using tools that run fast but becomes more problematic for software with much larger execution times…
What’s the solution? It would be more efficient to let the scheduler (OpenPBS in our case) deal with the distribution of jobs according to the resources available on the cluster. With this system, we’ll also be able to adapt the resources that are reserved to the amount each tool actually needs to run, thereby leaving unused resources free for others to use.
Scroll to Top