Getting started with Snakemake
Step 1 - connect to the I2BC cluster
As in the previous exercises, we will be working on the I2BC cluster, on which all the tools that we need are already installed. To connect to the cluster you will need your Multipass login information. Please refer to Step 4 on the I2BC cluster training page for more details on how to proceed.
Step 2 - prepare your working environment
Once on the cluster, connect to a node (that’s where all the tools are installed):
john.doe@cluster-i2bc:~$ qsub -I -q common
qsub: waiting for job 659602.pbsserver to start
qsub: job 659602.pbsserver ready
john.doe@node06:~$
Then load the appropriate tools using the module command (i.e. snakemake):
module load snakemake/snakemake-8.4.6
Double-check that you’ve loaded the modules correctly, for example, by checking their version:
john.doe@node06:~$ snakemake --version
8.4.6
And create & move to your chosen working space (if not already done in the previous exercises), for example:
WORKDIR="/data/work/I2BC/$USER/snakemake_tutorial"
mkdir -p $WORKDIR
cd $WORKDIR
Step 3 - fetch the example files
# download archive
wget "https://zenodo.org/record/3997237/files/FAIR_Bioinfo_data.tar.gz"
# extract files
tar -zxf FAIR_Bioinfo_data.tar.gz
# delete the archive
rm FAIR_Bioinfo_data.tar.gz
You should now see in your working space, a folder called Data
containing compressed fastq files (*.fastq.gz
) and O. tauri genome files (*.gff
and *.fna
):
john.doe@node06:/data/work/I2BC/john.doe/snakemake_tutorial$ ls Data/
O.tauri_annotation.gff SRR3099585_chr18.fastq.gz SRR3099587_chr18.fastq.gz SRR3105698_chr18.fastq.gz
O.tauri_genome.fna SRR3099586_chr18.fastq.gz SRR3105697_chr18.fastq.gz SRR3105699_chr18.fastq.gz
bowtie2
and featureCounts
steps. The bash script that we will be converting into a snakemake pipeline is the following:
## READ THE INPUT ARGUMENTS
while getopts g:a:d: flag; do
case $flag in
g) genome=$OPTARG
echo genome is $genome ;;
a) annots=$OPTARG
echo annotation is $annots ;;
d) rnadir=$OPTARG
echo RNAseq path is $rnadir ;;
:) echo "L'option $OPTARG requiert un argument "
exit 1 ;;
\?) echo "$OPTARG : option invalide "
exit 1 ;;
esac
done
shift $((OPTIND - 1)) # shift past the last flag or argument
echo samples are $*
# define resources in cpus for each step
ncpus_build=1
ncpus_map=1
ncpus_sort=1
ncpus_index=1
ncpus_count=1
## CREATE REFERENCE GENOME INDEX
if [ ! -d Bwt2_index ]; then
mkdir Bwt2_index
bowtie2-build --threads ${ncpus_build} ${genome} Bwt2_index/tauri > Bwt2_index/Bwt2_index.log 2>&1
fi
mkdir LOGs
## RUN QC, MAP READS, SORT BAMS & COUNT READS
nbs=0;
for sample in $* ; do
nbs=$(expr ${nbs} + 1)
echo traitement of sample ${sample}
# --------- quality control of reads
mkdir -p FastQC # fastqc doesnt create the output dir itself
fastqc --outdir FastQC ${rnadir}/${sample}.fastq.gz > FastQC/${sample}.log 2>&1
# ---------- reads mapping
mkdir -p SAMs
bowtie2 --threads ${ncpus_map} -x Bwt2_index/tauri -U ${rnadir}/${sample}.fastq.gz -S SAMs/${sample}.sam > LOGs/${sample}_bowtie2.log 2>&1
# ---------- selection and format modification
mkdir -p BAMs
samtools view -b SAMs/${sample}.sam -o BAMs/${sample}.bam
samtools sort --threads ${ncpus_sort} BAMs/${sample}.bam -o BAMs/${sample}_sort.bam
samtools index -@ ${ncpus_index} BAMs/${sample}_sort.bam
# ---------- counting of mapped reads by gene
mkdir -p Counts
featureCounts -T ${ncpus_count} -t gene -g ID -a ${annots} -s 2 -o Counts/${sample}_ftc.txt BAMs/${sample}_sort.bam > LOGs/${sample}_ftc.log 2>&1
done
## CONCATENATE & REFORMAT COUNT TABLES
paste Counts/*_ftc.txt > Counts/ftc_tmp.txt
awk -v nb=${nbs} -v col=7 'BEGIN {FS ="\t"}{ctmp=$1; for (i=col; i<=nb*col; i=i+col) {count=sprintf("%s\t%s", ctmp, $i); ctmp=count}; print count}' Counts/ftc_tmp.txt | sed '1d' > Counts/counts.txt