Exercise 2 setup – BIOI2 – Integrative BIOInformatics platforme

Getting started with Snakemake

About this course | Before the session | About Snakemake | Course material | Exercises

Exercise 2 - from bash to Snakemake

Step 1 - connect to the I2BC cluster

As in the previous exercises, we will be working on the I2BC cluster, on which all the tools that we need are already installed. To connect to the cluster you will need your Multipass login information. Please refer to Step 4 on the I2BC cluster training page for more details on how to proceed.

Step 2 - prepare your working environment

Once on the cluster, connect to a node (that’s where all the tools are installed):

				
					john.doe@slurmlogin:~$ srun --pty bash
john.doe@node06:~$

Then load the appropriate tools using the module command (i.e. snakemake):

				
					module load snakemake/snakemake-8.4.6

Double-check that you’ve loaded the modules correctly, for example, by checking their version:

				
					john.doe@node06:~$ snakemake --version
8.4.6

And create & move to your chosen working space (if not already done in the previous exercises), for example:

				
					WORKDIR="$HOME/snakemake-training/exercise-2"
mkdir -p $WORKDIR
cd $WORKDIR

Step 3 - fetch the example files

Example files are accessible on Zenodo as a tar.gz archive . To fetch and extract the files, you can use the following command lines (if not already done in Exercise 1):

				
					# download archive
wget "https://zenodo.org/record/3997237/files/FAIR_Bioinfo_data.tar.gz"
# extract files
tar -zxf FAIR_Bioinfo_data.tar.gz
# delete the archive
rm FAIR_Bioinfo_data.tar.gz

You should now see in your working space, a folder called Data containing compressed fastq files (*.fastq.gz) and O. tauri genome files (*.gff and *.fna):

				
					john.doe@node06:~/snakemake-training/exercise-2$ ls Data/
O.tauri_annotation.gff  SRR3099585_chr18.fastq.gz  SRR3099587_chr18.fastq.gz  SRR3105698_chr18.fastq.gz
O.tauri_genome.fna      SRR3099586_chr18.fastq.gz  SRR3105697_chr18.fastq.gz  SRR3105699_chr18.fastq.gz

Fastq files are the RNAseq input files for our pipeline and O. tauri genome and annotation files are the reference genome files needed in the bowtie2 and featureCounts steps.

The bash script that we will be converting into a snakemake pipeline is the following:

				
					## READ THE INPUT ARGUMENTS
while getopts g:a:d: flag; do
    case $flag in
        g)  genome=$OPTARG
            echo genome is $genome ;;
        a)  annots=$OPTARG
            echo annotation is $annots ;;
        d)  rnadir=$OPTARG
            echo RNAseq path is $rnadir ;;
        :)  echo "L'option $OPTARG requiert un argument "
            exit 1 ;;
        \?) echo "$OPTARG : option invalide "
            exit 1 ;;
    esac
done

shift $((OPTIND - 1)) # shift past the last flag or argument
echo samples are $*

# define resources in cpus for each step
ncpus_build=1
ncpus_map=1
ncpus_sort=1
ncpus_index=1
ncpus_count=1

## CREATE REFERENCE GENOME INDEX
if [ ! -d Bwt2_index ]; then
    mkdir Bwt2_index
    bowtie2-build --threads ${ncpus_build} ${genome} Bwt2_index/tauri > Bwt2_index/Bwt2_index.log 2>&1
fi

mkdir LOGs

## RUN QC, MAP READS, SORT BAMS & COUNT READS
nbs=0;
for sample in $* ; do
    nbs=$(expr ${nbs} + 1)
    echo traitement of sample ${sample}

    # --------- quality control of reads
    mkdir -p FastQC # fastqc doesnt create the output dir itself
    fastqc --outdir FastQC ${rnadir}/${sample}.fastq.gz > FastQC/${sample}.log 2>&1

    # ---------- reads mapping
    mkdir -p SAMs
    bowtie2 --threads ${ncpus_map} -x Bwt2_index/tauri -U ${rnadir}/${sample}.fastq.gz -S SAMs/${sample}.sam > LOGs/${sample}_bowtie2.log 2>&1

    # ---------- selection and format modification
    mkdir -p BAMs
    samtools view -b SAMs/${sample}.sam -o BAMs/${sample}.bam
    samtools sort --threads ${ncpus_sort} BAMs/${sample}.bam -o BAMs/${sample}_sort.bam
    samtools index -@ ${ncpus_index} BAMs/${sample}_sort.bam

    # ---------- counting of mapped reads by gene
    mkdir -p Counts
    featureCounts -T ${ncpus_count} -t gene -g ID -a ${annots} -s 2 -o Counts/${sample}_ftc.txt BAMs/${sample}_sort.bam > LOGs/${sample}_ftc.log 2>&1
done

## CONCATENATE & REFORMAT COUNT TABLES
paste Counts/*_ftc.txt > Counts/ftc_tmp.txt
awk -v nb=${nbs} -v col=7 'BEGIN {FS ="\t"}{ctmp=$1; for (i=col; i<=nb*col; i=i+col) {count=sprintf("%s\t%s", ctmp, $i); ctmp=count}; print count}' Counts/ftc_tmp.txt | sed '1d' > Counts/counts.txt