1 of 36

How to deal with your RNA-seq data ?

Emilie Drouineau, Rachel Legendre & the RNA-seq team

École de Bioinformatique AVIESAN-IFB-Inserm 2022

1 | Emilie Drouineau | Bioinformatics | 15/11/2022

EBAII niv 1 2022

2 of 36

Summary

Quality control, Mapping, Counting

01

Bioinformatics

Experimental design, Exploratory data analysis

02

Statistics

Normalization, modelisation and troubleshooting

03

Statistics

Gene Sets Analysis methods

05

Advanced practice

Differential analysis with SARTools

04

Practice

Transcriptome de novo assembly

06

Bioinformatics

2 | Emilie Drouineau | Bioinformatics | 15/11/2022

EBAII niv 1 2022

3 of 36

Bioinformatics

Introduction and prerequisites

3 | Emilie Drouineau | Bioinformatics | 15/11/2022

EBAII niv 1 2022

4 of 36

Raw NGS data

Instrument

Flowcell

Intensities

4 | Emilie Drouineau | Bioinformatics | 15/11/2022

EBAII niv 1 2022

5 of 36

Data storage: Hiseq2500

  • Text file with size between 100 to 150 Gb by lane
  • Let’s compare : War and peace by Léon Tolstoï
    • 1817 pages
    • 6 cm width
    • 4 Mb
  • 1 lane :
    • 25 000 times "war and peace”
    • 45 Millions pages
    • 1.5 km (5 Eiffel towers)
  • 8 lane by flow cell => 1 Tb of raw data/ week / sequencer
  • Times 2 for paired-end

5 | Emilie Drouineau | Bioinformatics | 15/11/2022

EBAII niv 1 2022

6 of 36

Data storage: NovaSeq6000

  • Text file with size between 80Gb to 3Tb (in single flowcell mode)
  • Let’s compare : War and peace by Léon Tolstoï
    • 1817 pages 4250
    • 6 cm width
    • 4 Mb
  • 1 run :
    • 750 000 times "war and peace”
    • 1350 Millions pages
    • 45 km (138 Eiffel towers)
  • Times 2 for dual flowcell mode

6 | Emilie Drouineau | Bioinformatics | 15/11/2022

EBAII niv 1 2022

7 of 36

RNA-seq applications

« Transcriptome analysis provides information about the identity and quantity of all RNA molecules in one cell or a population of cells »

7 | Emilie Drouineau | Bioinformatics | 15/11/2022

EBAII niv 1 2022

8 of 36

RNA-seq: Why ? How

Ask right question before libraries preparation and sequencing:

Prokaryotes

I don’t find a ribo-depletion kit for my organism:

  • Design yourself the oligos

I want to identify antisense RNA:

  • Directional protocol (standard)

I’m interested in transposons:

  • Longer read sequencing
  • Paired-end sequencing

Eukaryotes

I want coding genes only:

  • PolyA strategy

I want non-coding genes also:

  • Ribo Depletion

I’m interesting in small RNA profiling:

  • Use specific protocole

I’m interesting in isoforms:

  • Paired-end sequencing
  • Long read technologies

8 | Emilie Drouineau | Bioinformatics | 15/11/2022

EBAII niv 1 2022

9 of 36

RNA-seq: Why ? How

Regardless of your organism:

  • Complexity of your genome and the biological question paired end or single end, length of reads ?
  • Sequencing depth (multiplexing rate)
  • More biological replicates than more sequencing depth
  • Stranded RNA-seq protocol to assigned reads to a particular strand

9 | Emilie Drouineau | Bioinformatics | 15/11/2022

EBAII niv 1 2022

10 of 36

RNA-seq: Why ? How

Regardless of your organism:

  • Complexity of your genome and the biological question paired end or single end, length of reads ?
  • Sequencing depth (multiplexing rate)
  • More biological replicates than more sequencing depth
  • Stranded RNA-seq protocol to assigned reads to a particular strand

For a successful experiment, it's imperative to include bioinformaticians and biostatistician before the beginning of the RNA extraction

10 | Emilie Drouineau | Bioinformatics | 15/11/2022

EBAII niv 1 2022

11 of 36

Prerequisites

RNA sample:

  • DNAse treatment
  • Quantity (adapted protocole)
  • Quality (RNA integrity number > 7)
  • Stocked at -80°C

Reference genome:

Complete genomic sequence in fasta format

Annotation file:

All features (genes, CDS, intron, UTR) of genome in GFF/GTF format

11 | Emilie Drouineau | Bioinformatics | 15/11/2022

EBAII niv 1 2022

12 of 36

Where find the genome and the annotation ?

Common databases Specific databases

12 | Emilie Drouineau | Bioinformatics | 15/11/2022

EBAII niv 1 2022

13 of 36

Keep control on your datas

13 | Emilie Drouineau | Bioinformatics | 15/11/2022

EBAII niv 1 2022

14 of 36

FASTQC: explore quality scores

14 | Emilie Drouineau | Bioinformatics | 15/11/2022

EBAII niv 1 2022

15 of 36

FASTQC: explore quality scores

15 | Emilie Drouineau | Bioinformatics | 15/11/2022

EBAII niv 1 2022

16 of 36

FASTQC: explore quality scores

Systematic high duplication level in RNA-seq, why ?

16 | Emilie Drouineau | Bioinformatics | 15/11/2022

EBAII niv 1 2022

17 of 36

How to screen contaminations ?

Different levels:

  • Ribosomal contamination from same organism
    • Align reads against the ribosomal genome with a dedicated mapper

17 | Emilie Drouineau | Bioinformatics | 15/11/2022

EBAII niv 1 2022

18 of 36

How to screen contaminations ?

Different levels:

  • Ribosomal contamination from same organism
  • RNA contamination from other organism
    • Use dedicated or derived tools such as fastq_screen or kraken

18 | Emilie Drouineau | Bioinformatics | 15/11/2022

EBAII niv 1 2022

19 of 36

How to screen contaminations ?

Different levels:

  • Ribosomal contamination from same organism
  • RNA contamination from other organism
  • DNA contamination
    • DNAse treatment could be ineffective and for DNA to make it through into the final library.

As soon as you visualise your reads against an annotated genome the presence of DNA is normally fairly apparent as a consistent background of reads over the whole genome

19 | Emilie Drouineau | Bioinformatics | 15/11/2022

EBAII niv 1 2022

20 of 36

Bioinformatics

From mapping to counting

20 | Emilie Drouineau | Bioinformatics | 15/11/2022

EBAII niv 1 2022

21 of 36

RNA-seq mapping specificity

  • Mapping on genome or transcriptome ?
    • the transcriptome is currently not well characterised enough to serve as a suitable reference for RNA-Seq
    • get more gene isoforms information through mapping it to the genome

  • Take account to reads that come from exon-exon junctions

Cole Trapnell & Steven L Salzberg.Nature Biotechnology 27, 455 - 457 (2009)

21 | Emilie Drouineau | Bioinformatics | 15/11/2022

EBAII niv 1 2022

22 of 36

Mapping timeline

From https://www.ebi.ac.uk/~nf/hts_mappers/

22 | Emilie Drouineau | Bioinformatics | 15/11/2022

EBAII niv 1 2022

23 of 36

Choose the good mapper

Which one is the best mapper ?

23 | Emilie Drouineau | Bioinformatics | 15/11/2022

EBAII niv 1 2022

24 of 36

Choose the good mapper

Which one is the best mapper ?

Which mapper should I use based on my data and my analysis ?

24 | Emilie Drouineau | Bioinformatics | 15/11/2022

EBAII niv 1 2022

25 of 36

Choose the good mapper

Depends on:

- Detection of splicing events STAR, minimap2, Hisat2

- Length of reads:

Very short read (<50) : Bowtie1

Up to 1000kb : BWA-SW, bowtie2

Long reads : Minimap2

- Allow gap on alignment STAR, BWA, Bowtie2

Common situations: choose a mapper widely-used and well maintained

25 | Emilie Drouineau | Bioinformatics | 15/11/2022

EBAII niv 1 2022

26 of 36

Known biases in RNA-seq

Intron coverage: if many reads align to introns, this is indicative of incomplete poly(A) enrichment or abundant presence of immature transcripts.

Intergenic reads: if a significant portion of reads is aligned outside of annotated gene sequences, this may suggest genomic DNA contamination (or abundant non-coding transcripts).

3' bias: over-representation of 3' portions of transcripts indicates RNA degradation.

26 | Emilie Drouineau | Bioinformatics | 15/11/2022

EBAII niv 1 2022

27 of 36

Mapping QC on RNA-seq

  • Percentage of mapped reads along genome
    • Human/Mouse: 70 to 90 %
    • Prokaryotic: more to 90 %
  • Uniformity of read coverage on exons and the mapped strand.
  • Low rate of multiple mapping
  • Low rate of ribosomal RNA

27 | Emilie Drouineau | Bioinformatics | 15/11/2022

EBAII niv 1 2022

28 of 36

Mapping QC on RNA-seq

  • Common :
    • Samtools (flagstats)
    • Bamtools (stats)
    • Picardtools (CollectRnaSeqMetrics)
    • RseQC
  • Human and mouse :
    • RNAseQC
    • Qualimap

28 | Emilie Drouineau | Bioinformatics | 15/11/2022

EBAII niv 1 2022

29 of 36

Quantify number of reads on each gene

When counting reads, make sure you know how the program handles the following:

  • overlap size (full read vs. partial overlap)
  • multimapping reads
  • reads overlapping multiple genomic features of the same kind
  • reads overlapping introns

Two popular tools :

  • Htseq-count
  • featureCounts

29 | Emilie Drouineau | Bioinformatics | 15/11/2022

EBAII niv 1 2022

30 of 36

Quantify number of reads on each gene

Deschamps-Francoeur, et al. 2020. doi:10.1016/j.csbj.2020.06.014

30 | Emilie Drouineau | Bioinformatics | 15/11/2022

EBAII niv 1 2022

31 of 36

RNA-seq experiment

Organism: Arabidopsis thaliana, plant and model organism.

Genome and annotation available in TAIR10, the arabidopsis database

Dataset: 3 biological replicates, paired-end sequencing.

Characterization of the function of the protein arginine methyltransferase AtPRMT5 during de novo shoot regeneration in Arabidopsis by a knocking-out of AtPRMT5.

31 | Emilie Drouineau | Bioinformatics | 15/11/2022

EBAII niv 1 2022

32 of 36

Practice

  • Change directory:

cd /shared/projects/<PROJECT>

  • Create a new directory:

mkdir TP_rnaseq

  • Copy the script template in your home:

cp /shared/projects/form_2022_32/coursLinux/rna-seq/01-Bioinfo/runme.sh TP_rnaseq

  • Follow the commands on the runme

32 | Emilie Drouineau | Bioinformatics | 15/11/2022

EBAII niv 1 2022

33 of 36

Pipeline

33 | Emilie Drouineau | Bioinformatics | 15/11/2022

EBAII niv 1 2022

34 of 36

Bioinformatics

Visualize your data

34 | Emilie Drouineau | Bioinformatics | 15/11/2022

EBAII niv 1 2022

35 of 36

Visualize alignments

Which format ?

  • BAM
  • BigWig, BedGraph (base-by-base scores)
  • BED, GFF (feature-by-feature data)

Which tools ?

  • Browser : IGV, Artemis, UCSC Genome browser, SeqMonk…
  • Snapshots : Deeptools, ngs.plot,...

35 | Emilie Drouineau | Bioinformatics | 15/11/2022

EBAII niv 1 2022

36 of 36

Visualize alignments

Go to AT4G31120

36 | Emilie Drouineau | Bioinformatics | 15/11/2022

EBAII niv 1 2022