1 of 30

Introduction to Deep Seq Data Analysis

https://docs.google.com/presentation/d/1csCR1KwuJo1DXdGgWkNJKxBsEmY6PJ-6l1OZmI_qldo/edit?usp=sharing

Christophe.antoniewski@sorbonne-universite.fr

http://artbio.fr

M2 “BMC” - UE Biotechnologie

October 12, 2020, 9:30–12:30

  1. Sequencing Technologies
  2. DNA libraries
  3. Applications
  4. A framework for sequencing projects
  5. Introduction to sequence dataset analysis using command lines
  6. Using Galaxy for sequence analyses

2 of 30

Illumina Sequencing

3 of 30

Pacific Biosciences SMRT

4 of 30

Oxford Nanopore MinION

5 of 30

Read Lengths, Error Rates and Throughputs

Technology

Method

Read Length

Error Rates (%)

Throughput

(GB/run)

Illumina

Synthesis

100-300 bp

0.1

200-600

Pacific Biosciences SMRT

Synthesis

10-100 kb

5-15

10-20

Oxford Nanopore MinION

Nanopore

Variable (up to 1000 kb)

5-20

5-10

6 of 30

20-30nt RNA gel purification

Small RNA library

(Biases)

Library “Bar coding”

7 of 30

8 of 30

ChIPseq library preparation

(Non Directional)

9 of 30

Deep sequencing applications

  • Whole Genome sequencing (WGS)
    • De novo sequence assembly for new genomes
    • Re-sequencing indel, mutation, snp identification and analysis
  • Exome sequencing
    • capture probe to sequence only exons
  • RADseq, ddRADseq
    • Sampling of genome to be sequenced
  • Meta-genome sequencing
    • Pathogen / symbiotic flora identification or analysis
  • ChIPseq (Chromatin Immunoprecipitation sequencing)
    • Epigenetic landscape, Transcription factor binding sites
  • ATACseq (Assay for Transposase-Accessible Chromatin)
    • DNA accessibility (chromatin compaction, transcription factors…)
  • 3C-seq (Chromosome conformation capture), 4C, 5C, ChIA-PET, Hi-C
    • mapping of long-range genomic interactions
  • Medip-seq
    • DNA methylation studies
  • RNA or small RNA sequencing
    • Transcriptome
    • Small or long non-coding RNA profiling
    • Single cell transcriptome analysis
    • RIP-seq (RNA immunoprecipitation sequencing)
    • Protein associated RNAs
    • CLIP-seq (Cross-linking immunoprecipitation sequencing)
    • Protein associated RNA duplexes

High throughput sequencing of DNA or RNA provides Qualitative (sequence) and Quantitative (number of reads) information

10 of 30

Biological question &

Experimental Design

Library Preparation

Sequencing

Quality Control

Alignment

Assembly

Visualization & Statistics

  • Normalization (library comparison)
  • Peak finding (Binding sites, Breakpoints, etc…)
  • Differential Calling (expression, variants, etc)

Sensibility & Technical biases

sample plan, detection requirements

number of replicates

Batch effects

single cell / bulk ?

Whole genome

Whole exome

Target enrichment

Size selection ?

Stranded/unstranded ?

Amplification ?

Single Cell Protocol ?

Length and number of reads

Single or paired ends

Adapter Clipping

Quality trimming

Contaminant and Sequencing Errors

Biases in GC contents

Bowtie

BWA…

Velvet, Oases

Trinity, SPADES…

R, matlab

& Open Source software tools (FAIR)

Analysis Flowchart: Everything’s connected

11 of 30

Sequencing data mining

  • Read Mapping, Counting & Visualization
    • Information Technologies & biotools
    • understanding of annotation formats

  • Quantitative profiling (RNAseq, single-cell RNAseq, chromatin IPs, …)
  • Sequence and Structure analyses (Variant, signatures, 3D conformations, …)
    • Requires Tools for Statistics (R, Python)
    • Math. Methods: Tests Univariate or Multivariate, Classification, Regression, Dimensional reduction, etc...

12 of 30

Minimal requirements for sequencing data analysis

  • Sequence files (fastq format)
  • Knowledge of basic file formats

Fastq, fasta, GFF, GTF, bed, Sam, Bam → Read the UCSC ressources

  • A computer with enough resources (Memory, CPU, Storage)
  • Command line approach
    • A Unix compliant Operating System + « know how »
    • Knowledge of a programming language (R, Python, Perl, Java, C++…)
  • Analysis environment and/or Workflow management
    • GALAXY
      • A Galaxy server
      • know how
    • Rstudio (R), Jupyter notebook (python), Nextflow, Snakemake, etc...

13 of 30

ssh Connect to the analysis server

~$ ls -alF # list files

~$ cp ../GKG-13.fastq.gz . # watch the points !

~$ gunzip GKG-13.fastq.gz # uncompress file

~$ ll

  • Open/Activate your ssh terminal (PuTTY, MacOS terminal, Chrome “Secure Shell” extension, Firefox “sshGate…” extension)
  • Type the command line from this document in your terminal

  • pwd (print working directory)
    • Path : /home/<user> ; .. and .
  • ls (list)
  • cp (copy)
  • alias ll='ls -alF'
  • gunzip (gzip)

bash commands

14 of 30

What is this fastq file containing ?

  • more
  • less
  • cat … and ctrl-c !

bash commands

...

~$ more GKG-13.fastq # read file by chunk (type “q” key to exit)

@HWIEAS210R_0028:2:1:3019:1114#AGAAGA/1 Header

TNGGAACTTCATACCGTGCTCTCTGTAGGCACCATCAA Sequence

+HWIEAS210R_0028:2:1:3019:1114#AGAAGA/1 Header

bBb`bfffffhhhhhhhhhhhhhhhhhhhfhhhhhhgh Sequence Quality (ASCII encoded)

@HWIEAS210R_0028:2:1:3925:1114#AGAAGA/1

TNCTTGGACTACATATGGTTGAGGGTTGTACTGTAGGC

+HWIEAS210R_0028:2:1:3925:1114#AGAAGA/1

]B]VWaaaaaagggfggggggcggggegdgfgeggbab

@HWIEAS210R_0028:2:1:6220:1114#AGAAGA/1

TNGGAACTTCATACCGTGCTCTCTGTAGGCACCATCAA

+HWIEAS210R_0028:2:1:6220:1114#AGAAGA/1

aB^^afffffhhhhhhhhhhhhhhhhhhhhhhhchhhh

@HWIEAS210R_0028:2:1:6252:1115#AGAAGA/1

TNCTTGGACTACATATGGTTGAGGGTTGTACTGTAGGC

+HWIEAS210R_0028:2:1:6252:1115#AGAAGA/1

aBa^\ddeeehhhhhhhhhhhhhhhhghhhhhhhefff

@HWIEAS210R_0028:2:1:6534:1114#AGAAGA/1

TNAATGCACTATCTGGTACGACTGTAGGCACCATCAAT

+HWIEAS210R_0028:2:1:6534:1114#AGAAGA/1

aB\^^eeeeegcggfffffffcfffgcgcfffffR^^]

@HWIEAS210R_0028:2:1:8869:1114#AGAAGA/1

GNGGACTGAAGTGGAGCTGTAGGCACCATCAATAGATC

+HWIEAS210R_0028:2:1:8869:1114#AGAAGA/1

aBaaaeeeeehhhhhhhhhhhhfgfhhgfhhhhgga^^

15 of 30

How many sequence reads in my file ?

~$ wc -l GKG-13.fastq # count the number of lines in file

25703828 GKG-13.fastq

~$ grep -c -e "^@" GKG-13.fastq # search and count the regex in file

6425957

~$ python

in python interpreter:

>>> 25703828 / 4

6425957

>>> exit()

  • wc
  • grep
  • python

bash commands

16 of 30

Are my sequence reads containing the adapter ?

~$ cat GKG-13.fastq | grep CTGTAGG | wc -l

6355061

~$ grep -c "CTGTAGG" GKG-13.fastq

6355061

6 355 061 out of

6 425 957 sequences

… not bad (98.8%)

3’ adapter: CTGTAGGCACCATCAAT

~$ grep -c "ATCTCGT" GKG-13.fastq

308

A contrario

  • cat
  • grep
  • wc
  • | (pipe operation)

bash commands

17 of 30

Advanced combination of bash (unix) commands

~$ cat GKG-13.fastq | perl -ne 'print if /^[ATGCN]{22}CTGTAGG/' | wc -l

Outputs the content of a file, line by line

The output is passed to the input of the next command

perl interpreter is called with –ne options (loop & execute)

In line perl code

Regular expression

The output is passed to the input of the next command

wc with –l option counts the lines

1 675 469 22nt long reads with 3’ flanking CTGTAGG adapter sequence

  • cat
  • perl
  • wc

bash commands

18 of 30

Clipping adapter sequences

Unix Operating Systems already contain powerful native tools for sequence analyses

~$ cat GKG-13.fastq | perl -ne 'if (/^(.+CTGTAGG)/) {print "$1\n"}' | more

~$ cat GKG-13.fastq | perl -ne \

'if (/^([GATC]{18,})CTGTAGG/){$count++; print ">$count\n"; print "$1\n"}' \

> clipped_GKG13.fasta

~$ ll # leuleu to look at the output !

Final “clipping” command line

  • \ (continued line)
  • > (redirection)

bash commands

  1. copy the enlightened string (including returns) and paste into your terminal. Note: to copy, you may have to exit the “read slide mode of the google slide document. Terminate by pressing the enter key.
  2. type the enlightened string directly in your console. Be careful: each “\” is followed by a return-to-new-line by pressing the enter key.
  3. again, be sure to exit the “read slide mode” and copy paste the following line:�cat GKG-13.fastq | perl -ne 'if (/^([GATC]{18,})CTGTAGG/){$count++; print ">$count\n"; print "$1\n"}' > clipped_GKG13.fasta

3 way to get the command executed:

19 of 30

http://bowtie-bio.sourceforge.net/

Bowtie aligns reads on indexed genomes

Sequence Reads

(fastq or fasta)

Indexed Reference

(Genome)

Alignment report

20 of 30

~$ bowtie dmel/Dmel_r5.49 -f clipped_GKG13.fasta -v 1 -k 1 -p 4 --al droso_matched_GKG-13.fa --un unmatched_GKG13.fa -S > GKG13_bowtie_output.sam

A bowtie alignment (command lines)

dmel/Dmel_r5.49

-f clipped_GKG13.fasta

-v 1

-k 1

-p 6

--al droso_matched_GKG-13.fa

--un unmatched_GKG13.fa

-S

> GKG13_bowtie_output.sam

# reads processed: 5930851

# reads with at least one reported alignment: 4992296 (84.18%)

# reads that failed to align: 938555 (15.82%)

Reported 4992296 alignments to 1 output stream(s)

$ bowtie-build dmel-all-chromosome-r5.49.fasta Dmel_r5.49

$ cd ..

$ mkdir dmel && cp ../dmel-all-chromosome-r5.49.fasta.gz dmel/ && cd dmel

$ gunzip dmel-all-chromosome-r5.49.fasta.gz

  • mkdir
  • cp
  • cd
  • which

bash commands

  • bowtie-build (binary)
  • bowtie (binary)

21 of 30

Bowtie outputs

deepseq$ ls -laht

-rw-r--r-- 1 deepseq staff 351M Mar 24 17:46 GKG13_bowtie_output.tabulated

-rw-r--r-- 1 deepseq staff 156M Mar 24 17:46 droso_matched_GKG-13.fa

-rw-r--r-- 1 deepseq staff 28M Mar 24 17:46 unmatched_GKG13.fa

SAM alignment : $ more GKG13_bowtie_output.sam

Aligned reads: $ more droso_matched_GKG-13.fa

Unaligned reads: $ more unmatched_GKG13.fa

$ samtools view -Sb -@ 4 GKG13_bowtie_output.sam > GKG13_bowtie_output.bam # bam compression

$ samtools sort -@ 4 GKG13_bowtie_output.bam GKG13_bowtie_output.sort # bam sorting

$ ll

$ grep -c '>' droso_matched_GKG-13.fa

22 of 30

If we have time, we are going to repeat the same analysis in Galaxy

  • Visit https://mississippi.sorbonne-universite.fr
  • Register to the Galaxy server to get your own account
  • and follow the guide...

23 of 30

24 of 30

SAM - BAM

SAM stands for Sequence Alignment/Map format.

It is a TAB-delimited text format consisting of a

header section (optional but very useful) - Header lines start with `@'

alignment section.

Each alignment line has

11 mandatory fields for essential alignment information such as mapping position

A variable number of optional fields for flexible or aligner specific information

NM i Edit distance to the reference, including ambiguous bases but excluding clipping

MD Z String for mismatching positions. Regex : [0-9]+(([A-Z]|\^[A-Z]+)[0-9]+)

XA:i:<N> Aligned read belongs to stratum <N> (-v mode)

25 of 30

Formats

Raw sequence: Fastq (quality), Fasta (w/o quality)

Aligned sequence:

Genome annotation:

GFF, GTF, BED, bigBed, bedGraph, WIG, bigWig, MAF, BED detail, Personal Genome SNP, VCF, broadPeak, narrowPeak, or PSL.

http://genome.ucsc.edu/FAQ/FAQformat.html

Sam

Bam

    • Sorted
    • Indexed
    • Compressed

26 of 30

GFF - GTF

GFF and GTF are data formats heavily used for storing annotation information

GFFs (general feature format) are meant to be used for any genomic feature

GTF (gene transfer format) is strictly used for genes.

  1. reference sequence name (chromosome1,refContig1,sequence1)
  2. source of annotation (pfam,blast2go,interpro,est)
  3. type of feature (gene,exon,start_codon,cds,mRNA,zinc_finger,conserved_region)
  4. 1-based, inclusive start coordinate (integer > 0)
  5. 1-based, inclusive end coordinate (integer > 0)
  6. score
  7. strand (+,-,.)
  8. frame (0,1,2)
  9. attribute: contains information on the feature itself

##gff-version 2

##created 11/11/11

chr1 PFAM gene 501 750 . + 0 ID=geneA;Name=geneA

chr1 PFAM exon 501 650 . + 2 ID=exonA1;Parent=geneA

chr1 PFAM exon 700 750 . + 2 ID=exonA2;Parent=geneA

  • feature type (column 3) is not controlled.
  • tag/value pair delimited by a '=' sign. Multiple attributes are delimited by a semi-colon.
  • Always have an ID attribute and make sure IDs are unique.
  • The value of the 'Parent' attribute indicates that the current feature is a sub-feature of the parent.

##gtf

##created 11/11/11

chr1 PFAM gene 501 750 . + 0 gene_id "geneA"; study_id "0012";

chr1 PFAM exon 501 650 . + 2 gene_id "geneA"; transcript_id "geneA.1"; exon_number "1";

chr1 PFAM exon 700 750 . + 2 gene_id "geneA"; transcript_id "geneA.1"; exon_number "2";

  • Depending on the software, certain feature types must be present.
  • Common featutre types are exon, transcript, cds, start_codon, end_codon.
  • The 9th column attribute fields must begin with 'gene_id' and 'transcript_id' attributes.
  • There is also no feature/sub-feature relationship in GTF files. If there are multiple exons, they are grouped together by having the same transcript_id. Multiple transcripts can have the same gene_id

GFF

GTF

27 of 30

Pileup Format

seq1 272 T 24 ,.$.....,,.,.,...,,,.,..^+. <<<+;<<<<<<<<<<<=<;<;7<&

seq1 273 T 23 ,.....,,.,.,...,,,.,..A <<<;<<<<<<<<<3<=<<<;<<+

seq1 274 T 23 ,.$....,,.,.,...,,,.,... 7<7;<;<<<<<<<<<=<;<;<<6

seq1 275 A 23 ,$....,,.,.,...,,,.,...^l. <+;9*<<<<<<<<<=<<:;<<<<

seq1 276 G 22 ...T,,.,.,...,,,.,.... 33;+<<7=7<<7<&<<1;<<6<

seq1 277 T 22 ....,,.,.,.C.,,,.,..G. +7<;<<<<<<<&<=<<:;<<&<

seq1 278 G 23 ....,,.,.,...,,,.,....^k. %38*<<;<7<<7<=<<<;<<<<<

seq1 279 C 23 A..T,,.,.,...,,,.,..... ;75&<<<<<<<<<=<<<9<<:<<

28 of 30

Experimental procedures affect downstream analyses

29 of 30

Illumina Sequencing

Prepare library of DNA fragments

Attach DNA to surface

Bridge amplification

Fragments become

double-stranded

Denature and repeat

Complete amplification

Sequencing cycles

30 of 30

Sequence Quality Control

http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

FastQC, GUI version