1 of 30

Introduction to Deep Seq Data Analysis

https://docs.google.com/presentation/d/1csCR1KwuJo1DXdGgWkNJKxBsEmY6PJ-6l1OZmI_qldo/edit?usp=sharing

Christophe.antoniewski@sorbonne-universite.fr

http://artbio.fr

M2 “BMC” - UE Biotechnologie

October 12, 2020, 9:30–12:30

Sequencing Technologies
DNA libraries
Applications
A framework for sequencing projects
Introduction to sequence dataset analysis using command lines
Using Galaxy for sequence analyses

2 of 30

Illumina Sequencing

3 of 30

Pacific Biosciences SMRT

4 of 30

Oxford Nanopore MinION

5 of 30

Read Lengths, Error Rates and Throughputs

Technology	Method	Read Length	Error Rates (%)	Throughput (GB/run)
Illumina	Synthesis	100-300 bp	0.1	200-600
Pacific Biosciences SMRT	Synthesis	10-100 kb	5-15	10-20
Oxford Nanopore MinION	Nanopore	Variable (up to 1000 kb)	5-20	5-10

6 of 30

20-30nt RNA gel purification

Small RNA library

(Biases)

Library “Bar coding”

8 of 30

ChIPseq library preparation

(Non Directional)

9 of 30

Deep sequencing applications

Whole Genome sequencing (WGS)

De novo sequence assembly for new genomes
Re-sequencing indel, mutation, snp identification and analysis

Exome sequencing

capture probe to sequence only exons

RADseq, ddRADseq

Sampling of genome to be sequenced

Meta-genome sequencing

Pathogen / symbiotic flora identification or analysis

ChIPseq (Chromatin Immunoprecipitation sequencing)

Epigenetic landscape, Transcription factor binding sites

ATACseq (Assay for Transposase-Accessible Chromatin)

DNA accessibility (chromatin compaction, transcription factors…)

3C-seq (Chromosome conformation capture), 4C, 5C, ChIA-PET, Hi-C

mapping of long-range genomic interactions

Medip-seq

DNA methylation studies

RNA or small RNA sequencing

Transcriptome
Small or long non-coding RNA profiling
Single cell transcriptome analysis
RIP-seq (RNA immunoprecipitation sequencing)
Protein associated RNAs
CLIP-seq (Cross-linking immunoprecipitation sequencing)
Protein associated RNA duplexes

High throughput sequencing of DNA or RNA provides Qualitative (sequence) and Quantitative (number of reads) information

10 of 30

Biological question &

Experimental Design

Library Preparation

Sequencing

Quality Control

Alignment

Assembly

Visualization & Statistics

Normalization (library comparison)
Peak finding (Binding sites, Breakpoints, etc…)
Differential Calling (expression, variants, etc)

Sensibility & Technical biases

sample plan, detection requirements

number of replicates

Batch effects

single cell / bulk ?

Whole genome

Whole exome

Target enrichment

Size selection ?

Stranded/unstranded ?

Amplification ?

Single Cell Protocol ?

Length and number of reads

Single or paired ends

Adapter Clipping

Quality trimming

Contaminant and Sequencing Errors

Biases in GC contents

Bowtie

BWA…

Velvet, Oases

Trinity, SPADES…

R, matlab

& Open Source software tools (FAIR)

Analysis Flowchart: Everything’s connected

11 of 30

Sequencing data mining

Read Mapping, Counting & Visualization

Information Technologies & biotools
understanding of annotation formats

Quantitative profiling (RNAseq, single-cell RNAseq, chromatin IPs, …)
Sequence and Structure analyses (Variant, signatures, 3D conformations, …)

Requires Tools for Statistics (R, Python)
Math. Methods: Tests Univariate or Multivariate, Classification, Regression, Dimensional reduction, etc...

12 of 30

Minimal requirements for sequencing data analysis

Sequence files (fastq format)
Knowledge of basic file formats

Fastq, fasta, GFF, GTF, bed, Sam, Bam → Read the UCSC ressources

A computer with enough resources (Memory, CPU, Storage)
Command line approach

A Unix compliant Operating System + « know how »
Knowledge of a programming language (R, Python, Perl, Java, C++…)

Analysis environment and/or Workflow management

GALAXY

A Galaxy server
know how

Rstudio (R), Jupyter notebook (python), Nextflow, Snakemake, etc...

13 of 30

ssh Connect to the analysis server

~$ ls -alF # list files

~$ cp ../GKG-13.fastq.gz . # watch the points !

~$ gunzip GKG-13.fastq.gz # uncompress file

~$ ll

Open/Activate your ssh terminal (PuTTY, MacOS terminal, Chrome “Secure Shell” extension, Firefox “sshGate…” extension)
Type the command line from this document in your terminal

pwd (print working directory)

Path : /home/<user> ; .. and .

ls (list)
cp (copy)
alias ll='ls -alF'
gunzip (gzip)

bash commands

14 of 30

What is this fastq file containing ?

more
less
cat … and ctrl-c !

bash commands

…

...

~$ more GKG-13.fastq # read file by chunk (type “q” key to exit)

@HWIEAS210R_0028:2:1:3019:1114#AGAAGA/1 Header

TNGGAACTTCATACCGTGCTCTCTGTAGGCACCATCAA Sequence

+HWIEAS210R_0028:2:1:3019:1114#AGAAGA/1 Header

bBb`bfffffhhhhhhhhhhhhhhhhhhhfhhhhhhgh Sequence Quality (ASCII encoded)

@HWIEAS210R_0028:2:1:3925:1114#AGAAGA/1

TNCTTGGACTACATATGGTTGAGGGTTGTACTGTAGGC

+HWIEAS210R_0028:2:1:3925:1114#AGAAGA/1

]B]VWaaaaaagggfggggggcggggegdgfgeggbab

@HWIEAS210R_0028:2:1:6220:1114#AGAAGA/1

TNGGAACTTCATACCGTGCTCTCTGTAGGCACCATCAA

+HWIEAS210R_0028:2:1:6220:1114#AGAAGA/1

aB^^afffffhhhhhhhhhhhhhhhhhhhhhhhchhhh

@HWIEAS210R_0028:2:1:6252:1115#AGAAGA/1

TNCTTGGACTACATATGGTTGAGGGTTGTACTGTAGGC

+HWIEAS210R_0028:2:1:6252:1115#AGAAGA/1

aBa^\ddeeehhhhhhhhhhhhhhhhghhhhhhhefff

@HWIEAS210R_0028:2:1:6534:1114#AGAAGA/1

TNAATGCACTATCTGGTACGACTGTAGGCACCATCAAT

+HWIEAS210R_0028:2:1:6534:1114#AGAAGA/1

aB\^^eeeeegcggfffffffcfffgcgcfffffR^^]

@HWIEAS210R_0028:2:1:8869:1114#AGAAGA/1

GNGGACTGAAGTGGAGCTGTAGGCACCATCAATAGATC

+HWIEAS210R_0028:2:1:8869:1114#AGAAGA/1

aBaaaeeeeehhhhhhhhhhhhfgfhhgfhhhhgga^^

15 of 30

How many sequence reads in my file ?

~$ wc -l GKG-13.fastq # count the number of lines in file

25703828 GKG-13.fastq

~$ grep -c -e "^@" GKG-13.fastq # search and count the regex in file

6425957

~$ python

in python interpreter:

>>> 25703828 / 4

6425957

>>> exit()

wc
grep
python

bash commands

16 of 30

Are my sequence reads containing the adapter ?

~$ cat GKG-13.fastq | grep CTGTAGG | wc -l

6355061

~$ grep -c "CTGTAGG" GKG-13.fastq

6355061

6 355 061 out of

6 425 957 sequences

… not bad (98.8%)

3’ adapter: CTGTAGGCACCATCAAT

~$ grep -c "ATCTCGT" GKG-13.fastq

308

A contrario

cat
grep
wc
| (pipe operation)

bash commands

17 of 30

Advanced combination of bash (unix) commands

~$ cat GKG-13.fastq | perl -ne 'print if /^[ATGCN]{22}CTGTAGG/' | wc -l

Outputs the content of a file, line by line

The output is passed to the input of the next command

perl interpreter is called with –ne options (loop & execute)

In line perl code

Regular expression

The output is passed to the input of the next command

wc with –l option counts the lines

1 675 469 22nt long reads with 3’ flanking CTGTAGG adapter sequence

cat
perl
wc

bash commands

18 of 30

Clipping adapter sequences

Unix Operating Systems already contain powerful native tools for sequence analyses

~$ cat GKG-13.fastq | perl -ne 'if (/^(.+CTGTAGG)/) {print "$1\n"}' | more

~$ cat GKG-13.fastq | perl -ne \

'if (/^([GATC]{18,})CTGTAGG/){$count++; print ">$count\n"; print "$1\n"}' \

> clipped_GKG13.fasta

~$ ll # leuleu to look at the output !

Final “clipping” command line

\ (continued line)
> (redirection)

bash commands

copy the enlightened string (including returns) and paste into your terminal. Note: to copy, you may have to exit the “read slide mode of the google slide document. Terminate by pressing the enter key.
type the enlightened string directly in your console. Be careful: each “\” is followed by a return-to-new-line by pressing the enter key.
again, be sure to exit the “read slide mode” and copy paste the following line:�cat GKG-13.fastq | perl -ne 'if (/^([GATC]{18,})CTGTAGG/){$count++; print ">$count\n"; print "$1\n"}' > clipped_GKG13.fasta

3 way to get the command executed:

19 of 30

http://bowtie-bio.sourceforge.net/

Bowtie aligns reads on indexed genomes

Sequence Reads

(fastq or fasta)

Indexed Reference

(Genome)

Alignment report

20 of 30

~$ bowtie dmel/Dmel_r5.49 -f clipped_GKG13.fasta -v 1 -k 1 -p 4 --al droso_matched_GKG-13.fa --un unmatched_GKG13.fa -S > GKG13_bowtie_output.sam

A bowtie alignment (command lines)

dmel/Dmel_r5.49

-f clipped_GKG13.fasta

-v 1

-k 1

-p 6

--al droso_matched_GKG-13.fa

--un unmatched_GKG13.fa

-S

> GKG13_bowtie_output.sam

# reads processed: 5930851

# reads with at least one reported alignment: 4992296 (84.18%)

# reads that failed to align: 938555 (15.82%)

Reported 4992296 alignments to 1 output stream(s)

$ bowtie-build dmel-all-chromosome-r5.49.fasta Dmel_r5.49

$ cd ..

$ mkdir dmel && cp ../dmel-all-chromosome-r5.49.fasta.gz dmel/ && cd dmel

$ gunzip dmel-all-chromosome-r5.49.fasta.gz

mkdir
cp
cd
which

bash commands

bowtie-build (binary)
bowtie (binary)

21 of 30

Bowtie outputs

deepseq$ ls -laht

-rw-r--r-- 1 deepseq staff 351M Mar 24 17:46 GKG13_bowtie_output.tabulated

-rw-r--r-- 1 deepseq staff 156M Mar 24 17:46 droso_matched_GKG-13.fa

-rw-r--r-- 1 deepseq staff 28M Mar 24 17:46 unmatched_GKG13.fa

SAM alignment : $ more GKG13_bowtie_output.sam

Aligned reads: $ more droso_matched_GKG-13.fa

Unaligned reads: $ more unmatched_GKG13.fa

$ samtools view -Sb -@ 4 GKG13_bowtie_output.sam > GKG13_bowtie_output.bam # bam compression

$ samtools sort -@ 4 GKG13_bowtie_output.bam GKG13_bowtie_output.sort # bam sorting

$ ll

$ grep -c '>' droso_matched_GKG-13.fa

22 of 30

If we have time, we are going to repeat the same analysis in Galaxy

Visit https://mississippi.sorbonne-universite.fr
Register to the Galaxy server to get your own account
and follow the guide...

24 of 30

SAM - BAM

SAM stands for Sequence Alignment/Map format.

It is a TAB-delimited text format consisting of a

header section (optional but very useful) - Header lines start with `@'

alignment section.

Each alignment line has

11 mandatory fields for essential alignment information such as mapping position

A variable number of optional fields for flexible or aligner specific information

http://genome.sph.umich.edu/wiki/SAM

http://samtools.github.io/hts-specs/SAMv1.pdf

NM i Edit distance to the reference, including ambiguous bases but excluding clipping

MD Z String for mismatching positions. Regex : [0-9]+(([A-Z]|\^[A-Z]+)[0-9]+)

XA:i:<N> Aligned read belongs to stratum <N> (-v mode)

25 of 30

Formats

Raw sequence: Fastq (quality), Fasta (w/o quality)

Aligned sequence:

Genome annotation:

GFF, GTF, BED, bigBed, bedGraph, WIG, bigWig, MAF, BED detail, Personal Genome SNP, VCF, broadPeak, narrowPeak, or PSL.

http://genome.ucsc.edu/FAQ/FAQformat.html

Sam

Bam

Sorted
Indexed
Compressed

26 of 30

GFF - GTF

http://www.sanger.ac.uk/resources/software/gff/spec.html#t_2

GFF and GTF are data formats heavily used for storing annotation information

GFFs (general feature format) are meant to be used for any genomic feature

GTF (gene transfer format) is strictly used for genes.

reference sequence name (chromosome1,refContig1,sequence1)
source of annotation (pfam,blast2go,interpro,est)
type of feature (gene,exon,start_codon,cds,mRNA,zinc_finger,conserved_region)
1-based, inclusive start coordinate (integer > 0)
1-based, inclusive end coordinate (integer > 0)
score
strand (+,-,.)
frame (0,1,2)
attribute: contains information on the feature itself

##gff-version 2

##created 11/11/11

chr1 PFAM gene 501 750 . + 0 ID=geneA;Name=geneA

chr1 PFAM exon 501 650 . + 2 ID=exonA1;Parent=geneA

chr1 PFAM exon 700 750 . + 2 ID=exonA2;Parent=geneA

feature type (column 3) is not controlled.
tag/value pair delimited by a '=' sign. Multiple attributes are delimited by a semi-colon.
Always have an ID attribute and make sure IDs are unique.
The value of the 'Parent' attribute indicates that the current feature is a sub-feature of the parent.

##gtf

##created 11/11/11

chr1 PFAM gene 501 750 . + 0 gene_id "geneA"; study_id "0012";

chr1 PFAM exon 501 650 . + 2 gene_id "geneA"; transcript_id "geneA.1"; exon_number "1";

chr1 PFAM exon 700 750 . + 2 gene_id "geneA"; transcript_id "geneA.1"; exon_number "2";

Depending on the software, certain feature types must be present.
Common featutre types are exon, transcript, cds, start_codon, end_codon.
The 9th column attribute fields must begin with 'gene_id' and 'transcript_id' attributes.
There is also no feature/sub-feature relationship in GTF files. If there are multiple exons, they are grouped together by having the same transcript_id. Multiple transcripts can have the same gene_id

GFF

GTF

27 of 30

Pileup Format

http://samtools.sourceforge.net/pileup.shtml

seq1 272 T 24 ,.$.....,,.,.,...,,,.,..^+. <<<+;<<<<<<<<<<<=<;<;7<&

seq1 273 T 23 ,.....,,.,.,...,,,.,..A <<<;<<<<<<<<<3<=<<<;<<+

seq1 274 T 23 ,.$....,,.,.,...,,,.,... 7<7;<;<<<<<<<<<=<;<;<<6

seq1 275 A 23 ,$....,,.,.,...,,,.,...^l. <+;9*<<<<<<<<<=<<:;<<<<

seq1 276 G 22 ...T,,.,.,...,,,.,.... 33;+<<7=7<<7<&<<1;<<6<

seq1 277 T 22 ....,,.,.,.C.,,,.,..G. +7<;<<<<<<<&<=<<:;<<&<

seq1 278 G 23 ....,,.,.,...,,,.,....^k. %38*<<;<7<<7<=<<<;<<<<<

seq1 279 C 23 A..T,,.,.,...,,,.,..... ;75&<<<<<<<<<=<<<9<<:<<

28 of 30

Experimental procedures affect downstream analyses

29 of 30

Illumina Sequencing

Prepare library of DNA fragments

Attach DNA to surface

Bridge amplification

Fragments become

double-stranded

Denature and repeat

Complete amplification

Sequencing cycles

1 of 30

2 of 30

3 of 30

4 of 30

5 of 30

6 of 30

7 of 30

8 of 30

9 of 30

10 of 30

11 of 30

12 of 30

13 of 30

14 of 30

15 of 30

16 of 30

17 of 30

18 of 30

19 of 30

20 of 30

21 of 30

22 of 30

23 of 30

24 of 30

25 of 30

26 of 30

27 of 30

28 of 30

29 of 30

30 of 30