Introduction to Deep Seq Data Analysis
https://docs.google.com/presentation/d/1csCR1KwuJo1DXdGgWkNJKxBsEmY6PJ-6l1OZmI_qldo/edit?usp=sharing
M2 “BMC” - UE Biotechnologie
October 12, 2020, 9:30–12:30
Illumina Sequencing
Pacific Biosciences SMRT
Oxford Nanopore MinION
Read Lengths, Error Rates and Throughputs
Technology | Method | Read Length | Error Rates (%) | Throughput (GB/run) |
Illumina | Synthesis | 100-300 bp | 0.1 | 200-600 |
Pacific Biosciences SMRT | Synthesis | 10-100 kb | 5-15 | 10-20 |
Oxford Nanopore MinION | Nanopore | Variable (up to 1000 kb) | 5-20 | 5-10 |
20-30nt RNA gel purification
Small RNA library
(Biases)
Library “Bar coding”
ChIPseq library preparation
(Non Directional)
Deep sequencing applications
High throughput sequencing of DNA or RNA provides Qualitative (sequence) and Quantitative (number of reads) information
Biological question &
Experimental Design
Library Preparation
Sequencing
Quality Control
Alignment
Assembly
Visualization & Statistics
Sensibility & Technical biases
sample plan, detection requirements
number of replicates
Batch effects
single cell / bulk ?
Whole genome
Whole exome
Target enrichment
Size selection ?
Stranded/unstranded ?
Amplification ?
Single Cell Protocol ?
Length and number of reads
Single or paired ends
Adapter Clipping
Quality trimming
Contaminant and Sequencing Errors
Biases in GC contents
Bowtie
BWA…
Velvet, Oases
Trinity, SPADES…
R, matlab
& Open Source software tools (FAIR)
Analysis Flowchart: Everything’s connected
Sequencing data mining
Minimal requirements for sequencing data analysis
Fastq, fasta, GFF, GTF, bed, Sam, Bam → Read the UCSC ressources
ssh Connect to the analysis server
~$ ls -alF # list files
~$ cp ../GKG-13.fastq.gz . # watch the points !
~$ gunzip GKG-13.fastq.gz # uncompress file
~$ ll
bash commands
What is this fastq file containing ?
bash commands
…
…
...
~$ more GKG-13.fastq # read file by chunk (type “q” key to exit)
@HWIEAS210R_0028:2:1:3019:1114#AGAAGA/1 Header
TNGGAACTTCATACCGTGCTCTCTGTAGGCACCATCAA Sequence
+HWIEAS210R_0028:2:1:3019:1114#AGAAGA/1 Header
bBb`bfffffhhhhhhhhhhhhhhhhhhhfhhhhhhgh Sequence Quality (ASCII encoded)
@HWIEAS210R_0028:2:1:3925:1114#AGAAGA/1
TNCTTGGACTACATATGGTTGAGGGTTGTACTGTAGGC
+HWIEAS210R_0028:2:1:3925:1114#AGAAGA/1
]B]VWaaaaaagggfggggggcggggegdgfgeggbab
@HWIEAS210R_0028:2:1:6220:1114#AGAAGA/1
TNGGAACTTCATACCGTGCTCTCTGTAGGCACCATCAA
+HWIEAS210R_0028:2:1:6220:1114#AGAAGA/1
aB^^afffffhhhhhhhhhhhhhhhhhhhhhhhchhhh
@HWIEAS210R_0028:2:1:6252:1115#AGAAGA/1
TNCTTGGACTACATATGGTTGAGGGTTGTACTGTAGGC
+HWIEAS210R_0028:2:1:6252:1115#AGAAGA/1
aBa^\ddeeehhhhhhhhhhhhhhhhghhhhhhhefff
@HWIEAS210R_0028:2:1:6534:1114#AGAAGA/1
TNAATGCACTATCTGGTACGACTGTAGGCACCATCAAT
+HWIEAS210R_0028:2:1:6534:1114#AGAAGA/1
aB\^^eeeeegcggfffffffcfffgcgcfffffR^^]
@HWIEAS210R_0028:2:1:8869:1114#AGAAGA/1
GNGGACTGAAGTGGAGCTGTAGGCACCATCAATAGATC
+HWIEAS210R_0028:2:1:8869:1114#AGAAGA/1
aBaaaeeeeehhhhhhhhhhhhfgfhhgfhhhhgga^^
How many sequence reads in my file ?
~$ wc -l GKG-13.fastq # count the number of lines in file
25703828 GKG-13.fastq
~$ grep -c -e "^@" GKG-13.fastq # search and count the regex in file
6425957
~$ python
in python interpreter:
>>> 25703828 / 4
6425957
>>> exit()
bash commands
Are my sequence reads containing the adapter ?
~$ cat GKG-13.fastq | grep CTGTAGG | wc -l
6355061
~$ grep -c "CTGTAGG" GKG-13.fastq
6355061
6 355 061 out of
6 425 957 sequences
… not bad (98.8%)
3’ adapter: CTGTAGGCACCATCAAT
~$ grep -c "ATCTCGT" GKG-13.fastq
308
A contrario
bash commands
Advanced combination of bash (unix) commands
~$ cat GKG-13.fastq | perl -ne 'print if /^[ATGCN]{22}CTGTAGG/' | wc -l
Outputs the content of a file, line by line
The output is passed to the input of the next command
perl interpreter is called with –ne options (loop & execute)
In line perl code
Regular expression
The output is passed to the input of the next command
wc with –l option counts the lines
1 675 469 22nt long reads with 3’ flanking CTGTAGG adapter sequence
bash commands
Clipping adapter sequences
Unix Operating Systems already contain powerful native tools for sequence analyses
~$ cat GKG-13.fastq | perl -ne 'if (/^(.+CTGTAGG)/) {print "$1\n"}' | more
~$ cat GKG-13.fastq | perl -ne \
'if (/^([GATC]{18,})CTGTAGG/){$count++; print ">$count\n"; print "$1\n"}' \
> clipped_GKG13.fasta
~$ ll # leuleu to look at the output !
Final “clipping” command line
bash commands
3 way to get the command executed:
http://bowtie-bio.sourceforge.net/
Bowtie aligns reads on indexed genomes
Sequence Reads
(fastq or fasta)
Indexed Reference
(Genome)
Alignment report
~$ bowtie dmel/Dmel_r5.49 -f clipped_GKG13.fasta -v 1 -k 1 -p 4 --al droso_matched_GKG-13.fa --un unmatched_GKG13.fa -S > GKG13_bowtie_output.sam
A bowtie alignment (command lines)
dmel/Dmel_r5.49
-f clipped_GKG13.fasta
-v 1
-k 1
-p 6
--al droso_matched_GKG-13.fa
--un unmatched_GKG13.fa
-S
> GKG13_bowtie_output.sam
# reads processed: 5930851
# reads with at least one reported alignment: 4992296 (84.18%)
# reads that failed to align: 938555 (15.82%)
Reported 4992296 alignments to 1 output stream(s)
$ bowtie-build dmel-all-chromosome-r5.49.fasta Dmel_r5.49
$ cd ..
$ mkdir dmel && cp ../dmel-all-chromosome-r5.49.fasta.gz dmel/ && cd dmel
$ gunzip dmel-all-chromosome-r5.49.fasta.gz
bash commands
Bowtie outputs
deepseq$ ls -laht
-rw-r--r-- 1 deepseq staff 351M Mar 24 17:46 GKG13_bowtie_output.tabulated
-rw-r--r-- 1 deepseq staff 156M Mar 24 17:46 droso_matched_GKG-13.fa
-rw-r--r-- 1 deepseq staff 28M Mar 24 17:46 unmatched_GKG13.fa
SAM alignment : $ more GKG13_bowtie_output.sam
Aligned reads: $ more droso_matched_GKG-13.fa
Unaligned reads: $ more unmatched_GKG13.fa
$ samtools view -Sb -@ 4 GKG13_bowtie_output.sam > GKG13_bowtie_output.bam # bam compression
$ samtools sort -@ 4 GKG13_bowtie_output.bam GKG13_bowtie_output.sort # bam sorting
$ ll
$ grep -c '>' droso_matched_GKG-13.fa
If we have time, we are going to repeat the same analysis in Galaxy
SAM - BAM
SAM stands for Sequence Alignment/Map format.
It is a TAB-delimited text format consisting of a
header section (optional but very useful) - Header lines start with `@'
alignment section.
Each alignment line has
11 mandatory fields for essential alignment information such as mapping position
A variable number of optional fields for flexible or aligner specific information
NM i Edit distance to the reference, including ambiguous bases but excluding clipping
MD Z String for mismatching positions. Regex : [0-9]+(([A-Z]|\^[A-Z]+)[0-9]+)
XA:i:<N> Aligned read belongs to stratum <N> (-v mode)
Formats
Raw sequence: Fastq (quality), Fasta (w/o quality)
Aligned sequence:
Genome annotation:
GFF, GTF, BED, bigBed, bedGraph, WIG, bigWig, MAF, BED detail, Personal Genome SNP, VCF, broadPeak, narrowPeak, or PSL.
http://genome.ucsc.edu/FAQ/FAQformat.html
Sam
Bam
GFF - GTF
GFF and GTF are data formats heavily used for storing annotation information
GFFs (general feature format) are meant to be used for any genomic feature
GTF (gene transfer format) is strictly used for genes.
##gff-version 2
##created 11/11/11
chr1 PFAM gene 501 750 . + 0 ID=geneA;Name=geneA
chr1 PFAM exon 501 650 . + 2 ID=exonA1;Parent=geneA
chr1 PFAM exon 700 750 . + 2 ID=exonA2;Parent=geneA
##gtf
##created 11/11/11
chr1 PFAM gene 501 750 . + 0 gene_id "geneA"; study_id "0012";
chr1 PFAM exon 501 650 . + 2 gene_id "geneA"; transcript_id "geneA.1"; exon_number "1";
chr1 PFAM exon 700 750 . + 2 gene_id "geneA"; transcript_id "geneA.1"; exon_number "2";
GFF
GTF
Pileup Format
seq1 272 T 24 ,.$.....,,.,.,...,,,.,..^+. <<<+;<<<<<<<<<<<=<;<;7<&
seq1 273 T 23 ,.....,,.,.,...,,,.,..A <<<;<<<<<<<<<3<=<<<;<<+
seq1 274 T 23 ,.$....,,.,.,...,,,.,... 7<7;<;<<<<<<<<<=<;<;<<6
seq1 275 A 23 ,$....,,.,.,...,,,.,...^l. <+;9*<<<<<<<<<=<<:;<<<<
seq1 276 G 22 ...T,,.,.,...,,,.,.... 33;+<<7=7<<7<&<<1;<<6<
seq1 277 T 22 ....,,.,.,.C.,,,.,..G. +7<;<<<<<<<&<=<<:;<<&<
seq1 278 G 23 ....,,.,.,...,,,.,....^k. %38*<<;<7<<7<=<<<;<<<<<
seq1 279 C 23 A..T,,.,.,...,,,.,..... ;75&<<<<<<<<<=<<<9<<:<<
Experimental procedures affect downstream analyses
Illumina Sequencing
Prepare library of DNA fragments
Attach DNA to surface
Bridge amplification
Fragments become
double-stranded
Denature and repeat
Complete amplification
Sequencing cycles
Sequence Quality Control
http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
FastQC, GUI version