Mapping, genome annotation and related databases
NGS read mapping
Reference genome sequence VS Sequencing reads
Sequence alignment
Pairwise sequence alignment VS Multiple sequence alignment
Global alignment vs local alignment
Global alignment
Local alignment
E-value, similarity, identity and score
ATGTTCGACCCTAGTCCCACCAAAGTCGATATAGGAGTGAAAACCACAGGGTGA
AGTCC
Similarity vs identity
ILE
LEU
Similarity is good, but not identical
Score
Sometimes the alignment score also considers the length of the query sequence.
What does NGS read mapping belong?
If using local alignment to NGS mapping
ATGTTCGACCCTAGTCCCACCAAAGTCGATATAGGTAGTCCCACCACAGGGTGA
Reference
AGATAGGACCTAGTCCCTGACAATAGTGAC
Read
AGATAGGACCTAGTCCCTGACAATAGTGAC
ATGTTCGACCCTAGTCCCACCAAAGTCGATATAGGTAGTCCCACCACAGGGTGA
AGATAGGACCTAGTCCCTGACAATAGTGAC
ATGTTCGACCCTAGTCCCACCAAAGTCGATATAGGTAGTCCCACCACAGGGTGA
Causing many false positive hits
Unique mapping
Mapping for pair-end reads
Long reads vs short reads
Mapping workflow
Adapter trimming
Adapter trimming
Poly-A tail trimming
Quality trimming
Quality score | Probability of Incorrect Base Call | Inferred Base Call Accuracy |
Q10 | 1 in 10 | 90% |
Q20 | 1 in 100 | 99% |
Q30 | 1 in 1000 | 99.9% |
Quality trimming
Quality decrease
Q20
Aligners
Mapping statistics
Unmapped reads
Unmapped reads
AGATAGGACCTAGTCCCTGACAATAGTGAC
Reference
AGATAGGGTGAC
Nowhere to align
AGATAGGACCTAGTCCCTGACAATAGTGAC
Reference
AGATAGG ------------------------------------- GTGAC
Long deletion
AGAT ---------------------------- AGGACCTAGTCCCTGACAATAGTGAC
Reference
Long insertion
AGATGGAGTCCTAGGAGAGGACC
AGATGGAGTCCTAGGAGAGGACC
The scores of these situations cannot pass the criteria. Thus, these types of reads will be classified as unmapped reads. Currently, some tools can rescue these reads after conventional step of mapping.
Duplicate set
Gene quantification
RPKM
The commonly file format
Fasta and Fastq
Fasta
Fastq
Fastq format
The presentation of QC score in Fastq
SAM and BAM
SAM format
For the encode of FLAG, CIGAR, QUAL and optional fields, please
check the following links:
https://en.wikipedia.org/wiki/SAM_(file_format)
https://en.wikipedia.org/wiki/Sequence_alignment#Representations
+ optional fields
Coverage file
Wiggle file and BAM file
Coverage file
Alignment file
Wig file format
Wig file format
Wig file format
Displays the values 11, 22, 33 as single-base features, on chromosome 3 at positions 400601, 400701 and 400801 respectively.
Displays the values 11, 22, 33 as 5-base features, on chromosome 3 at positions 400601-400605, 400701-400705 and 400801-400805 respectively.
SNP files
VCF file
For the details:
https://en.wikipedia.org/wiki/Variant_Call_Format
BED+BIM+FAM files
GFF file
GFF file
Commonly used attributes
Genbank
BED file
GEO
https://www.ncbi.nlm.nih.gov/geo/
GEO
SRA
https://www.ncbi.nlm.nih.gov/sra
SRA
TGCA
https://www.cancer.gov/about-nci/organization/ccg/research/structural-genomics/tcga
TGCA
RefSeq
https://www.ncbi.nlm.nih.gov/refseq/
RefSeq
RefSeq
RefSeq FTP
https://ftp.ncbi.nlm.nih.gov/refseq/
RefSeq FTP
RefSeq FTP