2 of 63

NGS read mapping

NGS will produce numerous short DNA/RNA reads. Precisely aligning these reads to the correct locations in reference genome sequence is crucial and strongly influence the interpretations of the experiments.

3 of 63

Reference genome sequence VS Sequencing reads

Reference genome is the complete sequence of a organism stored in databases. It contains all nucleotides with correct order.
Sequencing reads are the short fragmented DNAs/RNAs. They have to mapped on reference genome.

4 of 63

Sequence alignment

For mapping the reads to reference genome, sequence alignment is the necessary step.
Sequence alignment – comparing DNA, RNA, or protein sequences to identify regions of similarity
The similarity of sequence alignment usually corresponds to the functional, structural, or evolutionary relationships between the sequences.
Types of sequence alignment

Pairwise sequence alignment VS Multiple sequence alignment
Global sequence alignment VS Local sequence alignment

5 of 63

Pairwise sequence alignment VS Multiple sequence alignment

Pairwise sequence alignment

Align two sequences one time.
A query sequence for a searching in a database
Understanding the differences between two samples

Multiple sequence alignment

Perform an alignment for more than two sequences
Searching conserved regions
Understanding phylogenetics

7 of 63

Global alignment vs local alignment

Global alignment

End to end alignment
Input sequences are similar
Input sequences have equal size
Specie or sample Classification

Local alignment

Subsequence to subsequence alignment
Dissimilar sequences
Conserved region discovery
Motif searching

9 of 63

Global alignment

Local alignment

10 of 63

E-value, similarity, identity and score

E-value (expect value) is dependent on the length of the query sequence and the size of the database. The lower the E-value, the better the hit. For example, an alignment obtaining an E-value of 0.05 means that there is a 5 in 100 chance of occurring by chance alone.
Short identical sequence may have a high E-value and may be regarded as "false positive" hits.

ATGTTCGACCCTAGTCCCACCAAAGTCGATATAGGAGTGAAAACCACAGGGTGA

AGTCC

11 of 63

Similarity vs identity

Similarity is more often used for protein sequence alignment. If the mismatching sequences have similar chemical/physical properties, the similarity can still be high, ex. Hydrophobic with hydrophobic.
Identity means the two sequences should be exactly the same.

ILE

LEU

Similarity is good, but not identical

12 of 63

Score

Sometimes the alignment score also considers the length of the query sequence.

13 of 63

What does NGS read mapping belong?

Multiple sequence alignment or pairwise sequence alignment
Pairwise sequence alignment
Alignment is performed between a reference genome and a read
Local sequence alignment or global sequence alignment
Global sequence alignment (reads), local alignment (reference)
Reads are usually short, all the nucleotides are important. Moreover, the reads and reference genome are identical species. The sequence should be the same.

14 of 63

If using local alignment to NGS mapping

ATGTTCGACCCTAGTCCCACCAAAGTCGATATAGGTAGTCCCACCACAGGGTGA

Reference

AGATAGGACCTAGTCCCTGACAATAGTGAC

Read

AGATAGGACCTAGTCCCTGACAATAGTGAC

ATGTTCGACCCTAGTCCCACCAAAGTCGATATAGGTAGTCCCACCACAGGGTGA

AGATAGGACCTAGTCCCTGACAATAGTGAC

ATGTTCGACCCTAGTCCCACCAAAGTCGATATAGGTAGTCCCACCACAGGGTGA

Causing many false positive hits

15 of 63

Unique mapping

Uniquely mapped reads – For mapping purposes a read is either one single-ended read or a paired-end set of two reads which will be counted as one mapping. Uniquely mapped reads map to exactly one location within the reference genome.
For the multiple mapped reads, there are many approaches to deal with them.

16 of 63

Mapping for pair-end reads

The alignment for single-end reads is similar as the normal sequence alignment.
The alignment for pair-end reads are for mapping two reads in pair at the same time.

17 of 63

Long reads vs short reads

Long reads

The alignment can be more specific and increase the unique mapping.
More genetic information can be kept.
May produce more error during sequencing.

Short reads

The accuracy of sequencing normally higher.
Less nucleotides are removed due to quality trimming.
May produce more multiple mapped reads.

The commonly used read length is 200-300 nts.

18 of 63

Mapping workflow

19 of 63

Adapter trimming

In order to connect the DNA/RNA reads to the flow cell, adapter needs to be attached to the reads via artificial approach.
The adapter normally quite different from the reference genome for avoiding the misconnection.
Since the adapter does not belong to the reference genome, it cannot be aligned on it. Thus, adapter will influence the read alignment negatively. It needs to be removed from the DNA/RNA reads.

20 of 63

Adapter trimming

21 of 63

Poly-A tail trimming

Normally, the mRNAs of eukaryotic genome contain ploy-A tail in the 3’end.
Ploy-A tail does not exist in reference genome which is DNA sequence. Thus, poly-A tail needs to be removed.
Since most of the prokaryotic genomes have no poly-A tail, this step can be skip.

22 of 63

Quality trimming

The sequencing errors usually happen near the 3’end due to the sequencing quality decreasing.
If the reads containing too many low quality nucleotides (wrong), the mapping accuracy will be influenced.
Normally, before read mapping, quality trimming will be executed for removing the low quality nucleotides.

Q = -10 log₁₀(e)

Quality score	Probability of Incorrect Base Call	Inferred Base Call Accuracy
Q10	1 in 10	90%
Q20	1 in 100	99%
Q30	1 in 1000	99.9%

23 of 63

Quality trimming

Quality decrease

Q20

27 of 63

Mapping statistics

28 of 63

Unmapped reads

Unmapped reads refer to those reads that map nowhere on the reference genome.
Unmapped reads are often ignored or discarded without further analysis. However, there may be gold, or junk.
Surely, these reads may be contaminated or low quality, but they may reveal us the sample has long range insertion or deletion.

29 of 63

Unmapped reads

AGATAGGACCTAGTCCCTGACAATAGTGAC

Reference

AGATAGGGTGAC

Nowhere to align

AGATAGGACCTAGTCCCTGACAATAGTGAC

Reference

AGATAGG ------------------------------------- GTGAC

Long deletion

AGAT ---------------------------- AGGACCTAGTCCCTGACAATAGTGAC

Reference

Long insertion

AGATGGAGTCCTAGGAGAGGACC

The scores of these situations cannot pass the criteria. Thus, these types of reads will be classified as unmapped reads. Currently, some tools can rescue these reads after conventional step of mapping.

30 of 63

Duplicate set

Duplicate reads have arisen due to the use of PCR amplification (or other enrichment) during sample preparation. They may cause bias.
However, it is not easy to know the duplicated reads exist naturally or not, especially for RNA-Seq.
This is an optional step.

31 of 63

Gene quantification

One of the most important purposes of read mapping is to understand the expression values of the genes.
Based on the amounts of reads aligned on the genes, the quantification can be done.
In order to avoid the batch effects, the normalization is usually performed to standardize the read counts.
The commonly used normalization methods

RPKM (Reads Per Kilobase per Million)
FPKM (Fragments Per Kilobase per Million) for paired-end
TPM (Transcripts Per Million)

32 of 63

RPKM

If the length of a gene is longer, more reads can map on it.

33 of 63

The commonly file format

Sequencing raw data

Fastq or Fasta

Alignment file

SAM or BAM

Coverage file

Wig or Bigwig

SNP calling file

VCF or BED+FAM+BIM

Genome annotation file

34 of 63

Fasta and Fastq

Fasta

Fastq

35 of 63

Fastq format

A sequence identifier with information about the sequencing run and the cluster. Start with @.
The sequence (the base calls; A, C, T, G and N).
A separator, which is simply a plus (+) sign.
The base call quality scores. These are Phred encoded, using ASCII characters to represent the numerical quality scores.

36 of 63

The presentation of QC score in Fastq

37 of 63

SAM and BAM

After mapping, the results will be stored in SAM or BAM file.
SAM file is the human readable, but the size is quite large. The information can be viewed and also can be loaded in genome browser.
BAM file is the file compressed from SAM. It requires less storage, but is binary data which is not readable. However, it can be loaded in genome browser to view it.

38 of 63

SAM format

For the encode of FLAG, CIGAR, QUAL and optional fields, please

check the following links:

https://en.wikipedia.org/wiki/SAM_(file_format)

https://en.wikipedia.org/wiki/Sequence_alignment#Representations

+ optional fields

39 of 63

Coverage file

Coverage file contains the coverage values of each position without any other detailed information. Coverage file is usually applied for RNA-Seq data.
This type of files requires much less storage and easy to handle and share.
The commonly used formats are wiggle file (.wig) and BigWig file. Both of them can be loaded in genome browser, but no SNP or read count can be viewed.

40 of 63

Wiggle file and BAM file

Coverage file

Alignment file

41 of 63

Wig file format

Track – present the information of input file

type - must be wiggle_0
Valid optional parameters used by Ensembl are:

name - unique name to identify this track when parsing the file
description - Label to be displayed under the track in Region in Detail
priority - integer defining the order in which to display tracks, if multiple tracks are defined.
color - as RGB, hex or X11 named color
graphType - either 'bar' or 'points’

42 of 63

Wig file format

VariableStep

It designed for data with irregular intervals between data points, and is the more commonly used format. It begins with a declaration line, followed by two columns containing chromosome positions and data values.

chrom (required) - name of chromosome
span (optional, defaults to 1) - the number of bases that each data value should cover

43 of 63

Wig file format

FixedStep

fixedStep format is designed for data with regular intervals between data points, and is the more compact of the two wiggle formats. It begins with a declaration line, followed by a single column of data values.
The declaration line begins with the word fixedStep and is followed by space-separated key-value pairs:
chrom (required) - name of chromosome
start (required) - start point for the data values
step (required) - distance between data values
span (optional, defaults to 1) - the number of bases that each data value should cover

Displays the values 11, 22, 33 as single-base features, on chromosome 3 at positions 400601, 400701 and 400801 respectively.

Displays the values 11, 22, 33 as 5-base features, on chromosome 3 at positions 400601-400605, 400701-400705 and 400801-400805 respectively.

44 of 63

SNP files

VCF file contains all the detailed information of SNPs. It is readable but may need large storage.
BED, BIM and FAM files are another commonly used file combination for checking SNPs. Each of these files focuses on its specific aspects. They need less storage and good for share and store.

45 of 63

VCF file

For the details:

https://en.wikipedia.org/wiki/Variant_Call_Format

46 of 63

BED+BIM+FAM files

.bed

Primary representation of genotype calls at biallelic variants. Must be accompanied by .bim and .fam files.

.bim

Extended variant information file accompanying a .bed binary genotype table.
A text file with no header line, and one line per variant with the following six fields:

Chromosome code (either an integer, or 'X'/'Y'/'XY'/'MT'; '0' indicates unknown) or name
Variant identifier
Position in morgans or centimorgans (safe to use dummy value of '0')
Base-pair coordinate (1-based; limited to 231-2)
Allele 1 (corresponding to clear bits in .bed; usually minor)
Allele 2 (corresponding to set bits in .bed; usually major)

.fam

Sample information file accompanying a .bed binary genotype table.
A text file with no header line, and one line per sample with the following six fields:

Family ID ('FID')
Within-family ID ('IID'; cannot be '0')
Within-family ID of father ('0' if father isn't in dataset)
Within-family ID of mother ('0' if mother isn't in dataset)
Sex code ('1' = male, '2' = female, '0' = unknown)
Phenotype value ('1' = control, '2' = case, '-9'/'0'--out/non-numeric = missing data if case/control)

49 of 63

Commonly used attributes

ID should be unique. It is used for identifying different features.

Name

The name of genomic features.

Parent

Normally, a gene may have multiple products. Parent is for connecting the gene with its products.

Gene

Gene name.

Note

It normally presents the information of the genomic features, ex: function.

50 of 63

Genbank

GenBank is the NIH genetic sequence database, an annotated collection of all publicly available DNA sequences.
It contains more detailed information including sequence.
All the information are easy to be understanded.

51 of 63

BED file

The BED (Browser Extensible Data) format is a text file format used to store genomic regions as coordinates and associated annotations.
This format was developed during the Human Genome Project and then adopted by other sequencing projects.
The required columns

Sequence ID
Starting point
Ending point

The optional columns

https://en.wikipedia.org/wiki/BED_(file_format)

1 of 63

2 of 63

3 of 63

4 of 63

5 of 63

6 of 63

7 of 63

8 of 63

9 of 63

10 of 63

11 of 63

12 of 63

13 of 63

14 of 63

15 of 63

16 of 63

17 of 63

18 of 63

19 of 63

20 of 63

21 of 63

22 of 63

23 of 63

24 of 63

25 of 63

26 of 63

27 of 63

28 of 63

29 of 63

30 of 63

31 of 63

32 of 63

33 of 63

34 of 63

35 of 63

36 of 63

37 of 63

38 of 63

39 of 63

40 of 63

41 of 63

42 of 63

43 of 63

44 of 63

45 of 63

46 of 63

47 of 63

48 of 63

49 of 63

50 of 63

51 of 63

52 of 63

53 of 63

54 of 63

55 of 63

56 of 63

57 of 63

58 of 63

59 of 63

60 of 63

61 of 63

62 of 63

63 of 63