1 of 63

Mapping, genome annotation and related databases

2 of 63

NGS read mapping

  • NGS will produce numerous short DNA/RNA reads. Precisely aligning these reads to the correct locations in reference genome sequence is crucial and strongly influence the interpretations of the experiments.

3 of 63

Reference genome sequence VS Sequencing reads

  • Reference genome is the complete sequence of a organism stored in databases. It contains all nucleotides with correct order.
  • Sequencing reads are the short fragmented DNAs/RNAs. They have to mapped on reference genome.

4 of 63

Sequence alignment

  • For mapping the reads to reference genome, sequence alignment is the necessary step.
  • Sequence alignment – comparing DNA, RNA, or protein sequences to identify regions of similarity
  • The similarity of sequence alignment usually corresponds to the functional, structural, or evolutionary relationships between the sequences.
  • Types of sequence alignment
    • Pairwise sequence alignment VS Multiple sequence alignment
    • Global sequence alignment VS Local sequence alignment

5 of 63

Pairwise sequence alignment VS Multiple sequence alignment

  • Pairwise sequence alignment
    • Align two sequences one time.
    • A query sequence for a searching in a database
    • Understanding the differences between two samples
  • Multiple sequence alignment
    • Perform an alignment for more than two sequences
    • Searching conserved regions
    • Understanding phylogenetics

6 of 63

7 of 63

Global alignment vs local alignment

  • Global alignment
    • End to end alignment
    • Input sequences are similar
    • Input sequences have equal size
    • Specie or sample Classification
  • Local alignment
    • Subsequence to subsequence alignment
    • Dissimilar sequences
    • Conserved region discovery
    • Motif searching

8 of 63

9 of 63

Global alignment

Local alignment

10 of 63

E-value, similarity, identity and score

  • E-value (expect value) is dependent on the length of the query sequence and the size of the database. The lower the E-value, the better the hit. For example, an alignment obtaining an E-value of 0.05 means that there is a 5 in 100 chance of occurring by chance alone.
  • Short identical sequence may have a high E-value and may be regarded as "false positive" hits.

ATGTTCGACCCTAGTCCCACCAAAGTCGATATAGGAGTGAAAACCACAGGGTGA

AGTCC

11 of 63

Similarity vs identity

  • Similarity is more often used for protein sequence alignment. If the mismatching sequences have similar chemical/physical properties, the similarity can still be high, ex. Hydrophobic with hydrophobic.
  • Identity means the two sequences should be exactly the same.

ILE

LEU

Similarity is good, but not identical

12 of 63

Score

Sometimes the alignment score also considers the length of the query sequence.

13 of 63

What does NGS read mapping belong?

  • Multiple sequence alignment or pairwise sequence alignment
  • Pairwise sequence alignment
  • Alignment is performed between a reference genome and a read
  • Local sequence alignment or global sequence alignment
  • Global sequence alignment (reads), local alignment (reference)
  • Reads are usually short, all the nucleotides are important. Moreover, the reads and reference genome are identical species. The sequence should be the same.

14 of 63

If using local alignment to NGS mapping

ATGTTCGACCCTAGTCCCACCAAAGTCGATATAGGTAGTCCCACCACAGGGTGA

Reference

AGATAGGACCTAGTCCCTGACAATAGTGAC

Read

AGATAGGACCTAGTCCCTGACAATAGTGAC

ATGTTCGACCCTAGTCCCACCAAAGTCGATATAGGTAGTCCCACCACAGGGTGA

AGATAGGACCTAGTCCCTGACAATAGTGAC

ATGTTCGACCCTAGTCCCACCAAAGTCGATATAGGTAGTCCCACCACAGGGTGA

Causing many false positive hits

15 of 63

Unique mapping

  • Uniquely mapped reads – For mapping purposes a read is either one single-ended read or a paired-end set of two reads which will be counted as one mapping. Uniquely mapped reads map to exactly one location within the reference genome.
  • For the multiple mapped reads, there are many approaches to deal with them.

16 of 63

Mapping for pair-end reads

  • The alignment for single-end reads is similar as the normal sequence alignment.
  • The alignment for pair-end reads are for mapping two reads in pair at the same time.

17 of 63

Long reads vs short reads

  • Long reads
    • The alignment can be more specific and increase the unique mapping.
    • More genetic information can be kept.
    • May produce more error during sequencing.
  • Short reads
    • The accuracy of sequencing normally higher.
    • Less nucleotides are removed due to quality trimming.
    • May produce more multiple mapped reads.
  • The commonly used read length is 200-300 nts.

18 of 63

Mapping workflow

19 of 63

Adapter trimming

  • In order to connect the DNA/RNA reads to the flow cell, adapter needs to be attached to the reads via artificial approach.
  • The adapter normally quite different from the reference genome for avoiding the misconnection.
  • Since the adapter does not belong to the reference genome, it cannot be aligned on it. Thus, adapter will influence the read alignment negatively. It needs to be removed from the DNA/RNA reads.

20 of 63

Adapter trimming

21 of 63

Poly-A tail trimming

  • Normally, the mRNAs of eukaryotic genome contain ploy-A tail in the 3’end.
  • Ploy-A tail does not exist in reference genome which is DNA sequence. Thus, poly-A tail needs to be removed.
  • Since most of the prokaryotic genomes have no poly-A tail, this step can be skip.

22 of 63

Quality trimming

  • The sequencing errors usually happen near the 3’end due to the sequencing quality decreasing.
  • If the reads containing too many low quality nucleotides (wrong), the mapping accuracy will be influenced.
  • Normally, before read mapping, quality trimming will be executed for removing the low quality nucleotides.
    • Q = -10 log10(e)

Quality score

Probability of Incorrect Base Call

Inferred Base Call Accuracy

Q10

1 in 10

90%

Q20

1 in 100

99%

Q30

1 in 1000

99.9%

23 of 63

Quality trimming

Quality decrease

Q20

24 of 63

Aligners

25 of 63

26 of 63

27 of 63

Mapping statistics

28 of 63

Unmapped reads

  • Unmapped reads refer to those reads that map nowhere on the reference genome.
  • Unmapped reads are often ignored or discarded without further analysis. However, there may be gold, or junk.
  • Surely, these reads may be contaminated or low quality, but they may reveal us the sample has long range insertion or deletion.

29 of 63

Unmapped reads

AGATAGGACCTAGTCCCTGACAATAGTGAC

Reference

AGATAGGGTGAC

Nowhere to align

AGATAGGACCTAGTCCCTGACAATAGTGAC

Reference

AGATAGG ------------------------------------- GTGAC

Long deletion

AGAT ---------------------------- AGGACCTAGTCCCTGACAATAGTGAC

Reference

Long insertion

AGATGGAGTCCTAGGAGAGGACC

AGATGGAGTCCTAGGAGAGGACC

The scores of these situations cannot pass the criteria. Thus, these types of reads will be classified as unmapped reads. Currently, some tools can rescue these reads after conventional step of mapping.

30 of 63

Duplicate set

  • Duplicate reads have arisen due to the use of PCR amplification (or other enrichment) during sample preparation. They may cause bias.
  • However, it is not easy to know the duplicated reads exist naturally or not, especially for RNA-Seq.
  • This is an optional step.

31 of 63

Gene quantification

  • One of the most important purposes of read mapping is to understand the expression values of the genes.
  • Based on the amounts of reads aligned on the genes, the quantification can be done.
  • In order to avoid the batch effects, the normalization is usually performed to standardize the read counts.
  • The commonly used normalization methods
    • RPKM (Reads Per Kilobase per Million)
    • FPKM (Fragments Per Kilobase per Million) for paired-end
    • TPM (Transcripts Per Million)

32 of 63

RPKM

  • If the length of a gene is longer, more reads can map on it.

33 of 63

The commonly file format

  • Sequencing raw data
    • Fastq or Fasta
  • Alignment file
    • SAM or BAM
  • Coverage file
    • Wig or Bigwig
  • SNP calling file
    • VCF or BED+FAM+BIM
  • Genome annotation file
    • GFF
    • BED

34 of 63

Fasta and Fastq

Fasta

Fastq

35 of 63

Fastq format

  • A sequence identifier with information about the sequencing run and the cluster. Start with @.
  • The sequence (the base calls; A, C, T, G and N).
  • A separator, which is simply a plus (+) sign.
  • The base call quality scores. These are Phred encoded, using ASCII characters to represent the numerical quality scores.

36 of 63

The presentation of QC score in Fastq

37 of 63

SAM and BAM

  • After mapping, the results will be stored in SAM or BAM file.
  • SAM file is the human readable, but the size is quite large. The information can be viewed and also can be loaded in genome browser.
  • BAM file is the file compressed from SAM. It requires less storage, but is binary data which is not readable. However, it can be loaded in genome browser to view it.

38 of 63

SAM format

For the encode of FLAG, CIGAR, QUAL and optional fields, please

check the following links:

https://en.wikipedia.org/wiki/SAM_(file_format)

https://en.wikipedia.org/wiki/Sequence_alignment#Representations

+ optional fields

39 of 63

Coverage file

  • Coverage file contains the coverage values of each position without any other detailed information. Coverage file is usually applied for RNA-Seq data.
  • This type of files requires much less storage and easy to handle and share.
  • The commonly used formats are wiggle file (.wig) and BigWig file. Both of them can be loaded in genome browser, but no SNP or read count can be viewed.

40 of 63

Wiggle file and BAM file

Coverage file

Alignment file

41 of 63

Wig file format

  • Track – present the information of input file
    • type - must be wiggle_0
    • Valid optional parameters used by Ensembl are:
      • name - unique name to identify this track when parsing the file
      • description - Label to be displayed under the track in Region in Detail
      • priority - integer defining the order in which to display tracks, if multiple tracks are defined.
      • color - as RGB, hex or X11 named color
      • graphType - either 'bar' or 'points’

42 of 63

Wig file format

  • VariableStep
    • It designed for data with irregular intervals between data points, and is the more commonly used format. It begins with a declaration line, followed by two columns containing chromosome positions and data values.
      • chrom (required) - name of chromosome
      • span (optional, defaults to 1) - the number of bases that each data value should cover

43 of 63

Wig file format

  • FixedStep
    • fixedStep format is designed for data with regular intervals between data points, and is the more compact of the two wiggle formats. It begins with a declaration line, followed by a single column of data values.
    • The declaration line begins with the word fixedStep and is followed by space-separated key-value pairs:
    • chrom (required) - name of chromosome
    • start (required) - start point for the data values
    • step (required) - distance between data values
    • span (optional, defaults to 1) - the number of bases that each data value should cover

Displays the values 11, 22, 33 as single-base features, on chromosome 3 at positions 400601, 400701 and 400801 respectively.

Displays the values 11, 22, 33 as 5-base features, on chromosome 3 at positions 400601-400605, 400701-400705 and 400801-400805 respectively.

44 of 63

SNP files

  • VCF file contains all the detailed information of SNPs. It is readable but may need large storage.
  • BED, BIM and FAM files are another commonly used file combination for checking SNPs. Each of these files focuses on its specific aspects. They need less storage and good for share and store.

45 of 63

VCF file

For the details:

https://en.wikipedia.org/wiki/Variant_Call_Format

46 of 63

BED+BIM+FAM files

  • .bed
    • Primary representation of genotype calls at biallelic variants. Must be accompanied by .bim and .fam files.
  • .bim
    • Extended variant information file accompanying a .bed binary genotype table.
    • A text file with no header line, and one line per variant with the following six fields:
      • Chromosome code (either an integer, or 'X'/'Y'/'XY'/'MT'; '0' indicates unknown) or name
      • Variant identifier
      • Position in morgans or centimorgans (safe to use dummy value of '0')
      • Base-pair coordinate (1-based; limited to 231-2)
      • Allele 1 (corresponding to clear bits in .bed; usually minor)
      • Allele 2 (corresponding to set bits in .bed; usually major)
  • .fam
    • Sample information file accompanying a .bed binary genotype table.
    • A text file with no header line, and one line per sample with the following six fields:
      • Family ID ('FID')
      • Within-family ID ('IID'; cannot be '0')
      • Within-family ID of father ('0' if father isn't in dataset)
      • Within-family ID of mother ('0' if mother isn't in dataset)
      • Sex code ('1' = male, '2' = female, '0' = unknown)
      • Phenotype value ('1' = control, '2' = case, '-9'/'0'--out/non-numeric = missing data if case/control)

47 of 63

GFF file

48 of 63

GFF file

49 of 63

Commonly used attributes

  • ID
    • ID should be unique. It is used for identifying different features.
  • Name
    • The name of genomic features.
  • Parent
    • Normally, a gene may have multiple products. Parent is for connecting the gene with its products.
  • Gene
    • Gene name.
  • Note
    • It normally presents the information of the genomic features, ex: function.

50 of 63

Genbank

  • GenBank is the NIH genetic sequence database, an annotated collection of all publicly available DNA sequences.
  • It contains more detailed information including sequence.
  • All the information are easy to be understanded.

51 of 63

BED file

  • The BED (Browser Extensible Data) format is a text file format used to store genomic regions as coordinates and associated annotations.
  • This format was developed during the Human Genome Project and then adopted by other sequencing projects.
  • The required columns
    • Sequence ID
    • Starting point
    • Ending point
  • The optional columns
    • https://en.wikipedia.org/wiki/BED_(file_format)

52 of 63

GEO

https://www.ncbi.nlm.nih.gov/geo/

53 of 63

GEO

54 of 63

SRA

https://www.ncbi.nlm.nih.gov/sra

55 of 63

SRA

56 of 63

TGCA

https://www.cancer.gov/about-nci/organization/ccg/research/structural-genomics/tcga

57 of 63

TGCA

58 of 63

RefSeq

https://www.ncbi.nlm.nih.gov/refseq/

59 of 63

RefSeq

60 of 63

RefSeq

61 of 63

RefSeq FTP

https://ftp.ncbi.nlm.nih.gov/refseq/

62 of 63

RefSeq FTP

63 of 63

RefSeq FTP