1 of 47

Майнор по биоинформатике

Мария Попцова

Лекция 1

Семестр 2

2 of 47

Human genome

mitochondria

3 of 47

The Human Genome Project

  • First reference genome published in 2003
  • It took 15 years and an international effort
  • Sanger sequencing
  • It didn’t represent the genome of one individual but was built using information from the DNA of several volunteers living near the laboratories involved in the project.
    • The identities of those that participated has never been revealed; even the participants themselves do not know if their DNA was used to produce the final published genome.

Read more: https://www.genome.gov/about-genomics/educational-resources/fact-sheets/human-genome-project

4 of 47

Read more: https://www.genome.gov/about-genomics/educational-resources/fact-sheets/human-genome-project

Most of the original human genome sequence came from volunteers living in Buffalo, New York. Researchers at the Roswell Park Cancer Institute, located in Buffalo, were experts at preparing the DNA in a form that could be used for sequencing the human genome

5 of 47

Most of the original human genome sequence came from volunteers living in Buffalo, New York. Researchers at the Roswell Park Cancer Institute, located in Buffalo, were experts at preparing the DNA in a form that could be used for sequencing the human genome

6 of 47

Human Genome Project

  • Started in 1990
  • Finished 2001/2003
  • Sanger Sequencing

7 of 47

Sanger Sequencing

  • In 1975 Frederick Sanger invented a sequencing method involving DNA synthesis in the presence of chain- terminating ddNTPs (dideoxy nucleoside triphosphates), followed by electrophoresis.

  • For 30 years, this method was refined but never replaced.
    • From one at a time to 96 in parallel
    • Radiolabels to fluorescence

8 of 47

https://www.youtube.com/watch?v=KTstRrDTmWI

9 of 47

Nitrogenous Bases

10 of 47

Nucleosides

  • Base linked to a 2-deoxy-D-ribose at 1’ carbon

Nucleotides

  • Nucleosides with a phosphate at 5’ carbon

11 of 47

Phosphodiester Bond

  • DNA Polymerase

12 of 47

Sanger sequencing �Chain termination or dideoxy method�

1

4

3

2

Gel electrophoresis

5

13 of 47

Dideoxy (Sanger) Method

  • ddNTP- 2’,3’-dideoxynucleotide

  • No 3’ hydroxyl

  • Terminates chain when incorporated

  • Add enough so each ddNTP is randomly and completely incorporated at each base

14 of 47

Dideoxy Method

  • Run four separate reactions each with different ddNTPs

  • Run on a gel in four separate lanes

  • Read the gel from the bottom up

15 of 47

A sequencing gel

This picture is a radiograph. The dark color of the lines is

proportional to the radioactivity from 32P labeled adenonsine

in the transcribed DNA sample.

16 of 47

Automated Version of the Dideoxy Method

http://www.youtube.com/watch?v=JHv7IxxgxW4

17 of 47

18 of 47

19 of 47

Vol 318, Issue 5858�21 December 2007

Equipped with faster, cheaper technologies for sequencing DNA and assessing variation in genomes on scales ranging from one to millions of bases, researchers are finding out how truly different we are from one another.

20 of 47

Single Nucleotide Polymorphism SNPs

21 of 47

Human genetics: terminology

Slide courtesy of Sven Cichon

If allele G is associated with risk for disease, it is the risk allele.

That makes allele A the protective allele.

22 of 47

Slide courtesy of Sven Cichon

Human genetics: terminology

23 of 47

SNPs may / may not alter protein structure

24 of 47

SNPs act as gene markers

25 of 47

SNP Maps

26 of 47

Structural Variation

27 of 47

What is next-generation sequencing?

  • Modern methods have three key distinctions:

    • DNA molecules are immobilized on a solid support

    • DNA sequence is read as part of the DNA synthesis process

    • hundreds of thousands to billions of molecules are sequenced in parallel

28 of 47

29 of 47

How to find genetic variation with next generation sequencing

(Meyerson, Nat Review Genet, 2010)

30 of 47

How many variants exist in each individual?

European

Asian

South American/Hispanic

African

Each individual has 4-5 million variants different from reference genome

>99.9% of those variants are SNPs or short indels

(1000 Genomes Project Consortium, Nature, 2015)

31 of 47

Frequency of variants

  • Most detected variants in a population are rare
    • ~64 million have frequency < 0.5%
    • ~12 million have frequency between 0.5 and 5%
    • ~8 million have frequency >5%
  • Most variants within an individual are common
    • 1-4% of variants have frequency < 0.5%
    • ~96% of variants have frequency > 0.5%

(1000 Genomes Project Consortium, Nature, 2015)

32 of 47

1000 genomes project

33 of 47

The 1000 Genomes Project set out to provide a comprehensive description of common human genetic variation by applying whole-genome sequencing to a diverse set of individuals from multiple populations. Here we report completion of the project, having reconstructed the genomes of 2,504 individuals from 26 populations using a combination of low-coverage whole-genome sequencing, deep exome sequencing, and dense microarray genotyping. We characterized a broad spectrum of genetic variation, in total over 88 million variants (84.7 million single nucleotide polymorphisms (SNPs), 3.6 million short insertions/deletions (indels), and 60,000 structural variants), all phased onto high-quality haplotypes. This resource includes >99% of SNP variants with a frequency of >1% for a variety of ancestries. We describe the distribution of genetic variation across the global sample, and discuss the implications for common disease studies.

2015

34 of 47

100 000 genomes project

  • In late 2012, Prime Minister David Cameron announced the 100,000 Genomes Project.

  • To be realized by Genomics England, a company wholly owned and funded by the Department of Health & Social Care.

  • The project focused on patients with a rare disease and their families and patients with cancer.

  • The first samples for sequencing were being taken from patients living in England with discussions taking place with Scotland, Wales and Northern Ireland about potential future involvement.

35 of 47

36 of 47

37 of 47

1 million

38 of 47

A typical genome

  • differs from the reference human genome at 4.1-5.0 million sites
  • contains an estimated 2,100 to 2,500 structural variants, affecting ∼20 million bases of sequence
    • ∼1,000 large deletions
    • ∼160 copy-number variants
    • ∼915 Alu insertions
    • ∼128 L1 insertions
    • ∼51 SVA insertions
    • ∼4 NUMTs
    • ∼10 inversions

39 of 47

The majority of variants in the data set are rare

  • The majority of variants in the data set are rare:
    • ∼64 million autosomal variants have a frequency <0.5%,
    • ∼12 million have a frequency between 0.5% and 5%,
    • and only ∼8 million have a frequency >5%

40 of 47

Putatively functional variation�

  • a typical genome contained
    • 149–182 sites with protein truncating variants,
    • 10,000 to 12,000 sites with peptide-sequence-altering variants,
    • 459,000 to 565,000 variant sites overlapping known regulatory regions
      • untranslated regions (UTRs)
      • promoters
      • insulators
      • enhancers
      • transcription factor binding sites

41 of 47

Single Nucleotide Polymorphism

  • A Single Nucleotide Polymorphisms (SNP), pronounced “snip,” is a genetic variation when a single nucleotide (i.e., A, T, C, or G) is altered and kept through heredity.
    • SNP: Single DNA base variation found >1%
    • Mutation: Single DNA base variation found <1%

C T T A G C T T

C T T A G T T T

SNP

C T T A G C T T

C T T A G T T T

Mutation

94%

6%

99.9%

0.1%

From Kun-Mao Chao, National Taiwan University

42 of 47

Mutations and SNPs

42

Common Ancestor

time

present

Observed genetic variations

Mutations

SNPs

From Kun-Mao Chao, National Taiwan University

43 of 47

Single Nucleotide Polymorphism

  • SNPs are the most frequent form among various genetic variations.
    • 90% of human genetic variations come from SNPs.
    • SNPs occur about every 300~600 base pairs.
    • Millions of SNPs have been identified (e.g., HapMap and Perlegen).
  • SNPs have become the preferred markers for association studies because of their high abundance and high-throughput SNP genotyping technologies.

43

From Kun-Mao Chao, National Taiwan University

44 of 47

Single Nucleotide Polymorphism

  • A SNP is usually assumed to be a binary variable.
    • The probability of repeat mutation at the same SNP locus is quite small.
    • The tri-allele cases are usually considered to be the effect of genotyping errors.
  • The nucleotide on a SNP locus is called
    • a major allele (if allele frequency > 50%), or
    • a minor allele (if allele frequency < 50%).

A C T T A G C T T

A C T T A G C T C

C: Minor allele

94%

6%

T: Major allele

From Kun-Mao Chao, National Taiwan University

45 of 47

Haplotypes

  • A haplotype stands for a set of linked SNPs on the same chromosome.
    • A haplotype can be simply considered as a binary string since each SNP is binary.

45

SNP1

SNP2

SNP3

-A C T T A G C T T-

-A A T T T G C T C-

-A C T T T G C T C-

Haplotype 2

Haplotype 3

C A T

A T C

C T C

Haplotype 1

SNP1

SNP2

SNP3

From Kun-Mao Chao, National Taiwan University

46 of 47

Haplotypes

a set of linked single-nucleotide polymorphism (SNP) alleles that tend to always occur together

47 of 47

Genotype