1 of 47

Майнор по биоинформатике

Мария Попцова

Лекция 1

Семестр 2

2 of 47

Human genome

mitochondria

3 of 47

The Human Genome Project

First reference genome published in 2003
It took 15 years and an international effort
Sanger sequencing
It didn’t represent the genome of one individual but was built using information from the DNA of several volunteers living near the laboratories involved in the project.

The identities of those that participated has never been revealed; even the participants themselves do not know if their DNA was used to produce the final published genome.

Read more: https://www.genome.gov/about-genomics/educational-resources/fact-sheets/human-genome-project

4 of 47

Read more: https://www.genome.gov/about-genomics/educational-resources/fact-sheets/human-genome-project

Most of the original human genome sequence came from volunteers living in Buffalo, New York. Researchers at the Roswell Park Cancer Institute, located in Buffalo, were experts at preparing the DNA in a form that could be used for sequencing the human genome

5 of 47

6 of 47

Human Genome Project

Started in 1990
Finished 2001/2003
Sanger Sequencing

7 of 47

Sanger Sequencing

In 1975 Frederick Sanger invented a sequencing method involving DNA synthesis in the presence of chain- terminating ddNTPs (dideoxy nucleoside triphosphates), followed by electrophoresis.

For 30 years, this method was refined but never replaced.

From one at a time to 96 in parallel
Radiolabels to fluorescence

8 of 47

https://www.youtube.com/watch?v=KTstRrDTmWI

9 of 47

Nitrogenous Bases

10 of 47

Nucleosides

Base linked to a 2-deoxy-D-ribose at 1’ carbon

Nucleotides

Nucleosides with a phosphate at 5’ carbon

11 of 47

Phosphodiester Bond

DNA Polymerase

12 of 47

Sanger sequencing �Chain termination or dideoxy method�

Gel electrophoresis

13 of 47

Dideoxy (Sanger) Method

ddNTP- 2’,3’-dideoxynucleotide

No 3’ hydroxyl

Terminates chain when incorporated

Add enough so each ddNTP is randomly and completely incorporated at each base

14 of 47

Dideoxy Method

Run four separate reactions each with different ddNTPs

Run on a gel in four separate lanes

Read the gel from the bottom up

15 of 47

A sequencing gel

This picture is a radiograph. The dark color of the lines is

proportional to the radioactivity from ³²P labeled adenonsine

in the transcribed DNA sample.

16 of 47

Automated Version of the Dideoxy Method

http://www.youtube.com/watch?v=JHv7IxxgxW4

19 of 47

Vol 318, Issue 5858�21 December 2007

Equipped with faster, cheaper technologies for sequencing DNA and assessing variation in genomes on scales ranging from one to millions of bases, researchers are finding out how truly different we are from one another.

20 of 47

Single Nucleotide Polymorphism SNPs

21 of 47

Human genetics: terminology

Slide courtesy of Sven Cichon

If allele G is associated with risk for disease, it is the risk allele.

That makes allele A the protective allele.

22 of 47

Slide courtesy of Sven Cichon

Human genetics: terminology

23 of 47

SNPs may / may not alter protein structure

24 of 47

SNPs act as gene markers

26 of 47

Structural Variation

27 of 47

What is next-generation sequencing?

Modern methods have three key distinctions:

DNA molecules are immobilized on a solid support

DNA sequence is read as part of the DNA synthesis process

hundreds of thousands to billions of molecules are sequenced in parallel

29 of 47

How to find genetic variation with next generation sequencing

(Meyerson, Nat Review Genet, 2010)

30 of 47

How many variants exist in each individual?

European

Asian

South American/Hispanic

African

Each individual has 4-5 million variants different from reference genome

>99.9% of those variants are SNPs or short indels

(1000 Genomes Project Consortium, Nature, 2015)

31 of 47

Frequency of variants

Most detected variants in a population are rare

~64 million have frequency < 0.5%
~12 million have frequency between 0.5 and 5%
~8 million have frequency >5%

Most variants within an individual are common

1-4% of variants have frequency < 0.5%
~96% of variants have frequency > 0.5%

(1000 Genomes Project Consortium, Nature, 2015)

32 of 47

1000 genomes project

33 of 47

The 1000 Genomes Project set out to provide a comprehensive description of common human genetic variation by applying whole-genome sequencing to a diverse set of individuals from multiple populations. Here we report completion of the project, having reconstructed the genomes of 2,504 individuals from 26 populations using a combination of low-coverage whole-genome sequencing, deep exome sequencing, and dense microarray genotyping. We characterized a broad spectrum of genetic variation, in total over 88 million variants (84.7 million single nucleotide polymorphisms (SNPs), 3.6 million short insertions/deletions (indels), and 60,000 structural variants), all phased onto high-quality haplotypes. This resource includes >99% of SNP variants with a frequency of >1% for a variety of ancestries. We describe the distribution of genetic variation across the global sample, and discuss the implications for common disease studies.

2015

34 of 47

100 000 genomes project

In late 2012, Prime Minister David Cameron announced the 100,000 Genomes Project.

To be realized by Genomics England, a company wholly owned and funded by the Department of Health & Social Care.

The project focused on patients with a rare disease and their families and patients with cancer.

The first samples for sequencing were being taken from patients living in England with discussions taking place with Scotland, Wales and Northern Ireland about potential future involvement.

38 of 47

A typical genome

differs from the reference human genome at 4.1-5.0 million sites
contains an estimated 2,100 to 2,500 structural variants, affecting ∼20 million bases of sequence

∼1,000 large deletions
∼160 copy-number variants
∼915 Alu insertions
∼128 L1 insertions
∼51 SVA insertions
∼4 NUMTs
∼10 inversions

39 of 47

The majority of variants in the data set are rare

The majority of variants in the data set are rare:

∼64 million autosomal variants have a frequency <0.5%,
∼12 million have a frequency between 0.5% and 5%,
and only ∼8 million have a frequency >5%

40 of 47

Putatively functional variation�

a typical genome contained

149–182 sites with protein truncating variants,
10,000 to 12,000 sites with peptide-sequence-altering variants,
459,000 to 565,000 variant sites overlapping known regulatory regions

untranslated regions (UTRs)
promoters
insulators
enhancers
transcription factor binding sites

41 of 47

Single Nucleotide Polymorphism

A Single Nucleotide Polymorphisms (SNP), pronounced “snip,” is a genetic variation when a single nucleotide (i.e., A, T, C, or G) is altered and kept through heredity.

SNP: Single DNA base variation found >1%
Mutation: Single DNA base variation found <1%

C T T A G C T T

C T T A G T T T

SNP

C T T A G C T T

C T T A G T T T

Mutation

94%

99.9%

0.1%

From Kun-Mao Chao, National Taiwan University

42 of 47

Mutations and SNPs

Common Ancestor

time

present

Observed genetic variations

_Mutations

_SNPs

From Kun-Mao Chao, National Taiwan University

43 of 47

Single Nucleotide Polymorphism

SNPs are the most frequent form among various genetic variations.

90% of human genetic variations come from SNPs.
SNPs occur about every 300~600 base pairs.
Millions of SNPs have been identified (e.g., HapMap and Perlegen).

SNPs have become the preferred markers for association studies because of their high abundance and high-throughput SNP genotyping technologies.

From Kun-Mao Chao, National Taiwan University

44 of 47

Single Nucleotide Polymorphism

A SNP is usually assumed to be a binary variable.

The probability of repeat mutation at the same SNP locus is quite small.
The tri-allele cases are usually considered to be the effect of genotyping errors.

The nucleotide on a SNP locus is called

a major allele (if allele frequency > 50%), or
a minor allele (if allele frequency < 50%).

A C T T A G C T T

A C T T A G C T C

C: Minor allele

94%

T: Major allele

From Kun-Mao Chao, National Taiwan University

45 of 47

Haplotypes

A haplotype stands for a set of linked SNPs on the same chromosome.

A haplotype can be simply considered as a binary string since each SNP is binary.

SNP₁

SNP₂

SNP₃

-A C T T A G C T T-

-A A T T T G C T C-

-A C T T T G C T C-

Haplotype 2

Haplotype 3

C A T

A T C

C T C

Haplotype 1

SNP₁

SNP₂

SNP₃

From Kun-Mao Chao, National Taiwan University

1 of 47

2 of 47

3 of 47

4 of 47

5 of 47

6 of 47

7 of 47

8 of 47

9 of 47

10 of 47

11 of 47

12 of 47

13 of 47

14 of 47

15 of 47

16 of 47

17 of 47

18 of 47

19 of 47

20 of 47

21 of 47

22 of 47

23 of 47

24 of 47

25 of 47

26 of 47

27 of 47

28 of 47

29 of 47

30 of 47

31 of 47

32 of 47

33 of 47

34 of 47

35 of 47

36 of 47

37 of 47

38 of 47

39 of 47

40 of 47

41 of 47

42 of 47

43 of 47

44 of 47

45 of 47

46 of 47

47 of 47