1 of 38

Introduction to the Human Genome

Applied Computational Genomics, Lecture 03

https://github.com/quinlan-lab/applied-computational-genomics

Aaron Quinlan

Departments of Human Genetics and Biomedical Informatics

USTAR Center for Genetic Discovery

University of Utah

quinlanlab.org

2 of 38

3 of 38

http://labiotech.eu/history-of-biotech-25-years-of-the-human-genome-project/

First genetic map

UU graduate review committee at Alta

1952 - Hershey/Chase

1953 - DNA structure

4 of 38

1978 - Alta meeting - use "markers" to map disease genes

The early concept of "linkage mapping"

Mark Skolnick (U. of Utah)

Kerry Kravitz

(grad student w Skolnick)

David Botstein

5 of 38

1980 - Idea: use RFLPs to build a human linkage map

Ray Gesteland, PhD and Ray White, PhD, Eccles Institute of Human Genetics Building Construction (1989)

6 of 38

http://labiotech.eu/history-of-biotech-25-years-of-the-human-genome-project/

1983 - Wexler and Gusella map Huntingtin using markers

7 of 38

http://labiotech.eu/history-of-biotech-25-years-of-the-human-genome-project/

1985 - PCR invented by Kary Mullis

8 of 38

How many genomes exist in a human cell?

One nuclear genome - *most of the time

  • *Red blood cells lack chromosomes
  • *Many liver cells are polyploid (more than two haploid genomes

https://www.khanacademy.org/science/biology/structure-of-a-cell/prokaryotic-and-eukaryotic-cells/a/intro-to-eukaryotic-cells

Also many, many mitochondrial genomes

  • Liver cells have 1000-2000 mito.

Corticospinal neurons can be several feet in length. Synapses have high ATP demands - many mitochondria!

9 of 38

The scale of DNA in our body is staggering.

  • A typical human is comprised of roughly 40 trillion human cells (excluding trillions of bacterial cells in our gut)
  • If stretched out, each haploid genome would be roughly 2 meters.
  • So, each cell has 4 meters of DNA.
  • 40 trillion * 4 meters = 160 trillion meters.
  • 160 trillion meters / 1609.34 = 99,750,623,441 miles
  • 99,750,623,441 / 92,960,000 = 1,073.05 trips to the sun.

10 of 38

ATCGGGTACCATCCAATCATTACC

Humans are diploid.

ATCGGGAACCATCCAATCATTACC

Our genome is comprised of a paternal and a maternal "haplotype". Together, they form our "genotype"

11 of 38

Our genome: mini quiz

How many distinct chromosomes are there in the human genome?

24: the autosomes (chromosome 1-22), sex chromosomes (X, Y)

How many chromosomes exist in a (typical) haploid human genome ?

23: the autosomes (chromosome 1-22) and one sex chromosomes (X or Y)

How many chromosomes exist in a (typical) diploid human genome ?

46: two haploid genomes - one from mother and one from father

12 of 38

The human genome

from a macro to micro scale

13 of 38

The human genome

from a macro to micro scale

https://micro.magnet.fsu.edu/cells/nucleus/images/chromatinstructurefigure1.jpg

14 of 38

The human genome - basic stats

http://uswest.ensembl.org/Homo_sapiens/Location/Genome

  • 3.096 billion base pairs (haploid)
  • 20,441 protein coding genes
  • 198,002 coding transcripts (isoforms of a gene that each encode a distinct protein product)

15 of 38

The human karyotype

Parental haploid copy 1

Parental haploid copy 2

Male

16 of 38

The basic structure of a chromosome

The role of the centromere.

Centromeres are required for chromosome separation during cell division. The centromeres are attachment points for microtubules, which are protein fibers that pull duplicate chromosomes toward opposite ends of the cell before it divides. This separation ensures that each daughter cell will have a full set of chromosomes.

Telomeres

http://learn.genetics.utah.edu/content/basics/readchromosomes/

17 of 38

Centromere positioning

http://learn.genetics.utah.edu/content/basics/readchromosomes/

Centromere position can be described three ways: metacentric, submetacentric or acrocentric.

In metacentric chromosomes, the centromere lies near the center of the chromosome. Human chromosomes 1, 3,16, 19, and 20 are metacentric

Submetacentric chromosomes have a centromere that is off-center, so that one chromosome arm is longer than the other. The short arm is designated "p" (for petite), and the long arm is designated "q" (because it follows the letter "p"). Human chromosomes 2, 4-12, 17, 18, and X are submetacentric.

In acrocentric chromosomes, the centromere is very near one end. Human chromosomes 13, 14, 15, 21, 22, and Y are acrocentric.

18 of 38

Chromosome Giemsa banding (G-banding)

  • Heterochromatic regions, which tend to be rich with adenine and thymine (AT-rich) DNA and relatively gene-poor, stain more darkly with Giemsa and result in G-banding
  • Less condensed ("open") chromatin, which tends to be (GC-rich) and more transcriptionally active, incorporates less Giemsa stain, resulting in light bands in G-banding.
  • Cytogenetic bands are labeled p1, p2, p3, q1, q2, q3, etc., counting from the centromere out toward the telomeres. At higher resolutions, sub-bands can be seen within the bands.
  • For example, the locus for the CFTR (cystic fibrosis) gene is 7q31.2, which indicates it is on chromosome 7, q arm, region 3, band 1, and sub-band 2. (Say 7,q,3,1 dot 2)

https://en.wikipedia.org/wiki/G_banding, https://ghr.nlm.nih.gov/chromosome/1#ideogram

19 of 38

Sequencing a reference human genome. Not the human genome.

20 of 38

21 of 38

Why sequence a reference genome?

  • Determine the "complete" sequence of a human haploid genome.
    • Previously "snippets" of the genome were available.
  • Identify the sequence and location of every protein coding gene.
    • Recall that in 1913, Alfred Sturtevant determined that genes were arranged on chromosomes in a linear fashion, creating the first genetic map for Drosophila. He was an undergrad with TH Morgan.
  • Use as a "map" with which to track the location and frequency of genetic variation in the human genome.
  • Unravel the genetic architecture of inherited and somatic human diseases.
  • To understand genome and species evolution

22 of 38

Sequencing the first human genome: Sanger method

Key points:

1) sequencing by synthesis (not degradation)

2) radioactive primers hybridize to DNA

3) polymerase + dNTPS + ddNTP terminators at low concentration

4) Add one ddNTP base per reaction/lane, visually interpret ladder

Strengths over chemical sequencing (Gilbert):

1) easier & faster

2) no nasty chemicals

23 of 38

How to sequence a human genome: Lee Hood automation

before

after

read lengths: ~500bp

24 of 38

Sanger sequencing: technological advances

1977: Fred Sanger

1985: ABI 370 (first automated sequencer)

1 hardworking technician

= 700 bases per day

= 118,000 years to sequence the human genome

5000 bases per day

= 16,000 years

1995: ABI 377 (Bigger gels, better chemistry & optics, more sensitive dyes, faster computers)

19,000 bases per day

= 4,400 years

1999: ABI 3700 (96 capillaries, 96 well plates, fluid handling robots)

400,000 bases per day

= 205 years

25 of 38

Shotgun genome sequencing (Sanger, 1979)

1) Fragment the genome (or large BAC clones)

2) Clone 2-10kb fragments into plasmids; pick lots of colonies; purify DNA from each

3) use a primer to plasmid to sequence into genomic DNA

4) assemble the genome from overlapping “reads”

ATCGCCGTACTAGCGAGCTTGCGAT

GCTTGCGATAACGCTTCCGTCGAGCCGTAAATCGGCTCGAG

TCGGCTCGAGAAGCTGCTTGCGAAAGCTGT

ATCGCCGTACTAGCGAGCTTGCGATAACGCTTCCGTCGAGCCGTAAATCGGCTCGAGAAGCTGCTTGCGAAAGCTGT

1977: Bacteriophage fX 174 (5kb)

1995: H. Influenza (1Mb);

1996: Yeast (12mb);

2000: Drosophila (165Mb);

2002: Human (3Gb)

26 of 38

The competing human genome projects (this was war)

Public (Universities)

1990-2001 (2003)

3 billion dollars

Celera Corporation

1999-2001 (2003)

300 million dollars

ca. 150kb segments amplified in BACs

Note: "Much of the sequence (>70%) of the reference genome produced by the public HGP came from a single anonymous male donor from Buffalo, New York (RP11)."

https://en.wikipedia.org/wiki/Human_Genome_Project

27 of 38

The competing human genome projects (this was war)

28 of 38

A first map of the human genome

http://www.nature.com/nature/journal/v409/n6822/full/409860a0.html

29 of 38

A first map of the human genome ("build 1")

http://www.nature.com/nature/journal/v409/n6822/full/409860a0.html

30 of 38

CGCAAATTTGCCGGATTTCCTTTGCTGTTCCTGCATGTAGTTTAAACGAGATTGCCAGCACCGGGTATCATTCACCATTTTTCTTTTCGTTAACTTGCCGTCAGCCTTTTCTTTGACCTCTTCTTTCTGTTCATGTGTATTTGCTGTCTCTTAGCCAGACTTCCCGTGTCCTTTCCACCGGGCCTTTGAGAGGTCACAGGGTCTTGATGCTGTGGTCTTCATCTGCAGGTGTCTGACTTCCAGCAACTGCTGGCCTGTGCCAGGGTGCAAGCTGAGCACTGGAGTGGAGTTTTCCTGTGGAGAGGAGCCATGCCTAGAGTGGGATGGGCCATTGTTCATCTTCTGGCCCCTGTTGTCTGCATGTAACTTAATACCACAACCAGGCATAGGGGAAAGATTGGAGGAAAGATGAGTGAGAGCATCAACTTCTCTCACAACCTAGGCCAGTAAGTAGTGCTTGTGCTCATCTCCTTGGCTGTGATACGTGGCCGGCCCTCGCTCCAGCAGCTGGACCCCTACCTGCCGTCTGCTGCCATCGGAGCCCAAAGCCGGGCTGTGACTGCTCAGACCAGCCGGCTGGAGGGAGGGGCTCAGCAGGTCTGGCTTTGGCCCTGGGAGAGCAGGTGGAAGATCAGGCAGGCCATCGCTGCCACAGAACCCAGTGGATTGGCCTAGGTGGGATCTCTGAGCTCAACAAGCCCTCTCTGGGTGGTAGGTGCAGAGACGGGAGGGGCAGAGCCGCAGGCACAGCCAAGAGGGCTGAAGAAATGGTAGAACGGAGCAGCTGGTGATGTGTGGGCCCACCGGCCCCAGGCTCCTGTCTCCCCCCAGGTGTGTGGTGATGCCAGGCATGCCCTTCCCCAGCATCAGGTCTCCAGAGCTGCAGAAGACGACGGCCGACTTGGATCACACTCTTGTGAGTGTCCCCAGTGTTGCAGAGGTGAGAGGAGAGTAGACAGTGAGTGGGAGTGGCGTCGCCCCTAGGGCTCTACGGGGCCGGCGTCTCCTGTCTCCTGGAGAGGCTTCGATGCCCCTCCACACCCTCTTGATCTTCCCTGTGATGTCATCTGGAGCCCTGCTGCTTGCGGTGGCCTATAAAGCCTCCTAGTCTGGCTCCAAGGCCTGGCAGAGTCTTTCCCAGGGAAAGCTACAAGCAGCAAACAGTCTGCATGGGTCATCCCCTTCACTCCCAGCTCAGAGCCCAGGCCAGGGGCCCCCAAGAAAGGCTCTGGTGGAGAACCTGTGCATGAAGGCTGTCAACCAGTCCATAGGCAAGCCTGGCTGCCTCCAGCTGGGTCGACAGACAGGGGCTGGAGAAGGGGAGAAGAGGAAAGTGAGGTTGCCTGCCCTGTCTCCTACCTGAGGCTGAGGAAGGAGAAGGGGATGCACTGTTGGGGAGGCAGCTGTAACTCAAAGCCTTAGCCTCTGTTCCCACGAAGGCAGGGCCATCAGGCACCAAAGGGATTCTGCCAGCATAGTGCTCCTGGACCAGTGATACACCCGGCACCCTGTCCTGGACACGCTGTTGGCCTGGATCTGAGCCCTGGTGGAGGTCAAAGCCACCTTTGGTTCTGCCATTGCTGCTGTGTGGAAGTTCACTCCTGCCTTTTCCTTTCCCTAGAGCCTCCACCACCCCGAGATCACATTTCTCACTGCCTTTTGTCTGCCCAGTTTCACCAGAAGTAGGCCTCTTCCTGACAGGCAGCTGCACCACTGCCTGGCGCTGTGCCCTTCCTTTGCTCTGCCCGCTGGAGACGGTGTTTGTCATGGGCCTGGTCTGCAGGGATCCTGCTACAAAGGTGAAACCCAGGAGAGTGTGGAGTCCAGAGTGTTGCCAGGACCCAGGCACAGGCATTAGTGCCCGTTGGAGAAAACAGGGGAATCCCGAAGAAATGGTGGGTCCTGGCCATCCGTGAGATCTTCCCAGGTGTGCCGTTTTCTCTGGAAGCCTCTTAAGAACACAGTGGCGCAGGCTGGGTGGAGCCGTCCCCCCATGGAGCACAGGCAGACAGAAGTCCCCGCCCCAGCTGTGTGGCCTCAAGCCAGCCTTCCGCTCCTTGAAGCTGGTCTCCACACAGTGCTGGTTCCGTCACCCCCTCCCAAGGAAGTAGGTCTGAGCAGCTTGTCCTGGCTGTGTCCATGTCAGAGCAACGGCCCAAGTCTGGGTCTGGGGGGGAAGGTGTCATGGAGCCCCCTACGATTCCCAGTCGTCCTCGTCCTCCTCTGCCTGTGGCTGCTGCGGTGGCGGCAGAGGAGGGATGGAGTCTGACACGCGGGCAAAGGCTCCTCCGGGCCCCTCACCAGCCCCAGGTCCTTTCCCAGAGATGCCTGGAGGGAAAAGGCTGAGTGAGGGTGGTTGGTGGGAAACCCTGGTTCCCCCAGCCCCCGGAGACTTAAATACAGGAAGAAAAAGGCAGGACAGAATTACAAGGTGCTGGCCCAGGGCGGGCAGCGGCCCTGCCTCCTACCCTTGCGCCTCATGACCGGAGCCATAGCCCAGGCAGGAGGGCTGAGGACCTCTGGTGGCGGCCCAGGGCTTCCAGCATGTGCCCTAGGGGAAGCAGGGGCCAGCTGGCAAGAGCAGGGGGTGGGCAGAAAGCACCCGGTGGACTCAGGGCTGGAGGGGAGGAGGCGATCTTGCCCAAGGCCCTCCGACTGCAAGCTCCAGGGCCCGCTCACCTTGCTCCTGCTCCTTCTGCTGCTGCTTCTCCAGCTTTCGCTCCTTCATGCTGCGCAGCTTGGCCTTGCCGATGCCCCCAGCTTGGCGGATGGACTCTAGCAGAGTGGCCAGCCACCGGAGGGGTCAACCACTTCCC

31 of 38

Gene content

http://www.nature.com/nature/journal/v409/n6822/full/409860a0.html

"There appear to be about 30,000-40,000 protein-coding genes in the human genome -- only about twice as many as in worm or fly. However, the genes are more complex, with more alternative splicing generating a larger number of protein products." (Over time this has evolved to an estimate of approximately 20,000 protein coding genes, which reflects roughly the number of genes in fly and worm)

32 of 38

Solely 2% of the human genome encodes proteins.

https://genome.ucsc.edu

33 of 38

Half of the human genome is comprised of repeats

http://www.nature.com/nrg/journal/v10/n10/pdf/nrg2640.pdf

Retrotransposons use a "copy/paste" mechanism

DNA transposons use a "cut/paste" mechanism

McClintock's "jumping

genes" in maize

34 of 38

Half of the human genome is comprised of repeats

http://www.nature.com/nrg/journal/v10/n10/pdf/nrg2640.pdf

Repetitive DNA not driven by retrotransposition (e.g., ATATATATATATATATAT…)

35 of 38

GC content varies dramatically in the genome

http://www.nature.com/nrg/journal/v10/n10/pdf/nrg2640.pdf

Region from chromosome 1

GC content

Each point is 20kb

Each point is 2kb

Each point is 200 bp

Why are there no points here?

36 of 38

CpG islands - clusters of CG dinucleotides

(The "p" represents the phosphate bond between the nucleotides on the same strand. Needed to distinguish between hydrogen bond between C and G on complementary DNA strands

http://www.nature.com/nrg/journal/v10/n10/pdf/nrg2640.pdf

http://missinglink.ucsf.edu/lm/genes_and_genomes/methylation.html

ATGTCGTAATCTCGAA

m

Methylated cytosine

Unmethylated cytosine

37 of 38

CpG island content throughout the genome

http://www.nature.com/nrg/journal/v10/n10/pdf/nrg2640.pdf

Chromosome 19 is the most gene dense chromosome in the human genome

38 of 38

The human reference genome continues to change.

  • Ongoing efforts to fill "gaps" and properly/thoroughly represent complex structures and loci in the genome (e.g., Major Histocompatibility Complex)
  • Each improvement leads to a new genome "build". Currently on build 38.
  • Experimental and computational methods provide new genome annotations
    • New gene models, transcription factor binding sites, and loci where human individuals differ (i.e., polymorphisms)
  • Therefore, the human reference genome is by no means "complete"!
  • How does the same genome yield such phenotypic diversity across tissue types?
  • How does the genome evolve within an individual (tissues) and among a population?