1 of 39

Hae Kyung Im, PhD

Principal Component Analysis and Population Structure

April 11, 2022

2 of 39

Genotype Matrix, a Treasure Trove

2

3 of 39

Principal Components Reveals Demographic History

3

J. Novembre, et al “Genes mirror geography within Europe,” Nature, vol. 456, no. 7218, pp. 98–101, Aug. 2008.

4 of 39

Could Population Structure Bias GWAS Results?

5 of 39

Spurious Association Due to Population Structure

5

Case

Control

6 of 39

Spurious Association Due to Population Structure

6

Control

maf=50%

maf=25%

maf=50%

maf=25%

maf=40%

maf=35%

Case

7 of 39

How to Correct for Population Structure?

1. Correcting with genomic control (Devlin and Roeder 1999)

2. Inferring the latent sub-populations (Pritchard et al 2000)

Fit association in each population separately and combine

3. Adjusting for principal components

(Patterson 2006, Novembre 2008, Price et al 2010)

4. Mixed effects modeling (EMMAX, Kang et al 2010)

7

8 of 39

Principal Component Analysis

9 of 39

Principal Component Analysis (SVD)

9

DATA

n x M

=

x

x

x

x

+

+ ...

:

:

u1

u2

d1

d2

v'1

v'2

10 of 39

Geometric Interpretation of Singular Value Decomposition

10

X

D

X

D

X

D

11 of 39

Example Population Structure

12 of 39

HapMap Project

12

An international project to create a haplotype map of the human genome

13 of 39

1000 Genomes Project

13

Auton, A., Altshuler, D. M., Durbin, R. M., Chakravarti, A., Clark, A. G., Donnelly, P., et al. (2015). A global reference for human genetic variation. Nature, 526(7571), 68–74. http://doi.org/10.1038/nature15393

14 of 39

HapMap Phase 3 Populations

14

15 of 39

HapMap Phase 3 Populations

15

16 of 39

Population Structure in HapMap

16

https://hakyimlab.github.io/hgen471/L6-population-structure.html

17 of 39

Population Structure in HapMap

17

18 of 39

Population Structure in HapMap

18

19 of 39

Population Structure in HapMap

19

20 of 39

PCA in UK Biobank

20

21 of 39

GWAS in Multi-ancestry Samples

22 of 39

Example: Growth Phenotype by Population

22

H. K. Im et al, “Mixed effects modeling of proliferation rates in cell-based models: consequence for pharmacogenomics and cancer.,” PLoS Genetics, 2012.

23 of 39

Example: Growth Phenotype by Population

23

https://hakyimlab.github.io/hgen471/L6-population-structure.html

24 of 39

Populations Differences Lead to Inflation of Small P-values

24

https://hakyimlab.github.io/hgen471/L6-population-structure.html

25 of 39

Populations Differences Lead to Inflation of Small P-values

25

https://hakyimlab.github.io/hgen471/L6-population-structure.html

26 of 39

Populations Differences Lead to Inflation of Small P-values

26

https://hakyimlab.github.io/hgen471/L6-population-structure.html

27 of 39

what happens if we add principal components as covariates in the regression?

28 of 39

Growth GWAS Adjusted with PCs

28

29 of 39

Heritability

30 of 39

Types of Heritability

  • Broad sense heritability: var()/var()
  • Narrow sense heritability: var()/var()
  • Chip heritability: additive component captured by the chip/imputation

30

31 of 39

Review

Matrix Algebra

32 of 39

Matrix Algebra

32

Addition

Scalar

Multiplication

Transposition

33 of 39

Matrix Multiplication

33

By File:Matrix multiplication diagram.svg:User:BilouSee below. - This file was derived from: Matrix multiplication diagram.svg,

CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=15175268

34 of 39

Matrix Multiplication

34

35 of 39

Matrix Form of System of Linear Equations

35

https://en.wikipedia.org/wiki/Matrix_(mathematics)

36 of 39

Derive Linear Regression Solution with Matrix Notation

36

37 of 39

Hardy Weinberg Equilibrium

38 of 39

Hardy Weinberg Equilibrium

  • Random mating
  • Genes inherited from the father is independent of the ones inherited from the mother
  • If p = minor allele frequency of a biallelic SNP
    • Suppose the variants seen in the population are A and C
    • Allele frequency of A is 20% and C is 80%
    • Minor allele frequency p = ?
    • p(AA) = p^2
    • AC = 2*p*(1-p)
    • CC = (1-p)^2
  • Tested using a chi2 test

38

39 of 39

Title Text

  • The Hardy-Weinberg equilibrium is a principle stating that the genetic variation in a population will remain constant from one generation to the next in the absence of disturbing factors. When mating is random in a large population with no disruptive circumstances, the law predicts that both genotype and allele frequencies will remain constant because they are in equilibrium.
  • The Hardy-Weinberg equilibrium can be disturbed by a number of forces, including mutations, natural selection, nonrandom mating, genetic drift, and gene flow. For instance, mutations disrupt the equilibrium of allele frequencies by introducing new alleles into a population. Similarly, natural selection and nonrandom mating disrupt the Hardy-Weinberg equilibrium because they result in changes in gene frequencies. This occurs because certain alleles help or harm the reproductive success of the organisms that carry them. Another factor that can upset this equilibrium is genetic drift, which occurs when allele frequencies grow higher or lower by chance and typically takes place in small populations. Gene flow, which occurs when breeding between two populations transfers new alleles into a population, can also alter the Hardy-Weinberg equilibrium.
  • Because all of these disruptive forces commonly occur in nature, the Hardy-Weinberg equilibrium rarely applies in reality. Therefore, the Hardy-Weinberg equilibrium describes an idealized state, and genetic variations in nature can be measured as changes from this equilibrium state.

39

https://www.nature.com/scitable/definition/hardy-weinberg-equilibrium-122/