1 of 27

Welcome (back) to Stat 494: Statistical Genetics!

While we wait to get started...

  • Sit anywhere you want (just avoid the back two corner tables, please!)
  • Catch up on recent messages on Slack
  • Plan your Capstone Days schedule!
  • Review your notes and prepare for today's Journal Club discussion

2 of 27

Goals for today

  • Journal Club 5: application of PCA to genetic data
    • Infer genetic similarity (& geographic location?)
    • Avoid spurious associations caused by "population structure"
  • Intro to PCA
    • Overview / Stat 253 review
    • Intuition: what do top PCs tend to capture when applied to genetic data?
  • Lab 4 work time
  • Next time:
    • Connecting PCA to questions of causal inference: confounding, collider variables (DAGs)

3 of 27

Journal Club #5

4 of 27

Topic: PCA

An early (and very well-known!) application of Principal Component Analysis to genetic data

Discussion Leaders: Lucas, Noah, Tam

5 of 27

Journal Club Debrief

Submit and upvote questions here →

[Presenter link for Kelsey]

6 of 27

Stretch Break!

7 of 27

Journal Club Debrief

Key Points:

  • When PCA is applied to genetic data, top PCs often reflect genetic similarity / ancestry / geographic location / "population structure"
    • Why?? More soon!
  • Genetic variation exists along a continuum, not discrete clusters!

8 of 27

Journal Club Debrief

Key Points (continued):

  • When phenotypes are associated with geography or other aspects of "population structure," spurious associations may arise in GWAS (at SNPs that are likewise associated with geography / population structure)
  • Including PCs as covariates in GWAS models can help fix this.
  • Why?? More soon!
  • Simulation studies are a powerful and ubiquitous tool in statistics research!!

(Image source: Supplementary Material)

9 of 27

Journal Club Debrief

However…

  • Sensitive to pre-processing and may capture other data "artifacts" / features (e.g., related individuals, linkage disequilibrium)
  • Can be difficult to decide how many PCs are needed to fully capture this "structure"
  • This is something I've been thinking a lot about for the past 5+ years! →

10 of 27

Journal Club Debrief

However…

  • Sensitive to pre-processing and may capture other data "artifacts" / features (e.g., related individuals, linkage disequilibrium)
  • Can be difficult to decide how many PCs are needed to fully capture this "structure"
  • This is something I've been thinking a lot about for the past 5+ years! →
  • In light of last week's journal club discussion, what other comments, questions, or concerns do you have about this article?

11 of 27

Journal Club Debrief

Connection to upcoming content:

the PCA-based methods used here are based on genotypic patterns of variation and do not take advantage of signatures of population structure that are contained in patterns of haplotype variation

p. 100, last paragraph before Methods Summary, when discussing limitations and areas for future work

12 of 27

Genetic Ancestry

  • Local ancestry: at a specific position along the genome
    • I inherited this part of my genome from an ancestor in category X
  • Global ancestry: across the whole genome
    • I inherited this proportion of my genome from an ancestor in category X
    • PCA captures global ancestry

13 of 27

Journal Club Debrief - what's next?

Journal Club #6:

  • Date: March (!!) 4
  • Readings: application of global and local ancestry inference to 23andMe data
  • Leaders: Nick, Ronan, Sam
  • Instructions: see the course website
    • Remember to share slides with me (leaders) and submit at least one question to slido (everyone else) before class

14 of 27

Understanding PCA

15 of 27

Principal component analysis

  • PCA is a widely used technique for dimension reduction
  • The goal of dimension reduction is to find a way to represent the information within our data using fewer variables
  • With PCA, we are specifically looking for the linear transformation of our original data that can explain the most variability (variability = information) and produces a new set of uncorrelated (orthogonal) features

16 of 27

Principal component analysis

Source: ISLR Figure 6.14

17 of 27

Principal component analysis

Source: ISLR Figure 6.15

18 of 27

Principal component analysis

  • PCA finds new variables (called principal components, or PCs) that are linear combinations of our original ones:

PC1 = a11x1 + a12x2 + … + a1pxp

PC2 = a21x1 + a22x2 + … + a2pxp

PCp = ap1x1 + ap2x2 + … + appxp

19 of 27

STAT 253 review (or preview)

Discuss at your table:

  • What are scores?
  • What are loadings?
  • Does the order of the PCs have any particular meaning?

Check out these Stat 253 materials if you need a refresher:

https://kegrinde.github.io/stat253_coursenotes/

(Unit 6 > 19 Principal Component Analysis)

PC1 = a11x1 + a12x2 + … + a1pxp

PC2 = a21x1 + a22x2 + … + a2pxp

PCp = ap1x1 + ap2x2 + … + appxp

20 of 27

Don't peek at the next slide

21 of 27

PCA vocab

  • The principal components are the new variables

PC1 = a11x1 + a12x2 + … + a1pxp

PC2 = a21x1 + a22x2 + … + a2pxp

PCp = ap1x1 + ap2x2 + … + appxp

  • The scores are the values that these new variables take

PC1i = a11x1i + a12x2i + … + a1pxpi

  • The loadings are the weights/coefficients a11, a12, …, app. They describe the contribution of each of the original variables to the new PCs

22 of 27

Principal component analysis

  • PC1 is the linear combination of our original variables that explains the most variability in our data: considering all possible linear combinations

𝜙1x1 + 𝜙2x2 + … + 𝜙pxp ,

PC1 = a11x1 + a12x2 + … + a1pxp is the one with the highest variance (subject to the constraint that 𝜙12 + 𝜙22 + … + 𝜙p2 = 1)

23 of 27

Principal component analysis

  • PC1 is the linear combination of our original variables that explains the most variability in our data: considering all possible linear combinations

𝜙1x1 + 𝜙2x2 + … + 𝜙pxp ,

PC1 = a11x1 + a12x2 + … + a1pxp is the one with the highest variance (subject to the constraint that 𝜙12 + 𝜙22 + … + 𝜙p2 = 1)

  • PC2 is then the linear combination of our original variables that has the next largest variance, subject to the constraint of being uncorrelated with (i.e., perpendicular or orthogonal to) PC1

24 of 27

Principal component analysis

  • PC1 is the linear combination of our original variables that explains the most variability in our data: considering all possible linear combinations

𝜙1x1 + 𝜙2x2 + … + 𝜙pxp ,

PC1 = a11x1 + a12x2 + … + a1pxp is the one with the highest variance (subject to the constraint that 𝜙12 + 𝜙22 + … + 𝜙p2 = 1)

  • PC2 is then the linear combination of our original variables that has the next largest variance, subject to the constraint of being uncorrelated with (i.e., perpendicular or orthogonal to) PC1
  • etc.

25 of 27

Lab 4: principal component analysis

  • Find Lab 4 on the course website
  • Copy-paste the code into a new QMD in RStudio
  • Add your name to the YAML header!
  • Work through the questions, discussing at your table as you go
    • Part 1: Building (our PCA) Intuition ← goal: finish this part before our next class
    • Part 2: Impact on GWAS
    • Part 3: Application to Real Genetic Data (Learn a New R Package!)

26 of 27

What's Next?

27 of 27

What's Next?

  • Reminder: no class on Thursday!
    • Attend (or give) 3 capstone talks instead
    • Bring your Capstone Reflection to class next Tuesday
  • Prep for Journal Club 6
    • Remember to submit at least one question to the Slido before class
  • Continue brainstorming project topics:
    • Will send out a Project Preferences Survey next week
  • Coming soon: midterm learning reflection & one-on-one conferences