1 of 27

Welcome (back) to Stat 494: Statistical Genetics!

While we wait to get started...

Sit anywhere you want (just avoid the back two corner tables, please!)
Catch up on recent messages on Slack
Plan your Capstone Days schedule!
Review your notes and prepare for today's Journal Club discussion

2 of 27

Goals for today

Journal Club 5: application of PCA to genetic data

Infer genetic similarity (& geographic location?)
Avoid spurious associations caused by "population structure"

Intro to PCA

Overview / Stat 253 review
Intuition: what do top PCs tend to capture when applied to genetic data?

Lab 4 work time
Next time:

Connecting PCA to questions of causal inference: confounding, collider variables (DAGs)

3 of 27

Journal Club #5

4 of 27

Topic: PCA

An early (and very well-known!) application of Principal Component Analysis to genetic data

Discussion Leaders: Lucas, Noah, Tam

5 of 27

Journal Club Debrief

Submit and upvote questions here →

[Presenter link for Kelsey]

6 of 27

Stretch Break!

7 of 27

Journal Club Debrief

Key Points:

When PCA is applied to genetic data, top PCs often reflect genetic similarity / ancestry / geographic location / "population structure"

Why?? More soon!

Genetic variation exists along a continuum, not discrete clusters!

8 of 27

Journal Club Debrief

Key Points (continued):

When phenotypes are associated with geography or other aspects of "population structure," spurious associations may arise in GWAS (at SNPs that are likewise associated with geography / population structure)
Including PCs as covariates in GWAS models can help fix this.
Why?? More soon!
Simulation studies are a powerful and ubiquitous tool in statistics research!!

(Image source: Supplementary Material)

9 of 27

Journal Club Debrief

However…

Sensitive to pre-processing and may capture other data "artifacts" / features (e.g., related individuals, linkage disequilibrium)
Can be difficult to decide how many PCs are needed to fully capture this "structure"
This is something I've been thinking a lot about for the past 5+ years! →

10 of 27

Journal Club Debrief

However…

Sensitive to pre-processing and may capture other data "artifacts" / features (e.g., related individuals, linkage disequilibrium)
Can be difficult to decide how many PCs are needed to fully capture this "structure"
This is something I've been thinking a lot about for the past 5+ years! →
In light of last week's journal club discussion, what other comments, questions, or concerns do you have about this article?

11 of 27

Journal Club Debrief

Connection to upcoming content:

the PCA-based methods used here are based on genotypic patterns of variation and do not take advantage of signatures of population structure that are contained in patterns of haplotype variation

p. 100, last paragraph before Methods Summary, when discussing limitations and areas for future work

12 of 27

Genetic Ancestry

Local ancestry: at a specific position along the genome

I inherited this part of my genome from an ancestor in category X

Global ancestry: across the whole genome

I inherited this proportion of my genome from an ancestor in category X
PCA captures global ancestry

13 of 27

Journal Club Debrief - what's next?

Journal Club #6:

Date: March (!!) 4
Readings: application of global and local ancestry inference to 23andMe data
Leaders: Nick, Ronan, Sam
Instructions: see the course website

Remember to share slides with me (leaders) and submit at least one question to slido (everyone else) before class

14 of 27

Understanding PCA

15 of 27

Principal component analysis

PCA is a widely used technique for dimension reduction
The goal of dimension reduction is to find a way to represent the information within our data using fewer variables
With PCA, we are specifically looking for the linear transformation of our original data that can explain the most variability (variability = information) and produces a new set of uncorrelated (orthogonal) features

16 of 27

Principal component analysis

Source: ISLR Figure 6.14

17 of 27

Principal component analysis

Source: ISLR Figure 6.15

18 of 27

Principal component analysis

PCA finds new variables (called principal components, or PCs) that are linear combinations of our original ones:

PC₁ = a₁₁x₁ + a₁₂x₂ + … + a_1px_p

PC₂ = a₂₁x₁ + a₂₂x₂ + … + a_2px_p

_…

PC_p = a_p1x₁ + a_p2x₂ + … + a_ppx_p

19 of 27

STAT 253 review (or preview)

Discuss at your table:

What are scores?
What are loadings?
Does the order of the PCs have any particular meaning?

Check out these Stat 253 materials if you need a refresher:

https://kegrinde.github.io/stat253_coursenotes/

(Unit 6 > 19 Principal Component Analysis)

PC₁ = a₁₁x₁ + a₁₂x₂ + … + a_1px_p

PC₂ = a₂₁x₁ + a₂₂x₂ + … + a_2px_p

_…

PC_p = a_p1x₁ + a_p2x₂ + … + a_ppx_p

20 of 27

Don't peek at the next slide

21 of 27

PCA vocab

The principal components are the new variables

PC₁ = a₁₁x₁ + a₁₂x₂ + … + a_1px_p

PC₂ = a₂₁x₁ + a₂₂x₂ + … + a_2px_p

_…

PC_p = a_p1x₁ + a_p2x₂ + … + a_ppx_p

The scores are the values that these new variables take

PC_1i = a₁₁x_1i + a₁₂x_2i + … + a_1px_pi

The loadings are the weights/coefficients a₁₁, a₁₂, …, a_pp. They describe the contribution of each of the original variables to the new PCs

22 of 27

Principal component analysis

PC₁ is the linear combination of our original variables that explains the most variability in our data: considering all possible linear combinations

𝜙₁x₁ + 𝜙₂x₂ + … + 𝜙_px_p,

PC₁ = a₁₁x₁ + a₁₂x₂ + … + a_1px_p is the one with the highest variance (subject to the constraint that 𝜙₁² + 𝜙₂² + … + 𝜙_p² = 1)

23 of 27

Principal component analysis

PC₁ is the linear combination of our original variables that explains the most variability in our data: considering all possible linear combinations

𝜙₁x₁ + 𝜙₂x₂ + … + 𝜙_px_p,

PC₁ = a₁₁x₁ + a₁₂x₂ + … + a_1px_p is the one with the highest variance (subject to the constraint that 𝜙₁² + 𝜙₂² + … + 𝜙_p² = 1)

PC₂ is then the linear combination of our original variables that has the next largest variance, subject to the constraint of being uncorrelated with (i.e., perpendicular or orthogonal to) PC₁

24 of 27

Principal component analysis

PC₁ is the linear combination of our original variables that explains the most variability in our data: considering all possible linear combinations

𝜙₁x₁ + 𝜙₂x₂ + … + 𝜙_px_p,

PC₁ = a₁₁x₁ + a₁₂x₂ + … + a_1px_p is the one with the highest variance (subject to the constraint that 𝜙₁² + 𝜙₂² + … + 𝜙_p² = 1)

PC₂ is then the linear combination of our original variables that has the next largest variance, subject to the constraint of being uncorrelated with (i.e., perpendicular or orthogonal to) PC₁
etc.

25 of 27

Lab 4: principal component analysis

Find Lab 4 on the course website
Copy-paste the code into a new QMD in RStudio
Add your name to the YAML header!
Work through the questions, discussing at your table as you go

Part 1: Building (our PCA) Intuition ← goal: finish this part before our next class
Part 2: Impact on GWAS
Part 3: Application to Real Genetic Data (Learn a New R Package!)

26 of 27

What's Next?

27 of 27

What's Next?

Reminder: no class on Thursday!

Attend (or give) 3 capstone talks instead
Bring your Capstone Reflection to class next Tuesday

Prep for Journal Club 6

Remember to submit at least one question to the Slido before class

Continue brainstorming project topics:

Will send out a Project Preferences Survey next week

Coming soon: midterm learning reflection & one-on-one conferences