1 of 21

Welcome (back) to Stat 494: Statistical Genetics!

While we wait to get started...

  • Sit anywhere you want (just avoid the back two corner tables, please!)
  • Catch up on recent messages on Slack
    • Please respond to my career exploration poll!
    • Reminder: attend at least two department seminars and complete the Speaker Reflection form by the end of the semester
  • Turn in your Capstone Reflection
  • Review your notes and prepare for today's Journal Club discussion

2 of 21

Goals for today

  • Journal Club 6: application of ancestry inference to 23andMe data
    • Student presentation
    • Instructor debrief
  • Project "speed dating"
  • Lab 4:
    • Part 1 debrief
    • Part 2 work time
    • If time: Part 2 debrief
      • Connecting ancestry inference / PCA to questions of causal inference (confounding, colliders, DAGs)

3 of 21

Journal Club #6

4 of 21

Topic: Genetic Ancestry Inference

  • Application of genetic ancestry inference to individuals in the United States
  • An early paper using 23andMe customer data (see full publication list here)
  • One of the first papers we've seen that focuses on admixed individuals

Discussion Leaders: Nick, Ronan, Sam

5 of 21

Journal Club Debrief

Submit and upvote questions here →

[Presenter link for Kelsey]

6 of 21

Stretch Break!

7 of 21

Journal Club Debrief

Key Points:

  • SVMs are a useful tool for inferring genetic ancestry similarity
    • See ISLR for more details
  • Genetic ancestry is highly variable
    • Within self-reported race/ethnicity groups
    • Across the US
    • → can't predict race/ethnicity from genetic ancestry
  • … and complex
    • → simple models (e.g., assuming K = 2 for African Americans) may not suffice

Figure 5B. The proportion of individuals that self-report as European American, Latino, and African American for each 2% bin of African ancestry and Native American ancestry. The proportion for each 2% bin is shown as a pie chart, with slices colored in proportion to the absolute numbers of individuals from each self-reported identity that carry those levels of genome-wide ancestry. Pie charts are omitted for bins where there were no individuals with those corresponding levels of Native American and African ancestry.

8 of 21

Journal Club Debrief

Connections to previous topics:

  • Discussion and justification of choice of population descriptors
  • Expanding the focus beyond Europe:
    • African Americans, Latinos, European Americans (acknowledging admixture)
    • Sub-continental ancestry
  • However…
    • groups still analyzed separately
    • individuals who self-identify as mixed race excluded
    • accuracy of statistical methods impacted by data availability (or lack thereof)

p. 38 (Methods)

p. 39 (Methods)

9 of 21

Statistical methods for inferring genetic ancestry

  • Machine learning tools (clustering + classification)
    • PCA (e.g., smartpca, SNPRelate)
    • Regression forests (e.g., RFMix)
    • Support vector machines (e.g., 23andMe)
  • Model-based, biologically-informed approaches
    • Bayesian methods (e.g., STRUCTURE)
    • Maximum likelihood estimation (e.g., ADMIXTURE)
    • Hidden Markov Models (e.g., STRUCTURE 2.0)
    • Li and Stephens population genetic model (e.g., HAPMIX, Chromopainter, FLARE)

Supervised (requires labels or outcome values) → need to have some people with known ancestry so we can train model

Unsupervised (doesn't require labels or outcome values)

10 of 21

Journal Club Debrief - What's Next?

Journal Club #7:

  • Date: March 27 (Thursday after spring break)
  • Readings: review of statistical methods to handle "population structure" (the definition of which is now expanded to exclude "cryptic relatedness")
  • Leaders: Bowman, Cameron, Charles
  • Instructions: see the course website
    • Remember to share slides with me (leaders) and submit at least one question to slido (everyone else) before class

11 of 21

What's Next?

12 of 21

Upcoming Project Checkpoints

Due Dates:

  • Checkpoint 1 (Project Preferences Survey): this Friday, March 7
  • Checkpoint 2 (Topic Proposal): next Friday, March 14

To help with this, let's do a quick round of project "speed dating"...

13 of 21

Project "Speed Dating" - Round 1

Discuss:

  1. Introductions
  2. What you're hoping to get out of the project (skills, concepts, etc.)
  3. Any specific topic ideas you have

14 of 21

Project "Speed Dating" - Round 2

Move to another table.

Make sure there are at least 2 people you didn't talk to during the last round.

Discuss:

  • Introductions
  • What you're hoping to get out of the project (skills, concepts, etc.)
  • Any specific topic ideas you have

15 of 21

Project "Speed Dating" - Round 3

We'll do this one more time in class on Thursday.

(And then I'll give you time in class to complete the Project Preferences Survey.)

When you get to class on Thursday, sit with at least 2 people you didn't talk to today!

16 of 21

What's Next?

  • Due this Friday (March 7):
    • Project Preferences Survey
    • Lab 4
  • We will not meet as a class next week! Instead:
    • Meet with me one-on-one (Monday–Thursday)
      • Sign up for a time here: https://calendar.app.google/wFNJ9MgbFPfKToSM9
      • Submit your Midterm Learning Reflection at least 48 hours prior
    • Meet with your project group to work on the Topic Proposal (due Friday, March 14)
      • I'll let you know your groups no later than Monday, March 10
    • Work on, and submit, Content Summary 2 (due Friday, March 14)

17 of 21

Understanding PCA

18 of 21

Lab 4: check-in

Part 1:

  • How did you calculate MAF?
    • HINT: if your results are not between 0 and 1, something needs updating!
  • Which SNPs differ in frequency across populations?
  • Which SNPs contribute most highly to PC1?
  • How many PCs do we need to capture "population structure" in this sample?
    • Which plots help us answer this question?

19 of 21

Lab 4: work time

Continue working on Parts 2 and 3!

NOTE:

  • We'll discuss Omitted Variable Bias together in class on Thursday.
  • You can skip this section for now (and leave it blank when you submit).

20 of 21

Challenge: genetic data are observational

Genetic ancestry is a potential confounding variable in GWAS:

allele frequency of the SNP we're testing differs across ancestral populations

environmental factors or causal SNPs elsewhere in genome that differ across ancestry groups

21 of 21

Challenge: genetic data are observational

If we know that genetic ancestry is a potential confounding variable, then we should adjust for it in our GWAS models:

E[y | xj, 𝛑] = 𝛼 + 𝛽j xj + 𝛄𝛑,

  • y is the trait
  • xj is the number of minor alleles at position j
  • 𝛑 is the genetic ancestry
  • repeat for all positions j = 1, …, p

← inferred using PCA or other techniques!