1 of 16

Genome wide association studies

Saket Choudhary

saketc@iitb.ac.in

Introduction to computational multi-omics

DH 607

Lecture 22 || Wednesday, 23^rd October 2024

2 of 16

dfdf

What is a SNP?

SNP = Position in genome where some individuals have one nucleotide (say G) and other individuals havea different nucleotide (say T)

Estimated to be around 3-4 x 10⁸
Theoretically each SNP has 4 ‘alleles’ but most SNPs exist as two alleles
SNPs originate because of point mutations → converts one nucleotide into another → is not repaired by the repairing machinery
If SNPs arise in the reproductive cells → offspring inherits the mutation → after many generations → SNP becomes established in the population
Most SNPs are ‘biallelic’ → they have two possible

Genomes - Brown

3 of 16

How to profile mutations?

4 of 16

dfdf

How SNP arrays work

Genomes - Brown

Oligonucleotide hybridization analysis

Oligonucleotide = short (<50nt) single stranded DNA
If conditions are right → cross hybridizes to the DNA target
If there is a SNP → prevents hybridization

5 of 16

dfdf

Genetic testing is becoming ‘accessible’

https://23andme.com/

https://mapmygenome.in/

Most genetic testing kits use affymetrix or Illumina SNP arrays

6 of 16

dfdf

How SNP arrays work

LaFramboise 2009

Goal: Determine the SNP at the A/C locus of the given DNA fragment

Affymetrix:

25-mer probes for both alleles
DNA binds to both probes but is more efficiently bound when all 25 base pairs match (brigher yellow) v/s when it maps with mismatched SNP (dimmer yellow)

IIlumina:

Illumina bead is attached a 50-mer sequence complementary to the sequence adjacent to the SNP site.
Single-base extension (T or G) that is complementary to the allele carried by the DNA (A or C, respectively) then binds and results in the appropriately-colored signal (red or green, respectively)

For both platforms, we require algorithms to convert raw signal into SNPs

7 of 16

SNP arrays

Genomes - Brown

8 of 16

How to associate mutations with phenotypes?

9 of 16

Uffelman et al. 2021

Genome wide association studies

Collect genotypic data using microarrays for hundreds and thousands of ‘diverse’ individuals
Perform quality control steps at the wet lab and dry lab sides
Impute missing genotypes leveraging information about reference population
Run an association test for each genomic variant

10 of 16

Uffelman et al. 2021

Test for association

Balding

Fits a linear model for every variant, where the x axis is genotype and the y axis is a phenotype

Manhattan Plot shows significance of each variant’s association with a phenotype

Expected vs observed p=values

11 of 16

Tam et al. 2019

12 of 16

Functional follow up of GWAS

Uffelman et al. 2021

To prioritize likely causal variants, statistical fine-mapping is applied to identify a set of variants that are likely to include the causal variant as well as the most likely causal variant

13 of 16

Functional follow up of GWAS

Uffelman et al. 2021

General idea: Perturb the loci and measure the phenotype

Integrate functional annotations of the genome with GWAS data to identify epigenetic mechanisms that may be perturbed by the causal variant (enhancers/promoters or other functional elements)
Massively parallel reporter assays can be used to measure whether alleles differ in their ability to drive gene expression or other molecular activity for each variant
eQTL = Genetic variants that affect gene expression = expression Quantitative Trait Loci
Target gene for a GWAS locus can be prioritized by mapping eQTLs and their co-localization to identify loci where the causal variant from GWAS is also a causal variant affecting gene expression
Pathways can be identified by enrichment analysis of genes from the previous step

14 of 16

Polygenic risk score

Uffelman et al. 2021

Step 1: GWAS summary statistics are obtained, which detail the effect of each single-nucleotide polymorphism (SNP) on the phenotype of interest
Step 2: genotype data for a set of individuals are referenced against GWAS summary statistics.
Step 3: polygenic risk scores (PRSs) can be calculated for each individual by summing up the effect sizes of all risk alleles for each individual.
Step 4: linear regression analysis is performed on the calculated PRS to assess the effect of the PRS on the outcome measure.

Disease are often associated with multiple (poly) genes. How can we assign risk scores to individuals based on their SNP profile?

1 of 16

2 of 16

3 of 16

4 of 16

5 of 16

6 of 16

7 of 16

8 of 16

9 of 16

10 of 16

11 of 16

12 of 16

13 of 16

14 of 16

15 of 16

16 of 16