1 of 16

Genome wide association studies

Saket Choudhary

saketc@iitb.ac.in

Introduction to computational multi-omics

DH 607

Lecture 22 || Wednesday, 23rd October 2024

2 of 16

dfdf

What is a SNP?

SNP = Position in genome where some individuals have one nucleotide (say G) and other individuals havea different nucleotide (say T)

  • Estimated to be around 3-4 x 108
  • Theoretically each SNP has 4 ‘alleles’ but most SNPs exist as two alleles
  • SNPs originate because of point mutations → converts one nucleotide into another → is not repaired by the repairing machinery
  • If SNPs arise in the reproductive cells → offspring inherits the mutation → after many generations → SNP becomes established in the population
  • Most SNPs are ‘biallelic’ → they have two possible

3 of 16

How to profile mutations?

4 of 16

dfdf

How SNP arrays work

Oligonucleotide hybridization analysis

  • Oligonucleotide = short (<50nt) single stranded DNA
  • If conditions are right → cross hybridizes to the DNA target
  • If there is a SNP → prevents hybridization

5 of 16

dfdf

Genetic testing is becoming ‘accessible’

Most genetic testing kits use affymetrix or Illumina SNP arrays

6 of 16

dfdf

How SNP arrays work

Goal: Determine the SNP at the A/C locus of the given DNA fragment

Affymetrix:

  • 25-mer probes for both alleles
  • DNA binds to both probes but is more efficiently bound when all 25 base pairs match (brigher yellow) v/s when it maps with mismatched SNP (dimmer yellow)

IIlumina:

  • Illumina bead is attached a 50-mer sequence complementary to the sequence adjacent to the SNP site.
  • Single-base extension (T or G) that is complementary to the allele carried by the DNA (A or C, respectively) then binds and results in the appropriately-colored signal (red or green, respectively)

For both platforms, we require algorithms to convert raw signal into SNPs

7 of 16

SNP arrays

8 of 16

How to associate mutations with phenotypes?

9 of 16

Genome wide association studies

  • Collect genotypic data using microarrays for hundreds and thousands of ‘diverse’ individuals
  • Perform quality control steps at the wet lab and dry lab sides
  • Impute missing genotypes leveraging information about reference population
  • Run an association test for each genomic variant

10 of 16

Test for association

Fits a linear model for every variant, where the x axis is genotype and the y axis is a phenotype

Manhattan Plot shows significance of each variant’s association with a phenotype

Expected vs observed p=values

11 of 16

12 of 16

Functional follow up of GWAS

To prioritize likely causal variants, statistical fine-mapping is applied to identify a set of variants that are likely to include the causal variant as well as the most likely causal variant

13 of 16

Functional follow up of GWAS

General idea: Perturb the loci and measure the phenotype

  • Integrate functional annotations of the genome with GWAS data to identify epigenetic mechanisms that may be perturbed by the causal variant (enhancers/promoters or other functional elements)
  • Massively parallel reporter assays can be used to measure whether alleles differ in their ability to drive gene expression or other molecular activity for each variant
  • eQTL = Genetic variants that affect gene expression = expression Quantitative Trait Loci
  • Target gene for a GWAS locus can be prioritized by mapping eQTLs and their co-localization to identify loci where the causal variant from GWAS is also a causal variant affecting gene expression
  • Pathways can be identified by enrichment analysis of genes from the previous step

14 of 16

Polygenic risk score

  • Step 1: GWAS summary statistics are obtained, which detail the effect of each single-nucleotide polymorphism (SNP) on the phenotype of interest
  • Step 2: genotype data for a set of individuals are referenced against GWAS summary statistics.
  • Step 3: polygenic risk scores (PRSs) can be calculated for each individual by summing up the effect sizes of all risk alleles for each individual.
  • Step 4: linear regression analysis is performed on the calculated PRS to assess the effect of the PRS on the outcome measure.

Disease are often associated with multiple (poly) genes. How can we assign risk scores to individuals based on their SNP profile?

15 of 16

GWAS:The dark side and the bright side

16 of 16

dfdf

Questions