1 of 76

Let’s go back to the start

Saket Choudhary

saketc@iitb.ac.in

Introduction to computational multi-omics

DH 607

Lecture 24 || Wednesday, 30th October 2024

2 of 76

Logistics

  • 7 Assignments in total (Lowest 2 will be dropped)
  • End semester examination:
    • Friday, 22nd November 13:30 - 16:30
    • In Class (Venue: IC1)
    • Reports and posters (pdf) due on: 22nd November (Friday)
    • Poster submission: 25th November (Monday)
    • Simple quiz based on material taught after midsem (Interpreting plots, suggesting analyses, diagnosing issues)
  • No classes on November 1st, 6th, 9th

3 of 76

3

dfdf

Welcome to last class of DH607!

“Somewhere, something incredible is waiting to be known”

Carl Sagan

American astronomer and planetary scientist

But we just got started…

4 of 76

4

dfdf

What was the course about?

Goal 1: Give you a flavour of science

    • Science: Essence of science is “inquiry”: concrete descriptions of what we observe; theories about what drives those observations
    • Engineering: “Design”: expands the scope of human plans results

Goal 2: Equip you with fundamental analytical framework to answer your own questions (broadly in genomics)

5 of 76

5

dfdf

What was the course about?

6 of 76

Course vignettes

(from Lecture 1)

7 of 76

dfdf

Structure of DNA

Building blocks of DNA

A pairs with T; G pairs with C; Double helix

DNA as a template for its own duplication

8 of 76

What is ‘genomics’?

Dr. Thomas Roderick

‘Genomics’

  • Branch of molecular biology that studies structure, function, evolution, mapping and editing of ‘genomes’
  • First coined by Dr. Thomas Roderick (Jackson Laboratory) as a name for yet-to-be-published journal in 1986
  • Genetics vs Genomics:
    • Genetics = study of specific and limited number of genes or part of genes with known function
    • Genomics = study of “all” genes

"I propose the expression Genom for the haploid chromosome set, which, together with the pertinent protoplams specifies the material foundation of the species" – Hans Winkler

Why ‘-ome’?

  • Hypothesis 1: fusion of two words, gene and chromosome
  • Hypothesis 2: Suffix ‘-ome’ = variant of the greek word -oma, had already been recruited into botany: biome, rhizome, phyllome to express many elements of complex biological systems

‘Genome’

  • Genome = entire set of DNA instructions found in a cell
  • Coined in 1920 by Hans Winkler, a German botanist

9 of 76

Lectures 01-04

10 of 76

dfdf

P-values

11 of 76

dfdf

Sequence probabilities

12 of 76

dfdf

Looking at sequences probabilistically

13 of 76

Lectures 05-08

DNA-sequencing

14 of 76

dfdf

Aligning sequences

Biological problem: How similar are two DNA/RNA/Protein sequences

Solution:

  • Dynamic programming [CS]
  • Sequence statistics [MATH]

15 of 76

dfdf

The alignment problem

16 of 76

dfdf

Global alignment

17 of 76

dfdf

Local alignment

18 of 76

dfdf

Searching for sequences in large scale databases

Biological problem: Given a biological sequences, what is the likely function of this sequence as compared to all known biological sequences that are close to it

Solution:

  • Dynamic programming [CS]
  • Sequence statistics [STATS]

19 of 76

dfdf

Querying large databases: BLAST

20 of 76

dfdf

The ‘eureka’ moment of PCR: Copying segments of DNA

Kary Mullis, ‘Inventor’ of PCR

  • A technique so simple that it is easy to ignore its importance
  • Series of denaturing and synthesis steps to perform non-linear amplification of DNA

21 of 76

dfdf

What is PCR?

Goal: Make multiple copies of a given DNA molecule OR ‘Amplify’ DNA

Recipe:

  • Mix target DNA (as low as a single molecule) with Taq DNA polymerase
  • Two oligonucleotide primers
  • Supply of nucleotides
  • Primers attach to the ends and hence need to be ‘designed’ for a given DNA sequence

Mechanistic steps:

  • At 94°C, hydrogen bonds of the double strand are broken; Target DNA gets denatured; Taq is thermostable
  • Temperature reduced to 50-60°C which allows some rejoining of single strands but also enables primers to attach
  • Temperature raised to 72°C where Taq polmyerase is most efficient

22 of 76

dfdf

Generations of sequencing

First generation

(Sanger sequencing)

1977

Microarrays

1981

Second generation DNA sequencing

~ 2000s

Third generation

~ 2010s

23 of 76

dfdf

Genomics and Sequencing by synthesis

Biological question: How to determine the DNA sequence of all molecules in a (tissue) sample in a high-throughput fashion?

Solution: Bridge amplification [BIO]

24 of 76

dfdf

Second generation sequencing - Using fluorescently labelled deoxynucleotides

Key Idea: At each step the modified polymerase incorporates one and only one fluorescently labelled deoxynucleotides

25 of 76

dfdf

Third gen: Real time single molecule sequencing using Pacbio: 1kb to 20kb

Key idea: Avoid delay at detection step

  • Use fluorescent markers but not blocking
  • Template is circularized and happens in flowcells with millions of wells
  • ‘Optically observe’ polymerase mediated synthesis
  • Zero mode waveguide (ZMW) = a hole with width less than half the wavelength of light
  • ZMW limits the fluorescent excitation to tiny volume encompassing the polymerase and its template

ZMW: Hole with double stranded DNA + polymerse

Limitations: Higher error rate: 11-12% vs 0.1% for Illumina (short read sequencing)

26 of 76

dfdf

Third gen: Real time single molecule sequencing using Nanopore: 10kb-100kb

Oxford Nanopore MinION

Key idea:

  • Passing a single stranded DNA through a narrow channel will create a pattern of flow of ions
  • Pores are nanometer-scale wide (helix spans 2 nanometer): nanpore
  • Each nanopore has its own electroted connected to a sensor that measures the electric current flowing as the DNA passes through the nanopore
  • Bases determined using neural network (CNN) based predictors

Limitations: Higher error rate (~14% but seems to have improved over the past year)

27 of 76

dfdf

The after effects of human genome project: Using human genome as a reference

The assembled genome is used as a ‘reference’ to ‘map’ newly sequenced DNA fragments

28 of 76

dfdf

Aligning sequences with reduced memory footprint

Biological problem: How to align short sequences to a large reference genome without blowing up the computer memory?

Solution:

  • Smart hashing [CS]
  • Lossless transformation [CS]
  • Suffix trees [CS]

29 of 76

dfdf

Burrows Wheeler Transform

  • First column is sorted lexicographically
  • Last column is called the Burrows-Wheeler transform or BWT of Genome
  • Notice the BWT of panamabananas is smnpbnnaaaaa$a → runs of a’s are grouped together. Why?

30 of 76

Lectures 09-15

RNA-seq and differential expression

31 of 76

dfdf

Transcriptomics: Sequencing the ‘transcriptome’

The need for sequencing transcriptome:

  • DNA is same across cells but the gene expression pattern is different
  • Changes in the DNA might not necessarily reflect in the expression phenotype

32 of 76

dfdf

Bulk RNA-seq: Unbiased and high-throughput profiling of transcriptome

33 of 76

dfdf

Mapping reads to transcripts

Biological problem: What are the expression levels (mRNA) of the gene?

Solution:

  • Smart hashing [CS]
  • Graph based pseudomapping [CS]

34 of 76

dfdf

Pre-processing -omics datasets

Analytical questions:

  • Is there enough signal in my data?
  • Do all samples have the right signal?
  • Are there ‘batch-effects’ in my data?

Techniques:

  • Linear and non-linear dimensionality reduction PCA/SVD/tSNE/UMAP [STATS]
  • Clustering [STATS]

35 of 76

Principal Component Analysis - The recipe

  • Start with a data matrix M.
  • Center M by subtracting the column means (each column is a feature)
  • Perform SVD of M → M = UΣVT
    • U and V are orthonormal
    • Σ is a diagonal matrix of singular values
    • V is made of eigenvectors that diagonalize the covariance matrix MTM.
  • Truncate Vk to retain the first k columns
    • Mk = UkΣVkT is a good low rank (k) approximation of M.
  • “Project” the original matrix M onto Vk: MVk
    • This projection has two properties:
      • It maximises the variance of projected points
      • It results in minimum reconstruction error if the original matrix is to be reconstructed

36 of 76

dfdf

PCA: the optimization

37 of 76

dfdf

Interpreting PCA

PCA of the morphology of shoulder blade captures differences between humans and other primates.

PC1 = Orientation of spine to the shoulder blade

PC2 = Difference in borders of the shoulder blade

38 of 76

dfdf

How to reverse PCA?

Mk = UkΣVkT

MV1

MV1VT1

PCA reconstruction=PC scores⋅EigenvectorsT + Mean

39 of 76

dfdf

Statistical models for handling omics data

Biological question: Are differences between the two samples (cases/controls) biological or technical?

Solution:

  • Model biological and technical noise [STATS]
  • Multiple hypothesis testing [STATS]

40 of 76

dfdf

Choosing a test when you have two conditions

Parameteric tests = Have a parameter → used when the distribution is known

Non-parameteric tests → used when the distribution is unknown

Genes across two conditions are unpaired so we will use unpaired tests for most part of the course

41 of 76

dfdf

P-value revisited

Area = α/2

Area = α/2

Distribution of T under H0

Significant

findings

Null findings

Significant

findings

T1-α/2

Tα/2

P-value

Tobs

P-value = Probability of sampling a test statistic at least as extreme as the observed test statistic if the null hypothesis is true

We “reject” the null hypothesis (H0) if the pvalue is below the threshold (𝝰)

Under the null, p-value follows a uniform distribution (Proof: on to the board)

42 of 76

dfdf

Type I,II errors and Power

  • Type I error:
    • Probability that the test incorrectly rejects the null hypothesis (H0) when the null H0 is true
    • Often denoted by 𝞪
  • Type II error:
    • Probability that the test incorrectly fails to reject the null hypothesis (H0) when H0 is false
    • Often denoted by β
  • Power:
    • Probability that the test correctly rejects the null hypothesis (H0) when the alternative hypothesis (H1) is true
    • Commonly denoted by 1- β where β is the probability of making a Type II error by incorrectly failing to reject the null hypothesis.
    • As β increases, the power of a test decreases.

43 of 76

dfdf

Type I,II errors and Power

False-positive

False-

negative

Distribution of T under H0

False-positive

Distribution of T under HA

Power

False-

negative

The false-positive rate is the probability of incorrectly rejecting H0.

The false-negative rate is the probability of incorrectly accepting H0.

Power = 1 – false-negative rate = probability of correctly rejecting H0.

Tα/2

T1-α/2

44 of 76

dfdf

Types of error

45 of 76

dfdf

Error rates for multiple hypothesis

Null

Hypothesis

True

Alternative

True

Not significant

Significant

Test Statistic

True Negative (U)

False Positive (V)

False Negative (T)

True Positive (S)

W

R

G0

G1

Family-wise error rate (FWER)

= probability of at least 1 false-positive

= Pr(V>0)

= significance level

False-discovery rate (FDR)

= expected proportion of false-positives among the rejected hypotheses

By convention, V/R≡0 if R=0.

46 of 76

dfdf

FWER: Family wise error rate

  • Suppose we are testing for a total of m genes, resulting in multiple null hypotheses H1, H2, …, Hm. The corresponding pvalues are p1,p2,...,pm.
  • Among the m genes (hypotheses) being tested, suppose m0 are true → m0 are actually DE between conditions (but we do not know m0)
  • The familywise error rate (FWER) is the probability of rejecting at least one of the true hypothesis m0 or alternatively FWER is the probability of having at least one false positive
  • We want to “control” the FWER: We do this by choosing a threshold for rejecting each null hypothesis so that in aggregate the FWER is less than some threshold α.

47 of 76

dfdf

Bonferroni correction

  • We can “control” the FWER by rejecting all null hypotheses for which �

  • How does this control FWER?

  • This is referred to as Bonferroni correction
  • And the corresponding pvalues are referred to as “adjusted p-values”

48 of 76

dfdf

FDR: False discovery rate

  • FDR = Number of false discoveries is the number of type I errors (false positives) among the rejections of the null hypothesis.
  • FDR, intuitively is the expected number of “false discoveries”
  • If V = number of type I errors, R = number of rejected null hypothesis, FDR is given by

49 of 76

dfdf

Controlling the FDR using Benjamini Hochberg

  • Given the p-values for m hypothesis tests, sort the p-values: �p(1) , p(2) , … p(m).
  • For a given threshold α, find the largest value k such that �
  • Reject all the null hypotheses H(i) for i=1,...,k.
  • The corresponding p-values are referred to as “BH adjusted pvalues”

50 of 76

dfdf

Controlling the FDR using qvalues

  • Given the p-values for m hypothesis tests, sort the p-values: p(1) , p(2) , … p(m).
  • For each p-value set the q-value to be �
  • Rejecting the null hypothesis with q-values less than α ensures the expected false discovery rate is α.

51 of 76

Lectures 16-19

Single-cell RNA-seq

52 of 76

Single-cell omics

Biological question: How similar or different are the ‘profiles’ of two given cells?

Solution:

  • Technological advancement [BIO]
  • Separating technical noise from biological signal [STATS]
  • Clustering [STATS]
  • Differential expression [STATS]
  • Handling large scale data [CS]

53 of 76

Single-cell analysis workflow

scRNA-seq counts (UMIs) matrix

Input of any scRNA-seq workflow:

Image credits:

Azenta.com

Question: How should we ‘normalize’ counts matrix to adjust for non-biological variation?

Counts matrix needs to be ‘normalized’ before any downstream analysis to account for difference in total molecules sequenced

54 of 76

dfdf

Data: Human PBMC Smart-seq3, 3k cells from Hagemann-Jensen et al., NBT 2020

Many genes exhibit technical variation

Challenge: Deeper sequencing results in higher gene counts - how can we adjust for this effect?

55 of 76

Single-cell RNA-seq enables comparison of the transcriptomes of individual cells

Target tissue

Single cell RNA-seq (scRNA-seq)

Solution: scRNA-seq enables molecular measurement of RNA in single cells

56 of 76

56

Why Single cell?

Cells with varying gene expression pattern

Bulk RNA-seq

Missing co-variation pattern

Single-cell RNA-seq

Reveals co-variation pattern

57 of 76

Single-cell omics: Active playground for statistical methods development

Number of scRNA-seq tools v/s datasets

See a more comprehensive version of the transistor plot here

58 of 76

58

Negative Binomial distribution for modeling UMIs

Negative Binomial model for modeling scRNA-seq counts:

Mean = 𝜇

Variance = 𝜇 + 𝜇2/𝜃

Inverse-dispersion parameter = 𝜃

Negative binomial model allows capturing the heterogeneity observed in gene counts

59 of 76

dfdf

Single cell multi-omics

Biological question: How are two different types of molecules related within each cell

60 of 76

Lectures 20-23

CRISPR, GWAS and Epigenomics

61 of 76

dfdf

Genome editing

Genome editing using a programmable nuclease (usually Cas9 protein) is currently the most efficient way to inactivate a specific gene

Principle: Direct a nuclease to a specific site where it makes a double strand cut

  • Cut stimulates a natural repair process nonhomologous end-joining (NHEJ) in eukaryotes → join the DNA strands together
  • Cut position is specified by the 20-nucleotide guide RNA (gRNA), which must be designed to base pair to a target site immediately upstream of a 5’–NGG–3’ or 5’–NAG–3’
  • NHEJ is error prone → will result in a short insertion or deletion at the repair site → often disrupts the open reading frame (ORF) → lesser or no protein from this gene

62 of 76

dfdf

How does Cas9 work in eukaryotes?

63 of 76

dfdf

The faults in our DNA

Sickle cell disease: Two mutations to sickle

64 of 76

dfdf

The faults in our DNA

What made treating sickle cell anaemia possible?

  • Discovery of CRISPR-Cas9
  • Advancements in genomic technologies
  • Statistical methods for genomics

65 of 76

dfdf

Gene wide association studies

Key idea

  • Genotype individuals from multiple cohorts (could be observational)
  • Associate genotypes with traits (e.g. correlate height of individual with a single nucleotide polymorphism [SNP] at a particular locus)
  • Traits could be anything: diseases, disorders, physical characteristics.

66 of 76

dfdf

Where did original Indians come from?

Biological question: Do two individuals have shared ancestry?

Solution:

  • SNP arrays [BIO]
  • Statistical genetics [STATS]

67 of 76

dfdf

Statistical models for deciphering gene regulation

Biological question: How do transcription factors (enhancers/promoters_ influence gene expression?

Solution:

  • Technological advancement [BIO]
  • Statistical models for modeling DNA fragments [STATS]

68 of 76

What is epigenomics?

Epigenetics(omics) = Study of changes in DNA that do not involve alterations in the DNA sequence.

For our current focus: It would mean study of chemical modifications to the chromatin

Can we ‘sequence’ it?

69 of 76

Tales of: Histones, Chromatin, Nucleosomes

Histones = Arginine/Lysine rich proteins

5 families: H1/H5 (linker), H2,H3 and H4

Core histones form an octamer called nucleosome → 147bp of DNA wraps around nucleosome

Histone tails contains amino acids

that can undergo chemical modification

Chromatin = A complex of DNA and histone proteins. The basic unit of chromatin is the nucleosome

Nucleosome = Two H2A-H2B dimer + One H3-H4 tetramer

70 of 76

Active and Repressed regions are enriched for distinct modifications

  • Repressed regions are characterized by regularly spaced nucleosome enriched for DNA methylation and histone modifications such as H3K27Me3 → Repression arises from inability of TFs to bind enhancers and promoters
  • Active regions are enzymatically accessible and enriched for acetylated histone modifications such as H3K27Ac

71 of 76

Where from here?

Other courses, grad school, industry

72 of 76

One simple advice for approaching anyone: SHOW Don’t TELL

73 of 76

dfdf

Other courses

KCDH

  • *Computational genomics of ageing (Autumn 2025)
  • *Introduction to data science for genomics (Fall 2025)

Coursera

74 of 76

dfdf

Grad Schools

  • Programs are increasingly becoming cross disciplinary
  • You can do computational biology in any of these departments (non-exhaustive):
    • Chemical Engineering
    • Biosciences/Genomic sciences
    • (Applied) Mathematics
    • Statistics
    • Genetics
    • Chemistry
    • Physics
    • Computer Science
    • Electrical engineering
    • Environmental Sciences
  • While applying for grad schools, more than the ranking, you should look at whether the school has the right ecosystem for you to flourish
  • PhD is for picking up the right skills (you are still training) - so please approach it with an open mind

75 of 76

dfdf

Industry

76 of 76

dfdf

Circling back: Questions?