1 of 76

Let’s go back to the start

Saket Choudhary

saketc@iitb.ac.in

Introduction to computational multi-omics

DH 607

Lecture 24 || Wednesday, 30^th October 2024

2 of 76

Logistics

7 Assignments in total (Lowest 2 will be dropped)
End semester examination:

Friday, 22nd November 13:30 - 16:30
In Class (Venue: IC1)
Reports and posters (pdf) due on: 22nd November (Friday)
Poster submission: 25th November (Monday)
Simple quiz based on material taught after midsem (Interpreting plots, suggesting analyses, diagnosing issues)

No classes on November 1st, 6th, 9th

3 of 76

dfdf

Welcome to last class of DH607!

“Somewhere, something incredible is waiting to be known”

Carl Sagan

American astronomer and planetary scientist

But we just got started…

4 of 76

dfdf

What was the course about?

Goal 1: Give you a flavour of science

Science: Essence of science is “inquiry”: concrete descriptions of what we observe; theories about what drives those observations
Engineering: “Design”: expands the scope of human plans results

Goal 2: Equip you with fundamental analytical framework to answer your own questions (broadly in genomics)

Source

5 of 76

dfdf

What was the course about?

Eric Drexler, Radical Abundance

Source

6 of 76

Course vignettes

(from Lecture 1)

7 of 76

dfdf

Structure of DNA

Building blocks of DNA

A pairs with T; G pairs with C; Double helix

DNA as a template for its own duplication

Molecular biology of the gene

8 of 76

What is ‘genomics’?

Dr. Thomas Roderick

‘Genomics’

Branch of molecular biology that studies structure, function, evolution, mapping and editing of ‘genomes’
First coined by Dr. Thomas Roderick (Jackson Laboratory) as a name for yet-to-be-published journal in 1986
Genetics vs Genomics:

Genetics = study of specific and limited number of genes or part of genes with known function
Genomics = study of “all” genes

Kuska 1998; Yadav 2007; Brien 2022; Cristescu 2019; Goldman et al. 2016

"I propose the expression Genom for the haploid chromosome set, which, together with the pertinent protoplams specifies the material foundation of the species" – Hans Winkler

Why ‘-ome’?

Hypothesis 1: fusion of two words, gene and chromosome
Hypothesis 2: Suffix ‘-ome’ = variant of the greek word -oma, had already been recruited into botany: biome, rhizome, phyllome to express many elements of complex biological systems

‘Genome’

Genome = entire set of DNA instructions found in a cell
Coined in 1920 by Hans Winkler, a German botanist

9 of 76

Lectures 01-04

10 of 76

dfdf

P-values

11 of 76

dfdf

Sequence probabilities

12 of 76

dfdf

Looking at sequences probabilistically

https://ftp.ncbi.nlm.nih.gov/genomes/GENOME_REPORTS/

13 of 76

Lectures 05-08

DNA-sequencing

14 of 76

dfdf

Aligning sequences

Biological problem: How similar are two DNA/RNA/Protein sequences

Solution:

Dynamic programming [CS]
Sequence statistics [MATH]

15 of 76

dfdf

The alignment problem

16 of 76

dfdf

Global alignment

17 of 76

dfdf

Local alignment

18 of 76

dfdf

Searching for sequences in large scale databases

Biological problem: Given a biological sequences, what is the likely function of this sequence as compared to all known biological sequences that are close to it

Solution:

Dynamic programming [CS]
Sequence statistics [STATS]

https://blast.ncbi.nlm.nih.gov/Blast.cgi

19 of 76

dfdf

Querying large databases: BLAST

20 of 76

dfdf

The ‘eureka’ moment of PCR: Copying segments of DNA

Kary Mullis, ‘Inventor’ of PCR

Making PCR, Rabino 1996

Mullis et al. 1986

A technique so simple that it is easy to ignore its importance
Series of denaturing and synthesis steps to perform non-linear amplification of DNA

Genomes, Brown

21 of 76

dfdf

What is PCR?

Genomes, Brown

Goal: Make multiple copies of a given DNA molecule OR ‘Amplify’ DNA

Recipe:

Mix target DNA (as low as a single molecule) with Taq DNA polymerase
Two oligonucleotide primers
Supply of nucleotides
Primers attach to the ends and hence need to be ‘designed’ for a given DNA sequence

Mechanistic steps:

At 94°C, hydrogen bonds of the double strand are broken; Target DNA gets denatured; Taq is thermostable
Temperature reduced to 50-60°C which allows some rejoining of single strands but also enables primers to attach
Temperature raised to 72°C where Taq polmyerase is most efficient

22 of 76

dfdf

Generations of sequencing

First generation

(Sanger sequencing)

1977

Microarrays

1981

Second generation DNA sequencing

~ 2000s

Third generation

~ 2010s

23 of 76

dfdf

Genomics and Sequencing by synthesis

Shendure et al. 2017

Biological question: How to determine the DNA sequence of all molecules in a (tissue) sample in a high-throughput fashion?

Solution: Bridge amplification [BIO]

24 of 76

dfdf

Second generation sequencing - Using fluorescently labelled deoxynucleotides

Key Idea: At each step the modified polymerase incorporates one and only one fluorescently labelled deoxynucleotides

Langmead

25 of 76

dfdf

Third gen: Real time single molecule sequencing using Pacbio: 1kb to 20kb

https://www.pacb.com/

Key idea: Avoid delay at detection step

Use fluorescent markers but not blocking
Template is circularized and happens in flowcells with millions of wells
‘Optically observe’ polymerase mediated synthesis
Zero mode waveguide (ZMW) = a hole with width less than half the wavelength of light
ZMW limits the fluorescent excitation to tiny volume encompassing the polymerase and its template

ZMW: Hole with double stranded DNA + polymerse

Limitations: Higher error rate: 11-12% vs 0.1% for Illumina (short read sequencing)

26 of 76

dfdf

Third gen: Real time single molecule sequencing using Nanopore: 10kb-100kb

Shendure et al. 2017

Nanoporetech.com

Genomes - Brown

Oxford Nanopore MinION

Key idea:

Passing a single stranded DNA through a narrow channel will create a pattern of flow of ions
Pores are nanometer-scale wide (helix spans 2 nanometer): nanpore
Each nanopore has its own electroted connected to a sensor that measures the electric current flowing as the DNA passes through the nanopore
Bases determined using neural network (CNN) based predictors

Limitations: Higher error rate (~14% but seems to have improved over the past year)

27 of 76

dfdf

The after effects of human genome project: Using human genome as a reference

NHGRI

The assembled genome is used as a ‘reference’ to ‘map’ newly sequenced DNA fragments

28 of 76

dfdf

Aligning sequences with reduced memory footprint

Biological problem: How to align short sequences to a large reference genome without blowing up the computer memory?

Solution:

Smart hashing [CS]
Lossless transformation [CS]
Suffix trees [CS]

Ferragina et al., 2005

29 of 76

dfdf

Burrows Wheeler Transform

First column is sorted lexicographically
Last column is called the Burrows-Wheeler transform or BWT of Genome
Notice the BWT of panamabananas is smnpbnnaaaaa$a → runs of a’s are grouped together. Why?

30 of 76

Lectures 09-15

RNA-seq and differential expression

31 of 76

dfdf

Transcriptomics: Sequencing the ‘transcriptome’

Genomes - Brown

The need for sequencing transcriptome:

DNA is same across cells but the gene expression pattern is different
Changes in the DNA might not necessarily reflect in the expression phenotype

32 of 76

dfdf

Bulk RNA-seq: Unbiased and high-throughput profiling of transcriptome

Griffith et al., PLOS Comp Bio. (2015)

Mortazavi et al., Nature (2008)

33 of 76

dfdf

Mapping reads to transcripts

Haas and Zody 2010

Biological problem: What are the expression levels (mRNA) of the gene?

Solution:

Smart hashing [CS]
Graph based pseudomapping [CS]

34 of 76

dfdf

Pre-processing -omics datasets

Analytical questions:

Is there enough signal in my data?
Do all samples have the right signal?
Are there ‘batch-effects’ in my data?

Techniques:

Linear and non-linear dimensionality reduction PCA/SVD/tSNE/UMAP [STATS]
Clustering [STATS]

35 of 76

Principal Component Analysis - The recipe

Start with a data matrix M.
Center M by subtracting the column means (each column is a feature)
Perform SVD of M → M = UΣV^T

U and V are orthonormal
Σ is a diagonal matrix of singular values
V is made of eigenvectors that diagonalize the covariance matrix M^TM.

Truncate V_k to retain the first k columns

M_k = U_kΣV_k^Tis a good low rank (k) approximation of M.

“Project” the original matrix M onto V_k: MV_k

This projection has two properties:

It maximises the variance of projected points
It results in minimum reconstruction error if the original matrix is to be reconstructed

36 of 76

dfdf

PCA: the optimization

Source

37 of 76

dfdf

Interpreting PCA

Young et al. 2015

PCA of the morphology of shoulder blade captures differences between humans and other primates.

PC1 = Orientation of spine to the shoulder blade

PC2 = Difference in borders of the shoulder blade

38 of 76

dfdf

How to reverse PCA?

Source

M_k = U_kΣV_k^T

MV₁

MV₁V^T₁

PCA reconstruction=PC scores⋅Eigenvectors^T + Mean

39 of 76

dfdf

Statistical models for handling omics data

Source

Love et al. (2014)

Biological question: Are differences between the two samples (cases/controls) biological or technical?

Solution:

Model biological and technical noise [STATS]
Multiple hypothesis testing [STATS]

40 of 76

dfdf

Choosing a test when you have two conditions

Parameteric tests = Have a parameter → used when the distribution is known

Non-parameteric tests → used when the distribution is unknown

Genes across two conditions are unpaired so we will use unpaired tests for most part of the course

41 of 76

dfdf

P-value revisited

Area = α/2

Distribution of T under H₀

Significant

findings

Null findings

Significant

findings

T_1-α/2

T_α/2

P-value

T_obs

P-value = Probability of sampling a test statistic at least as extreme as the observed test statistic if the null hypothesis is true

We “reject” the null hypothesis (H₀) if the pvalue is below the threshold (𝝰)

Under the null, p-value follows a uniform distribution (Proof: on to the board)

42 of 76

dfdf

Type I,II errors and Power

Type I error:

Probability that the test incorrectly rejects the null hypothesis (H₀) when the null H₀ is true
Often denoted by 𝞪

Type II error:

Probability that the test incorrectly fails to reject the null hypothesis (H₀) when H₀ is false
Often denoted by β

Power:

Probability that the test correctly rejects the null hypothesis (H₀) when the alternative hypothesis (H₁) is true
Commonly denoted by 1- β where β is the probability of making a Type II error by incorrectly failing to reject the null hypothesis.
As β increases, the power of a test decreases.

43 of 76

dfdf

Type I,II errors and Power

False-positive

False-

negative

Distribution of T under H₀

False-positive

Distribution of T under H_A

Power

False-

negative

The false-positive rate is the probability of incorrectly rejecting H₀.

The false-negative rate is the probability of incorrectly accepting H₀.

Power = 1 – false-negative rate = probability of correctly rejecting H₀.

T_α/2

T_1-α/2

44 of 76

dfdf

Types of error

Paul Ellis, 2010

Source

45 of 76

dfdf

Error rates for multiple hypothesis

Null

Hypothesis

True

Alternative

True

Not significant

Significant

Test Statistic

True Negative (U)

False Positive (V)

False Negative (T)

True Positive (S)

Family-wise error rate (FWER)

= probability of at least 1 false-positive

= Pr(V>0)

= significance level

False-discovery rate (FDR)

= expected proportion of false-positives among the rejected hypotheses

By convention, V/R≡0 if R=0.

46 of 76

dfdf

FWER: Family wise error rate

Suppose we are testing for a total of m genes, resulting in multiple null hypotheses H₁, H₂, …, H_m. The corresponding pvalues are p₁,p₂,...,p_m.
Among the m genes (hypotheses) being tested, suppose m₀ are true → m₀ are actually DE between conditions (but we do not know m₀)
The familywise error rate (FWER) is the probability of rejecting at least one of the true hypothesis m₀ or alternatively FWER is the probability of having at least one false positive
We want to “control” the FWER: We do this by choosing a threshold for rejecting each null hypothesis so that in aggregate the FWER is less than some threshold α.

47 of 76

dfdf

Bonferroni correction

We can “control” the FWER by rejecting all null hypotheses for which �

How does this control FWER?

This is referred to as Bonferroni correction
And the corresponding pvalues are referred to as “adjusted p-values”

Bonferroni 1936

48 of 76

dfdf

FDR: False discovery rate

FDR = Number of false discoveries is the number of type I errors (false positives) among the rejections of the null hypothesis.
FDR, intuitively is the expected number of “false discoveries”
If V = number of type I errors, R = number of rejected null hypothesis, FDR is given by

49 of 76

dfdf

Controlling the FDR using Benjamini Hochberg

Given the p-values for m hypothesis tests, sort the p-values: �p₍₁₎, p₍₂₎, … p_(m).
For a given threshold α, find the largest value k such that �
Reject all the null hypotheses H_(i) for i=1,...,k.
The corresponding p-values are referred to as “BH adjusted pvalues”

Benjamini and Hochberg, 1995.

50 of 76

dfdf

Controlling the FDR using qvalues

Given the p-values for m hypothesis tests, sort the p-values: p₍₁₎, p₍₂₎, … p_(m). �
For each p-value set the q-value to be �
Rejecting the null hypothesis with q-values less than α ensures the expected false discovery rate is α.

Storey, 2002.

51 of 76

Lectures 16-19

Single-cell RNA-seq

52 of 76

Single-cell omics

Yao et al., Nature (2021)

Biological question: How similar or different are the ‘profiles’ of two given cells?

Solution:

Technological advancement [BIO]
Separating technical noise from biological signal [STATS]
Clustering [STATS]
Differential expression [STATS]
Handling large scale data [CS]

53 of 76

Single-cell analysis workflow

scRNA-seq counts (UMIs) matrix

Input of any scRNA-seq workflow:

Image credits:

Azenta.com

Question: How should we ‘normalize’ counts matrix to adjust for non-biological variation?

Counts matrix needs to be ‘normalized’ before any downstream analysis to account for difference in total molecules sequenced

54 of 76

dfdf

Data: Human PBMC Smart-seq3, 3k cells from Hagemann-Jensen et al., NBT 2020

Many genes exhibit technical variation

Challenge: Deeper sequencing results in higher gene counts - how can we adjust for this effect?

55 of 76

Single-cell RNA-seq enables comparison of the transcriptomes of individual cells

Target tissue

Single cell RNA-seq (scRNA-seq)

Solution: scRNA-seq enables molecular measurement of RNA in single cells

^Tang ^{et al}^{. Nature Methods (2009)}

^Islam ^{et al}^{. Genome Research (2011)}

56 of 76

Why Single cell?

Cells with varying gene expression pattern

Bulk RNA-seq

Missing co-variation pattern

Single-cell RNA-seq

Reveals co-variation pattern

57 of 76

Single-cell omics: Active playground for statistical methods development

Number of scRNA-seq tools v/s datasets

See a more comprehensive version of the transistor plot here

Data source: https://www.nxn.se/single-cell-studies and https://www.scrna-tools.org/

58 of 76

Negative Binomial distribution for modeling UMIs

Negative Binomial model for modeling scRNA-seq counts:

Mean = 𝜇

Variance = 𝜇 + 𝜇²/𝜃

Inverse-dispersion parameter = 𝜃

Negative binomial model allows capturing the heterogeneity observed in gene counts

59 of 76

dfdf

Single cell multi-omics

Stuart and Satija, 2019

Biological question: How are two different types of molecules related within each cell

60 of 76

Lectures 20-23

CRISPR, GWAS and Epigenomics

61 of 76

dfdf

Genome editing

Genomes - Brown

Genome editing using a programmable nuclease (usually Cas9 protein) is currently the most efficient way to inactivate a specific gene

Principle: Direct a nuclease to a specific site where it makes a double strand cut

Cut stimulates a natural repair process nonhomologous end-joining (NHEJ) in eukaryotes → join the DNA strands together
Cut position is specified by the 20-nucleotide guide RNA (gRNA), which must be designed to base pair to a target site immediately upstream of a 5’–NGG–3’ or 5’–NAG–3’
NHEJ is error prone → will result in a short insertion or deletion at the repair site → often disrupts the open reading frame (ORF) → lesser or no protein from this gene

62 of 76

dfdf

How does Cas9 work in eukaryotes?

Source

63 of 76

dfdf

The faults in our DNA

https://archive.ph/Jkn1L

Kato et al., 2018

Sickle cell disease: Two mutations to sickle

64 of 76

dfdf

The faults in our DNA

What made treating sickle cell anaemia possible?

Discovery of CRISPR-Cas9
Advancements in genomic technologies
Statistical methods for genomics

https://www.nature.com/articles/s41467-021-25298-9

https://www.nature.com/articles/549S28a

65 of 76

dfdf

Gene wide association studies

Uffelmann et al. 2021

Visscher et al. 2017

Key idea

Genotype individuals from multiple cohorts (could be observational)
Associate genotypes with traits (e.g. correlate height of individual with a single nucleotide polymorphism [SNP] at a particular locus)
Traits could be anything: diseases, disorders, physical characteristics.

66 of 76

dfdf

Where did original Indians come from?

Biological question: Do two individuals have shared ancestry?

LaFramboise 2009

https://genomeofindia.substack.com/p/genome-10-udgam-of-india-a-genetic

Moorjani et al. 2013

Solution:

SNP arrays [BIO]
Statistical genetics [STATS]

Kerdoncuff et al. 2023

67 of 76

dfdf

Statistical models for deciphering gene regulation

Source

Biological question: How do transcription factors (enhancers/promoters_ influence gene expression?

Solution:

Technological advancement [BIO]
Statistical models for modeling DNA fragments [STATS]

68 of 76

What is epigenomics?

Epigenetics(omics) = Study of changes in DNA that do not involve alterations in the DNA sequence.

For our current focus: It would mean study of chemical modifications to the chromatin

Can we ‘sequence’ it?

69 of 76

Tales of: Histones, Chromatin, Nucleosomes

Source

Histones = Arginine/Lysine rich proteins

5 families: H1/H5 (linker), H2,H3 and H4

Core histones form an octamer called nucleosome → 147bp of DNA wraps around nucleosome

Histone tails contains amino acids

that can undergo chemical modification

Chromatin = A complex of DNA and histone proteins. The basic unit of chromatin is the nucleosome

Nucleosome = Two H2A-H2B dimer + One H3-H4 tetramer

70 of 76

Active and Repressed regions are enriched for distinct modifications

Carter and Zhao

Repressed regions are characterized by regularly spaced nucleosome enriched for DNA methylation and histone modifications such as H3K27Me3 → Repression arises from inability of TFs to bind enhancers and promoters
Active regions are enzymatically accessible and enriched for acetylated histone modifications such as H3K27Ac

71 of 76

Where from here?

Other courses, grad school, industry

72 of 76

One simple advice for approaching anyone: SHOW Don’t TELL

73 of 76

dfdf

Other courses

KCDH

*Computational genomics of ageing (Autumn 2025)
*Introduction to data science for genomics (Fall 2025)

Coursera

74 of 76

dfdf

Grad Schools

Programs are increasingly becoming cross disciplinary
You can do computational biology in any of these departments (non-exhaustive):

Chemical Engineering
Biosciences/Genomic sciences
(Applied) Mathematics
Statistics
Genetics
Chemistry
Physics
Computer Science
Electrical engineering
Environmental Sciences

While applying for grad schools, more than the ranking, you should look at whether the school has the right ecosystem for you to flourish
PhD is for picking up the right skills (you are still training) - so please approach it with an open mind

1 of 76

2 of 76

3 of 76

4 of 76

5 of 76

6 of 76

7 of 76

8 of 76

9 of 76

10 of 76

11 of 76

12 of 76

13 of 76

14 of 76

15 of 76

16 of 76

17 of 76

18 of 76

19 of 76

20 of 76

21 of 76

22 of 76

23 of 76

24 of 76

25 of 76

26 of 76

27 of 76

28 of 76

29 of 76

30 of 76

31 of 76

32 of 76

33 of 76

34 of 76

35 of 76

36 of 76

37 of 76

38 of 76

39 of 76

40 of 76

41 of 76

42 of 76

43 of 76

44 of 76

45 of 76

46 of 76

47 of 76

48 of 76

49 of 76

50 of 76

51 of 76

52 of 76

53 of 76

54 of 76

55 of 76

56 of 76

57 of 76

58 of 76

59 of 76

60 of 76

61 of 76

62 of 76

63 of 76

64 of 76

65 of 76

66 of 76

67 of 76

68 of 76

69 of 76

70 of 76

71 of 76

72 of 76

73 of 76

74 of 76

75 of 76

76 of 76