Let’s go back to the start
Saket Choudhary
Introduction to computational multi-omics
DH 607
Lecture 24 || Wednesday, 30th October 2024
Logistics
3
dfdf
Welcome to last class of DH607!
“Somewhere, something incredible is waiting to be known”
Carl Sagan
American astronomer and planetary scientist
But we just got started…
4
dfdf
What was the course about?
Goal 1: Give you a flavour of science
Goal 2: Equip you with fundamental analytical framework to answer your own questions (broadly in genomics)
5
dfdf
What was the course about?
Course vignettes
(from Lecture 1)
dfdf
Structure of DNA
Building blocks of DNA
A pairs with T; G pairs with C; Double helix
DNA as a template for its own duplication
What is ‘genomics’?
Dr. Thomas Roderick
‘Genomics’
"I propose the expression Genom for the haploid chromosome set, which, together with the pertinent protoplams specifies the material foundation of the species" – Hans Winkler
Why ‘-ome’?
‘Genome’
Lectures 01-04
dfdf
P-values
dfdf
Sequence probabilities
dfdf
Looking at sequences probabilistically
Lectures 05-08
DNA-sequencing
dfdf
Aligning sequences
Biological problem: How similar are two DNA/RNA/Protein sequences
Solution:
dfdf
The alignment problem
dfdf
Global alignment
dfdf
Local alignment
dfdf
Searching for sequences in large scale databases
Biological problem: Given a biological sequences, what is the likely function of this sequence as compared to all known biological sequences that are close to it
Solution:
dfdf
Querying large databases: BLAST
dfdf
The ‘eureka’ moment of PCR: Copying segments of DNA
Kary Mullis, ‘Inventor’ of PCR
dfdf
What is PCR?
Goal: Make multiple copies of a given DNA molecule OR ‘Amplify’ DNA
Recipe:
Mechanistic steps:
dfdf
Generations of sequencing
Second generation DNA sequencing
~ 2000s
Third generation
~ 2010s
dfdf
Genomics and Sequencing by synthesis
Biological question: How to determine the DNA sequence of all molecules in a (tissue) sample in a high-throughput fashion?
Solution: Bridge amplification [BIO]
dfdf
Second generation sequencing - Using fluorescently labelled deoxynucleotides
Key Idea: At each step the modified polymerase incorporates one and only one fluorescently labelled deoxynucleotides
dfdf
Third gen: Real time single molecule sequencing using Pacbio: 1kb to 20kb
Key idea: Avoid delay at detection step
ZMW: Hole with double stranded DNA + polymerse
Limitations: Higher error rate: 11-12% vs 0.1% for Illumina (short read sequencing)
dfdf
Third gen: Real time single molecule sequencing using Nanopore: 10kb-100kb
Oxford Nanopore MinION
Key idea:
Limitations: Higher error rate (~14% but seems to have improved over the past year)
dfdf
The after effects of human genome project: Using human genome as a reference
The assembled genome is used as a ‘reference’ to ‘map’ newly sequenced DNA fragments
dfdf
Aligning sequences with reduced memory footprint
Biological problem: How to align short sequences to a large reference genome without blowing up the computer memory?
Solution:
dfdf
Burrows Wheeler Transform
Lectures 09-15
RNA-seq and differential expression
dfdf
Transcriptomics: Sequencing the ‘transcriptome’
The need for sequencing transcriptome:
dfdf
Bulk RNA-seq: Unbiased and high-throughput profiling of transcriptome
dfdf
Mapping reads to transcripts
Biological problem: What are the expression levels (mRNA) of the gene?
Solution:
dfdf
Pre-processing -omics datasets
Analytical questions:
Techniques:
Principal Component Analysis - The recipe
dfdf
PCA: the optimization
dfdf
Interpreting PCA
PCA of the morphology of shoulder blade captures differences between humans and other primates.
PC1 = Orientation of spine to the shoulder blade
PC2 = Difference in borders of the shoulder blade
dfdf
How to reverse PCA?
Mk = UkΣVkT
MV1
MV1VT1
PCA reconstruction=PC scores⋅EigenvectorsT + Mean
dfdf
Statistical models for handling omics data
Biological question: Are differences between the two samples (cases/controls) biological or technical?
Solution:
dfdf
Choosing a test when you have two conditions
Parameteric tests = Have a parameter → used when the distribution is known
Non-parameteric tests → used when the distribution is unknown
Genes across two conditions are unpaired so we will use unpaired tests for most part of the course
dfdf
P-value revisited
Area = α/2
Area = α/2
Distribution of T under H0
Significant
findings
Null findings
Significant
findings
T1-α/2
Tα/2
P-value
Tobs
P-value = Probability of sampling a test statistic at least as extreme as the observed test statistic if the null hypothesis is true
We “reject” the null hypothesis (H0) if the pvalue is below the threshold (𝝰)
Under the null, p-value follows a uniform distribution (Proof: on to the board)
dfdf
Type I,II errors and Power
dfdf
Type I,II errors and Power
False-positive
False-
negative
Distribution of T under H0
False-positive
Distribution of T under HA
Power
False-
negative
The false-positive rate is the probability of incorrectly rejecting H0.
The false-negative rate is the probability of incorrectly accepting H0.
Power = 1 – false-negative rate = probability of correctly rejecting H0.
Tα/2
T1-α/2
dfdf
Types of error
dfdf
Error rates for multiple hypothesis
Null
Hypothesis
True
Alternative
True
Not significant
Significant
Test Statistic
True Negative (U)
False Positive (V)
False Negative (T)
True Positive (S)
W
R
G0
G1
Family-wise error rate (FWER)
= probability of at least 1 false-positive
= Pr(V>0)
= significance level
False-discovery rate (FDR)
= expected proportion of false-positives among the rejected hypotheses
By convention, V/R≡0 if R=0.
dfdf
FWER: Family wise error rate
dfdf
Bonferroni correction
dfdf
FDR: False discovery rate
dfdf
Controlling the FDR using Benjamini Hochberg
dfdf
Controlling the FDR using qvalues
Lectures 16-19
Single-cell RNA-seq
Single-cell omics
Biological question: How similar or different are the ‘profiles’ of two given cells?
Solution:
Single-cell analysis workflow
scRNA-seq counts (UMIs) matrix
Input of any scRNA-seq workflow:
Image credits:
Azenta.com
Question: How should we ‘normalize’ counts matrix to adjust for non-biological variation?
Counts matrix needs to be ‘normalized’ before any downstream analysis to account for difference in total molecules sequenced
dfdf
Data: Human PBMC Smart-seq3, 3k cells from Hagemann-Jensen et al., NBT 2020
Many genes exhibit technical variation
Challenge: Deeper sequencing results in higher gene counts - how can we adjust for this effect?
Single-cell RNA-seq enables comparison of the transcriptomes of individual cells
Target tissue
Single cell RNA-seq (scRNA-seq)
Solution: scRNA-seq enables molecular measurement of RNA in single cells
56
Why Single cell?
Cells with varying gene expression pattern
Bulk RNA-seq
Missing co-variation pattern
Single-cell RNA-seq
Reveals co-variation pattern
Single-cell omics: Active playground for statistical methods development
Number of scRNA-seq tools v/s datasets
See a more comprehensive version of the transistor plot here
Data source: https://www.nxn.se/single-cell-studies and https://www.scrna-tools.org/
58
Negative Binomial distribution for modeling UMIs
Negative Binomial model for modeling scRNA-seq counts:
Mean = 𝜇
Variance = 𝜇 + 𝜇2/𝜃
Inverse-dispersion parameter = 𝜃
Negative binomial model allows capturing the heterogeneity observed in gene counts
dfdf
Single cell multi-omics
Biological question: How are two different types of molecules related within each cell
Lectures 20-23
CRISPR, GWAS and Epigenomics
dfdf
Genome editing
Genome editing using a programmable nuclease (usually Cas9 protein) is currently the most efficient way to inactivate a specific gene
Principle: Direct a nuclease to a specific site where it makes a double strand cut
dfdf
How does Cas9 work in eukaryotes?
dfdf
The faults in our DNA
Sickle cell disease: Two mutations to sickle
dfdf
The faults in our DNA
What made treating sickle cell anaemia possible?
dfdf
Gene wide association studies
Key idea
dfdf
Where did original Indians come from?
Biological question: Do two individuals have shared ancestry?
Solution:
dfdf
Statistical models for deciphering gene regulation
Biological question: How do transcription factors (enhancers/promoters_ influence gene expression?
Solution:
What is epigenomics?
Epigenetics(omics) = Study of changes in DNA that do not involve alterations in the DNA sequence.
For our current focus: It would mean study of chemical modifications to the chromatin
Can we ‘sequence’ it?
Tales of: Histones, Chromatin, Nucleosomes
Histones = Arginine/Lysine rich proteins
5 families: H1/H5 (linker), H2,H3 and H4
Core histones form an octamer called nucleosome → 147bp of DNA wraps around nucleosome
Histone tails contains amino acids
that can undergo chemical modification
Chromatin = A complex of DNA and histone proteins. The basic unit of chromatin is the nucleosome
Nucleosome = Two H2A-H2B dimer + One H3-H4 tetramer
Active and Repressed regions are enriched for distinct modifications
Where from here?
Other courses, grad school, industry
One simple advice for approaching anyone: SHOW Don’t TELL
dfdf
Other courses
KCDH
dfdf
Grad Schools
dfdf
Industry
dfdf
Circling back: Questions?