Introduction to the course and
The faults in our DNA
Saket Choudhary
Introduction to computational multi-omics
DH 607
Lecture 01 || Monday, 28th July 2025
2
dfdf
Welcome to DH607!
“Somewhere, something incredible is waiting to be known”
Carl Sagan
American astronomer and planetary scientist
3
dfdf
A brief timeline (of the course)
4
dfdf
Logistics - Grading
Final grades: RG (Relative grading)
dfdf
Why so many assignments? Why the project too?
dfdf
Project ideas
Broad Themes:
7
dfdf
LLM Policy
You are allowed to use Large Language Models (LLMs) like ChatGPT, Claude, etc. as learning aids, but you must:
For exams, LLMs will not be permitted.
Be aware that LLMs can make mistakes and should not be considered infallible.
If the TAs detect use of LLM without attribution → Your assignment will not be graded and will be given a zero → The onus will be on you to come and explain your answer.
For assignments there will be no partial marking for coding questions. For theory questions, you will be expected to explain your answers and reasoning.
We will randomly select a few students to explain their answers during office hours. If you are unable to explain your work, you will receive a zero for that question.
8
dfdf
Logistics - Office hour(s)
Please use Piazza (https://piazza.com/iit_bombay/fall2025/dh607/) for all course related queries - anonymous questions are open!
Use email preferably only for personal requests - if you have a question, someone else might also have a similar one.
All course material: https://saket-choudhary.me/DH607
9
dfdf
TAs and office hours
10
dfdf
Logistics - Reading material
** Course topics developed in collaboration with other colleagues who work in this area and people from industry
11
dfdf
Logistics - Programming/Coding
12
dfdf
Logistics - Tentative syllabus
** Subject to change as we dive deeper
13
dfdf
!!! Collaboration policy and Academic Integrity !!!
“I declare that I will adhere to all principles of academic honesty and integrity throughout my stay in the Institute. I will not seek or give unauthorized assistance in tests, quizzes, examinations or assignments. I will not misrepresent, fabricate or falsify any idea/data/fact/source in my project submissions. I understand that any violation of the above will be cause for disciplinary action as per the rules and regulations of the Institute.”
14
dfdf
!!! Collaboration policy and Academic Integrity !!!
DTRT - Do the right thing
Don’t game the system
Don’t try to bend the rules
UG Rule Book: https://acad.iitb.ac.in/files/UG_RULE_BOOK.pdf
PG Rule Book(s): https://acad.iitb.ac.in/academics/rules/pg
15
dfdf
!!! Collaboration policy and Academic Integrity !!!
16
dfdf
Code of conduct
1. Be on time (classes/assignments/projects)
2. Be respectful on Piazza (and all other platforms).
3. Maintain a convivial & collegial atmosphere
Questions and clarifications are always welcome during/after the class or during office hours.
17
dfdf
What is the course about?
Computational Multi-omics is a “fancy” term for :
Each subtopic on the left can be studied in a full semester course in seclusion.
Computational biology is hard (I am biased).
In the scheme of studying hard things, we often lose context of why did we even start.
So, we will instead take a “reverse bollywood approach” - climax first, details later.
18
dfdf
What is the course about?
Goal 1: Give you a flavour of science
Goal 2: Equip you with fundamental analytical framework to answer your own questions (broadly in genomics)
19
dfdf
What is the course about?
20
dfdf
What is the course about?
Democratizing large- (and small-scale) omics data analysis!
21
dfdf
What is this course not about?
22
dfdf
Role of computation/statistics/computer science/mathematics/engineering in molecular biology
“Computational biology lets you see the big picture
Another way computers have reshaped biology is by introducing statistics and data analysis methods. A good example is understanding how mutational processes shape genomes [3]. Mutational processes—be it cigarette smoke, sunlight, or defects in homologous recombination—are not visible in individual mutations but only in their global patterns. How often is a C turned into a T? How does this frequency vary depending on the neighbours of the mutated base? How much of this frequency is explained by other features of the genome, like replication timing? Answering these questions helps us to understand basic properties of the mutational processes active in cells, and it is only possible by statistical techniques that identify patterns and correlations”
What role do you have as an engineering student to shape the next frontier of biology?
23
dfdf
Why you might want to take this course:
24
dfdf
Expectations
25
Questions?
dfdf
Goals for today
Who cares?
dfdf
A simple problem from World War II
Section of plane | Bullet holes per square foot |
Engine | 1.11 |
Fuselage | 1.73 |
Fuel system | 1.85 |
Rest of the plane | 1.50 |
Where should you put the armour?
dfdf
A simple problem from World War II
Section of plane | Bullet holes per square foot |
Engine | 1.11 |
Fuselage | 1.73 |
Fuel system | 1.85 |
Rest of the plane | 1.50 |
Where should you put the armour?
Usual answer: Put armour where the bullet hills are maximum (Fuel system)
First principles thinking: Armour goes on the engine!
Course vignettes
dfdf
The faults in our DNA
Sickle cell disease: Two mutations to sickle
dfdf
The faults in our DNA
What made treating sickle cell anaemia possible?
dfdf
The faults in our DNA
Case for model organisms: Effect of mutations in the Kit Gene are reproducible across species
dfdf
The future of therapies is (almost) here…
World's First Patient Treated with Personalized CRISPR Gene Editing
Little KJ Muldoon
blood had high levels of ammonia, a compound that is particularly toxic to the brain
dfdf
“Hope, despair and CRISPR” - A story from India
Uditi Saraf
dfdf
Aligning sequences
Biological problem: How similar are two DNA/RNA/Protein sequences
Solution:
dfdf
Searching for sequences in large scale databases
Biological problem: Given a biological sequences, what is the likely function of this sequence as compared to all known biological sequences that are close to it
Solution:
dfdf
Genomics and Sequencing by synthesis
Biological question: How to determine the DNA sequence of all molecules in a (tissue) sample in a high-throughput fashion?
Solution: Bridge amplification [BIO]
dfdf
Aligning sequences with reduced memory footprint
Biological problem: How to align short sequences to a large reference genome without blowing up the computer memory?
Solution:
dfdf
Transcriptomics: Sequencing the ‘transcriptome’
The need for sequencing transcriptome:
dfdf
Mapping reads to transcripts
Biological problem: What are the expression levels (mRNA) of the gene?
Solution:
dfdf
Pre-processing -omics datasets
Analytical questions:
Techniques:
dfdf
Statistical models for handling omics data
Biological question: Are differences between the two samples (cases/controls) biological or technical?
Solution:
Single-cell omics
Biological question: How similar or different are the ‘profiles’ of two given cells?
Solution:
dfdf
How similar are the transcriptomes across species?
Biological question: Characterize similar cell states across thousands and millions of single-cells across species
Techniques:
Single-cell omics: Active playground for statistical methods development
Number of scRNA-seq tools v/s datasets
See a more comprehensive version of the transistor plot here
Data source: https://www.nxn.se/single-cell-studies and https://www.scrna-tools.org/
dfdf
Statistical models for understanding biological causal circuits
Biological question: How can we predict and manipulate the behaviour of cells?
Solution:
dfdf
Single cell multi-omics
Biological question: How are two different types of molecules related within each cell
dfdf
Statistical models for deciphering gene regulation
Biological question: How do transcription factors (enhancers/promoters_ influence gene expression?
Solution:
49
dfdf
What is ageing?
Ageing ≅Accumulation of chemical damage to our cells and molecules over time
50
dfdf
Statistical models for understanding ageing at molecular level
dfdf
Gene wide association studies
Key idea
dfdf
Where did original Indians come from?
Biological question: Do two individuals have shared ancestry?
Solution:
dfdf
Deep learning applications
Goal: Learn the sequence grammar that determines gene expression
55
Questions?