1 of 55

Introduction to the course and

The faults in our DNA

Saket Choudhary

saketc@iitb.ac.in

Introduction to computational multi-omics

DH 607

Lecture 01 || Monday, 28th July 2025

2 of 55

dfdf

Welcome to DH607!

“Somewhere, something incredible is waiting to be known”

Carl Sagan

American astronomer and planetary scientist

3 of 55

dfdf

A brief timeline (of the course)

IITB ChemE 2009-2014 → Computational Biology and Bioinformatics (USC)
I joined KCDH in March 2024
Sent a proposal for a new course on April 2024
UGPGPC approved course in May 2024
First run of course in Autumn 2025 with 57 people

4 of 55

dfdf

Logistics - Grading

Assignments: 30% (Best n-1 out of n)

Due on Fridays 5pm via Gradescope
Weightage: 30/(n-1)% each
Late submission policy: 10% penalty per day upto a maximum of 6 days
One submission per student (Attribute if you discussed with someone or used LLMs)

Mid-sem: 22.5%

Closed book and offline (no collaboration)

End-sem: 12.5%

Closed book and offline (no collaboration)

Course project: 25%

Groups of maximum of 3 members
Topics: Late August

Surprise Quizzes: 10%

Final grades: RG (Relative grading)

5 of 55

dfdf

Why so many assignments? Why the project too?

You might not remember:

this course
the instructor’s name
most things (anything) taught in the course

But you will (hopefully) remember that cool project (and hopefully a cool assignment) you worked with your teammates/friends on in your X^th year at college
Huge potential to make an impact → put out a preprint, have your first open source bioinformatics package or anything else..

6 of 55

dfdf

Project ideas

Broad Themes:

Methods benchmarking and development
*Re-analysis of published datasets
Application of statistical methods (“AI/ML”) for solving problems in genomics
Data integration across modalities
Your theme - as long as it aligns with broad objectives of the course

7 of 55

dfdf

LLM Policy

You are allowed to use Large Language Models (LLMs) like ChatGPT, Claude, etc. as learning aids, but you must:

Clearly document when and how you used an LLM in your submission
Ensure you understand the solutions provided by the LLM
Be prepared to explain your work during office hours or exams
Not rely solely on LLM-generated code without understanding

For exams, LLMs will not be permitted.

Be aware that LLMs can make mistakes and should not be considered infallible.

If the TAs detect use of LLM without attribution → Your assignment will not be graded and will be given a zero → The onus will be on you to come and explain your answer.

For assignments there will be no partial marking for coding questions. For theory questions, you will be expected to explain your answers and reasoning.

We will randomly select a few students to explain their answers during office hours. If you are unable to explain your work, you will receive a zero for that question.

8 of 55

dfdf

Logistics - Office hour(s)

Lecture: Mondays and Thursdays, 3:30pm – 4:55pm, ESE 113, Energy Science and Engineering Building GMaps coordinates
Instructor Office: B-22, KCDH, KReSIT Basement
Instructor Office Hours: Wednesdays, 4:00 - 5:00pm or by appointment
For appointments outside office hours: https://cal.com/saketkc/
Contact: saketc@iitb.ac.in | Ext: 3785 (+91 22 2159 3785)

Please use Piazza (https://piazza.com/iit_bombay/fall2025/dh607/) for all course related queries - anonymous questions are open!

Use email preferably only for personal requests - if you have a question, someone else might also have a similar one.

All course material: https://saket-choudhary.me/DH607

9 of 55

dfdf

TAs and office hours

Shubham Thakur

shubham.thakur@iitb.ac.in

Mondays 2:00 PM - 3:30 PM

(B-20 ASL Lab, KRESIT Basement)

Souparna Bhowmik

25 d1623@iitb.ac.in

Grading TA

Gaurav Devendra Jain

210040050@iitb.ac.in

Grading TA

10 of 55

dfdf

Logistics - Reading material

Material is based on a mix of topics** across genomics
No one textbook
Slides will be text + (digitally) handwritten
Lecture will contain sufficient references + key papers

** Course topics developed in collaboration with other colleagues who work in this area and people from industry

11 of 55

dfdf

Logistics - Programming/Coding

We will cover some preliminary coding exercise in the hands-on class
Please bring your laptops for hands-on sessions (will be announced)
But if this is your first exposure to programming, please use the programming resources to familiarize yourself
R installation and a brief tutorial is available on website

12 of 55

dfdf

Logistics - Tentative syllabus

https://saket-choudhary.me/DH607/syllabus.html

** Subject to change as we dive deeper

13 of 55

dfdf

!!! Collaboration policy and Academic Integrity !!!

You are expected to work on your own for most part of the course.
For assignment problems, If you get stuck, you are welcome to discuss it with other students (in-person, or online on Piazza). However, the solutions must be your work. If you discussed with someone, please mention their name and what you received help with in your submission. If you do not attribute and we find similarities in the final submissions - this will automatically count as plagiarism!
Mid-semester exam (closed book). No collaboration is allowed.
Write/speak what you understand. If you write something, it is assumed you understand it - and hence are open to being quizzed by it
Simply: DTRT - Do the right thing

See Policy

“I declare that I will adhere to all principles of academic honesty and integrity throughout my stay in the Institute. I will not seek or give unauthorized assistance in tests, quizzes, examinations or assignments. I will not misrepresent, fabricate or falsify any idea/data/fact/source in my project submissions. I understand that any violation of the above will be cause for disciplinary action as per the rules and regulations of the Institute.”

14 of 55

dfdf

!!! Collaboration policy and Academic Integrity !!!

DTRT - Do the right thing

Don’t game the system

Don’t try to bend the rules

UG Rule Book: https://acad.iitb.ac.in/files/UG_RULE_BOOK.pdf

PG Rule Book(s): https://acad.iitb.ac.in/academics/rules/pg

See Policy

15 of 55

dfdf

!!! Collaboration policy and Academic Integrity !!!

I will co-operate with the Institutes authorities in maintaining discipline, academic standards and good order in the Campus.
I declare that I will adhere to all Principles of academic honesty and integrity throughout my stay in the Institute, I will not seek or give unauthorized assistance in test, quizzes, examinations or assignments. I will not misrepresent, fabricate or falsify any idea/ data/ facts/ source in my project submissions. I understand that any violations of the above will be cause for disciplinary action as per the rules and regulations of the Institute.

See Policy

16 of 55

dfdf

Code of conduct

1. Be on time (classes/assignments/projects)

2. Be respectful on Piazza (and all other platforms).

3. Maintain a convivial & collegial atmosphere

Questions and clarifications are always welcome during/after the class or during office hours.

17 of 55

dfdf

What is the course about?

Computational Multi-omics is a “fancy” term for :

Biology
Mathematics:

Linear algebra
Discrete mathematics (combinatorics)
Calculus

Probability and Statistics:

Probability Theory
Applied and theoretical statistics

Computer Science:

Data structures and algorithms
Programming and software engineering

Each subtopic on the left can be studied in a full semester course in seclusion.

Computational biology is hard (I am biased).

In the scheme of studying hard things, we often lose context of why did we even start.

So, we will instead take a “reverse bollywood approach” - climax first, details later.

https://x.com/alfiyastic/status/1817932569515638980

18 of 55

dfdf

What is the course about?

Goal 1: Give you a flavour of science

Science: Essence of science is “inquiry”: concrete descriptions of what we observe; theories about what drives those observations
Engineering: “Design”: expands the scope of human plans results

Goal 2: Equip you with fundamental analytical framework to answer your own questions (broadly in genomics)

Source

19 of 55

dfdf

What is the course about?

Eric Drexler, Radical Abundance

Source

20 of 55

dfdf

What is the course about?

Fundamental principles of how genes are regulated
Techniques for profiling the various modalities (DNA/RNA/Epigenetic marks) in a high-throughput fashion (DNA-seq/RNA-seq/*-seq)
Statistical principles for analysing large scale data
Analytical techniques for large scale (multi-omics) datasets
Framework to identify factors underlying human diseases

Democratizing large- (and small-scale) omics data analysis!

21 of 55

dfdf

What is this course not about?

Using online tools to do bioinformatics
I am looking for answers to bioinformatic questions that go unanswered on forums
How to use <my_favorite_tool> for processing genomics data
What is the <best_tool> for my workflow
How should I run <someone’s_favorite_tool> on <my_dataset>
I am getting this <error> in using installing <my_favorite_tool>
I want to learn <R/Python/C++>
I want to learn cell biology or ALL of genomics or biochemistry``

22 of 55

dfdf

Role of computation/statistics/computer science/mathematics/engineering in molecular biology

“Computational biology lets you see the big picture

Another way computers have reshaped biology is by introducing statistics and data analysis methods. A good example is understanding how mutational processes shape genomes [3]. Mutational processes—be it cigarette smoke, sunlight, or defects in homologous recombination—are not visible in individual mutations but only in their global patterns. How often is a C turned into a T? How does this frequency vary depending on the neighbours of the mutated base? How much of this frequency is explained by other features of the genome, like replication timing? Answering these questions helps us to understand basic properties of the mutational processes active in cells, and it is only possible by statistical techniques that identify patterns and correlations”

What role do you have as an engineering student to shape the next frontier of biology?

Source

23 of 55

dfdf

Why you might want to take this course:

Blend of mathematics, statistics, computer science, software engineering and biology
First-principles approach to biological problems
Genomics is ubiquitous abroad:

Academic labs
Industry: Genentech, 10XGenomics, Illumina, 23andMe, Qiagen, Thermo Fisher, Mammoth …

Genomics is becoming ubiquitous in India

Academic labs
Industry: Strand genomics, MedGenome, MapMyGenome, Precision health, …

Learn how to analyse (“do hands-on data science”) large-scale datasets

24 of 55

dfdf

Expectations

Some prior exposure to biology is great but not necessary
Some prior exposure to mathematics - linear algebra and probability preferably
Put more effort and time if the material is extremely new to you
Ask questions
Do not speak/write/use things (particularly tools) you do not understand
Do not use brute-force to make things work somehow; but do explore (more on this later)

25 of 55

Questions?

26 of 55

dfdf

Goals for today

A short(est) introduction to molecular biology
What is ‘Genomics’?
20,000 feet picture of the course

Who cares?

How do cells operate at the molecular level?
What goes wrong in the cell machinery during a disease?
Where did we come from?
…

27 of 55

dfdf

A simple problem from World War II

Section of plane	Bullet holes per square foot
Engine	1.11
Fuselage	1.73
Fuel system	1.85
Rest of the plane	1.50

Where should you put the armour?

https://www.jstor.org/stable/2288257

28 of 55

dfdf

A simple problem from World War II

Section of plane	Bullet holes per square foot
Engine	1.11
Fuselage	1.73
Fuel system	1.85
Rest of the plane	1.50

Where should you put the armour?

Usual answer: Put armour where the bullet hills are maximum (Fuel system)

First principles thinking: Armour goes on the engine!

https://www.jstor.org/stable/2288257

29 of 55

Course vignettes

30 of 55

dfdf

The faults in our DNA

https://archive.ph/Jkn1L

Kato et al., 2018

Sickle cell disease: Two mutations to sickle

31 of 55

dfdf

The faults in our DNA

What made treating sickle cell anaemia possible?

Discovery of CRISPR-Cas9
Advancements in genomic technologies
Statistical methods for genomics

https://www.nature.com/articles/s41467-021-25298-9

https://www.nature.com/articles/549S28a

32 of 55

dfdf

The faults in our DNA

Case for model organisms: Effect of mutations in the Kit Gene are reproducible across species

Molecular biology of the cell

33 of 55

dfdf

The future of therapies is (almost) here…

Source

World's First Patient Treated with Personalized CRISPR Gene Editing

Source

Little KJ Muldoon

Baby Muldoon had inherited two mutations, one from each parent → did not produce the normal form of a crucial enzyme called carbamoyl phosphate synthetase 1 (CPS-1)

When the body breaks down protein, it produces nitrogen

Mutations in CPS1 compromised his ability to process the nitrogen →

blood had high levels of ammonia, a compound that is particularly toxic to the brain

Solution: Liver transplant → Months before he becomes eligible + risk of brain damage

Final treatment: Personalised CRISPR therapy

34 of 55

dfdf

“Hope, despair and CRISPR” - A story from India

Source

Uditi Saraf

Uditi was diagnosed with epilepsy at the age of 9 but the seizures began more progressed with her age

Uditi’s genome had a single-base change in the gene that codes for a protein called neuroserpin that caused tangled polymers to form in her brain cells

Plan was to design a targeted therapy for her but it could not happen in time (for various reasons)

Bottom line: There is a huge opportunity for computational (and molecular) biology to make impact

35 of 55

dfdf

Aligning sequences

Biological problem: How similar are two DNA/RNA/Protein sequences

Solution:

Dynamic programming [CS]
Sequence statistics [MATH]

36 of 55

dfdf

Searching for sequences in large scale databases

Biological problem: Given a biological sequences, what is the likely function of this sequence as compared to all known biological sequences that are close to it

Solution:

Dynamic programming [CS]
Sequence statistics [STATS]

https://blast.ncbi.nlm.nih.gov/Blast.cgi

37 of 55

dfdf

Genomics and Sequencing by synthesis

Shendure et al. 2017

Biological question: How to determine the DNA sequence of all molecules in a (tissue) sample in a high-throughput fashion?

Solution: Bridge amplification [BIO]

38 of 55

dfdf

Aligning sequences with reduced memory footprint

Biological problem: How to align short sequences to a large reference genome without blowing up the computer memory?

Solution:

Smart hashing [CS]
Lossless transformation [CS]
Suffix trees [CS]

Ferragina et al., 2005

39 of 55

dfdf

Transcriptomics: Sequencing the ‘transcriptome’

Genomes - Brown

The need for sequencing transcriptome:

DNA is same across cells but the gene expression pattern is different
Changes in the DNA might not necessarily reflect in the expression phenotype

40 of 55

dfdf

Mapping reads to transcripts

Haas and Zody 2010

Biological problem: What are the expression levels (mRNA) of the gene?

Solution:

Smart hashing [CS]
Graph based pseudomapping [CS]

41 of 55

dfdf

Pre-processing -omics datasets

Analytical questions:

Is there enough signal in my data?
Do all samples have the right signal?
Are there ‘batch-effects’ in my data?

Techniques:

Linear and non-linear dimensionality reduction PCA/SVD/tSNE/UMAP [STATS]
Clustering [STATS]

42 of 55

dfdf

Statistical models for handling omics data

Source

Love et al. (2014)

Biological question: Are differences between the two samples (cases/controls) biological or technical?

Solution:

Model biological and technical noise [STATS]
Multiple hypothesis testing [STATS]

43 of 55

Single-cell omics

Yao et al., Nature (2021)

Biological question: How similar or different are the ‘profiles’ of two given cells?

Solution:

Technological advancement [BIO]
Separating technical noise from biological signal [STATS]
Clustering [STATS]
Differential expression [STATS]
Handling large scale data [CS]

44 of 55

dfdf

How similar are the transcriptomes across species?

Biological question: Characterize similar cell states across thousands and millions of single-cells across species

Techniques:

Canonical correlation analysis [STATS]
Dimensionality reduction [STATS]

45 of 55

Single-cell omics: Active playground for statistical methods development

Number of scRNA-seq tools v/s datasets

See a more comprehensive version of the transistor plot here

Data source: https://www.nxn.se/single-cell-studies and https://www.scrna-tools.org/

46 of 55

dfdf

Statistical models for understanding biological causal circuits

Biological question: How can we predict and manipulate the behaviour of cells?

Solution:

Technological advancement [BIO]
Statistical models for causality analysis [STATS]

Source

47 of 55

dfdf

Single cell multi-omics

Stuart and Satija, 2019

Biological question: How are two different types of molecules related within each cell

48 of 55

dfdf

Statistical models for deciphering gene regulation

Source

Biological question: How do transcription factors (enhancers/promoters_ influence gene expression?

Solution:

Technological advancement [BIO]
Statistical models for modeling DNA fragments [STATS]

49 of 55

dfdf

What is ageing?

Ageing ≅Accumulation of chemical damage to our cells and molecules over time

Source

50 of 55

dfdf

Statistical models for understanding ageing at molecular level

Source

51 of 55

dfdf

Statistical models for understanding spatial variation

Source

52 of 55

dfdf

Gene wide association studies

Uffelmann et al. 2021

Visscher et al. 2017

Key idea

Genotype individuals from multiple cohorts (could be observational)
Associate genotypes with traits (e.g. correlate height of individual with a single nucleotide polymorphism [SNP] at a particular locus)
Traits could be anything: diseases, disorders, physical characteristics.

53 of 55

dfdf

Where did original Indians come from?

Biological question: Do two individuals have shared ancestry?

LaFramboise 2009

https://genomeofindia.substack.com/p/genome-10-udgam-of-india-a-genetic

Moorjani et al. 2013

Solution:

SNP arrays [BIO]
Statistical genetics [STATS]

Kerdoncuff et al. 2023

54 of 55

dfdf

Deep learning applications

Sokolova et al. 2024

Goal: Learn the sequence grammar that determines gene expression

1 of 55

2 of 55

3 of 55

4 of 55

5 of 55

6 of 55

7 of 55

8 of 55

9 of 55

10 of 55

11 of 55

12 of 55

13 of 55

14 of 55

15 of 55

16 of 55

17 of 55

18 of 55

19 of 55

20 of 55

21 of 55

22 of 55

23 of 55

24 of 55

25 of 55

26 of 55

27 of 55

28 of 55

29 of 55

30 of 55

31 of 55

32 of 55

33 of 55

34 of 55

35 of 55

36 of 55

37 of 55

38 of 55

39 of 55

40 of 55

41 of 55

42 of 55

43 of 55

44 of 55

45 of 55

46 of 55

47 of 55

48 of 55

49 of 55

50 of 55

51 of 55

52 of 55

53 of 55

54 of 55

55 of 55