1 of 55

Introduction to the course and

The faults in our DNA

Saket Choudhary

saketc@iitb.ac.in

Introduction to computational multi-omics

DH 607

Lecture 01 || Monday, 28th July 2025

2 of 55

2

dfdf

Welcome to DH607!

“Somewhere, something incredible is waiting to be known”

Carl Sagan

American astronomer and planetary scientist

3 of 55

3

dfdf

A brief timeline (of the course)

  1. IITB ChemE 2009-2014 → Computational Biology and Bioinformatics (USC)
  2. I joined KCDH in March 2024
  3. Sent a proposal for a new course on April 2024
  4. UGPGPC approved course in May 2024
  5. First run of course in Autumn 2025 with 57 people

4 of 55

4

dfdf

Logistics - Grading

  • Assignments: 30% (Best n-1 out of n)
    • Due on Fridays 5pm via Gradescope
    • Weightage: 30/(n-1)% each
    • Late submission policy: 10% penalty per day upto a maximum of 6 days
    • One submission per student (Attribute if you discussed with someone or used LLMs)
  • Mid-sem: 22.5%
    • Closed book and offline (no collaboration)
  • End-sem: 12.5%
    • Closed book and offline (no collaboration)
  • Course project: 25%
      • Groups of maximum of 3 members
      • Topics: Late August
  • Surprise Quizzes: 10%

Final grades: RG (Relative grading)

5 of 55

dfdf

Why so many assignments? Why the project too?

  • You might not remember:
    • this course
    • the instructor’s name
    • most things (anything) taught in the course
  • But you will (hopefully) remember that cool project (and hopefully a cool assignment) you worked with your teammates/friends on in your Xth year at college
  • Huge potential to make an impact → put out a preprint, have your first open source bioinformatics package or anything else..

6 of 55

dfdf

Project ideas

Broad Themes:

  • Methods benchmarking and development
  • *Re-analysis of published datasets
  • Application of statistical methods (“AI/ML”) for solving problems in genomics
  • Data integration across modalities
  • Your theme - as long as it aligns with broad objectives of the course

7 of 55

7

dfdf

LLM Policy

You are allowed to use Large Language Models (LLMs) like ChatGPT, Claude, etc. as learning aids, but you must:

  1. Clearly document when and how you used an LLM in your submission
  2. Ensure you understand the solutions provided by the LLM
  3. Be prepared to explain your work during office hours or exams
  4. Not rely solely on LLM-generated code without understanding

For exams, LLMs will not be permitted.

Be aware that LLMs can make mistakes and should not be considered infallible.

If the TAs detect use of LLM without attribution → Your assignment will not be graded and will be given a zero → The onus will be on you to come and explain your answer.

For assignments there will be no partial marking for coding questions. For theory questions, you will be expected to explain your answers and reasoning.

We will randomly select a few students to explain their answers during office hours. If you are unable to explain your work, you will receive a zero for that question.

8 of 55

8

dfdf

Logistics - Office hour(s)

  • Lecture: Mondays and Thursdays, 3:30pm – 4:55pm, ESE 113, Energy Science and Engineering Building GMaps coordinates
  • Instructor Office: B-22, KCDH, KReSIT Basement
  • Instructor Office Hours: Wednesdays, 4:00 - 5:00pm or by appointment
  • For appointments outside office hours: https://cal.com/saketkc/
  • Contact: saketc@iitb.ac.in | Ext: 3785 (+91 22 2159 3785)

Please use Piazza (https://piazza.com/iit_bombay/fall2025/dh607/) for all course related queries - anonymous questions are open!

Use email preferably only for personal requests - if you have a question, someone else might also have a similar one.

9 of 55

9

dfdf

TAs and office hours

Shubham Thakur

shubham.thakur@iitb.ac.in

Mondays 2:00 PM - 3:30 PM

(B-20 ASL Lab, KRESIT Basement)

Souparna Bhowmik

25d1623@iitb.ac.in

Grading TA

Gaurav Devendra Jain

210040050@iitb.ac.in

Grading TA

10 of 55

10

dfdf

Logistics - Reading material

  • Material is based on a mix of topics** across genomics
  • No one textbook
  • Slides will be text + (digitally) handwritten
  • Lecture will contain sufficient references + key papers

** Course topics developed in collaboration with other colleagues who work in this area and people from industry

11 of 55

11

dfdf

Logistics - Programming/Coding

  • We will cover some preliminary coding exercise in the hands-on class
  • Please bring your laptops for hands-on sessions (will be announced)
  • But if this is your first exposure to programming, please use the programming resources to familiarize yourself
  • R installation and a brief tutorial is available on website

12 of 55

12

dfdf

Logistics - Tentative syllabus

** Subject to change as we dive deeper

13 of 55

13

dfdf

!!! Collaboration policy and Academic Integrity !!!

  • You are expected to work on your own for most part of the course.
  • For assignment problems, If you get stuck, you are welcome to discuss it with other students (in-person, or online on Piazza). However, the solutions must be your work. If you discussed with someone, please mention their name and what you received help with in your submission. If you do not attribute and we find similarities in the final submissions - this will automatically count as plagiarism!
  • Mid-semester exam (closed book). No collaboration is allowed.
  • Write/speak what you understand. If you write something, it is assumed you understand it - and hence are open to being quizzed by it
  • Simply: DTRT - Do the right thing

“I declare that I will adhere to all principles of academic honesty and integrity throughout my stay in the Institute. I will not seek or give unauthorized assistance in tests, quizzes, examinations or assignments. I will not misrepresent, fabricate or falsify any idea/data/fact/source in my project submissions. I understand that any violation of the above will be cause for disciplinary action as per the rules and regulations of the Institute.”

14 of 55

14

dfdf

!!! Collaboration policy and Academic Integrity !!!

DTRT - Do the right thing

Don’t game the system

Don’t try to bend the rules

UG Rule Book: https://acad.iitb.ac.in/files/UG_RULE_BOOK.pdf

PG Rule Book(s): https://acad.iitb.ac.in/academics/rules/pg

15 of 55

15

dfdf

!!! Collaboration policy and Academic Integrity !!!

  • I will co-operate with the Institutes authorities in maintaining discipline, academic standards and good order in the Campus.
  • I declare that I will adhere to all Principles of academic honesty and integrity throughout my stay in the Institute, I will not seek or give unauthorized assistance in test, quizzes, examinations or assignments. I will not misrepresent, fabricate or falsify any idea/ data/ facts/ source in my project submissions. I understand that any violations of the above will be cause for disciplinary action as per the rules and regulations of the Institute.

16 of 55

16

dfdf

Code of conduct

1. Be on time (classes/assignments/projects)

2. Be respectful on Piazza (and all other platforms).

3. Maintain a convivial & collegial atmosphere

Questions and clarifications are always welcome during/after the class or during office hours.

17 of 55

17

dfdf

What is the course about?

Computational Multi-omics is a “fancy” term for :

  • Biology
  • Mathematics:
    • Linear algebra
    • Discrete mathematics (combinatorics)
    • Calculus
  • Probability and Statistics:
    • Probability Theory
    • Applied and theoretical statistics
  • Computer Science:
    • Data structures and algorithms
    • Programming and software engineering

Each subtopic on the left can be studied in a full semester course in seclusion.

Computational biology is hard (I am biased).

In the scheme of studying hard things, we often lose context of why did we even start.

So, we will instead take a “reverse bollywood approach” - climax first, details later.

18 of 55

18

dfdf

What is the course about?

Goal 1: Give you a flavour of science

    • Science: Essence of science is “inquiry”: concrete descriptions of what we observe; theories about what drives those observations
    • Engineering: “Design”: expands the scope of human plans results

Goal 2: Equip you with fundamental analytical framework to answer your own questions (broadly in genomics)

19 of 55

19

dfdf

What is the course about?

20 of 55

20

dfdf

What is the course about?

  • Fundamental principles of how genes are regulated
  • Techniques for profiling the various modalities (DNA/RNA/Epigenetic marks) in a high-throughput fashion (DNA-seq/RNA-seq/*-seq)
  • Statistical principles for analysing large scale data
  • Analytical techniques for large scale (multi-omics) datasets
  • Framework to identify factors underlying human diseases

Democratizing large- (and small-scale) omics data analysis!

21 of 55

21

dfdf

What is this course not about?

  • Using online tools to do bioinformatics
  • I am looking for answers to bioinformatic questions that go unanswered on forums
  • How to use <my_favorite_tool> for processing genomics data
  • What is the <best_tool> for my workflow
  • How should I run <someone’s_favorite_tool> on <my_dataset>
  • I am getting this <error> in using installing <my_favorite_tool>
  • I want to learn <R/Python/C++>
  • I want to learn cell biology or ALL of genomics or biochemistry``

22 of 55

22

dfdf

Role of computation/statistics/computer science/mathematics/engineering in molecular biology

“Computational biology lets you see the big picture

Another way computers have reshaped biology is by introducing statistics and data analysis methods. A good example is understanding how mutational processes shape genomes [3]. Mutational processes—be it cigarette smoke, sunlight, or defects in homologous recombination—are not visible in individual mutations but only in their global patterns. How often is a C turned into a T? How does this frequency vary depending on the neighbours of the mutated base? How much of this frequency is explained by other features of the genome, like replication timing? Answering these questions helps us to understand basic properties of the mutational processes active in cells, and it is only possible by statistical techniques that identify patterns and correlations”

What role do you have as an engineering student to shape the next frontier of biology?

23 of 55

23

dfdf

Why you might want to take this course:

  • Blend of mathematics, statistics, computer science, software engineering and biology
  • First-principles approach to biological problems
  • Genomics is ubiquitous abroad:
  • Genomics is becoming ubiquitous in India
  • Learn how to analyse (“do hands-on data science”) large-scale datasets

24 of 55

24

dfdf

Expectations

  • Some prior exposure to biology is great but not necessary
  • Some prior exposure to mathematics - linear algebra and probability preferably
  • Put more effort and time if the material is extremely new to you
  • Ask questions
  • Do not speak/write/use things (particularly tools) you do not understand
  • Do not use brute-force to make things work somehow; but do explore (more on this later)

25 of 55

25

Questions?

26 of 55

dfdf

Goals for today

  1. A short(est) introduction to molecular biology
  2. What is ‘Genomics’?
  3. 20,000 feet picture of the course

Who cares?

  • How do cells operate at the molecular level?
  • What goes wrong in the cell machinery during a disease?
  • Where did we come from?

27 of 55

dfdf

A simple problem from World War II

Section of plane

Bullet holes per square foot

Engine

1.11

Fuselage

1.73

Fuel system

1.85

Rest of the plane

1.50

Where should you put the armour?

28 of 55

dfdf

A simple problem from World War II

Section of plane

Bullet holes per square foot

Engine

1.11

Fuselage

1.73

Fuel system

1.85

Rest of the plane

1.50

Where should you put the armour?

Usual answer: Put armour where the bullet hills are maximum (Fuel system)

First principles thinking: Armour goes on the engine!

29 of 55

Course vignettes

30 of 55

dfdf

The faults in our DNA

Sickle cell disease: Two mutations to sickle

31 of 55

dfdf

The faults in our DNA

What made treating sickle cell anaemia possible?

  • Discovery of CRISPR-Cas9
  • Advancements in genomic technologies
  • Statistical methods for genomics

32 of 55

dfdf

The faults in our DNA

Case for model organisms: Effect of mutations in the Kit Gene are reproducible across species

33 of 55

dfdf

The future of therapies is (almost) here…

World's First Patient Treated with Personalized CRISPR Gene Editing

Little KJ Muldoon

  • Baby Muldoon had inherited two mutations, one from each parent → did not produce the normal form of a crucial enzyme called carbamoyl phosphate synthetase 1 (CPS-1)

  • When the body breaks down protein, it produces nitrogen

  • Mutations in CPS1 compromised his ability to process the nitrogen →

blood had high levels of ammonia, a compound that is particularly toxic to the brain

  • Solution: Liver transplant → Months before he becomes eligible + risk of brain damage

  • Final treatment: Personalised CRISPR therapy

34 of 55

dfdf

“Hope, despair and CRISPR” - A story from India

Uditi Saraf

  • Uditi was diagnosed with epilepsy at the age of 9 but the seizures began more progressed with her age

  • Uditi’s genome had a single-base change in the gene that codes for a protein called neuroserpin that caused tangled polymers to form in her brain cells

  • Plan was to design a targeted therapy for her but it could not happen in time (for various reasons)

  • Bottom line: There is a huge opportunity for computational (and molecular) biology to make impact

35 of 55

dfdf

Aligning sequences

Biological problem: How similar are two DNA/RNA/Protein sequences

Solution:

  • Dynamic programming [CS]
  • Sequence statistics [MATH]

36 of 55

dfdf

Searching for sequences in large scale databases

Biological problem: Given a biological sequences, what is the likely function of this sequence as compared to all known biological sequences that are close to it

Solution:

  • Dynamic programming [CS]
  • Sequence statistics [STATS]

37 of 55

dfdf

Genomics and Sequencing by synthesis

Biological question: How to determine the DNA sequence of all molecules in a (tissue) sample in a high-throughput fashion?

Solution: Bridge amplification [BIO]

38 of 55

dfdf

Aligning sequences with reduced memory footprint

Biological problem: How to align short sequences to a large reference genome without blowing up the computer memory?

Solution:

  • Smart hashing [CS]
  • Lossless transformation [CS]
  • Suffix trees [CS]

39 of 55

dfdf

Transcriptomics: Sequencing the ‘transcriptome’

The need for sequencing transcriptome:

  • DNA is same across cells but the gene expression pattern is different
  • Changes in the DNA might not necessarily reflect in the expression phenotype

40 of 55

dfdf

Mapping reads to transcripts

Biological problem: What are the expression levels (mRNA) of the gene?

Solution:

  • Smart hashing [CS]
  • Graph based pseudomapping [CS]

41 of 55

dfdf

Pre-processing -omics datasets

Analytical questions:

  • Is there enough signal in my data?
  • Do all samples have the right signal?
  • Are there ‘batch-effects’ in my data?

Techniques:

  • Linear and non-linear dimensionality reduction PCA/SVD/tSNE/UMAP [STATS]
  • Clustering [STATS]

42 of 55

dfdf

Statistical models for handling omics data

Biological question: Are differences between the two samples (cases/controls) biological or technical?

Solution:

  • Model biological and technical noise [STATS]
  • Multiple hypothesis testing [STATS]

43 of 55

Single-cell omics

Biological question: How similar or different are the ‘profiles’ of two given cells?

Solution:

  • Technological advancement [BIO]
  • Separating technical noise from biological signal [STATS]
  • Clustering [STATS]
  • Differential expression [STATS]
  • Handling large scale data [CS]

44 of 55

dfdf

How similar are the transcriptomes across species?

Biological question: Characterize similar cell states across thousands and millions of single-cells across species

Techniques:

  • Canonical correlation analysis [STATS]
  • Dimensionality reduction [STATS]

45 of 55

Single-cell omics: Active playground for statistical methods development

Number of scRNA-seq tools v/s datasets

See a more comprehensive version of the transistor plot here

46 of 55

dfdf

Statistical models for understanding biological causal circuits

Biological question: How can we predict and manipulate the behaviour of cells?

Solution:

  • Technological advancement [BIO]
  • Statistical models for causality analysis [STATS]

47 of 55

dfdf

Single cell multi-omics

Biological question: How are two different types of molecules related within each cell

48 of 55

dfdf

Statistical models for deciphering gene regulation

Biological question: How do transcription factors (enhancers/promoters_ influence gene expression?

Solution:

  • Technological advancement [BIO]
  • Statistical models for modeling DNA fragments [STATS]

49 of 55

49

dfdf

What is ageing?

Ageing ≅Accumulation of chemical damage to our cells and molecules over time

50 of 55

50

dfdf

Statistical models for understanding ageing at molecular level

51 of 55

51

dfdf

Statistical models for understanding spatial variation

52 of 55

dfdf

Gene wide association studies

Key idea

  • Genotype individuals from multiple cohorts (could be observational)
  • Associate genotypes with traits (e.g. correlate height of individual with a single nucleotide polymorphism [SNP] at a particular locus)
  • Traits could be anything: diseases, disorders, physical characteristics.

53 of 55

dfdf

Where did original Indians come from?

Biological question: Do two individuals have shared ancestry?

Solution:

  • SNP arrays [BIO]
  • Statistical genetics [STATS]

54 of 55

dfdf

Deep learning applications

Goal: Learn the sequence grammar that determines gene expression

55 of 55

55

Questions?