1 of 42

Large language models

Nick Galioto

Gilbert S. Omenn Department of Computational Medicine and Bioinformatics, University of Michigan

MATH-BIOINF-STAT 547: Mathematics of Data

February 26, 2026

2 of 42

My background

2

Introduction

ngalioto@umich.edu

B.E. Mechanical Engineering

M.S., PhD, Postdoc

Aerospace Engineering

Postdoc

Bioinformatics

2014—2018

2018—2024

2024—Present

Computational modeling and time-series forecasting

AI for Science

3 of 42

Outline

  • Cell reprogramming
  • Geneformer
  • Introduction to Hi-C data
  • ARCH3D: Architecture and pre-training
  • Results
  • Conclusions and future work

3

Introduction

ngalioto@umich.edu

4 of 42

IntroductionCell reprogramming�Geneformer�High-throughput chromosome conformation capture (Hi-C)�ARCH3D: Architecture and pre-training�Results�Conclusions

5 of 42

Foundation models

5

Geneformer

ngalioto@umich.edu

One model, many tasks (Bommasani, Rishi, et al.)

Main idea: Reduce data down to its minimal “features”

6 of 42

Transformers

6

Geneformer

ngalioto@umich.edu

The

cat

is

[MASK]

some

food

Language encoder

The

cat

is

eating

some

food

Self-supervised learning forces the model to use context to learn relationships between data

7 of 42

Geneformer�A foundation model for single-cell transcriptomics

7

Geneformer

ngalioto@umich.edu

8 of 42

Geneformer architecture

8

Geneformer

ngalioto@umich.edu

9 of 42

Rank value encoding

9

Geneformer

ngalioto@umich.edu

~20,000

892

76

5,498

2,064

311

11,900

8,757

13,249

1

2

3

6

5

4

7

8

931

Rank value encoding: Order genes by normalized expression

Gene:

Expression

Gene:

Expression

Rank:

1

2

3

6

5

4

7

8

2,048

 

 

892

76

5,498

2,064

311

11,900

8,757

13,249

931

 

Tokenization: Every gene ID corresponds to a learnable vector

Token:

10 of 42

Perturbation experiment�Modified mRNA transiently alters gene expression

10

Geneformer

ngalioto@umich.edu

RNA interference

modified mRNA

11 of 42

In silico perturbation with Geneformer

11

Geneformer

ngalioto@umich.edu

Geneformer

Truncate to 2,048

Add TFs to the top

Fibroblast rank value encoding

Transcription factors

 

 

 

Pass to Geneformer

12 of 42

Caveats

  • All transcription factors are added to the top
  • The order of transcription factors is not considered
  • Only models the first step of cell state transition

12

Geneformer

ngalioto@umich.edu

Relevant measure: the directionality of the shift

13 of 42

Experimental setup

13

Geneformer

ngalioto@umich.edu

Candidate transcription factors

GATA2

GFI1B

FOS

STAT5A

REL

FOSB

IKZF1

RUNX3

MEF2C

ETV6

 

 

The distance metric:

 

 

 

10 choose 5

=252 total recipes!

14 of 42

Results: some good, some not so good

14

Geneformer

ngalioto@umich.edu

Perturbation

STAT5A, REL, IKZF1, MEF2C, ETV6

-2.24

STAT5A, FOSB, IKZF1, MEF2C, ETV6

-1.89

FOS, STAT5A, IKZF1, MEF2C, ETV6

-1.87

GFI1B, STAT5A, IKZF1, MEF2C, ETV6

-1.86

FOS, STAT5A, REL, MEF2C, ETV6

-1.83

GATA2, GFI1B, FOS, IKZF1, RUNX3

1.49

GATA2, GFI1B, REL, FOSB, RUNX3

1.52

GATA2, FOS, REL, FOSB, RUNX3

1.52

GATA2, GFI1B, FOS, REL, RUNX3

1.53

GATA2, GFI1B, FOS, FOSB, RUNX3

1.89

Centroid of unperturbed cells

15 of 42

The cell is a dynamical system

15

Cell reprogramming

ngalioto@umich.edu

16 of 42

Genome structure regulates cell identity

  • Active genes are located in areas of loosely-packed chromatin (euchromatin)
  • Topologically associating domains (TADs) insulate sections of the genome from each other
  • Enhancers are brought into proximity of promoters through chromatin looping

16

Cell reprogramming

ngalioto@umich.edu

DNA helix

Chromosomes

Chromatin domains (TADs)

Nucleus

Gene (OFF)

Heterochromatin

Gene (ON)

Euchromatin

Loops bring distal elements into spatial proximity

Adapted from: Misteli, Tom. "The self-organizing genome: principles of genome architecture and function." Cell 183.1 (2020): 28-45.

17 of 42

Existing method: Data-guided control (DGC)

  • State is represented using RNA-seq data, grouped into TADs
  • Selection of TFs is modeled as a control policy

17

Cell reprogramming

ngalioto@umich.edu

  • Limitation: Cannot account for changes in TAD structure

 

Target cell state

Estimated cell state

Ronquist, Scott, et al. "Algorithm for cellular reprogramming." Proceedings of the National Academy of Sciences 114.45 (2017): 11832-11837.

Optimal TF policy

18 of 42

Foundation models show promise in producing multi-purpose representations of biological data

DNA Sequence

  • GenSLM
  • AlphaGenome
  • Evo2

18

Cell reprogramming

ngalioto@umich.edu

Transcriptomic

  • Geneformer
  • scGPT
  • scBERT

Protein sequence

  • AlphaFold
  • ESM-2, 3

ATAC-seq + DNA

  • EPCOT
  • GET

Spatial transcriptomics

  • scGPT-spatial
  • Nicheformer

Genome structure remains underexplored!

19 of 42

AI-powered state representation

19

Cell reprogramming

ngalioto@umich.edu

Transcriptomic

foundation model

Chromosome conformation foundation model

Fusion of structure and function

Contribution of function

Contribution of structure

20 of 42

IntroductionCell reprogramming�Geneformer�High-throughput chromosome conformation capture (Hi-C)ARCH3D: Architecture and pre-training�Results�Conclusions

21 of 42

Hi-C records the number of times two loci come into contact

21

High-throughput chromosome conformation capture (Hi-C)

ngalioto@umich.edu

  • Each entry in the contact matrix is known as a pixel
  • Each pixel can be interpreted as a contact frequency

Intra-chromosomal contacts are more frequent than inter-chromosomal

22 of 42

Original Hi-C paper�Chromatin accessibility shows high correlation with A/B compartmentalization

22

High-throughput chromosome conformation capture (Hi-C)

ngalioto@umich.edu

Contact probability follows a power-law scaling

Lieberman-Aiden, Erez, et al. "Comprehensive mapping of long-range interactions reveals folding principles of the human genome." Science 326.5950 (2009): 289-293.

Dividing every diagonal by its average (observed/expected) mitigates the diagonal dominance

The first eigenvalue correlates with chromatin accessibility

23 of 42

Hi-C: Resolution and coverage

23

High-throughput chromosome conformation capture (Hi-C)

ngalioto@umich.edu

High resolution

Low resolution

Low coverage

High coverage

  • Low coverage cannot capture fine-scale structures (e.g., loops)
  • However, low coverage can represent low-resolution Hi-C with similar accuracy as the high-coverage experiment

24 of 42

IntroductionCell reprogramming�GeneformerHigh-throughput chromosome conformation capture (Hi-C)ARCH3D: Architecture and pre-trainingResults�Conclusions

25 of 42

Pre-training corpus

25

ARCH3D: Architecture and pre-training

ngalioto@umich.edu

Consortia:

  • 4DNucleome
  • ENCODE

Experiments:

  • In-situ Hi-C
  • Dilution Hi-C
  • DNase Hi-C

Preprocessing:

  • KR normalization
  • Observed/expected

481 total experiments (> 10M contacts)

26 of 42

Tokenization scheme

26

ARCH3D: Architecture and pre-training

ngalioto@umich.edu

  • Represents genomic loci, not patches
  • Permits loci of any length (multiple of 5kb)
  • Retains high resolution along columns

20 kb

5 kb bins

Column averaging

20 kb input vector

 

 

Locus lengths:

  • 5 kb
  • 10 kb
  • 25 kb
  • 50 kb
  • 100 kb
  • 250 kb
  • 500 kb
  • 1 Mb

27 of 42

Biology-informed encodings provide the model with positional information

27

ARCH3D: Architecture and pre-training

ngalioto@umich.edu

22

1

2

3

4

 

 

 

 

 

 

 

Final positional encoding:

 

 

chr

 

Base pair encodings

Chromosomal encodings

 

 

Genomic locus:

Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems 30 (2017).

28 of 42

ARCH3D architecture

28

ARCH3D: Architecture and pre-training

ngalioto@umich.edu

 

Multi-headed self-attention

Layer norm

Feedforward neural network

Layer norm

 

x24

Model dimension

Layers

Attention heads

Feedforward dimension

Activation

1,024

24

16

4,096

ReLU

Linear layer

 

 

Locus embeddings

Transformer

 

Positional encoding

Base pair

Chromosome

 

 

 

 

Devlin, Jacob, et al. "Bert: Pre-training of deep bidirectional transformers for language understanding." Proceedings of the 2019 conference of the NAACL: human language technologies, volume 1 (long and short papers). 2019.

29 of 42

Task head architecture

29

ARCH3D: Architecture and pre-training

ngalioto@umich.edu

 

 

Locus embeddings are transformed to pixel embeddings through pairwise addition

Linear + ReLU

Predicted pixels

Linear + ReLU

1024

2048

2048

1024

1

Task head

Linear + ReLU

Linear

 

 

Locus embeddings

Pixel embeddings

Locus embeddings

30 of 42

Pre-training task: Masked locus modeling

30

ARCH3D: Architecture and pre-training

ngalioto@umich.edu

[MASK]

ARCH3D

Mean squared error loss

Hi-C data

Target pixels

Predicted pixels

Pixel embeddings

Locus embeddings

Chromosome encodings

Data input

Base pair encodings

Task head

31 of 42

Training approach

University of Michigan Lighthouse HPC Cluster

17 nodes, each with:

  • 8 NVIDIA H100 GPUs (80 GB VRAM)
  • 1 TB RAM
  • 96 cores

31

ARCH3D: Architecture and pre-training

ngalioto@umich.edu

Stage

GPUs

RAM (TB)

Epochs

Time (h)

GPU hours

1

8

1.0

6,000

504

4,032

2

16 / 32

3.2 / 4.0

5,700

384

9,600

Stage 1: 194 Hi-C experiments; Stage 2: 481 Hi-C experiments

Optimizer: Adam

Learning rate schedule:

  1. Linear warmup to 1e-5 over 500 steps
  2. Constant 1e-5 for 3,000 steps
  3. Cosine anneal to 1e-6 over 2,000 steps
  4. Repeat 2—3

In final run, learning rate held at 10% of max.

Stage 1

Stage 2

32 of 42

IntroductionCell reprogramming�GeneformerHigh-throughput chromosome conformation capture (Hi-C)�ARCH3D: Architecture and pre-trainingResultsConclusions

33 of 42

Positioning of embeddings reflects genomic structure

33

Results

ngalioto@umich.edu

Main observations:

  1. Diagonal blocks show shortest average distance, similar to chromosome territories
  2. More differentiated cell lines show greater distance between inter-chromosomal embeddings
  3. Embeddings from smaller chromosomes are positioned closer together than embeddings from larger chromosomes, mirroring experimental data

Average distance between embeddings within and across chromosomes

Average inter-chromosomal contact probabilities

Lieberman-Aiden, Erez, et al. "Comprehensive mapping of long-range interactions reveals folding principles of the human genome." science 326.5950 (2009): 289-293.

Embedding distances

Data

34 of 42

Higher coverage reveals finer features

34

Results

ngalioto@umich.edu

Hi-C data can be sparse, especially at higher resolutions

Can we use ARCH3D to infer missing pixels?

35 of 42

ARCH3D infers genome-wide contact maps from sparse data

35

Results

ngalioto@umich.edu

Intrachromosomal contacts

Interchromosomal contacts

36 of 42

Evidence suggests genes cluster into transcription factories

36

Results

ngalioto@umich.edu

  • Gene transcription is localized to a small number of sites known as “transcription factories”
  • Genes within a transcription factory are co-regulated
  • Pore-C records multi-way interactions using long-read sequencing

Dotson, Gabrielle A., et al. "Deciphering multi-way interactions in the human genome." Nature Communications 13.1 (2022): 5498.

37 of 42

Pore-C creates a hypergraph

37

Results

ngalioto@umich.edu

Multi-way interactions

(Pore-C)

Pairwise interactions

(Virtual Hi-C)

Clique expansion

Clique-expansion gives an approximation of Hi-C referred to as “virtual Hi-C”

Surana, Amit, Can Chen, and Indika Rajapakse. "Hypergraph similarity measures." IEEE Transactions on Network Science and Engineering 10.2 (2022): 658-674.

Experimental procedure

Multi-correlation methods detect hyperedges from pairwise data:

https://www.overleaf.com/project/636c007729c3d3245fec1e8d

38 of 42

Training data

38

Results

ngalioto@umich.edu

  • 100,000 training samples from each cell line
    • GM12878 (0.24%)
    • BJ Fibroblast (15.59%)
    • IR Fibroblast (38.54%)
  • Train 50 epochs

Percentage Non-Homologous

Cell type

3-way

4-way

5-way

Total

GM12878

Unique

37.06

73.3

85.78

74.44

Total

31.23

72.36

85.74

71.88

BJ Fibroblast

Unique

70.49

94.37

97.31

90.02

Total

69.48

94.37

97.31

89.7

IR Fibroblast

Unique

74.24

96.18

97.39

88.57

Total

73.26

96.17

97.39

88.14

Aggregate

Unique

38.24

73.83

85.95

74.76

Total

32.37

72.9

85.92

72.23

Over 70% hyperedges span multiple chromosomes

Data from Dotson et al.

39 of 42

Hyperedge prediction training scheme

39

Results

ngalioto@umich.edu

Hyperedge

Not hyperedge

Transformer Encoder

Linear

Virtual Hi-C

ARCH3D

❄️

🔥

 

Average

Frozen encoder

Trainable task head

Select only the embeddings of loci in the candidate hyperedge

Probability of hyperedge

Negative sample dynamically generated for every positive sample

  • Replace a random selection of nodes in the positive sample with randomly-chosen nodes from the same chromosome

Zhang, Ruochi, and Jian Ma. "MATCHA: probing multi-way chromatin interaction with hypergraph representation learning." Cell systems 10.5 (2020): 397-407.

40 of 42

ARCH3D gives state-of-the-art results

40

Results

ngalioto@umich.edu

41 of 42

IntroductionCell reprogramming�GeneformerHigh-throughput chromosome conformation capture (Hi-C)�ARCH3D: Architecture and pre-training�ResultsConclusions

42 of 42

Conclusions

  • ARCH3D produces embeddings that capture important features of genome structure
  • ARCH3D infers long-range (including interchromosomal) interactions from sparse data
  • ARCH3D predicts hyperedges from pair-wise data

42

Conclusions

ngalioto@umich.edu

Funding

  • DARPA
    • TwinCell Blueprint: Foundation for AI-Assisted Cell Reprogramming
  • AFOSR
    • Data-guided Learning and Control of Higher Order Structures