Large language models
Nick Galioto
Gilbert S. Omenn Department of Computational Medicine and Bioinformatics, University of Michigan
MATH-BIOINF-STAT 547: Mathematics of Data
February 26, 2026
My background
2
Introduction
ngalioto@umich.edu
B.E. Mechanical Engineering
M.S., PhD, Postdoc
Aerospace Engineering
Postdoc
Bioinformatics
2014—2018
2018—2024
2024—Present
Computational modeling and time-series forecasting
AI for Science
Outline
3
Introduction
ngalioto@umich.edu
Introduction�Cell reprogramming�Geneformer�High-throughput chromosome conformation capture (Hi-C)�ARCH3D: Architecture and pre-training�Results�Conclusions
Foundation models
5
Geneformer
ngalioto@umich.edu
One model, many tasks (Bommasani, Rishi, et al.)
Main idea: Reduce data down to its minimal “features”
Transformers
6
Geneformer
ngalioto@umich.edu
The
cat
is
[MASK]
some
food
Language encoder
The
cat
is
eating
some
food
Self-supervised learning forces the model to use context to learn relationships between data
Geneformer�A foundation model for single-cell transcriptomics
7
Geneformer
ngalioto@umich.edu
Geneformer architecture
8
Geneformer
ngalioto@umich.edu
Rank value encoding
9
Geneformer
ngalioto@umich.edu
~20,000
892
76
5,498
2,064
311
11,900
8,757
13,249
1
2
3
6
5
4
7
8
931
Rank value encoding: Order genes by normalized expression
Gene:
Expression
Gene:
Expression
Rank:
1
2
3
6
5
4
7
8
2,048
892
76
5,498
2,064
311
11,900
8,757
13,249
931
Tokenization: Every gene ID corresponds to a learnable vector
Token:
Perturbation experiment�Modified mRNA transiently alters gene expression
10
Geneformer
ngalioto@umich.edu
RNA interference
modified mRNA
In silico perturbation with Geneformer
11
Geneformer
ngalioto@umich.edu
Geneformer
Truncate to 2,048
Add TFs to the top
Fibroblast rank value encoding
Transcription factors
Pass to Geneformer
Caveats
12
Geneformer
ngalioto@umich.edu
Relevant measure: the directionality of the shift
Experimental setup
13
Geneformer
ngalioto@umich.edu
Candidate transcription factors |
GATA2 |
GFI1B |
FOS |
STAT5A |
REL |
FOSB |
IKZF1 |
RUNX3 |
MEF2C |
ETV6 |
The distance metric:
10 choose 5
=252 total recipes!
Results: some good, some not so good
14
Geneformer
ngalioto@umich.edu
Perturbation | |
STAT5A, REL, IKZF1, MEF2C, ETV6 | -2.24 |
STAT5A, FOSB, IKZF1, MEF2C, ETV6 | -1.89 |
FOS, STAT5A, IKZF1, MEF2C, ETV6 | -1.87 |
GFI1B, STAT5A, IKZF1, MEF2C, ETV6 | -1.86 |
FOS, STAT5A, REL, MEF2C, ETV6 | -1.83 |
| |
GATA2, GFI1B, FOS, IKZF1, RUNX3 | 1.49 |
GATA2, GFI1B, REL, FOSB, RUNX3 | 1.52 |
GATA2, FOS, REL, FOSB, RUNX3 | 1.52 |
GATA2, GFI1B, FOS, REL, RUNX3 | 1.53 |
GATA2, GFI1B, FOS, FOSB, RUNX3 | 1.89 |
Centroid of unperturbed cells
The cell is a dynamical system
15
Cell reprogramming
ngalioto@umich.edu
Genome structure regulates cell identity
16
Cell reprogramming
ngalioto@umich.edu
DNA helix
Chromosomes
Chromatin domains (TADs)
Nucleus
Gene (OFF)
Heterochromatin
Gene (ON)
Euchromatin
Loops bring distal elements into spatial proximity
Adapted from: Misteli, Tom. "The self-organizing genome: principles of genome architecture and function." Cell 183.1 (2020): 28-45.
Existing method: Data-guided control (DGC)
17
Cell reprogramming
ngalioto@umich.edu
Target cell state
Estimated cell state
Ronquist, Scott, et al. "Algorithm for cellular reprogramming." Proceedings of the National Academy of Sciences 114.45 (2017): 11832-11837.
Optimal TF policy
Foundation models show promise in producing multi-purpose representations of biological data
DNA Sequence
18
Cell reprogramming
ngalioto@umich.edu
Transcriptomic
Protein sequence
ATAC-seq + DNA
Spatial transcriptomics
Genome structure remains underexplored!
AI-powered state representation
19
Cell reprogramming
ngalioto@umich.edu
Transcriptomic
foundation model
Chromosome conformation foundation model
Fusion of structure and function
Contribution of function
Contribution of structure
Introduction�Cell reprogramming�Geneformer�High-throughput chromosome conformation capture (Hi-C)�ARCH3D: Architecture and pre-training�Results�Conclusions
Hi-C records the number of times two loci come into contact
21
High-throughput chromosome conformation capture (Hi-C)
ngalioto@umich.edu
Intra-chromosomal contacts are more frequent than inter-chromosomal
Original Hi-C paper�Chromatin accessibility shows high correlation with A/B compartmentalization
22
High-throughput chromosome conformation capture (Hi-C)
ngalioto@umich.edu
Contact probability follows a power-law scaling
Lieberman-Aiden, Erez, et al. "Comprehensive mapping of long-range interactions reveals folding principles of the human genome." Science 326.5950 (2009): 289-293.
Dividing every diagonal by its average (observed/expected) mitigates the diagonal dominance
The first eigenvalue correlates with chromatin accessibility
Hi-C: Resolution and coverage
23
High-throughput chromosome conformation capture (Hi-C)
ngalioto@umich.edu
High resolution
Low resolution
Low coverage
High coverage
Introduction�Cell reprogramming�Geneformer�High-throughput chromosome conformation capture (Hi-C)�ARCH3D: Architecture and pre-training�Results�Conclusions
Pre-training corpus
25
ARCH3D: Architecture and pre-training
ngalioto@umich.edu
Consortia:
Experiments:
Preprocessing:
481 total experiments (> 10M contacts)
Tokenization scheme
26
ARCH3D: Architecture and pre-training
ngalioto@umich.edu
20 kb
5 kb bins
Column averaging
20 kb input vector
Locus lengths:
Biology-informed encodings provide the model with positional information
27
ARCH3D: Architecture and pre-training
ngalioto@umich.edu
22
1
2
3
4
Final positional encoding:
chr
Base pair encodings
Chromosomal encodings
Genomic locus:
Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems 30 (2017).
ARCH3D architecture
28
ARCH3D: Architecture and pre-training
ngalioto@umich.edu
Multi-headed self-attention
Layer norm
Feedforward neural network
Layer norm
x24
Model dimension | Layers | Attention heads | Feedforward dimension | Activation |
1,024 | 24 | 16 | 4,096 | ReLU |
Linear layer
Locus embeddings
Transformer
Positional encoding
Base pair
Chromosome
Devlin, Jacob, et al. "Bert: Pre-training of deep bidirectional transformers for language understanding." Proceedings of the 2019 conference of the NAACL: human language technologies, volume 1 (long and short papers). 2019.
Task head architecture
29
ARCH3D: Architecture and pre-training
ngalioto@umich.edu
Locus embeddings are transformed to pixel embeddings through pairwise addition
Linear + ReLU
Predicted pixels
Linear + ReLU
1024
2048
2048
1024
1
Task head
Linear + ReLU
Linear
Locus embeddings
Pixel embeddings
Locus embeddings
Pre-training task: Masked locus modeling
30
ARCH3D: Architecture and pre-training
ngalioto@umich.edu
[MASK]
ARCH3D
Mean squared error loss
Hi-C data
Target pixels
Predicted pixels
Pixel embeddings
Locus embeddings
Chromosome encodings
Data input
Base pair encodings
Task head
Training approach
University of Michigan Lighthouse HPC Cluster
17 nodes, each with:
31
ARCH3D: Architecture and pre-training
ngalioto@umich.edu
Stage | GPUs | RAM (TB) | Epochs | Time (h) | GPU hours |
1 | 8 | 1.0 | 6,000 | 504 | 4,032 |
2 | 16 / 32 | 3.2 / 4.0 | 5,700 | 384 | 9,600 |
Stage 1: 194 Hi-C experiments; Stage 2: 481 Hi-C experiments
Optimizer: Adam
Learning rate schedule:
In final run, learning rate held at 10% of max.
Stage 1
Stage 2
Introduction�Cell reprogramming�Geneformer�High-throughput chromosome conformation capture (Hi-C)�ARCH3D: Architecture and pre-training�Results�Conclusions
Positioning of embeddings reflects genomic structure
33
Results
ngalioto@umich.edu
Main observations:
Average distance between embeddings within and across chromosomes
Average inter-chromosomal contact probabilities
Lieberman-Aiden, Erez, et al. "Comprehensive mapping of long-range interactions reveals folding principles of the human genome." science 326.5950 (2009): 289-293.
Embedding distances
Data
Higher coverage reveals finer features
34
Results
ngalioto@umich.edu
Hi-C data can be sparse, especially at higher resolutions
Can we use ARCH3D to infer missing pixels?
ARCH3D infers genome-wide contact maps from sparse data
35
Results
ngalioto@umich.edu
Intrachromosomal contacts
Interchromosomal contacts
Evidence suggests genes cluster into transcription factories
36
Results
ngalioto@umich.edu
Dotson, Gabrielle A., et al. "Deciphering multi-way interactions in the human genome." Nature Communications 13.1 (2022): 5498.
Pore-C creates a hypergraph
37
Results
ngalioto@umich.edu
Multi-way interactions
(Pore-C)
Pairwise interactions
(Virtual Hi-C)
Clique expansion
Clique-expansion gives an approximation of Hi-C referred to as “virtual Hi-C”
Surana, Amit, Can Chen, and Indika Rajapakse. "Hypergraph similarity measures." IEEE Transactions on Network Science and Engineering 10.2 (2022): 658-674.
Experimental procedure
Multi-correlation methods detect hyperedges from pairwise data:
Training data
38
Results
ngalioto@umich.edu
| | Percentage Non-Homologous | |||
Cell type | | 3-way | 4-way | 5-way | Total |
GM12878 | Unique | 37.06 | 73.3 | 85.78 | 74.44 |
Total | 31.23 | 72.36 | 85.74 | 71.88 | |
BJ Fibroblast | Unique | 70.49 | 94.37 | 97.31 | 90.02 |
Total | 69.48 | 94.37 | 97.31 | 89.7 | |
IR Fibroblast | Unique | 74.24 | 96.18 | 97.39 | 88.57 |
Total | 73.26 | 96.17 | 97.39 | 88.14 | |
Aggregate | Unique | 38.24 | 73.83 | 85.95 | 74.76 |
Total | 32.37 | 72.9 | 85.92 | 72.23 | |
Over 70% hyperedges span multiple chromosomes
Data from Dotson et al.
Hyperedge prediction training scheme
39
Results
ngalioto@umich.edu
Hyperedge
Not hyperedge
Transformer Encoder
Linear
Virtual Hi-C
ARCH3D
❄️
🔥
Average
Frozen encoder
Trainable task head
Select only the embeddings of loci in the candidate hyperedge
Probability of hyperedge
Negative sample dynamically generated for every positive sample
Zhang, Ruochi, and Jian Ma. "MATCHA: probing multi-way chromatin interaction with hypergraph representation learning." Cell systems 10.5 (2020): 397-407.
ARCH3D gives state-of-the-art results
40
Results
ngalioto@umich.edu
Introduction�Cell reprogramming�Geneformer�High-throughput chromosome conformation capture (Hi-C)�ARCH3D: Architecture and pre-training�Results�Conclusions
Conclusions
42
Conclusions
ngalioto@umich.edu
Funding