Persistent Topological Features in�Large Language Models
Yuri Gardinazzi
Workshop: Interpretability in LLMs using Geometrical and Statistical Methods
27/5/2025
With Karthik Viswanathan, Giada Panerai, Alessio Ansuini, Alberto Cazzaniga & Matteo Biagetti (ICML 2025)
What’s going on inside Large Language Models?
2
welcome
to
the
geom.
workshop
LLM
Input prompt
output
What’s going on inside Large Language Models?
3
welcome
to
the
geom.
LLM
Step 1: Tokenization
last token representation
Input prompt
output
workshop
What’s going on inside a Large Language Models?
4
input embedding
hello
Transformer layers
Model | # Transformer layers | Hidden Dimension |
Llama 2 7B | 32 | 4096 |
Llama 3 8B | 32 | 4096 |
Mistral 7B | 32 | 4096 |
Pythia 6.9B | 32 | 4096 |
output layers
What’s going on inside a Large Language Models?
5
hello
…
Last token
Layers of internal representations
What’s going on inside a Large Language Models?
6
…
Hypothesis: The distribution of prompts is related to the inner workings of LLMs.
Goal: Describe global features of LLMs that are consistent across different models.
Strategy: Analyse internal representations with Topological Data Analysis.
Topological Data Analysis comes to rescue!
7
Connected components
Loops
Voids
We can look at the shape of data
Persistent Homology: Vietoris-Rips filtration
8
1
2
3
4
Connected components
Loops
Persistent Homology: Vietoris-Rips filtration
9
1
2
3
4
Connected components
Loops
Persistent Homology: Vietoris-Rips filtration
10
1
2
3
4
Connected components
Loops
Persistent Homology: Vietoris-Rips filtration
11
1
2
3
4
Connected components
Loops
What if your point cloud evolves in time?
12
ZigZag persistence
13
Connected components
Loops
Carlsson, Gunnar, and Vin De Silva. "Zigzag persistence." Foundations of computational mathematics 10 (2010): 367-405.
ZigZag persistence on transformers
14
Filtration: Knn graph
15
Non linear transformation from layer to layer make the choice of a radius for Vietoris-Rips not trivial.
Three adjacent edges are triangles
Six adjacent edges are tetrahedra
…
Toy Example: �“Let’s do some calendar math. Four months from [MONTH]”
16
Toy Example: �“Let’s do some calendar math. Four months from [MONTH]”
17
Results: Effective Persistence Image
18
Results: Birth Relative Frequency
19
Rate with at which new p-dimensional holes are created
Results: Inter-Layer Persistence (1/2)
20
Fraction of loops alive at layer L1 that are still alive at layer L2 (and were alive the whole path)
Results: Inter-Layer Persistence (1/2)
21
Probability that features alive at certain layer are still alive in earlier or later layers.
Low α: more weight to short-lived features
High α: more weight to long-lived features
Results: Inter-Layer Persistance (1/2)
22
Probability that features alive at certain layer are still alive in earlier or later layers.
Results: Inter-Layer Persistance (2/2)
23
Results: Relation to performance
24
Results: Layer Pruning
25
Other works: Gromov et al. (2024), Men et al. (2024)
Layer pruned by cutting the block of layers that lies within the 10% of the maximum value of Inter-Layer Persistence
Conclusions
26
yuri.gardinazzi@areasciencepark.it
Thank you!
27
Support: Inter-Layer Persistance
28
Power weighed Inter-layer persistence.
Probability that features alive at certain layer are still alive in earlier or later layers.
Betti number: number of p-dimensional holes
Inter-Layer Persistence
29
Births Relative Frequency
30
Persistence
31
Bigger models
32
Different Knn – Inter Layer Persistence
33
Sliding Window
34
Sliding Window
35