Day 2
Dimensionality Reduction�Coevolution and Contact Prediction
What is your favorite movie or tv show?
Conservation Review
Conservation can be displayed as a “weblogo”
MSA
Weblogo
What are we missing in our conservation/weblogo analysis?
Why might “Phylogeny” be important?
Putting it all together
Subgroups of sequences might have different conservation patterns!
Phylogeny: Distance method Review
Distance matrix
Data matrix
Delta
Alpha
Beta
Epsilon
Phylogenetics
| 1 | 2 | 3 | 4 | 5 | 6 |
Alpha | | | | | | |
Beta | | | | | | |
Gamma | | | | | | |
Delta | | | | | | |
Epsilon | | | | | | |
Features
Samples
UPGMA/NJ
Gamma
0 | 4 | 3 | 2 | 2 |
4 | 0 | 3 | 6 | 2 |
4 | 3 | 0 | 3 | 5 |
2 | 6 | 3 | 0 | 4 |
2 | 2 | 5 | 4 | 0 |
Alpha |
Beta |
Gamma |
Delta |
Epsilon |
Alpha
Beta
Gamma
Delta
Epsilon
Why might we want to do this?
https://pubmed.ncbi.nlm.nih.gov/10381871/
Species tree
Gene tree
image adopted from evolution-textbook.org
gene family α
gene family β
Distance matrix
0 | 4 | 3 | 2 | 2 |
4 | 0 | 3 | 6 | 2 |
3 | 3 | 0 | 3 | 5 |
2 | 6 | 3 | 0 | 4 |
2 | 2 | 5 | 4 | 0 |
Alpha |
Beta |
Gamma |
Delta |
Epsilon |
Alpha
Beta
Gamma
Delta
Epsilon
0 | 4 | 3 | 2 | 2 |
4 | 0 | 3 | 6 | 2 |
3 | 3 | 0 | 3 | 5 |
2 | 6 | 3 | 0 | 4 |
2 | 2 | 5 | 4 | 0 |
Alpha |
Beta |
Gamma |
Delta |
Epsilon |
Alpha
Beta
Gamma
Delta
Epsilon
Distance matrix
Dimensionality reduction
| 1 | 2 | 3 | 4 | 5 | 6 |
Alpha | | | | | | |
Beta | | | | | | |
Gamma | | | | | | |
Delta | | | | | | |
Epsilon | | | | | | |
6 Features
Samples
| X | Y |
Alpha | ? | ? |
Beta | ? | ? |
Gamma | ? | ? |
Delta | ? | ? |
Epsilon | ? | ? |
2 Features
X
Y
Plot
Multidimensional scaling
6 Features
| X | Y |
Alpha | ? | ? |
Beta | ? | ? |
Gamma | ? | ? |
Delta | ? | ? |
Epsilon | ? | ? |
2 Features
X
Y
Plot
| 1 | 2 | 3 | 4 | 5 | 6 |
Alpha | | | | | | |
Beta | | | | | | |
Gamma | | | | | | |
Delta | | | | | | |
Epsilon | | | | | | |
Samples
0 | 4 | 3 | 2 | 2 |
4 | 0 | 3 | 6 | 2 |
3 | 3 | 0 | 3 | 5 |
2 | 6 | 3 | 0 | 4 |
2 | 2 | 5 | 4 | 0 |
0 | 4 | 3 | 2 | 2 |
4 | 0 | 3 | 6 | 2 |
3 | 3 | 0 | 3 | 5 |
2 | 6 | 3 | 0 | 4 |
2 | 2 | 5 | 4 | 0 |
MDS (Multidimensional scaling)�Find a low-dimensional embedding that minimizes the difference in distances.
A
dm(A)
B
dm(B)
Minimize the difference
(dm(A)-dm(B))2
LET’S TRY!
Dimensionality reduction
n (samples)
d (features)
k
n (samples)
A
B
Linear transformation
A
B
W
Linear transformation
A
B
A’
W
WT
W
WT
What do those “lines” mean?
W
WT
What does this “line” mean?
A
B
M
A * M = B
Simple math!
2
6
3
2 * 3 = 6
What should be?
2
8
1
?
3
(2 * 3) + (1 * ?) = 8
?
2
8
1
2
3
(2 * 3) + (1 * 2) = 8
multiple lines and nodes
2
8
(2 * 3) + (1 * 2) = 8
(2 * 4) + (1 * 1) = 9
1
3
4
9
2
1
W
WT
Minimize difference
(find W, so that A ≈ A’)
PCA (Principal component analysis)�W = top-k eigenvectors of the covariance matrix of A�
Transform (rotate)
A
B
W
Transform (rotate) and plot the data!
A
B
W
Y
X
MDS - multidimensional scaling
A
dm(A)
B
dm(B)
A
B
A
PCA - principal component analysis
PC2
PC1
MDS2
MDS1
PCA “rotates” the data to find axes that maximize the variance.
MDS finds a lower-dimensional manifold that preserves the distances.
LET'S TRY
https://www.pnas.org/doi/10.1073/pnas.1711913115
https://onlinelibrary.wiley.com/doi/10.1002/pro.4422
https://github.com/gelnesr/SVD-of-MSAs
Part 2
Coevolution
Phylogenetics
Conservation
What can we learn from MSA(s)?
Review - Conservation
A
B
C
D
E
0
1
2
3
4
5
A
Sequence
Multiple Sequence Alignment
Conservation
BLAST
Review - Phylogeny
A
B
C
D
E
0
1
2
3
4
5
Phylogeny
0
4
3
2
2
4
0
3
6
2
3
3
0
3
5
2
6
3
0
4
2
2
5
4
0
Distance matrix
NJ/UPGMA
A
Sequence
Multiple Sequence Alignment
BLAST
Review - Dimensionality reduction
A
B
C
D
E
0
1
2
3
4
5
Phylogeny
0
4
3
2
2
4
0
3
6
2
3
3
0
3
5
2
6
3
0
4
2
2
5
4
0
Distance matrix
Dimensionality Reduction
NJ/UPGMA
MDS
PCA
A
Sequence
Multiple Sequence Alignment
BLAST
Today - Coevolution!
A
B
C
D
E
0
1
2
3
4
5
A
Sequence
Multiple Sequence Alignment
BLAST
Which positions appear to covary?
A
B
C
D
E
0
1
2
3
4
5
Which positions appear to covary?
A
B
C
D
E
0
2
1
3
4
5
Which positions appear to covary?
A
B
C
D
E
0
2
3
4
1
5
48
P(■■) = ?
0
1
49
Position-specific scoring matrix (PSSM)
P1
P2
0.5�0.5
0.5�0.5
P(■■) = P1[■] x P2[■] = 0.25
■�■
0
1
■�■
-0.7�-0.7
-0.7�-0.7
V1
V2
P(■■) = exp(V1[■]+V2[■]) = 0.25
Position-specific scoring matrix (PSSM)
0
1
Markov Random Fields
for Sequences
Alan S. Lapedes
Position-specific scoring matrix (PSSM)
■�■
-0.7�-0.7
-0.7�-0.7
V1
V2
P(■■) = exp(V1[■]+V2[■]) = 0.25
■�■
■ ■
0 1
1 0
W12 =
P(■■)=exp(V1[■]+V2[■]+W12[■][■])/Z = 0.36
Z = exp(V1[■]+V2[■]+W12[■][■])
+exp(V1[■]+V2[■]+W12[■][■])� +exp(V1[■]+V2[■]+W12[■][■])
+exp(V1[■]+V2[■]+W12[■][■])
Markov Random Field (MRF)
1-body terms (vi) to encode conservation
2-body terms (wij) to encode coevolution
Markov Random Fields
Cons:
Sequence
Position
conservartion
coevolution
+ reg(w,v)
pll - Besag et al 1975, 1977
GREMLIN - Balakrishnan et al. 2009�GREMLIN - Balakrishnan et al. 2011�GREMLIN_v2 - Kamisetty et al. 2013�plmDCA - Ekeberg et al. 2013
GREMLIN (Generative REgularized ModeLs of proteINs)
Let’s pretend there are:
0
1
2
conservation
coevolution
0
1
2
x = one_hot(MSA)
x
0
1
2
x' = x@w
w
x
0
1
2
[w]eights = coevolution
x'
x' = x@w + b
b
x
0
1
2
[w]eights = coevolution
[b]ias = conservation
w
x' = softmax(x@w + b)
softmax(x) = exp(x)/sum(exp(x))
�To make sure probabilities at each position sum to 1.0
b
x
x'
0
1
2
w
Fully connected layer (Dense layer)�P(x1,x2,x3) ≈ P(x1|x1,x2,x3) * P(x2|x1,x2,x3) * P(x3|x1,x2,x3)
x' = softmax(x@w + b)
�loss = -x*log(x')
loss
b
x
x'
0
1
2
w
Minimize difference using categorical-crossentropy��Can learn an identity mapping (self-connections)
Remove self-connections (Pseudo-likelihood)�P(x1,x2,x3) ≈ P(x1|x2,x3) * P(x2|x1,x3) * P(x3|x1,x2)
x' = softmax(x@w + b)
�loss = -x*log(x')
loss
b
x
x'
0
1
2
w
Minimize difference using categorical-crossentropy
Balakrishnan, S., Kamisetty, H., Carbonell, J.G. and Langmead, C.J., 2009. �Structure Learning for Generative Models of Protein Fold Families.
Sequence
Position
conservartion
coevolution
maximize PLL
+ reg(w,v)
pll - Besag et al 1975, 1977
GREMLIN - Balakrishnan et al. 2009�GREMLIN - Balakrishnan et al. 2011�GREMLIN_v2 - Kamisetty et al. 2013�plmDCA - Ekeberg et al. 2013
0
1
2
GREMLIN (Generative REgularized ModeLs of proteINs)
x' = softmax(x@w + b)
b
x
x'
0
1
2
w
x' = x@w
w
x
0
1
2
[w]eights = coevolution
x'
w
x
0
1
2
[w]eights = coevolution
loss
x'
x' = x@w
�loss = (x' - x)2�w = I
analytical solution: Identity matrix
w
x
0
1
2
[w]eights = coevolution
loss
x'
x' = x@w
�loss = (x' - x)2 /N - 2Tr(w)�w = cov(x)-1 + I
Direct Coupling Analysis
(analytical solution: inverse covariance)
Dauparas, J., Wang, H., Swartz, A., Koo, P., Nitzan, M. and Ovchinnikov, S., 2019. Unified framework for modeling multivariate distributions in biological sequences. arXiv��Morcos, F.,… Weigt, M., 2011. Direct-coupling analysis of residue coevolution captures native contacts across many protein families. PNAS
w
x
0
1
2
x'
x' = x@w@wT
�loss = (x' - x)2�w = PCA(x).components_
Autoencoder�(analytical solution: principle component analysis)
loss
wT
w
x
0
1
2
x'
x' = x@w@wT
�loss = (x' - x)2�w = PCA(x).components_
Autoencoder�(analytical solution: principle component analysis)
loss
wT
PC1
PC2
0
1
2
No connections (Conservation)
P(x1,x2,x3) ≈ P(x1) * P(x2) * P(x3)
loss
b
x
x'
0
1
2
x' = softmax(b)
�loss = -x*log(x')
x' = softmax(x@w + b)�loss = -x*log(x') + λb2 + λw2
loss
GREMLIN (Generative REgularized ModeLs of proteINs)
b
x
x'
0
1
2
w
(W)eights of MRF
0
1
2
0
1
2
0
1
2
w
(W)eights of MRF can be “reduced” to a Contact map
0
1
2
0
1
2
0
1
2
0
1
2
0
1
2
norm
w
0
1
2
0
1
2
0
1
2
0
1
2
0
1
2
w
0
1
2
0
1
2
0
1
2
0
1
2
0
1
2
Low rank correction often required
Sparse matrix
low-rank matrix
Decomposition
∑(:,j)
∑(i,:)
∑(:,:)
APC = raw - AP
Dunn et al. 2008
Mutual information without the influence of phylogeny or entropy dramatically improves residue contact prediction.
AP =
∑(i,:) * ∑(:,j)
∑(:,:)
Contact map
50S ribosomal protein L6
N
C
C
XRAY Contacts
How to read a contact map
50S ribosomal protein L6
N
C
C
XRAY Contacts
Overlay of predicted contacts on real contacts
N
C
C
50S ribosomal protein L6
XRAY Contacts
Predicted Contacts
LETS TRY
A
B
C
D
E
0
1
2
3
4
5
Sequences
Positions
Co-evolution
0
1
2
3
4
5
0
1
2
3
4
5
0
1
2
3
4
5
Tomorrow
V1
V2
V3
V4
W14
X1
X2
X3
X4
MRF/Potts
(markov random fields)
MG
(multivariate gaussian)
AE
(autoencoders)
PSSM
(position-specific scoring matrix)
BERT/�Transformer
AlphaFold/�RoseTTAFold
All these models can be expressed as “Autoencoders”
PSSM
MRF
MG
AE
BERT
Bonus
Find reduced dimensionality that only preserves local distances (or neighbors)?
Why might we want to do this?
SM = Gaussian(DM)
Similarity Matrix
Distance Matrix
Similarity Matrix
A
dm
-0.4 | 1.0 | 0.3 |
-0.6 | -0.6 | -0.3 |
-0.2 | 0.5 | 0.5 |
-0.7 | 0.9 | 0.0 |
0.9 | -0.8 | 0.9 |
-1.0 | 0.7 | -0.7 |
-0.1 | 0.3 | 0.3 |
0.8 | 0.5 | -0.6 |
0.0 | 1.7 | 0.5 | 0.4 | 2.3 | 1.1 | 0.8 | 1.5 |
1.7 | 0.0 | 1.4 | 1.6 | 1.9 | 1.4 | 1.2 | 1.8 |
0.5 | 1.4 | 0.0 | 0.8 | 1.8 | 1.3 | 0.3 | 1.5 |
0.4 | 1.6 | 0.8 | 0.0 | 2.5 | 0.8 | 1.0 | 1.7 |
2.3 | 1.9 | 1.8 | 2.5 | 0.0 | 2.8 | 1.6 | 2.0 |
1.1 | 1.4 | 1.3 | 0.8 | 2.8 | 0.0 | 1.4 | 1.8 |
0.8 | 1.2 | 0.3 | 1.0 | 1.6 | 1.4 | 0.0 | 1.3 |
1.5 | 1.8 | 1.5 | 1.7 | 2.0 | 1.8 | 1.3 | 0.0 |
1
2
3
4
5
6
7
8
x
y
z
Samples
1
2
3
4
5
6
7
8
Samples
1
2
3
4
5
6
7
8
Samples
Features
sm
dmij = exp(-dmij2 / 2σ2)
1.0 | 0.0 | 0.6 | 0.7 | 0.0 | 0.1 | 0.3 | 0.0 |
0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.1 | 0.0 |
0.6 | 0.0 | 1.0 | 0.3 | 0.0 | 0.0 | 0.8 | 0.0 |
0.7 | 0.0 | 0.3 | 1.0 | 0.0 | 0.3 | 0.1 | 0.0 |
0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 |
0.1 | 0.0 | 0.0 | 0.3 | 0.0 | 1.0 | 0.0 | 0.0 |
0.3 | 0.1 | 0.8 | 0.1 | 0.0 | 0.0 | 1.0 | 0.0 |
0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
1
2
3
4
5
6
7
8
Samples
1
2
3
4
5
6
7
8
Samples
Dij = (xi-xj)2 + (yi-yj)2 + (zi-zj)2
A
dm
B
dm
A
dm
B
dm
sm
sm
MDS
UMAP/TSNE
A
B
A
PCA
TSNE (t-distributed Stochastic Neighbor Embedding)
UMAP (Uniform Manifold Approximation and Projection)
MDS
UMAP/TSNE
Phylogeny
PCA
Encode integers by which prime-factors they share
8 million integers
78628D to 2D
https://johnhw.github.io/umap_primes/index.md.html
CHECK IT OUT
https://johnhw.github.io/umap_primes/three_umap.html
Cool interactive resources
w
x
0
1
2
x'
x' = softmax(x@w)@wT
�loss = (x' - x)2�w = KMeans(x).means_
Categorical Autoencoder�(solution: kmeans)
loss
wT
softmax(x) = exp(x)/sum(exp(x))
0
1
2
Variational Autoencoder
loss
x
Z
μ
σ
Only connect to previous positions (Autoregressive model)
P(x1,x2,x3) ≈ P(x1) * P(x2|x1) * P(x3|x1,x2)
b
x'
x' = softmax(x@w + b)
�loss = -x*log(x')
loss
x
0
1
2
w
Trinquier, J., Uguzzoni, G., Pagnani, A., Zamponi, F. and Weigt, M., 2021. �Efficient generative modeling of protein sequences using simple autoregressive models. �Nature communications, 12(1), pp.1-11.