1 of 97

Day 2

Dimensionality Reduction�Coevolution and Contact Prediction

2 of 97

What is your favorite movie or tv show?

3 of 97

Conservation Review

4 of 97

Conservation can be displayed as a “weblogo”

MSA

Weblogo

5 of 97

What are we missing in our conservation/weblogo analysis?

  1. Coevolution
  2. Phylogeny

Why might “Phylogeny” be important?

  1. Different groups of sequences might have different evolutionary constraints

6 of 97

Putting it all together

7 of 97

Subgroups of sequences might have different conservation patterns!

8 of 97

Phylogeny: Distance method Review

9 of 97

Distance matrix

Data matrix

Delta

Alpha

Beta

Epsilon

Phylogenetics

1

2

3

4

5

6

Alpha

Beta

Gamma

Delta

Epsilon

Features

Samples

UPGMA/NJ

Gamma

0

4

3

2

2

4

0

3

6

2

4

3

0

3

5

2

6

3

0

4

2

2

5

4

0

Alpha

Beta

Gamma

Delta

Epsilon

Alpha

Beta

Gamma

Delta

Epsilon

10 of 97

11 of 97

12 of 97

Why might we want to do this?

13 of 97

14 of 97

https://pubmed.ncbi.nlm.nih.gov/10381871/

15 of 97

Species tree

Gene tree

image adopted from evolution-textbook.org

gene family α

gene family β

16 of 97

Distance matrix

0

4

3

2

2

4

0

3

6

2

3

3

0

3

5

2

6

3

0

4

2

2

5

4

0

Alpha

Beta

Gamma

Delta

Epsilon

Alpha

Beta

Gamma

Delta

Epsilon

17 of 97

0

4

3

2

2

4

0

3

6

2

3

3

0

3

5

2

6

3

0

4

2

2

5

4

0

Alpha

Beta

Gamma

Delta

Epsilon

Alpha

Beta

Gamma

Delta

Epsilon

Distance matrix

18 of 97

Dimensionality reduction

1

2

3

4

5

6

Alpha

Beta

Gamma

Delta

Epsilon

6 Features

Samples

X

Y

Alpha

?

?

Beta

?

?

Gamma

?

?

Delta

?

?

Epsilon

?

?

2 Features

X

Y

Plot

19 of 97

Multidimensional scaling

6 Features

X

Y

Alpha

?

?

Beta

?

?

Gamma

?

?

Delta

?

?

Epsilon

?

?

2 Features

X

Y

Plot

1

2

3

4

5

6

Alpha

Beta

Gamma

Delta

Epsilon

Samples

0

4

3

2

2

4

0

3

6

2

3

3

0

3

5

2

6

3

0

4

2

2

5

4

0

0

4

3

2

2

4

0

3

6

2

3

3

0

3

5

2

6

3

0

4

2

2

5

4

0

20 of 97

MDS (Multidimensional scaling)�Find a low-dimensional embedding that minimizes the difference in distances.

A

dm(A)

B

dm(B)

Minimize the difference

(dm(A)-dm(B))2

21 of 97

LET’S TRY!

22 of 97

23 of 97

Dimensionality reduction

n (samples)

d (features)

k

n (samples)

A

B

24 of 97

Linear transformation

A

B

W

25 of 97

Linear transformation

A

B

A’

W

WT

26 of 97

27 of 97

W

WT

28 of 97

What do those “lines” mean?

W

WT

29 of 97

What does this “line” mean?

A

B

M

A * M = B

30 of 97

Simple math!

2

6

3

2 * 3 = 6

31 of 97

What should be?

2

8

1

?

3

(2 * 3) + (1 * ?) = 8

?

32 of 97

2

8

1

2

3

(2 * 3) + (1 * 2) = 8

33 of 97

multiple lines and nodes

2

8

(2 * 3) + (1 * 2) = 8

(2 * 4) + (1 * 1) = 9

1

3

4

9

2

1

34 of 97

W

WT

Minimize difference

(find W, so that AA’)

PCA (Principal component analysis)�W = top-k eigenvectors of the covariance matrix of A

35 of 97

Transform (rotate)

A

B

W

36 of 97

Transform (rotate) and plot the data!

A

B

W

Y

X

37 of 97

MDS - multidimensional scaling

A

dm(A)

B

dm(B)

A

B

A

PCA - principal component analysis

PC2

PC1

MDS2

MDS1

PCA “rotates” the data to find axes that maximize the variance.

MDS finds a lower-dimensional manifold that preserves the distances.

38 of 97

LET'S TRY

https://www.pnas.org/doi/10.1073/pnas.1711913115

https://onlinelibrary.wiley.com/doi/10.1002/pro.4422

https://github.com/gelnesr/SVD-of-MSAs

39 of 97

Part 2

40 of 97

Coevolution

Phylogenetics

Conservation

What can we learn from MSA(s)?

41 of 97

Review - Conservation

A

B

C

D

E

0

1

2

3

4

5

A

Sequence

Multiple Sequence Alignment

Conservation

BLAST

42 of 97

Review - Phylogeny

A

B

C

D

E

0

1

2

3

4

5

Phylogeny

0

4

3

2

2

4

0

3

6

2

3

3

0

3

5

2

6

3

0

4

2

2

5

4

0

Distance matrix

NJ/UPGMA

A

Sequence

Multiple Sequence Alignment

BLAST

43 of 97

Review - Dimensionality reduction

A

B

C

D

E

0

1

2

3

4

5

Phylogeny

0

4

3

2

2

4

0

3

6

2

3

3

0

3

5

2

6

3

0

4

2

2

5

4

0

Distance matrix

Dimensionality Reduction

NJ/UPGMA

MDS

PCA

A

Sequence

Multiple Sequence Alignment

BLAST

44 of 97

Today - Coevolution!

A

B

C

D

E

0

1

2

3

4

5

A

Sequence

Multiple Sequence Alignment

BLAST

45 of 97

Which positions appear to covary?

A

B

C

D

E

0

1

2

3

4

5

46 of 97

Which positions appear to covary?

A

B

C

D

E

0

2

1

3

4

5

47 of 97

Which positions appear to covary?

A

B

C

D

E

0

2

3

4

1

5

48 of 97

48

P() = ?

0

1

49 of 97

49

Position-specific scoring matrix (PSSM)

P1

P2

0.50.5

0.50.5

P() = P1[] x P2[] = 0.25

0

1

50 of 97

-0.7-0.7

-0.7-0.7

V1

V2

P() = exp(V1[]+V2[]) = 0.25

Position-specific scoring matrix (PSSM)

0

1

51 of 97

Markov Random Fields

for Sequences

Alan S. Lapedes

52 of 97

Position-specific scoring matrix (PSSM)

-0.7-0.7

-0.7-0.7

V1

V2

P() = exp(V1[]+V2[]) = 0.25

0 1

1 0

W12 =

P()=exp(V1[]+V2[]+W12[][])/Z = 0.36

Z = exp(V1[]+V2[]+W12[][])

+exp(V1[]+V2[]+W12[][])� +exp(V1[]+V2[]+W12[][])

+exp(V1[]+V2[]+W12[][])

Markov Random Field (MRF)

1-body terms (vi) to encode conservation

2-body terms (wij) to encode coevolution

53 of 97

Markov Random Fields

Cons:

  • Likelihood calculations VERY slow
  • 20Length combinations for proteins
    • For 100 aa proteins thats: 1.3x10130
    • For scale:�There are 1.0x1082 Atoms in the UNIVERSE!

54 of 97

Sequence

Position

conservartion

coevolution

+ reg(w,v)

pll - Besag et al 1975, 1977

GREMLIN - Balakrishnan et al. 2009�GREMLIN - Balakrishnan et al. 2011�GREMLIN_v2 - Kamisetty et al. 2013�plmDCA - Ekeberg et al. 2013

GREMLIN (Generative REgularized ModeLs of proteINs)

55 of 97

Let’s pretend there are:

  • 3 positions
  • 3 characters:

0

1

2

conservation

coevolution

56 of 97

0

1

2

57 of 97

x = one_hot(MSA)

x

0

1

2

58 of 97

x' = x@w

w

x

0

1

2

[w]eights = coevolution

x'

59 of 97

x' = x@w + b

b

x

0

1

2

[w]eights = coevolution

[b]ias = conservation

w

60 of 97

x' = softmax(x@w + b)

softmax(x) = exp(x)/sum(exp(x))

�To make sure probabilities at each position sum to 1.0

b

x

x'

0

1

2

w

61 of 97

Fully connected layer (Dense layer)�P(x1,x2,x3) ≈ P(x1|x1,x2,x3) * P(x2|x1,x2,x3) * P(x3|x1,x2,x3)

x' = softmax(x@w + b)

�loss = -x*log(x')

loss

b

x

x'

0

1

2

w

Minimize difference using categorical-crossentropy��Can learn an identity mapping (self-connections)

62 of 97

Remove self-connections (Pseudo-likelihood)�P(x1,x2,x3) ≈ P(x1|x2,x3) * P(x2|x1,x3) * P(x3|x1,x2)

x' = softmax(x@w + b)

�loss = -x*log(x')

loss

b

x

x'

0

1

2

w

Minimize difference using categorical-crossentropy

Balakrishnan, S., Kamisetty, H., Carbonell, J.G. and Langmead, C.J., 2009. �Structure Learning for Generative Models of Protein Fold Families.

63 of 97

Sequence

Position

conservartion

coevolution

maximize PLL

+ reg(w,v)

pll - Besag et al 1975, 1977

GREMLIN - Balakrishnan et al. 2009�GREMLIN - Balakrishnan et al. 2011�GREMLIN_v2 - Kamisetty et al. 2013�plmDCA - Ekeberg et al. 2013

0

1

2

GREMLIN (Generative REgularized ModeLs of proteINs)

64 of 97

x' = softmax(x@w + b)

b

x

x'

0

1

2

w

65 of 97

x' = x@w

w

x

0

1

2

[w]eights = coevolution

x'

66 of 97

w

x

0

1

2

[w]eights = coevolution

loss

x'

x' = x@w

�loss = (x' - x)2w = I

analytical solution: Identity matrix

67 of 97

w

x

0

1

2

[w]eights = coevolution

loss

x'

x' = x@w

�loss = (x' - x)2 /N - 2Tr(w)w = cov(x)-1 + I

Direct Coupling Analysis

(analytical solution: inverse covariance)

Dauparas, J., Wang, H., Swartz, A., Koo, P., Nitzan, M. and Ovchinnikov, S., 2019. Unified framework for modeling multivariate distributions in biological sequences. arXiv��Morcos, F.,… Weigt, M., 2011. Direct-coupling analysis of residue coevolution captures native contacts across many protein families. PNAS

68 of 97

w

x

0

1

2

x'

x' = x@w@wT

�loss = (x' - x)2w = PCA(x).components_

Autoencoder(analytical solution: principle component analysis)

loss

wT

69 of 97

w

x

0

1

2

x'

x' = x@w@wT

�loss = (x' - x)2w = PCA(x).components_

Autoencoder(analytical solution: principle component analysis)

loss

wT

PC1

PC2

0

1

2

70 of 97

No connections (Conservation)

P(x1,x2,x3) ≈ P(x1) * P(x2) * P(x3)

loss

b

x

x'

0

1

2

x' = softmax(b)

�loss = -x*log(x')

71 of 97

x' = softmax(x@w + b)�loss = -x*log(x') + λb2 + λw2

loss

GREMLIN (Generative REgularized ModeLs of proteINs)

b

x

x'

0

1

2

w

72 of 97

(W)eights of MRF

0

1

2

0

1

2

0

1

2

w

73 of 97

(W)eights of MRF can be “reduced” to a Contact map

0

1

2

0

1

2

0

1

2

0

1

2

0

1

2

norm

w

74 of 97

0

1

2

0

1

2

0

1

2

0

1

2

0

1

2

w

0

1

2

0

1

2

0

1

2

0

1

2

0

1

2

Low rank correction often required

Sparse matrix

low-rank matrix

Decomposition

75 of 97

∑(:,j)

∑(i,:)

∑(:,:)

76 of 97

APC = raw - AP

Dunn et al. 2008

Mutual information without the influence of phylogeny or entropy dramatically improves residue contact prediction.

AP =

∑(i,:) * ∑(:,j)

∑(:,:)

77 of 97

Contact map

50S ribosomal protein L6

N

C

C

XRAY Contacts

78 of 97

How to read a contact map

50S ribosomal protein L6

N

C

C

XRAY Contacts

79 of 97

Overlay of predicted contacts on real contacts

N

C

C

50S ribosomal protein L6

XRAY Contacts

Predicted Contacts

80 of 97

LETS TRY

A

B

C

D

E

0

1

2

3

4

5

Sequences

Positions

Co-evolution

0

1

2

3

4

5

0

1

2

3

4

5

0

1

2

3

4

5

81 of 97

Tomorrow

82 of 97

V1

V2

V3

V4

W14

X1

X2

X3

X4

MRF/Potts

(markov random fields)

MG

(multivariate gaussian)

AE

(autoencoders)

PSSM

(position-specific scoring matrix)

BERT/�Transformer

AlphaFold/�RoseTTAFold

83 of 97

All these models can be expressed as “Autoencoders”

PSSM

MRF

MG

AE

BERT

84 of 97

Bonus

85 of 97

Find reduced dimensionality that only preserves local distances (or neighbors)?

86 of 97

Why might we want to do this?

87 of 97

SM = Gaussian(DM)

Similarity Matrix

Distance Matrix

88 of 97

Similarity Matrix

A

dm

-0.4

1.0

0.3

-0.6

-0.6

-0.3

-0.2

0.5

0.5

-0.7

0.9

0.0

0.9

-0.8

0.9

-1.0

0.7

-0.7

-0.1

0.3

0.3

0.8

0.5

-0.6

0.0

1.7

0.5

0.4

2.3

1.1

0.8

1.5

1.7

0.0

1.4

1.6

1.9

1.4

1.2

1.8

0.5

1.4

0.0

0.8

1.8

1.3

0.3

1.5

0.4

1.6

0.8

0.0

2.5

0.8

1.0

1.7

2.3

1.9

1.8

2.5

0.0

2.8

1.6

2.0

1.1

1.4

1.3

0.8

2.8

0.0

1.4

1.8

0.8

1.2

0.3

1.0

1.6

1.4

0.0

1.3

1.5

1.8

1.5

1.7

2.0

1.8

1.3

0.0

1

2

3

4

5

6

7

8

x

y

z

Samples

1

2

3

4

5

6

7

8

Samples

1

2

3

4

5

6

7

8

Samples

Features

sm

dmij = exp(-dmij2 / 2σ2)

1.0

0.0

0.6

0.7

0.0

0.1

0.3

0.0

0.0

1.0

0.0

0.0

0.0

0.0

0.1

0.0

0.6

0.0

1.0

0.3

0.0

0.0

0.8

0.0

0.7

0.0

0.3

1.0

0.0

0.3

0.1

0.0

0.0

0.0

0.0

0.0

1.0

0.0

0.0

0.0

0.1

0.0

0.0

0.3

0.0

1.0

0.0

0.0

0.3

0.1

0.8

0.1

0.0

0.0

1.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

1.0

1

2

3

4

5

6

7

8

Samples

1

2

3

4

5

6

7

8

Samples

Dij = (xi-xj)2 + (yi-yj)2 + (zi-zj)2

89 of 97

A

dm

B

dm

A

dm

B

dm

sm

sm

MDS

UMAP/TSNE

A

B

A

PCA

TSNE (t-distributed Stochastic Neighbor Embedding)

UMAP (Uniform Manifold Approximation and Projection)

90 of 97

MDS

UMAP/TSNE

Phylogeny

PCA

91 of 97

Encode integers by which prime-factors they share

92 of 97

8 million integers

78628D to 2D

https://johnhw.github.io/umap_primes/index.md.html

93 of 97

CHECK IT OUT

https://johnhw.github.io/umap_primes/three_umap.html

94 of 97

Cool interactive resources

  1. PCA �https://setosa.io/ev/principal-component-analysis/
  2. TSNE �https://distill.pub/2016/misread-tsne/
  3. UMAP�https://pair-code.github.io/understanding-umap/https://johnhw.github.io/umap_primes/index.md.html

95 of 97

w

x

0

1

2

x'

x' = softmax(x@w)@wT

�loss = (x' - x)2w = KMeans(x).means_

Categorical Autoencoder(solution: kmeans)

loss

wT

softmax(x) = exp(x)/sum(exp(x))

96 of 97

0

1

2

Variational Autoencoder

loss

x

Z

μ

σ

97 of 97

Only connect to previous positions (Autoregressive model)

P(x1,x2,x3) ≈ P(x1) * P(x2|x1) * P(x3|x1,x2)

b

x'

x' = softmax(x@w + b)

�loss = -x*log(x')

loss

x

0

1

2

w

Trinquier, J., Uguzzoni, G., Pagnani, A., Zamponi, F. and Weigt, M., 2021. �Efficient generative modeling of protein sequences using simple autoregressive models. �Nature communications, 12(1), pp.1-11.