1 of 23

Fast end-to-end learning

on protein surfaces

(Sverrisson et al., CVPR 2021)

Petr Kouba

CMP Reading Group 2021

1

2 of 23

Contributions

  • Novel protein representation + novel convolutional layer
  • Tested on 2 benchmarks regarding protein interaction predictions
  • Achieves SOTA, while being 100-1000 x faster than previous comparable approach

2

3 of 23

Focus of the talk and relation to computer vision

  • Best paper finalist at CVPR 2021
  • The paper is quite technical
  • Deals with a biological application
  • In general the technique could be used for learning on surfaces of 3d point clouds

3

4 of 23

Contents

  1. Introduction to proteins and important problems in structural biology
  2. Protein representations for neural networks
  3. Proposed representation and architecture

4

5 of 23

Intro to proteins: Protein structure

  • Primary - sequence of amino acids (residues) given by peptide bonds, forming the (poly)peptide
  • Secondary - substructures induced by hydrogen bonds between residues at different places in the sequence
  • Tertiary - overall 3D arrangement of the polypeptide in space
  • Quaternary - association of several protein chains

5

Fig.1: Protein structure illustration

[lubrizolcdmo.com]

6 of 23

Intro to proteins

  • 3d structure of the protein important for its function
  • Important problems in structural biology:
    1. Structure prediction (AlphaFold 2)
    2. Protein design (inverse to a.)
    3. Prediction of surface interaction
      • Important for drug design

6

Fig.2: Important problems in structural biology

7 of 23

Protein representations for Neural Networks

  1. Primary structure can be represented as a word over alphabet of amino acids
  2. Graphs (edges based on spatial proximity or chemical bonds)
  3. Point clouds
  4. Surfaces (meshes)

7

relevant for the paper

Fig.3: Illustrations of protein

representations [Laine et al.]

8 of 23

Surface representation

  • Shape and chemical properties of the surface important for protein interactions

  • Surface can be approximated by a precomputed mesh - slow

=> Use point cloud and sample surface on the fly

8

Fig. 4: Protein pockets illustration [J. Hebda]

9 of 23

Fast end-to-end learning on protein surfaces

  • Idea:
    • Use point cloud on the input - no precomputed mesh
    • Approximate the surface on the fly
    • Define a convolutional operator on this surface
  • Input formally:
    • {a1,...,aA} ⊂ … atomic positions, A = # atoms in the protein
    • {t1,...,tA} ⊂ 6 … one-hot encoded atomic types from set [C,H,O,N,S,Se]
  • Surface representation:
    • {x1,...,xN} ⊂ … positions of points sampled from the surface, N = # sampled points
    • {n1,...,nN} ⊂ … unit surface normals at the sampled points
    • {f1,...,fN} ⊂ k … feature vectors, k ∈ {1,...,16}

9

Surface generation

Conv architecture

Learn the features in the points and classify the point as binding or non-binding

10 of 23

Tasks

  • Binding site identification
    • Classification of the surface of a single protein into interaction and non-interaction sites
    • Identification of the binding site is unaware of the interaction partner
  • Interaction prediction
    • Given two proteins, take two surface patches (one from each protein) and predict whether they are likely to come in contact in a protein complex

10

Where does it generally

bind?

What binds to the red part?

11 of 23

11

Binding site identification architecture

Interaction prediction architecture

Let’s see details

12 of 23

Surface generation

  • Define the smooth distance function, giving us level set (isosurfaces) of distances to the atomic centers

SDF(x) = -σ(x)・log( Σk=1...A exp(-॥x - ak॥ / σk)) … Smooth distance function

σk … atomic radius of atom k σ(x) … avg. atomic radius in neighborhood of x

  • Sampling surface points:
    1. Generate random 3d points from Gaussians centered around atoms (20 points per atom)
    2. Minimize the loss w.r.t. coordinates of the points in the sample, to arrive at surface

E(x1,...,xQ) = 0.5・Σi=1...Q(SDF(xi) - r)2 … Loss function, fixed r = 1.05 Å

12

Fig.5: SDF Illustration

Fig.6: Sampling

Fig.7: Descent

13 of 23

Surface generation

  • Sampling surface points:
    1. Remove points trapped inside the protein
    2. Sub-sample by using cubic bins of side length 1Å, keeping 1 point per bin => uniform density of sampling
    3. To get the normals {nj}j=1...N, compute normalized gradients of SDF at the sampled points {xj}j=1...N

13

Fig.8: Cleaning

Fig.9: Sub-sampling

Fig.10: Normals

14 of 23

14

Binding site identification architecture

Interaction prediction architecture

Let’s see details

15 of 23

Obtaining point features

  • 6 chemical features per point
  • 10 geometrical features per point

15

Covered

Fig.11: Feature learning network

16 of 23

Chemical features

  1. Pass atom types ti through MLP- keeping the dimensions, do it for all atoms i ∈ {1,...,A}

  • For each surface point xj find its 16 nearest neighbor atoms with coordinates {ajk}j=1...N, k=1,...,16

  • Construct [tkj,1/॥xj - akj॥] ⊂ 7 , for j=1...N, k=1,...,16 a

  • Process according to the remaining part of the diagram

16

Fig.12: Chemical feature estimation

17 of 23

Geometric features

  • Smooth the vector field {nj}j=1...N using a Gaussian kernel (with σ variance) ni= Normalize(Σj=1...N nj・exp(-॥xi - xj2 / σ 2))

  • Compute local curvatures (10 scalar values per point)
    1. Using Gaussian windows with different variances and the vector field {nj}j=1...N
    2. Refer to [Yueqi et al.], for details of the method
    3. Takeaway: there is a formula (which did not have to be learned)

17

18 of 23

18

Binding site identification architecture

Interaction prediction architecture

Let’s see details

19 of 23

Convolutional architecture

  • Estimate a local coordinate system (ni,ui,vi) for each surface point xi
      • Compute mutually orthogonal tangent vectors ui,vi for each surface point xi up to a rotation within the tangent plane
      • Fix the ambiguity by orienting ui along the direction of gradient of trained potential

19

Gaussian window

(explained on the next slide)

xj in local coordinates of xi

20 of 23

Convolutional architecture

  • ‘Quasi-geodesic convolution’
    1. Filter acting on local geometric and chemical properties of the surface

fi= Σj=1...NConv(xi,xj,fj,i+j bases) … Trainable weight put on relationship between xi,xj

    • Not influenced by atoms deep inside the protein, brings rotational and translational invariance
    • To define the convolution, first define ‘quasi-geodesic’ distance dij (along the surface)

dij = ॥xi - xj॥・(2 -〈ni,nj〉)

    • Conv(xi,xj,fj,i+j bases) = w(dij)・Filter(pij) fj

20

Fig.9: ‘quasi-geodesic’ distance illustration

Gaussian

window

MLP

xj in local coordinates of xi

element-wise

21 of 23

Convolutional architecture

  • Convolutional block repeated 1-3x

21

Convolutional block

Local coordinates

estimation

22 of 23

Quality of features

  • Chemical feature extractor can be used to regress the Poisson-Boltzman electrostatic potential (potential derived within Mean Field Theory)
  • Chemical/geometric features ablation study:
    • Local curvatures do not significantly improve the performance

22

23 of 23

Results

  • They achieve comparable or better results than previous SOTA mesh-based approach (MaSIF [Gainza et al.])
  • 100-1000x faster than MaSIF
  • Convolutional operator benchmarked against PointNet++ [Qi et al.] and DGCNN [Wang et al.]
    • SOTA convolutional layers for point clouds
    • For the 2 benchmarks, the quasi-geodesic convolution wins

23