1 of 23

Fast end-to-end learning

on protein surfaces

(Sverrisson et al., CVPR 2021)

Petr Kouba

CMP Reading Group 2021

1

2 of 23

Contributions

Novel protein representation + novel convolutional layer
Tested on 2 benchmarks regarding protein interaction predictions
Achieves SOTA, while being 100-1000 x faster than previous comparable approach

2

3 of 23

Focus of the talk and relation to computer vision

Best paper finalist at CVPR 2021
The paper is quite technical
Deals with a biological application
In general the technique could be used for learning on surfaces of 3d point clouds

3

4 of 23

Contents

Introduction to proteins and important problems in structural biology
Protein representations for neural networks
Proposed representation and architecture

4

5 of 23

Intro to proteins: Protein structure

Primary - sequence of amino acids (residues) given by peptide bonds, forming the (poly)peptide
Secondary - substructures induced by hydrogen bonds between residues at different places in the sequence
Tertiary - overall 3D arrangement of the polypeptide in space
Quaternary - association of several protein chains

5

Fig.1: Protein structure illustration

[lubrizolcdmo.com ]

6 of 23

Intro to proteins

3d structure of the protein important for its function
Important problems in structural biology:

Structure prediction (AlphaFold 2)
Protein design (inverse to a.)
Prediction of surface interaction

Important for drug design

6

Fig.2: Important problems in structural biology

7 of 23

Protein representations for Neural Networks

Primary structure can be represented as a word over alphabet of amino acids
Graphs (edges based on spatial proximity or chemical bonds)
Point clouds
Surfaces (meshes)

7

relevant for the paper

Fig.3: Illustrations of protein

representations [Laine et al.]

8 of 23

Surface representation

Shape and chemical properties of the surface important for protein interactions

Surface can be approximated by a precomputed mesh - slow

=> Use point cloud and sample surface on the fly

8

Fig. 4: Protein pockets illustration [J. Hebda]

9 of 23

Fast end-to-end learning on protein surfaces

Idea:

Use point cloud on the input - no precomputed mesh
Approximate the surface on the fly
Define a convolutional operator on this surface

Input formally:

{a₁,...,a_A} ⊂ … atomic positions, A = # atoms in the protein
{t₁,...,t_A} ⊂ ⁶ … one-hot encoded atomic types from set [C,H,O,N,S,Se]

Surface representation:

{x₁,...,x_N} ⊂ … positions of points sampled from the surface, N = # sampled points
{n₁,...,n_N} ⊂ … unit surface normals at the sampled points
{f₁,...,f_N} ⊂ ^k … feature vectors, k ∈ {1,...,16}

9

Surface generation

Conv architecture

Learn the features in the points and classify the point as binding or non-binding

10 of 23

Tasks

Binding site identification

Classification of the surface of a single protein into interaction and non-interaction sites
Identification of the binding site is unaware of the interaction partner

Interaction prediction

Given two proteins, take two surface patches (one from each protein) and predict whether they are likely to come in contact in a protein complex

10

Where does it generally

bind?

What binds to the red part?

11 of 23

11

Binding site identification architecture

Interaction prediction architecture

Let’s see details

12 of 23

Surface generation

Define the smooth distance function, giving us level set (isosurfaces) of distances to the atomic centers

SDF(x) = -σ(x)・log( Σ_k=1...A exp(-॥x - a_k॥ / σ_k)) … Smooth distance function

σ_k … atomic radius of atom k σ(x) … avg. atomic radius in neighborhood of x

Sampling surface points:

Generate random 3d points from Gaussians centered around atoms (20 points per atom)
Minimize the loss w.r.t. coordinates of the points in the sample, to arrive at surface

E(x₁,...,x_Q) = 0.5・Σ_i=1...Q(SDF(x_i) - r)² … Loss function, fixed r = 1.05 Å

12

Fig.5: SDF Illustration

Fig.6: Sampling

Fig.7: Descent

13 of 23

Surface generation

Sampling surface points:

Remove points trapped inside the protein
Sub-sample by using cubic bins of side length 1Å, keeping 1 point per bin => uniform density of sampling
To get the normals {n_j}_j=1...N, compute normalized gradients of SDF at the sampled points {x_j}_j=1...N

13

Fig.8: Cleaning

Fig.9: Sub-sampling

Fig.10: Normals

14 of 23

14

Binding site identification architecture

Interaction prediction architecture

Let’s see details

15 of 23

Obtaining point features

6 chemical features per point
10 geometrical features per point

15

Covered

Fig.11: Feature learning network

16 of 23

Chemical features

Pass atom types t_i through MLP- keeping the dimensions, do it for all atoms i ∈ {1,...,A}

For each surface point x_j find its 16 nearest neighbor atoms with coordinates {a_j^k}_{j=1...N, k=1,...,16}

Construct [t_k^j,1/॥x_j - a_k^j॥] ⊂ ⁷ , for j=1...N, k=1,...,16 a

Process according to the remaining part of the diagram

16

Fig.12: Chemical feature estimation

17 of 23

Geometric features

Smooth the vector field {n_j}_j=1...Nusing a Gaussian kernel (with σ variance) n_i= Normalize(Σ_j=1...N n_j・exp(-॥x_i - x_j॥² / σ²))

Compute local curvatures (10 scalar values per point)

Using Gaussian windows with different variances and the vector field {n_j}_j=1...N
Refer to [Yueqi et al.], for details of the method
Takeaway: there is a formula (which did not have to be learned)

17

18 of 23

18

Binding site identification architecture

Interaction prediction architecture

Let’s see details

19 of 23

Convolutional architecture

Estimate a local coordinate system (n_i,u_i,v_i) for each surface point x_i

Compute mutually orthogonal tangent vectors u_i,v_ifor each surface point x_i up to a rotation within the tangent plane
Fix the ambiguity by orienting u_i along the direction of gradient of trained potential

19

Gaussian window

(explained on the next slide)

x_j in local coordinates of x_i

20 of 23

Convolutional architecture

‘Quasi-geodesic convolution’

Filter acting on local geometric and chemical properties of the surface

f_i^’= Σ_j=1...NConv(x_i,x_j,f_j,i+j bases) … Trainable weight put on relationship between x_i,x_j

Not influenced by atoms deep inside the protein, brings rotational and translational invariance
To define the convolution, first define ‘quasi-geodesic’ distance d_ij (along the surface)

d_ij = ॥x_i - x_j॥・(2 -〈n_i,n_j〉)

Conv(x_i,x_j,f_j,i+j bases) = w(d_ij)・Filter(p_ij) f_j

20

Fig.9: ‘quasi-geodesic’ distance illustration

Gaussian

window

MLP

x_j in local coordinates of x_i

element-wise

21 of 23

Convolutional architecture

Convolutional block repeated 1-3x

21

Convolutional block

Local coordinates

estimation

22 of 23

Quality of features

Chemical feature extractor can be used to regress the Poisson-Boltzman electrostatic potential (potential derived within Mean Field Theory)
Chemical/geometric features ablation study:

Local curvatures do not significantly improve the performance

22

23 of 23

Results

They achieve comparable or better results than previous SOTA mesh-based approach (MaSIF [Gainza et al.])
100-1000x faster than MaSIF
Convolutional operator benchmarked against PointNet++ [Qi et al.] and DGCNN [Wang et al.]

SOTA convolutional layers for point clouds
For the 2 benchmarks, the quasi-geodesic convolution wins

23