1 of 25

Point Transformer

🏆: Eric Xie

🧑🏻‍⚖️: John Zoscak

👩🏼‍🚀: Siddharth Lakkoju

🧑🏽‍💻: Michael Fatemi

2 of 25

🏆: Overview

Problem Paradigm:

3D point clouds are unordered, scattered, and essentially just sets of points, preventing learning with classical convolution
Once we have a 3D point cloud, what is the best way to accomplish downstream tasks?

3 of 25

🏆: Background

Projection-based networks

Project our points into various image planes before using standard models for analysis
Geometric information is collapsed and the choice of projection plane can cause complications�

Voxel-based networks

Huge computation costs from cubic growth
Still losing geometric information from quantization�

Point-based networks

We could also tune networks to work specifically with point clouds
Connecting into a graph, using permutation-invariant operators, 3D point set specific convolutions
These are pretty good - no quantization/loss of information
We can do better with Transformers

4 of 25

🏆: Rationale: Why Transformers?

Self-attention is very suitable for point clouds

Already functions as a set operator
Positional information is treated as an attribute of elements to be processed as part of the set

Past works that use attention mechanisms for point cloud analysis have a few issues

Global attention
Scalar dot product

5 of 25

🏆: Methodology: Point Transformer Architecture

6 of 25

🏆: Results: Semantic Segmentation

Highest overall pointwise accuracy (OA), mean of classwise accuracy (mAcc), and mean classwise intersection over union (mIoU)

7 of 25

🏆: Results: Object Part Segmentation

Best performance in instance mIoU

Did not use loss-balancing during training, which can boost category mIoU

8 of 25

🏆: Results: Shape Classification

Highest performance on shape classification
Significantly outperformed projection and voxel-based methods
Slightly outperformed other point-based methods

9 of 25

🏆: Results: Ablation Studies

Small neighborhoods might not provide enough context to the model
Large neighborhoods might provide excessive noise to the self-attention layers

10 of 25

🏆: Results: Ablation Studies

Importance of relative positional encoding in both the attention generation and feature transformation branches
Importance of vector attention over scalar/no attention

11 of 25

🧑🏻‍⚖️: Paper Summary from Critic

You are a critic. Your goal is to showcase weaknesses of the paper. Address the following questions – add a slide for each bullet point. You should be fair, even if negative. Not all the parts of the paper need to have weaknesses; e.g. a paper might have a great positioning in related work or great motivation but weaknesses in the method.

What is this paper about and what problem does it tackle? Why is the problem important?
What is your critique of the paper?

Is it the motivation (see intro section)
Is it the positioning among prior work (see related work section)
Is it the approach (see method section)

Are the experiments sufficient? (see experiments section)
What are the limitations?

12 of 25

🧑🏻‍⚖️: Background

This paper attempts to apply concepts of self-attention from NLP, translation, and 2D image recognition tasks
Casts transformer and attention solutions to 3DV by using vector and position encoding / attention

Uses a point transformer block which integrates self attention and linear projections on the feature vectors and their positions. This acts as the primary feature aggregation of the network.
Applies the point transformer to two network architectures for semantic segmentation and object classification

13 of 25

🧑🏻‍⚖️: Background

What is this research trying to do?

Better feature understanding, better spatial relevance in processing 3D data, injecting attention transformers for having better total scene understanding with cohesive architecture.
Background paper: “Attention is all you need”

In 2D vision processing, ViT (Vision Transformers) showed promise, this is another continuation of that technology
Self-attention is a set operator, allowing scene understanding despite order invariance

14 of 25

🧑🏻‍⚖️: Potential Limitations

Domain/Object Specific Performance in Semantic Segmentation
Underperforms on chairs and windows in comparison to others

15 of 25

🧑🏻‍⚖️: Potential Limitations

Computational Intensity

The model uses local neighborhoods (k-nearest neighbors) for self-attention
The model accepts input clouds of varying sizes��
Because the model is adaptive to point cloud input size, the inputs could be any size
Linear memory / computational intensity scaling
Inference time seems to grow exponentially with respect to number of input points
No studies for inference time with respect to input cloud size

Input Points:	10k	20k	40k	80k
Inference Time:	44ms	86ms	222ms	719ms
Memory Consumption:	1702M	2064M	2800ms	4266ms

16 of 25

🧑🏻‍⚖️: Potential Limitations

No ablation studies to verify order invariance
Untested Experiments:

Densities/sparsity of input clouds
Relative scaling for less granular scans of larger spaces (How do we know that the point transformer will work well for scene scales not seen in the input data?)
Primarily tested on ground-truth data, does the model scale into noisier detections?
Example: Radar data (useful for low-visibility environments) are much more sparse and have worse granularity

17 of 25

🧑🏻‍⚖️: Scalability

Possible Hyperparameter Dependence on Scene Density and Size:

Hyperparameters (like neighborhood size k=16) would affect the segmentation, yet the intuition behind their choice is not considered

Segmentation from different knn initialization (random / uniform initialization?)

Spatial self-attention based vs some global attention (floors / ceilings / walls across rooms?) �

S3DIS (PointTransformer Training Data)

Matterport Data

18 of 25

👩🏼‍🚀: Potential Research Directions

You are a pioneer. Your goal is to think how the paper being discussed could be used to accelerate other findings, help in other disciplines (e.g. robotics, science), and be combined with other techniques you have seen to create a novel result worthy of a solid publication.

Think of two or three novel applications of the work and present them
Tell us how you would go about pursuing these ideas to showcase their efficacy

19 of 25

👩🏼‍🚀: Potential Research Directions

Key Strengths

Strong generalization ability (versatile architecture)

Segmentation, classification, part segmentation, etc.

SOTA 3D point cloud performance
Efficiency and scalability

Use of local attention (as opposed to global) allows for processing large-scale 3D scenes with millions of points

20 of 25

👩🏼‍🚀: Medical Imaging

Medical imaging (CT, MRI, etc.) generate 3D data representable as point clouds.
3D organ segmentation, tumor detection, medical diagnosis, etc.

Oncology
Neurosurgery

Train Point transformers on open source medical datasets for organ and tumor segmentation.

21 of 25

👩🏼‍🚀: AV Scene Understanding

AV point clouds are large-scale
AV in unstructured environments without other localization modalities (GPS)
Smaller scale AV (Drones, Disaster relief robots, etc.)

Enabled by efficient model

SemanticKITTI or custom drone generated point clouds.

22 of 25

🧑🏽‍💻: Paper Summary from Entrepreneur

You are an entrepreneur. This means you are constantly on the lookout for cool new ideas and to build new products (which will hopefully make profit!). Your goal is to think how the paper being discussed could be used to build a new product – remember the product does not have to be “novel” but it should have high chances of working well and robustly.

Think of one or two products derived from the work
Tell us how you would go about building a demo to showcase each idea – this is the demo you would show for your seed or round A funding.

23 of 25

New Product Idea: Warehouse Robot

Warehouses and factories may require sorting a large number of scattered objects belonging to different classes.
Currently, human hands are highly effective at manipulating objects, but this is very hard to write policies for
Using a PointTransformer could identify “affordance” locations: optimal gripping locations for a robotic arm

On a mug, selects the handle
On a box, selects parallel planes
On a coil of rope, selects opposite points inside the coil
… etc …

Treat as classification problem: {finger 1, finger 2, no contact}

“open the drawer”

24 of 25

New Product Idea: Warehouse Robot

Specific use cases:

Assembling kits
Packing boxes
Grasping deformable objects, such as fruit

Existing company: Covariant

Created by Pieter Abbeel
$222m valuation

25 of 25

Other Potential Target Markets

Autonomous vehicles

Classify road signs, curbs, etc. from point clouds

Batch asset collection for simulators and video games

Segment objects of different types for “batch” scans

Monitoring and acting in foggy, dark, snowy environments