1 of 25

Point Transformer

πŸ†: Eric Xie

πŸ§‘πŸ»β€βš–οΈ: John Zoscak

πŸ‘©πŸΌβ€πŸš€: Siddharth Lakkoju

πŸ§‘πŸ½β€πŸ’»: Michael Fatemi

2 of 25

πŸ†: Overview

  • Problem Paradigm:
    • 3D point clouds are unordered, scattered, and essentially just sets of points, preventing learning with classical convolution
    • Once we have a 3D point cloud, what is the best way to accomplish downstream tasks?

3 of 25

πŸ†: Background

  • Projection-based networks
    • Project our points into various image planes before using standard models for analysis
    • Geometric information is collapsed and the choice of projection plane can cause complicationsοΏ½
  • Voxel-based networks
    • Huge computation costs from cubic growth
    • Still losing geometric information from quantizationοΏ½
  • Point-based networks
    • We could also tune networks to work specifically with point clouds
    • Connecting into a graph, using permutation-invariant operators, 3D point set specific convolutions
    • These are pretty good - no quantization/loss of information
    • We can do better with Transformers

4 of 25

πŸ†: Rationale: Why Transformers?

  • Self-attention is very suitable for point clouds
    • Already functions as a set operator
    • Positional information is treated as an attribute of elements to be processed as part of the set
  • Past works that use attention mechanisms for point cloud analysis have a few issues
    • Global attention
    • Scalar dot product

5 of 25

πŸ†: Methodology: Point Transformer Architecture

6 of 25

πŸ†: Results: Semantic Segmentation

  • Highest overall pointwise accuracy (OA), mean of classwise accuracy (mAcc), and mean classwise intersection over union (mIoU)

7 of 25

πŸ†: Results: Object Part Segmentation

  • Best performance in instance mIoU
    • Did not use loss-balancing during training, which can boost category mIoU

8 of 25

πŸ†: Results: Shape Classification

  • Highest performance on shape classification
  • Significantly outperformed projection and voxel-based methods
  • Slightly outperformed other point-based methods

9 of 25

πŸ†: Results: Ablation Studies

  • Small neighborhoods might not provide enough context to the model
  • Large neighborhoods might provide excessive noise to the self-attention layers

10 of 25

πŸ†: Results: Ablation Studies

  • Importance of relative positional encoding in both the attention generation and feature transformation branches
  • Importance of vector attention over scalar/no attention

11 of 25

πŸ§‘πŸ»β€βš–οΈ: Paper Summary from Critic

You are a critic. Your goal is to showcase weaknesses of the paper. Address the following questions – add a slide for each bullet point. You should be fair, even if negative. Not all the parts of the paper need to have weaknesses; e.g. a paper might have a great positioning in related work or great motivation but weaknesses in the method.

  • What is this paper about and what problem does it tackle? Why is the problem important?
  • What is your critique of the paper?
    • Is it the motivation (see intro section)
    • Is it the positioning among prior work (see related work section)
    • Is it the approach (see method section)
  • Are the experiments sufficient? (see experiments section)
  • What are the limitations?

12 of 25

πŸ§‘πŸ»β€βš–οΈ: Background

  • This paper attempts to apply concepts of self-attention from NLP, translation, and 2D image recognition tasks
  • Casts transformer and attention solutions to 3DV by using vector and position encoding / attention
  • Uses a point transformer block which integrates self attention and linear projections on the feature vectors and their positions. This acts as the primary feature aggregation of the network.
  • Applies the point transformer to two network architectures for semantic segmentation and object classification

13 of 25

πŸ§‘πŸ»β€βš–οΈ: Background

  • What is this research trying to do?
    • Better feature understanding, better spatial relevance in processing 3D data, injecting attention transformers for having better total scene understanding with cohesive architecture.
    • Background paper: β€œAttention is all you need”
  • In 2D vision processing, ViT (Vision Transformers) showed promise, this is another continuation of that technology
  • Self-attention is a set operator, allowing scene understanding despite order invariance

14 of 25

πŸ§‘πŸ»β€βš–οΈ: Potential Limitations

  • Domain/Object Specific Performance in Semantic Segmentation
  • Underperforms on chairs and windows in comparison to others

15 of 25

πŸ§‘πŸ»β€βš–οΈ: Potential Limitations

  • Computational Intensity
    • The model uses local neighborhoods (k-nearest neighbors) for self-attention
    • The model accepts input clouds of varying sizesοΏ½οΏ½οΏ½οΏ½οΏ½οΏ½
    • Because the model is adaptive to point cloud input size, the inputs could be any size
    • Linear memory / computational intensity scaling
    • Inference time seems to grow exponentially with respect to number of input points
    • No studies for inference time with respect to input cloud size

Input Points:

10k

20k

40k

80k

Inference Time:

44ms

86ms

222ms

719ms

Memory Consumption:

1702M

2064M

2800ms

4266ms

16 of 25

πŸ§‘πŸ»β€βš–οΈ: Potential Limitations

  • No ablation studies to verify order invariance
  • Untested Experiments:
    • Densities/sparsity of input clouds
    • Relative scaling for less granular scans of larger spaces (How do we know that the point transformer will work well for scene scales not seen in the input data?)
    • Primarily tested on ground-truth data, does the model scale into noisier detections?
    • Example: Radar data (useful for low-visibility environments) are much more sparse and have worse granularity

17 of 25

πŸ§‘πŸ»β€βš–οΈ: Scalability

  • Possible Hyperparameter Dependence on Scene Density and Size:
    • Hyperparameters (like neighborhood size k=16) would affect the segmentation, yet the intuition behind their choice is not considered
      • Segmentation from different knn initialization (random / uniform initialization?)
    • Spatial self-attention based vs some global attention (floors / ceilings / walls across rooms?) οΏ½

S3DIS (PointTransformer Training Data)

Matterport Data

VS

18 of 25

πŸ‘©πŸΌβ€πŸš€: Potential Research Directions

You are a pioneer. Your goal is to think how the paper being discussed could be used to accelerate other findings, help in other disciplines (e.g. robotics, science), and be combined with other techniques you have seen to create a novel result worthy of a solid publication.

  • Think of two or three novel applications of the work and present them
  • Tell us how you would go about pursuing these ideas to showcase their efficacy

19 of 25

πŸ‘©πŸΌβ€πŸš€: Potential Research Directions

Key Strengths

  • Strong generalization ability (versatile architecture)
    • Segmentation, classification, part segmentation, etc.
  • SOTA 3D point cloud performance
  • Efficiency and scalability
    • Use of local attention (as opposed to global) allows for processing large-scale 3D scenes with millions of points

20 of 25

πŸ‘©πŸΌβ€πŸš€: Medical Imaging

  • Medical imaging (CT, MRI, etc.) generate 3D data representable as point clouds.
  • 3D organ segmentation, tumor detection, medical diagnosis, etc.
    • Oncology
    • Neurosurgery
  • Train Point transformers on open source medical datasets for organ and tumor segmentation.

21 of 25

πŸ‘©πŸΌβ€πŸš€: AV Scene Understanding

  • AV point clouds are large-scale
  • AV in unstructured environments without other localization modalities (GPS)
  • Smaller scale AV (Drones, Disaster relief robots, etc.)
    • Enabled by efficient model
  • SemanticKITTI or custom drone generated point clouds.

22 of 25

πŸ§‘πŸ½β€πŸ’»: Paper Summary from Entrepreneur

You are an entrepreneur. This means you are constantly on the lookout for cool new ideas and to build new products (which will hopefully make profit!). Your goal is to think how the paper being discussed could be used to build a new product – remember the product does not have to be β€œnovel” but it should have high chances of working well and robustly.

  • Think of one or two products derived from the work
  • Tell us how you would go about building a demo to showcase each idea – this is the demo you would show for your seed or round A funding.

23 of 25

New Product Idea: Warehouse Robot

  • Warehouses and factories may require sorting a large number of scattered objects belonging to different classes.
  • Currently, human hands are highly effective at manipulating objects, but this is very hard to write policies for
  • Using a PointTransformer could identify β€œaffordance” locations: optimal gripping locations for a robotic arm
    • On a mug, selects the handle
    • On a box, selects parallel planes
    • On a coil of rope, selects opposite points inside the coil
    • … etc …
  • Treat as classification problem: {finger 1, finger 2, no contact}

β€œopen the drawer”

24 of 25

New Product Idea: Warehouse Robot

Specific use cases:

  • Assembling kits
  • Packing boxes
  • Grasping deformable objects, such as fruit

Existing company: Covariant

  • Created by Pieter Abbeel
  • $222m valuation

25 of 25

Other Potential Target Markets

  • Autonomous vehicles
    • Classify road signs, curbs, etc. from point clouds
  • Batch asset collection for simulators and video games
    • Segment objects of different types for β€œbatch” scans
  • Monitoring and acting in foggy, dark, snowy environments