1 of 38

Bridging the gap to real-world object-centric learning

CVPR’2024 Object-Centric Representation For Computer Vision Tutorial

Tianjun Xiao

Senior Applied Scientist, AWS AI

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.

CVPR’2024 OBJECT-CENTRIC REPRESENTATION FOR COMPUTER VISION TUTORIAL

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.

2 of 38

Catalog

2

Retrospect on the motivation of object-centric learning

The gap to real-world vision data

Bridging the gap by reconstructing beyond RGB pixel space

Bridging the gap by upgrading encoder

Bridging the gap by upgrading decoder

Brief introduction to the object-centric learning framework

CVPR’2024 OBJECT-CENTRIC REPRESENTATION FOR COMPUTER VISION TUTORIAL

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.

3 of 38

Retrospect on the motivation of object-centric learning

3

Structured visual representation

  • Compositionality, systematic generalization, causal representation learning

Self-supervised learning

  • Scalability, learning by watching video

End to end differentiable architecture

  • Versatile, downstream friendly

“On the Binding Problem in Artificial Neural Networks”, Greff et al., 2020

CVPR’2024 OBJECT-CENTRIC REPRESENTATION FOR COMPUTER VISION TUTORIAL

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.

4 of 38

Catalog

4

Retrospect on the motivation of object-centric learning

The gap to real-world vision data

Bridging the gap by reconstructing beyond RGB pixel space

Bridging the gap by upgrading encoder

Bridging the gap by upgrading decoder

Brief introduction to the object-centric learning framework

CVPR’2024 OBJECT-CENTRIC REPRESENTATION FOR COMPUTER VISION TUTORIAL

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.

5 of 38

The gap to real-world vision data: slot attention

5

Encoder: CNN

Learning objective: reconstructing RGB pixel

Decoder: CNN mixture decoder, slots are decoded separately and merged at the final output

“Object-Centric Learning with Slot Attention”, Locatello and Kipf et al., NeurIPS 2020

CVPR’2024 OBJECT-CENTRIC REPRESENTATION FOR COMPUTER VISION TUTORIAL

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.

6 of 38

The gap to real-world vision data

6

Applicable datasets: CLEVR, Multi-dSprites, Tetrominoes

“Object-Centric Learning with Slot Attention”, Locatello and Kipf et al., NeurIPS 2020

CVPR’2024 OBJECT-CENTRIC REPRESENTATION FOR COMPUTER VISION TUTORIAL

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.

7 of 38

The gap to real-world vision data

7

BRIDGING THE GAP TO REAL-WORLD OBJECTCENTRIC LEARNING”, Seitzer et al., ICLR 2023

Object Scene Representation Transformer”, Sajjadi et al., NeurIPS 2022

Object-Centric Slot Diffusion”, Jiang et al., NeurIPS 2023

CVPR’2024 OBJECT-CENTRIC REPRESENTATION FOR COMPUTER VISION TUTORIAL

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.

8 of 38

Catalog

8

Retrospect on the motivation of object-centric learning

The gap to real-world vision data

Bridging the gap by reconstructing beyond RGB pixel space

Bridging the gap by upgrading encoder

Bridging the gap by upgrading decoder

Brief introduction to the object-centric learning framework

CVPR’2024 OBJECT-CENTRIC REPRESENTATION FOR COMPUTER VISION TUTORIAL

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.

9 of 38

Bridging the gap by reconstructing beyond RGB pixel

9

  • RGB pixel-level reconstruction for complex scenes is too challenging for the mixture decoder.

  • Instead, we can reconstruct on signals that are:
    • Represent objectness
    • Less heterogeneous space
    • Can be obtained in a self-supervised or other scalable way

CVPR’2024 OBJECT-CENTRIC REPRESENTATION FOR COMPUTER VISION TUTORIAL

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.

10 of 38

Bridging the gap by reconstructing optical flow

10

Encoder: CNN

Learning objective: reconstructing optical flow

Decoder: CNN mixture decoder, slots are decoded separately and merged at the final output

“Conditional Object-Centric Learning from Video”, Kipf et al., ICLR 2022

CVPR’2024 OBJECT-CENTRIC REPRESENTATION FOR COMPUTER VISION TUTORIAL

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.

11 of 38

Bridging the gap by reconstructing feature: SAVi

11

Applicable datasets: MOVi, MOVi++

  • Objects scanned from real-world.

  • Synthesized using physics engine.

  • Generating video, optical flow, depth, segmentation mask at the same time.

Kubric: a scalable dataset generator”, Greff et al., 2022

CVPR’2024 OBJECT-CENTRIC REPRESENTATION FOR COMPUTER VISION TUTORIAL

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.

12 of 38

Bridging the gap by reconstructing optical flow

12

Encoder: CNN (flow as input)

Learning objective: reconstructing optical flow

Decoder: CNN mixture decoder

Applicable datasets: DAVIS2016, SegTrackv2, FBMS59, MoCA

Self-supervised Video Object Segmentation by Motion Grouping”, Yang et al., ICCV 2021

CVPR’2024 OBJECT-CENTRIC REPRESENTATION FOR COMPUTER VISION TUTORIAL

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.

13 of 38

Bridging the gap by reconstructing depth

13

Encoder: CNN

Learning objective: reconstructing depth from LiDAR

Decoder: CNN mixture decoder

Applicable datasets: Waymo Open dataset

SAVi++: Towards End-to-End Object-Centric Learning from Real-World Videos”, Elsayed et al., NeurIPS 2022

CVPR’2024 OBJECT-CENTRIC REPRESENTATION FOR COMPUTER VISION TUTORIAL

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.

14 of 38

Catalog

14

Retrospect on the motivation of object-centric learning

The gap to real-world vision data

Bridging the gap by reconstructing beyond RGB pixel space

Bridging the gap by upgrading encoder

Bridging the gap by upgrading decoder

Brief introduction to the object-centric learning framework

CVPR’2024 OBJECT-CENTRIC REPRESENTATION FOR COMPUTER VISION TUTORIAL

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.

15 of 38

Bridging the gap by upgrading encoder

15

  • The upgrading is not trivially replacing encoder.
    • The deeper and larger encoder we use, the larger gap between the final feature where slot attention is executed on and the raw RGB signal.

  • Do we have to also use a larger decoder to fit the encoder?
    • Decoder is harder to design if we are targeting real-looking image/video.

  • Reconstruct in the hidden feature space!

CVPR’2024 OBJECT-CENTRIC REPRESENTATION FOR COMPUTER VISION TUTORIAL

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.

16 of 38

Bridging the gap by upgrading encoder: DINOSAUR

16

Encoder: ViT, ResNet

Learning objective: DINO feature, MAE feature

Decoder: MLP Mixture Decoder or Transformer Decoder

BRIDGING THE GAP TO REAL-WORLD OBJECTCENTRIC LEARNING”, Seitzer et al., ICLR 2023

CVPR’2024 OBJECT-CENTRIC REPRESENTATION FOR COMPUTER VISION TUTORIAL

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.

17 of 38

Bridging the gap by upgrading encoder: DINOSAUR

17

Applicable datasets: MOVi++, PASCAL VOC 2012, COCO

BRIDGING THE GAP TO REAL-WORLD OBJECTCENTRIC LEARNING”, Seitzer et al., ICLR 2023

CVPR’2024 OBJECT-CENTRIC REPRESENTATION FOR COMPUTER VISION TUTORIAL

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.

18 of 38

Bridging the gap by upgrading encoder: DINOSAUR

18

Metrics: FG-ARI, mBO, CorLoc, mIoU

BRIDGING THE GAP TO REAL-WORLD OBJECTCENTRIC LEARNING”, Seitzer et al., ICLR 2023

CVPR’2024 OBJECT-CENTRIC REPRESENTATION FOR COMPUTER VISION TUTORIAL

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.

19 of 38

Bridging the gap by upgrading encoder: Adaptive SA

19

Encoder: ViT, ResNet

Learning objective: DINO feature, MAE feature

Decoder: MLP Mixture Decoder or Transformer Decoder

Adaptive Slot Attention: Object Discovery with Dynamic Slot Number”, Fan et al., CVPR 2024

CVPR’2024 OBJECT-CENTRIC REPRESENTATION FOR COMPUTER VISION TUTORIAL

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.

20 of 38

Bridging the gap by upgrading encoder

20

  • Not just another unsupervised object segmentation model

  • Structured scene representation aiming for systematic/compositional generalization

To Understand Language is to Understand Generalization”, Eric Jang

CVPR’2024 OBJECT-CENTRIC REPRESENTATION FOR COMPUTER VISION TUTORIAL

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.

21 of 38

Catalog

21

Retrospect on the motivation of object-centric learning

The gap to real-world vision data

Bridging the gap by reconstructing beyond RGB pixel space

Bridging the gap by upgrading encoder

Bridging the gap by upgrading decoder

Brief introduction to the object-centric learning framework

CVPR’2024 OBJECT-CENTRIC REPRESENTATION FOR COMPUTER VISION TUTORIAL

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.

22 of 38

Slot-decoding dilemma

22

  • To make a slot focus on the pixel area corresponding to an object, it is usually required to employ a weak decoder to make object concepts emerge in the slots.
    • Weaker decoder prevents one slot from modeling multiple objects, introduce a statistical regularity

  • Weaker decoder can make the generated image blur the details.

  • Solutions:
    • Introduce more 3D inductive bias to keep the decoder in moderated size while also providing high quality image generation
    • Decoupling the decoding process

ILLITERATE DALL-E LEARNS TO COMPOSE”, Singh et al., ICLR 2022

CVPR’2024 OBJECT-CENTRIC REPRESENTATION FOR COMPUTER VISION TUTORIAL

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.

23 of 38

Upgrading decoder utilizing 3D inductive bias: ObSuRF

23

Encoder: CNN

Learning objective: Novel view synthesis

Decoder: Nerf decoder

Applicable datasets: Multiview CLEVR

Decomposing 3D Scenes into Objects via Unsupervised Volume Segmentation”, Stelzner et al., 2021

CVPR’2024 OBJECT-CENTRIC REPRESENTATION FOR COMPUTER VISION TUTORIAL

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.

24 of 38

Upgrading decoder utilizing 3D inductive bias: OSRT

24

Encoder: CNN + Transformer (SRT)

Learning objective: Novel view synthesis

Decoder: Slot Mixer

Applicable datasets: CLEVR-3D, MultiShapeNet

Object Scene Representation Transformer”, Sajjadi et al., NeurIPS 2022

CVPR’2024 OBJECT-CENTRIC REPRESENTATION FOR COMPUTER VISION TUTORIAL

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.

25 of 38

Upgrading decoder by decoupling decoding: SLATE/STEVE

25

Encoder: VQ encoder

Learning objective: Reconstructing VQ-Code

Decoder: Autoregressive transformer

Applicable datasets: MOVi++, Traffic and Aquarium

ILLITERATE DALL-E LEARNS TO COMPOSE”, Singh et al., ICLR 2022

Simple Unsupervised Object-Centric Learning for Complex and Naturalistic Videos”, Singh et al., NeuRIPS 2024

CVPR’2024 OBJECT-CENTRIC REPRESENTATION FOR COMPUTER VISION TUTORIAL

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.

26 of 38

Upgrading decoder by decoupling decoding: LSD

26

Encoder: Transformer

Learning objective: predict diffusion noise

Decoder: pre-trained stable diffusion

Encoder: pre-trained auto-encoder (AE)

Learning objective: predicting diffusion noise

Decoder: pre-trained Stable Diffusion model

Applicable datasets: MOVi++, FFHQ

Object-Centric Slot Diffusion”, Jiang et al., NeurIPS 2023

CVPR’2024 OBJECT-CENTRIC REPRESENTATION FOR COMPUTER VISION TUTORIAL

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.

27 of 38

Upgrading decoder by decoupling decoding: LSD

27

Slot-based Image Editing

Object-Centric Slot Diffusion”, Jiang et al., NeurIPS 2023

CVPR’2024 OBJECT-CENTRIC REPRESENTATION FOR COMPUTER VISION TUTORIAL

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.

28 of 38

Upgrading decoder utilizing pre-trained decoder: DORSAL

28

Encoder: CNN + Transformer

Learning objective: predicting diffusion noise

Decoder: Multiview U-Net

Applicable datasets: MSN, Street View

DORSAL: DIFFUSION FOR OBJECT-CENTRIC REPRESENTATIONS OF SCENES”, Jabri et al., ICLR 2024

CVPR’2024 OBJECT-CENTRIC REPRESENTATION FOR COMPUTER VISION TUTORIAL

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.

29 of 38

Upgrading decoder utilizing pre-trained decoder: DORSAL

29

DORSAL: DIFFUSION FOR OBJECT-CENTRIC REPRESENTATIONS OF SCENES”, Jabri et al., ICLR 2024

Slot-based Image Editing

CVPR’2024 OBJECT-CENTRIC REPRESENTATION FOR COMPUTER VISION TUTORIAL

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.

30 of 38

Other related important works

30

Disentangled representation

Co-train with image-text pairs

Videos

Downstream tasks

CVPR’2024 OBJECT-CENTRIC REPRESENTATION FOR COMPUTER VISION TUTORIAL

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.

31 of 38

Co-train with image-text pairs�

31

Learning Open-vocabulary Semantic Segmentation Models From Natural Language Supervision”, Xu et al., CVPR 2023

CVPR’2024 OBJECT-CENTRIC REPRESENTATION FOR COMPUTER VISION TUTORIAL

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.

32 of 38

Disentangled representation

32

NEURAL SYSTEMATIC BINDER”, Xu et al., ICLR 2023

More Fine-grained Slot-based Image Editing

CVPR’2024 OBJECT-CENTRIC REPRESENTATION FOR COMPUTER VISION TUTORIAL

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.

33 of 38

Other related important works

33

Disentangled representation

Co-train with image-text pairs

Videos

Downstream tasks

CVPR’2024 OBJECT-CENTRIC REPRESENTATION FOR COMPUTER VISION TUTORIAL

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.

34 of 38

Catalog

34

Retrospect on the motivation of object-centric learning

The gap to real-world vision data

Bridging the gap by reconstructing beyond RGB pixel space

Bridging the gap by upgrading encoder

Bridging the gap by upgrading decoder

Brief introduction to the object-centric learning framework

CVPR’2024 OBJECT-CENTRIC REPRESENTATION FOR COMPUTER VISION TUTORIAL

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.

35 of 38

The object-centric learning framework (OCLF)

35

CVPR’2024 OBJECT-CENTRIC REPRESENTATION FOR COMPUTER VISION TUTORIAL

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.

36 of 38

The object-centric learning framework (OCLF)

36

Encoder: CNN, Transformer Decoder: Mixture decoder, Autoregressive transformer

Learning objectives: Reconstruction on RGB, DINO, VQ, Optical flow

Tricks: Slot initialization, Latent duplicate suppression, etc

CVPR’2024 OBJECT-CENTRIC REPRESENTATION FOR COMPUTER VISION TUTORIAL

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.

37 of 38

The object-centric learning framework (OCLF)

37

Model Zoo: Slot attention, SAVi, SLATE, DINOSAUR, OC-MOT, Adaptive SA

CVPR’2024 OBJECT-CENTRIC REPRESENTATION FOR COMPUTER VISION TUTORIAL

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.

38 of 38

Thank You!

38

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.

CVPR’2024 OBJECT-CENTRIC REPRESENTATION FOR COMPUTER VISION TUTORIAL

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.