1 of 38

Bridging the gap to real-world object-centric learning

CVPR’2024 Object-Centric Representation For Computer Vision Tutorial

Tianjun Xiao

Senior Applied Scientist, AWS AI

CVPR’2024 OBJECT-CENTRIC REPRESENTATION FOR COMPUTER VISION TUTORIAL

2 of 38

Catalog

2

Retrospect on the motivation of object-centric learning

The gap to real-world vision data

Bridging the gap by reconstructing beyond RGB pixel space

Bridging the gap by upgrading encoder

Bridging the gap by upgrading decoder

Brief introduction to the object-centric learning framework

CVPR’2024 OBJECT-CENTRIC REPRESENTATION FOR COMPUTER VISION TUTORIAL

3 of 38

Retrospect on the motivation of object-centric learning

3

Structured visual representation

Compositionality, systematic generalization, causal representation learning

Self-supervised learning

Scalability, learning by watching video

End to end differentiable architecture

Versatile, downstream friendly

“On the Binding Problem in Artificial Neural Networks”, Greff et al., 2020

CVPR’2024 OBJECT-CENTRIC REPRESENTATION FOR COMPUTER VISION TUTORIAL

4 of 38

Catalog

4

Retrospect on the motivation of object-centric learning

The gap to real-world vision data

Bridging the gap by reconstructing beyond RGB pixel space

Bridging the gap by upgrading encoder

Bridging the gap by upgrading decoder

Brief introduction to the object-centric learning framework

CVPR’2024 OBJECT-CENTRIC REPRESENTATION FOR COMPUTER VISION TUTORIAL

5 of 38

The gap to real-world vision data: slot attention

5

Encoder: CNN

Learning objective: reconstructing RGB pixel

Decoder: CNN mixture decoder, slots are decoded separately and merged at the final output

“Object-Centric Learning with Slot Attention”, Locatello and Kipf et al., NeurIPS 2020

CVPR’2024 OBJECT-CENTRIC REPRESENTATION FOR COMPUTER VISION TUTORIAL

The slot attention model published in NeurIPS 2020 is a representative early work of object-centric learning. Slot attention is an iterative, competitive attention that can be considered as a differentiable interface between distributed repsentation and set representation, which we call it slots.

We have three facilities to support the algorithm to run: encoder, learning objective, and decoder.

For the original slot attention paper, the encoder is a small size CNN. The learning objective is to reconstruct image RGB pixel from slots.

The decoder first broadcast each slot into a spatial tensor, then use De-Conv operations to upsample to image resolution. Alone with RGB pixels, it will also produce a alpha channel. Each slot is decoded separately, and use the alpha channel to compete to explain the image.

6 of 38

The gap to real-world vision data

6

Applicable datasets: CLEVR, Multi-dSprites, Tetrominoes

“Object-Centric Learning with Slot Attention”, Locatello and Kipf et al., NeurIPS 2020

CVPR’2024 OBJECT-CENTRIC REPRESENTATION FOR COMPUTER VISION TUTORIAL

7 of 38

The gap to real-world vision data

7

“BRIDGING THE GAP TO REAL-WORLD OBJECTCENTRIC LEARNING”, Seitzer et al., ICLR 2023

“Object Scene Representation Transformer”, Sajjadi et al., NeurIPS 2022

“Object-Centric Slot Diffusion”, Jiang et al., NeurIPS 2023

CVPR’2024 OBJECT-CENTRIC REPRESENTATION FOR COMPUTER VISION TUTORIAL

8 of 38

Catalog

8

Retrospect on the motivation of object-centric learning

The gap to real-world vision data

Bridging the gap by reconstructing beyond RGB pixel space

Bridging the gap by upgrading encoder

Bridging the gap by upgrading decoder

Brief introduction to the object-centric learning framework

CVPR’2024 OBJECT-CENTRIC REPRESENTATION FOR COMPUTER VISION TUTORIAL

9 of 38

Bridging the gap by reconstructing beyond RGB pixel

9

RGB pixel-level reconstruction for complex scenes is too challenging for the mixture decoder.

Instead, we can reconstruct on signals that are:

Represent objectness
Less heterogeneous space
Can be obtained in a self-supervised or other scalable way

CVPR’2024 OBJECT-CENTRIC REPRESENTATION FOR COMPUTER VISION TUTORIAL

10 of 38

Bridging the gap by reconstructing optical flow

10

Encoder: CNN

Learning objective: reconstructing optical flow

Decoder: CNN mixture decoder, slots are decoded separately and merged at the final output

“Conditional Object-Centric Learning from Video”, Kipf et al., ICLR 2022

CVPR’2024 OBJECT-CENTRIC REPRESENTATION FOR COMPUTER VISION TUTORIAL

11 of 38

Bridging the gap by reconstructing feature: SAVi

11

Applicable datasets: MOVi, MOVi++

Objects scanned from real-world.

Synthesized using physics engine.

Generating video, optical flow, depth, segmentation mask at the same time.

“Kubric: a scalable dataset generator”, Greff et al., 2022

CVPR’2024 OBJECT-CENTRIC REPRESENTATION FOR COMPUTER VISION TUTORIAL

12 of 38

Bridging the gap by reconstructing optical flow

12

Encoder: CNN (flow as input)

Learning objective: reconstructing optical flow

Decoder: CNN mixture decoder

Applicable datasets: DAVIS2016, SegTrackv2, FBMS59, MoCA

“Self-supervised Video Object Segmentation by Motion Grouping”, Yang et al., ICCV 2021

CVPR’2024 OBJECT-CENTRIC REPRESENTATION FOR COMPUTER VISION TUTORIAL

13 of 38

Bridging the gap by reconstructing depth

13

Encoder: CNN

Learning objective: reconstructing depth from LiDAR

Decoder: CNN mixture decoder

Applicable datasets: Waymo Open dataset

“SAVi++: Towards End-to-End Object-Centric Learning from Real-World Videos”, Elsayed et al., NeurIPS 2022

CVPR’2024 OBJECT-CENTRIC REPRESENTATION FOR COMPUTER VISION TUTORIAL

14 of 38

Catalog

14

Retrospect on the motivation of object-centric learning

The gap to real-world vision data

Bridging the gap by reconstructing beyond RGB pixel space

Bridging the gap by upgrading encoder

Bridging the gap by upgrading decoder

Brief introduction to the object-centric learning framework

CVPR’2024 OBJECT-CENTRIC REPRESENTATION FOR COMPUTER VISION TUTORIAL

15 of 38

Bridging the gap by upgrading encoder

15

The upgrading is not trivially replacing encoder.

The deeper and larger encoder we use, the larger gap between the final feature where slot attention is executed on and the raw RGB signal.

Do we have to also use a larger decoder to fit the encoder?

Decoder is harder to design if we are targeting real-looking image/video.

Reconstruct in the hidden feature space!

CVPR’2024 OBJECT-CENTRIC REPRESENTATION FOR COMPUTER VISION TUTORIAL

16 of 38

Bridging the gap by upgrading encoder: DINOSAUR

16

Encoder: ViT, ResNet

Learning objective: DINO feature, MAE feature

Decoder: MLP Mixture Decoder or Transformer Decoder

“BRIDGING THE GAP TO REAL-WORLD OBJECTCENTRIC LEARNING”, Seitzer et al., ICLR 2023

CVPR’2024 OBJECT-CENTRIC REPRESENTATION FOR COMPUTER VISION TUTORIAL

17 of 38

Bridging the gap by upgrading encoder: DINOSAUR

17

Applicable datasets: MOVi++, PASCAL VOC 2012, COCO

“BRIDGING THE GAP TO REAL-WORLD OBJECTCENTRIC LEARNING”, Seitzer et al., ICLR 2023

CVPR’2024 OBJECT-CENTRIC REPRESENTATION FOR COMPUTER VISION TUTORIAL

18 of 38

Bridging the gap by upgrading encoder: DINOSAUR

18

Metrics: FG-ARI, mBO, CorLoc, mIoU

“BRIDGING THE GAP TO REAL-WORLD OBJECTCENTRIC LEARNING”, Seitzer et al., ICLR 2023

CVPR’2024 OBJECT-CENTRIC REPRESENTATION FOR COMPUTER VISION TUTORIAL

19 of 38

Bridging the gap by upgrading encoder: Adaptive SA

19

Encoder: ViT, ResNet

Learning objective: DINO feature, MAE feature

Decoder: MLP Mixture Decoder or Transformer Decoder

“Adaptive Slot Attention: Object Discovery with Dynamic Slot Number”, Fan et al., CVPR 2024

CVPR’2024 OBJECT-CENTRIC REPRESENTATION FOR COMPUTER VISION TUTORIAL

20 of 38

Bridging the gap by upgrading encoder

20

Not just another unsupervised object segmentation model

Structured scene representation aiming for systematic/compositional generalization

“To Understand Language is to Understand Generalization”, Eric Jang

CVPR’2024 OBJECT-CENTRIC REPRESENTATION FOR COMPUTER VISION TUTORIAL

OK, till now people may ask. Is object-centric model just another unsupervised detection/segmentation model.

Well, we'll emphrasize that, our goal is not pixel-level accuracy on segmentation, but to get a good structured scene representation, where each object has a token and representation.

This structured scene representation is aimed for systematic compositional generalization.

To demonstrate compositional generalization, we borrow one slides from Eric Jang's blog. There're different types of generalization:

Systematicity: recombine constituents that have not been seen together during training.

Productivity: test sequences longer than ones seen during training. For vision, we'll see by constructing a scene with more objects than in training, we can still get a valid representation.

Substituitivity: an expression is unchanged if a components is relaced with something of the same meaning. For vision, it means we can replace an object in a scene, we still get a valid representation.

Finally, localism, this means of local parts are unchanged by the global context.

In the next session talking about decoder, we'll start to see how object-centric representation makes a difference.

And in later talks, we'll also see how it works well on downstream tasks.

21 of 38

Catalog

21

Retrospect on the motivation of object-centric learning

The gap to real-world vision data

Bridging the gap by reconstructing beyond RGB pixel space

Bridging the gap by upgrading encoder

Bridging the gap by upgrading decoder

Brief introduction to the object-centric learning framework

CVPR’2024 OBJECT-CENTRIC REPRESENTATION FOR COMPUTER VISION TUTORIAL

22 of 38

Slot-decoding dilemma

22

To make a slot focus on the pixel area corresponding to an object, it is usually required to employ a weak decoder to make object concepts emerge in the slots.

Weaker decoder prevents one slot from modeling multiple objects, introduce a statistical regularity

Weaker decoder can make the generated image blur the details.

Solutions:

Introduce more 3D inductive bias to keep the decoder in moderated size while also providing high quality image generation
Decoupling the decoding process

“ILLITERATE DALL-E LEARNS TO COMPOSE”, Singh et al., ICLR 2022

CVPR’2024 OBJECT-CENTRIC REPRESENTATION FOR COMPUTER VISION TUTORIAL

23 of 38

Upgrading decoder utilizing 3D inductive bias: ObSuRF

23

Encoder: CNN

Learning objective: Novel view synthesis

Decoder: Nerf decoder

Applicable datasets: Multiview CLEVR

“Decomposing 3D Scenes into Objects via Unsupervised Volume Segmentation”, Stelzner et al., 2021

CVPR’2024 OBJECT-CENTRIC REPRESENTATION FOR COMPUTER VISION TUTORIAL

24 of 38

Upgrading decoder utilizing 3D inductive bias: OSRT

24

Encoder: CNN + Transformer (SRT)

Learning objective: Novel view synthesis

Decoder: Slot Mixer

Applicable datasets: CLEVR-3D, MultiShapeNet

“Object Scene Representation Transformer”, Sajjadi et al., NeurIPS 2022

CVPR’2024 OBJECT-CENTRIC REPRESENTATION FOR COMPUTER VISION TUTORIAL

25 of 38

Upgrading decoder by decoupling decoding: SLATE/STEVE

25

Encoder: VQ encoder

Learning objective: Reconstructing VQ-Code

Decoder: Autoregressive transformer

Applicable datasets: MOVi++, Traffic and Aquarium

“ILLITERATE DALL-E LEARNS TO COMPOSE”, Singh et al., ICLR 2022

“Simple Unsupervised Object-Centric Learning for Complex and Naturalistic Videos”, Singh et al., NeuRIPS 2024

CVPR’2024 OBJECT-CENTRIC REPRESENTATION FOR COMPUTER VISION TUTORIAL

26 of 38

Upgrading decoder by decoupling decoding: LSD

26

Encoder: Transformer

Learning objective: predict diffusion noise

Decoder: pre-trained stable diffusion

Encoder: pre-trained auto-encoder (AE)

Learning objective: predicting diffusion noise

Decoder: pre-trained Stable Diffusion model

Applicable datasets: MOVi++, FFHQ

“Object-Centric Slot Diffusion”, Jiang et al., NeurIPS 2023

CVPR’2024 OBJECT-CENTRIC REPRESENTATION FOR COMPUTER VISION TUTORIAL

27 of 38

Upgrading decoder by decoupling decoding: LSD

27

Slot-based Image Editing

“Object-Centric Slot Diffusion”, Jiang et al., NeurIPS 2023

CVPR’2024 OBJECT-CENTRIC REPRESENTATION FOR COMPUTER VISION TUTORIAL

28 of 38

Upgrading decoder utilizing pre-trained decoder: DORSAL

28

Encoder: CNN + Transformer

Learning objective: predicting diffusion noise

Decoder: Multiview U-Net

Applicable datasets: MSN, Street View

“DORSAL: DIFFUSION FOR OBJECT-CENTRIC REPRESENTATIONS OF SCENES”, Jabri et al., ICLR 2024

CVPR’2024 OBJECT-CENTRIC REPRESENTATION FOR COMPUTER VISION TUTORIAL

29 of 38

Upgrading decoder utilizing pre-trained decoder: DORSAL

29

“DORSAL: DIFFUSION FOR OBJECT-CENTRIC REPRESENTATIONS OF SCENES”, Jabri et al., ICLR 2024

Slot-based Image Editing

CVPR’2024 OBJECT-CENTRIC REPRESENTATION FOR COMPUTER VISION TUTORIAL

30 of 38

Other related important works

30

Disentangled representation

Co-train with image-text pairs

Videos

Downstream tasks

CVPR’2024 OBJECT-CENTRIC REPRESENTATION FOR COMPUTER VISION TUTORIAL

31 of 38

Co-train with image-text pairs�

31

“Learning Open-vocabulary Semantic Segmentation Models From Natural Language Supervision”, Xu et al., CVPR 2023

CVPR’2024 OBJECT-CENTRIC REPRESENTATION FOR COMPUTER VISION TUTORIAL

32 of 38

Disentangled representation

32

“NEURAL SYSTEMATIC BINDER”, Xu et al., ICLR 2023

More Fine-grained Slot-based Image Editing

CVPR’2024 OBJECT-CENTRIC REPRESENTATION FOR COMPUTER VISION TUTORIAL

33 of 38

Other related important works

33

Disentangled representation

Co-train with image-text pairs

Videos

Downstream tasks

CVPR’2024 OBJECT-CENTRIC REPRESENTATION FOR COMPUTER VISION TUTORIAL

34 of 38

Catalog

34

Retrospect on the motivation of object-centric learning

The gap to real-world vision data

Bridging the gap by reconstructing beyond RGB pixel space

Bridging the gap by upgrading encoder

Bridging the gap by upgrading decoder

Brief introduction to the object-centric learning framework

CVPR’2024 OBJECT-CENTRIC REPRESENTATION FOR COMPUTER VISION TUTORIAL

35 of 38

The object-centric learning framework (OCLF)

35

https://github.com/amazon-science/object-centric-learning-framework/tree/main/ocl

CVPR’2024 OBJECT-CENTRIC REPRESENTATION FOR COMPUTER VISION TUTORIAL

36 of 38

The object-centric learning framework (OCLF)

36

Encoder: CNN, Transformer Decoder: Mixture decoder, Autoregressive transformer

Learning objectives: Reconstruction on RGB, DINO, VQ, Optical flow

Tricks: Slot initialization, Latent duplicate suppression, etc

https://github.com/amazon-science/object-centric-learning-framework/tree/main/ocl

CVPR’2024 OBJECT-CENTRIC REPRESENTATION FOR COMPUTER VISION TUTORIAL

37 of 38

The object-centric learning framework (OCLF)

37

Model Zoo: Slot attention, SAVi, SLATE, DINOSAUR, OC-MOT, Adaptive SA

https://github.com/amazon-science/object-centric-learning-framework/tree/main/ocl

CVPR’2024 OBJECT-CENTRIC REPRESENTATION FOR COMPUTER VISION TUTORIAL

38 of 38

Thank You!

38

CVPR’2024 OBJECT-CENTRIC REPRESENTATION FOR COMPUTER VISION TUTORIAL