Bridging the gap to real-world object-centric learning
CVPR’2024 Object-Centric Representation For Computer Vision Tutorial
Tianjun Xiao
Senior Applied Scientist, AWS AI
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
CVPR’2024 OBJECT-CENTRIC REPRESENTATION FOR COMPUTER VISION TUTORIAL
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Catalog
2
Retrospect on the motivation of object-centric learning
The gap to real-world vision data
Bridging the gap by reconstructing beyond RGB pixel space
Bridging the gap by upgrading encoder
Bridging the gap by upgrading decoder
Brief introduction to the object-centric learning framework
CVPR’2024 OBJECT-CENTRIC REPRESENTATION FOR COMPUTER VISION TUTORIAL
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Retrospect on the motivation of object-centric learning
3
Structured visual representation
Self-supervised learning
End to end differentiable architecture
“On the Binding Problem in Artificial Neural Networks”, Greff et al., 2020
CVPR’2024 OBJECT-CENTRIC REPRESENTATION FOR COMPUTER VISION TUTORIAL
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Catalog
4
Retrospect on the motivation of object-centric learning
The gap to real-world vision data
Bridging the gap by reconstructing beyond RGB pixel space
Bridging the gap by upgrading encoder
Bridging the gap by upgrading decoder
Brief introduction to the object-centric learning framework
CVPR’2024 OBJECT-CENTRIC REPRESENTATION FOR COMPUTER VISION TUTORIAL
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
The gap to real-world vision data: slot attention
5
Encoder: CNN
Learning objective: reconstructing RGB pixel
Decoder: CNN mixture decoder, slots are decoded separately and merged at the final output
“Object-Centric Learning with Slot Attention”, Locatello and Kipf et al., NeurIPS 2020
CVPR’2024 OBJECT-CENTRIC REPRESENTATION FOR COMPUTER VISION TUTORIAL
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
The gap to real-world vision data
6
Applicable datasets: CLEVR, Multi-dSprites, Tetrominoes
“Object-Centric Learning with Slot Attention”, Locatello and Kipf et al., NeurIPS 2020
CVPR’2024 OBJECT-CENTRIC REPRESENTATION FOR COMPUTER VISION TUTORIAL
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
The gap to real-world vision data
7
“BRIDGING THE GAP TO REAL-WORLD OBJECTCENTRIC LEARNING”, Seitzer et al., ICLR 2023
“Object Scene Representation Transformer”, Sajjadi et al., NeurIPS 2022
“Object-Centric Slot Diffusion”, Jiang et al., NeurIPS 2023
CVPR’2024 OBJECT-CENTRIC REPRESENTATION FOR COMPUTER VISION TUTORIAL
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Catalog
8
Retrospect on the motivation of object-centric learning
The gap to real-world vision data
Bridging the gap by reconstructing beyond RGB pixel space
Bridging the gap by upgrading encoder
Bridging the gap by upgrading decoder
Brief introduction to the object-centric learning framework
CVPR’2024 OBJECT-CENTRIC REPRESENTATION FOR COMPUTER VISION TUTORIAL
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Bridging the gap by reconstructing beyond RGB pixel
9
CVPR’2024 OBJECT-CENTRIC REPRESENTATION FOR COMPUTER VISION TUTORIAL
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Bridging the gap by reconstructing optical flow
10
Encoder: CNN
Learning objective: reconstructing optical flow
Decoder: CNN mixture decoder, slots are decoded separately and merged at the final output
“Conditional Object-Centric Learning from Video”, Kipf et al., ICLR 2022
CVPR’2024 OBJECT-CENTRIC REPRESENTATION FOR COMPUTER VISION TUTORIAL
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Bridging the gap by reconstructing feature: SAVi
11
Applicable datasets: MOVi, MOVi++
“Kubric: a scalable dataset generator”, Greff et al., 2022
CVPR’2024 OBJECT-CENTRIC REPRESENTATION FOR COMPUTER VISION TUTORIAL
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Bridging the gap by reconstructing optical flow
12
Encoder: CNN (flow as input)
Learning objective: reconstructing optical flow
Decoder: CNN mixture decoder
Applicable datasets: DAVIS2016, SegTrackv2, FBMS59, MoCA
“Self-supervised Video Object Segmentation by Motion Grouping”, Yang et al., ICCV 2021
CVPR’2024 OBJECT-CENTRIC REPRESENTATION FOR COMPUTER VISION TUTORIAL
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Bridging the gap by reconstructing depth
13
Encoder: CNN
Learning objective: reconstructing depth from LiDAR
Decoder: CNN mixture decoder
Applicable datasets: Waymo Open dataset
“SAVi++: Towards End-to-End Object-Centric Learning from Real-World Videos”, Elsayed et al., NeurIPS 2022
CVPR’2024 OBJECT-CENTRIC REPRESENTATION FOR COMPUTER VISION TUTORIAL
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Catalog
14
Retrospect on the motivation of object-centric learning
The gap to real-world vision data
Bridging the gap by reconstructing beyond RGB pixel space
Bridging the gap by upgrading encoder
Bridging the gap by upgrading decoder
Brief introduction to the object-centric learning framework
CVPR’2024 OBJECT-CENTRIC REPRESENTATION FOR COMPUTER VISION TUTORIAL
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Bridging the gap by upgrading encoder
15
CVPR’2024 OBJECT-CENTRIC REPRESENTATION FOR COMPUTER VISION TUTORIAL
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Bridging the gap by upgrading encoder: DINOSAUR
16
Encoder: ViT, ResNet
Learning objective: DINO feature, MAE feature
Decoder: MLP Mixture Decoder or Transformer Decoder
“BRIDGING THE GAP TO REAL-WORLD OBJECTCENTRIC LEARNING”, Seitzer et al., ICLR 2023
CVPR’2024 OBJECT-CENTRIC REPRESENTATION FOR COMPUTER VISION TUTORIAL
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Bridging the gap by upgrading encoder: DINOSAUR
17
Applicable datasets: MOVi++, PASCAL VOC 2012, COCO
“BRIDGING THE GAP TO REAL-WORLD OBJECTCENTRIC LEARNING”, Seitzer et al., ICLR 2023
CVPR’2024 OBJECT-CENTRIC REPRESENTATION FOR COMPUTER VISION TUTORIAL
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Bridging the gap by upgrading encoder: DINOSAUR
18
Metrics: FG-ARI, mBO, CorLoc, mIoU
“BRIDGING THE GAP TO REAL-WORLD OBJECTCENTRIC LEARNING”, Seitzer et al., ICLR 2023
CVPR’2024 OBJECT-CENTRIC REPRESENTATION FOR COMPUTER VISION TUTORIAL
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Bridging the gap by upgrading encoder: Adaptive SA
19
Encoder: ViT, ResNet
Learning objective: DINO feature, MAE feature
Decoder: MLP Mixture Decoder or Transformer Decoder
“Adaptive Slot Attention: Object Discovery with Dynamic Slot Number”, Fan et al., CVPR 2024
CVPR’2024 OBJECT-CENTRIC REPRESENTATION FOR COMPUTER VISION TUTORIAL
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Bridging the gap by upgrading encoder
20
“To Understand Language is to Understand Generalization”, Eric Jang
CVPR’2024 OBJECT-CENTRIC REPRESENTATION FOR COMPUTER VISION TUTORIAL
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Catalog
21
Retrospect on the motivation of object-centric learning
The gap to real-world vision data
Bridging the gap by reconstructing beyond RGB pixel space
Bridging the gap by upgrading encoder
Bridging the gap by upgrading decoder
Brief introduction to the object-centric learning framework
CVPR’2024 OBJECT-CENTRIC REPRESENTATION FOR COMPUTER VISION TUTORIAL
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Slot-decoding dilemma
22
“ILLITERATE DALL-E LEARNS TO COMPOSE”, Singh et al., ICLR 2022
CVPR’2024 OBJECT-CENTRIC REPRESENTATION FOR COMPUTER VISION TUTORIAL
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Upgrading decoder utilizing 3D inductive bias: ObSuRF
23
Encoder: CNN
Learning objective: Novel view synthesis
Decoder: Nerf decoder
Applicable datasets: Multiview CLEVR
“Decomposing 3D Scenes into Objects via Unsupervised Volume Segmentation”, Stelzner et al., 2021
CVPR’2024 OBJECT-CENTRIC REPRESENTATION FOR COMPUTER VISION TUTORIAL
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Upgrading decoder utilizing 3D inductive bias: OSRT
24
Encoder: CNN + Transformer (SRT)
Learning objective: Novel view synthesis
Decoder: Slot Mixer
Applicable datasets: CLEVR-3D, MultiShapeNet
“Object Scene Representation Transformer”, Sajjadi et al., NeurIPS 2022
CVPR’2024 OBJECT-CENTRIC REPRESENTATION FOR COMPUTER VISION TUTORIAL
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Upgrading decoder by decoupling decoding: SLATE/STEVE
25
Encoder: VQ encoder
Learning objective: Reconstructing VQ-Code
Decoder: Autoregressive transformer
Applicable datasets: MOVi++, Traffic and Aquarium
“ILLITERATE DALL-E LEARNS TO COMPOSE”, Singh et al., ICLR 2022
“Simple Unsupervised Object-Centric Learning for Complex and Naturalistic Videos”, Singh et al., NeuRIPS 2024
CVPR’2024 OBJECT-CENTRIC REPRESENTATION FOR COMPUTER VISION TUTORIAL
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Upgrading decoder by decoupling decoding: LSD
26
Encoder: Transformer
Learning objective: predict diffusion noise
Decoder: pre-trained stable diffusion
Encoder: pre-trained auto-encoder (AE)
Learning objective: predicting diffusion noise
Decoder: pre-trained Stable Diffusion model
Applicable datasets: MOVi++, FFHQ
“Object-Centric Slot Diffusion”, Jiang et al., NeurIPS 2023
CVPR’2024 OBJECT-CENTRIC REPRESENTATION FOR COMPUTER VISION TUTORIAL
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Upgrading decoder by decoupling decoding: LSD
27
Slot-based Image Editing
“Object-Centric Slot Diffusion”, Jiang et al., NeurIPS 2023
CVPR’2024 OBJECT-CENTRIC REPRESENTATION FOR COMPUTER VISION TUTORIAL
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Upgrading decoder utilizing pre-trained decoder: DORSAL
28
Encoder: CNN + Transformer
Learning objective: predicting diffusion noise
Decoder: Multiview U-Net
Applicable datasets: MSN, Street View
“DORSAL: DIFFUSION FOR OBJECT-CENTRIC REPRESENTATIONS OF SCENES”, Jabri et al., ICLR 2024
CVPR’2024 OBJECT-CENTRIC REPRESENTATION FOR COMPUTER VISION TUTORIAL
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Upgrading decoder utilizing pre-trained decoder: DORSAL
29
“DORSAL: DIFFUSION FOR OBJECT-CENTRIC REPRESENTATIONS OF SCENES”, Jabri et al., ICLR 2024
Slot-based Image Editing
CVPR’2024 OBJECT-CENTRIC REPRESENTATION FOR COMPUTER VISION TUTORIAL
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Other related important works
30
Disentangled representation
Co-train with image-text pairs
Videos
Downstream tasks
CVPR’2024 OBJECT-CENTRIC REPRESENTATION FOR COMPUTER VISION TUTORIAL
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Co-train with image-text pairs�
31
“Learning Open-vocabulary Semantic Segmentation Models From Natural Language Supervision”, Xu et al., CVPR 2023
CVPR’2024 OBJECT-CENTRIC REPRESENTATION FOR COMPUTER VISION TUTORIAL
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Disentangled representation
32
“NEURAL SYSTEMATIC BINDER”, Xu et al., ICLR 2023
More Fine-grained Slot-based Image Editing
CVPR’2024 OBJECT-CENTRIC REPRESENTATION FOR COMPUTER VISION TUTORIAL
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Other related important works
33
Disentangled representation
Co-train with image-text pairs
Videos
Downstream tasks
CVPR’2024 OBJECT-CENTRIC REPRESENTATION FOR COMPUTER VISION TUTORIAL
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Catalog
34
Retrospect on the motivation of object-centric learning
The gap to real-world vision data
Bridging the gap by reconstructing beyond RGB pixel space
Bridging the gap by upgrading encoder
Bridging the gap by upgrading decoder
Brief introduction to the object-centric learning framework
CVPR’2024 OBJECT-CENTRIC REPRESENTATION FOR COMPUTER VISION TUTORIAL
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
The object-centric learning framework (OCLF)
35
CVPR’2024 OBJECT-CENTRIC REPRESENTATION FOR COMPUTER VISION TUTORIAL
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
The object-centric learning framework (OCLF)
36
Encoder: CNN, Transformer Decoder: Mixture decoder, Autoregressive transformer
Learning objectives: Reconstruction on RGB, DINO, VQ, Optical flow
Tricks: Slot initialization, Latent duplicate suppression, etc
CVPR’2024 OBJECT-CENTRIC REPRESENTATION FOR COMPUTER VISION TUTORIAL
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
The object-centric learning framework (OCLF)
37
Model Zoo: Slot attention, SAVi, SLATE, DINOSAUR, OC-MOT, Adaptive SA
CVPR’2024 OBJECT-CENTRIC REPRESENTATION FOR COMPUTER VISION TUTORIAL
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Thank You!
38
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
CVPR’2024 OBJECT-CENTRIC REPRESENTATION FOR COMPUTER VISION TUTORIAL
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.