1 of 30

2 of 30

Video segmentation…

3 of 30

Is optical flow algorithm doing all the work?

4 of 30

… in a fully self supervised manner

5 of 30

Outline

Method

The slot attention module
Optical flow
Applying slot attention to optical flow
Losses
Full pipeline

Results

Quantitative results
Qualitative results
Ablations

6 of 30

Method

7 of 30

Slot attention module[1]

[1] Object-Centric Learning with Slot Attention, F. Locatello, D. Weissenborn, T. Unterthiner, A. Mahendran, G. Heigold, J. Uszkoreit, A. Dosovitskiy, T. Kipf

8 of 30

Slot attention module[1]

Maps from a set of N input features to a set of K slots

Soft clustering assignment

Guarantees:

Permutation invariance with respect to the input features
Permutation equivariance with respect to the slots

Slots randomly sampled from a Gaussian distribution

K fixed during training
Can generalize to higher K during testing

Iterative process (3-5 iterations)

[1] Object-Centric Learning with Slot Attention, F. Locatello, D. Weissenborn, T. Unterthiner, A. Mahendran, G. Heigold, J. Uszkoreit, A. Dosovitskiy, T. Kipf

Hard clustering

Soft clustering

9 of 30

Slot attention iteration

Input appearance tokens
Input slot tokens
Output slot tokens

Slot attention iteration

10 of 30

1 iteration of slot attention

Maps from a set of N input features to a set of K slots

Input appearance tokens
Input slot tokens
Output slot tokens

Slot attention iteration

“Cross Attention”

GRU

MLP

̰

11 of 30

Slot update using “Cross Attention”

Input appearance tokens
Input slot tokens
Output slot update

Normalization over all slots

Weighted mean instead of weighted sum

“Cross Attention”

12 of 30

Slot update using “Cross Attention”

Input appearance tokens
Input slot tokens
Output slot update

“Cross Attention”

13 of 30

Recurrent Unit for the output slot

14 of 30

Reconstruction Loss

Decoder:�From each slot/layer i, it reconstructs:

the RGB image of the slot
the slot mask

�Final reconstruction image

Broadcasting:

Inputs the slots
Outputs the broadcasted slots

15 of 30

Slot attention module results, multi iterations

[1] Object-Centric Learning with Slot Attention, F. Locatello, D. Weissenborn, T. Unterthiner, A. Mahendran, G. Heigold, J. Uszkoreit, A. Dosovitskiy, T. Kipf

16 of 30

Challenges

Challenges:

Slot attention works on synthetic data where objects are primitive shapes with simple textures
More challenging with real life texture, specifically in the background

Idea:

Don’t use RGB images but optical flow images !
Objects in images may not be textureless, but their motions typically are.

17 of 30

New Pipeline

Optical flow image as input
Only 2 slots/queries/layers, foreground and background
The slots are not Gaussian-initialized but learnable query vectors

18 of 30

Reconstruction

From each slot/layer i, it reconstructs:

the layer reconstruction
the layer mask

19 of 30

Additional losses

Encourage the masks to be binary

Encourage temporal consistency

20 of 30

Final loss

21 of 30

Full pipeline

22 of 30

Results

23 of 30

24 of 30

25 of 30

Ablation on the loss

26 of 30

Ablation on the optical flow

27 of 30

Results

28 of 30

Results on MoCA

29 of 30

Limitations

Only 2 layers, foreground and background

Can it properly generalize to more moving object, or is it able to cluster well camera motion and object motion
Follow up paper on slot attention applied to videos, has K > 2

Rely only on motion

What if optical flow is noisy or low quality?
Use RGB signal
Jointly optimize flow and segmentation

Does it properly learn motion centric representations for other applications?

30 of 30

Other ablations

[2] Group normalization, Y. Wu, K. He