1 of 30

2 of 30

Video segmentation…

3 of 30

Is optical flow algorithm doing all the work?

4 of 30

… in a fully self supervised manner

5 of 30

Outline

  1. Method
    1. The slot attention module
    2. Optical flow
    3. Applying slot attention to optical flow
    4. Losses
    5. Full pipeline

  • Results
    • Quantitative results
    • Qualitative results
    • Ablations

6 of 30

Method

7 of 30

Slot attention module[1]

8 of 30

Slot attention module[1]

Maps from a set of N input features to a set of K slots

  • Soft clustering assignment

  • Guarantees:
    • Permutation invariance with respect to the input features
    • Permutation equivariance with respect to the slots

  • Slots randomly sampled from a Gaussian distribution
    • K fixed during training
    • Can generalize to higher K during testing

  • Iterative process (3-5 iterations)

Hard clustering

Soft clustering

9 of 30

Slot attention iteration

  • Input appearance tokens
  • Input slot tokens
  • Output slot tokens

Slot attention iteration

10 of 30

1 iteration of slot attention

Maps from a set of N input features to a set of K slots

  • Input appearance tokens
  • Input slot tokens
  • Output slot tokens

Slot attention iteration

“Cross Attention”

GRU

MLP

̰

11 of 30

Slot update using “Cross Attention”

  • Input appearance tokens
  • Input slot tokens
  • Output slot update

  • Normalization over all slots

  • Weighted mean instead of weighted sum

“Cross Attention”

12 of 30

Slot update using “Cross Attention”

  • Input appearance tokens
  • Input slot tokens
  • Output slot update

“Cross Attention”

13 of 30

Recurrent Unit for the output slot

14 of 30

Reconstruction Loss

Decoder:�From each slot/layer i, it reconstructs:

  • the RGB image of the slot
  • the slot mask

Final reconstruction image

Broadcasting:

  • Inputs the slots
  • Outputs the broadcasted slots

15 of 30

Slot attention module results, multi iterations

16 of 30

Challenges

Challenges:

  • Slot attention works on synthetic data where objects are primitive shapes with simple textures
  • More challenging with real life texture, specifically in the background

Idea:

  • Don’t use RGB images but optical flow images !
  • Objects in images may not be textureless, but their motions typically are.

17 of 30

New Pipeline

  • Optical flow image as input
  • Only 2 slots/queries/layers, foreground and background
  • The slots are not Gaussian-initialized but learnable query vectors

18 of 30

Reconstruction

From each slot/layer i, it reconstructs:

  • the layer reconstruction
  • the layer mask

19 of 30

Additional losses

Encourage the masks to be binary

Encourage temporal consistency

20 of 30

Final loss

21 of 30

Full pipeline

22 of 30

Results

23 of 30

24 of 30

25 of 30

Ablation on the loss

26 of 30

Ablation on the optical flow

27 of 30

Results

28 of 30

Results on MoCA

29 of 30

Limitations

  • Only 2 layers, foreground and background
    • Can it properly generalize to more moving object, or is it able to cluster well camera motion and object motion
    • Follow up paper on slot attention applied to videos, has K > 2

  • Rely only on motion
    • What if optical flow is noisy or low quality?
    • Use RGB signal
    • Jointly optimize flow and segmentation

  • Does it properly learn motion centric representations for other applications?

30 of 30

Other ablations