1 of 1

Improving Unsupervised Label Propagation for Pose Tracking and Video Object Segmentation �Urs Waldmann, Jannik Bamberger, Ole Johannsen, Oliver Deussen, and Bastian Goldlücke�

Contributions

Pipeline

A pipeline for label propagation built by carefully evaluating and selecting best practices.

A joint tracking and keypoint propagation framework.

Allows us e.g. to perform pigeon keypoint tracking.
Pose tracking for objects which are small compared to the frame size.

A fully unsupervised label propagation pipeline for Video Object Segmentation (VOS)..

We initialize the first frame with the self-attention layer from DINO [1].

Results

We outperform state of the art pipelines for pose tracking in terms of PCK@0.1 on the JHMDB [3] dataset by 6.5%.

We obtain excellent performance for our pigeon keypoint tracking.

We outperform other unsupervised methods that do not rely on motion segmentation in Video Object Segmentation if we disregard non-interactive post-processing.

Pipeline

Affinity Top-K and Normalization. Select more than one similar location and aggregate the corresponding label functions values. We use k=20 for pose tracking and k=5 for VOS. We perform the normalization after applying top-k, using softmax for UVC [4] and a simple normalization for DINO [1].

Local Affinity. Assumptions about the maximum displacement of an object between frames allow us to consider only a fixed-size neighborhood around each location for propagation. We use a local affinity of 12 in our applications and case studies.

Context Frames. It proves beneficial to use more than one reference frame at once. We use a joint propagation from all context frames at once. The context frames we use, are the first and 20 directly preceeding frames.

Applications and Case Studies

Joint Tracking and Keypoint Propagation

Pigeon Keypoint Tracking

Unsupervised Zero-Shot VOS

This work was supported by the DFG under Germany’s Excellence Strategy - EXC 2117 - 422037984.

Presented at the German Conference on Pattern Recognition (GCPR), Konstanz, September 2022.

References

[1] M. Caron et al.: Emerging properties in self-supervised vision transformers. In ICCV, 2021

[2] A. Jabri et al.: Space-time correspondence as a contrastive random walk. In NIPS, 2020.

[3] H. Jhuang et al.: Towards understanding action recognition. In ICCV, 2013.

[4] X. Li et al.: Joint-task self-supervised learning for temporal correspondence. In NIPS, 2019.

[5] C. Yang et al.: Self-sup. video object segmentation by motion grouping. In ICCV, 2021

[6] L. Yang et al.: Efficient video object segmentation via network modulation. In CVPR, 2018.

[7] Y. Yang et al.: Unsup. moving object detection via contextual inform. sep.. In CVPR, 2019.

Real-World Dataset	PCK@0.1	PCK@0.2
Core Pipeline	7.5%	25.1%
Joint Pipeline	81.0%	97.5%

	STC [2]	Ours	Ours + Trk	Supervised [6]
PCK@0.1	59.3%	63.9%	65.8%	68.7%
PCK@0.2	84.9%	82.8%	84.2%	92.1%

Method	Online	Post-Proc.	Motion Only	IoU
MoGr., unsup. flow [5]	✅	❌	✅	53.2%
CIS [7]	✅	❌	✅	59.2%
Ours	✅	❌	❌	61.6%
MoGr., sup. flow [5]	✅	❌	✅	68.3%
CIS + post-proc. [7]	❌	✅	✅	71.5%