1 of 26

Visual Geometry Group

Presented by: Zahid Hasan

MPSC lab, IS Department, UMBC

2 of 26

Objective

  • State of the art computer vision research
  • Latest techniques and methodologies
  • Problems in vision research
  • Future research directions
  • Integration or new research ideas

3 of 26

Peoples

  • Professor Andrew Zisserman
  • Professor Andrea Vedaldi
  • Research Fellows
  • Research Students
  • Huge collaborations
  • ….

4 of 26

Research Overview

  • Self-supervised learning
  • Audio-visual learning
  • Understanding and training CNN
  • Search and retrieval of images and video
  • Video-based recognition and understanding
  • Counting, detection, reading and tracking
  • Art recognition and search
  • Miscellaneous

5 of 26

Self-supervised learning

  • Self-supervised learning of audio-visual from video [1]
    • Sound source separation and tracking speaker
    • Correct misaligned audio-video
    • Trained on unlabeled data

Audio visual object detection [1]

6 of 26

SSL

  • Memory augmented DPC for video representation learning [2]
    • Representation of activity
    • Pretraining
    • Performed on 4 downstreamed task

Memory Predictive coding from the input video [2]

7 of 26

SSL

  • Unsupervised Learning of Object Landmarks thought conditional image generation. [3]
    • Generation on geometric constraint
  • Unsupervised learning of landmarks by DVE [4]
    • High dimensional unsupervised landmark

Figure: Landmark Detection from image [3]

8 of 26

SSL

  • Self-supervised learning from watching faces
    • X2Face: Controlling face generation [5]
    • FAb-Net: Facial attributes embedding [6]

Figure: FAb-Net [6]

9 of 26

SSL

  • Self-supervised Learning for Video correspondence flow [7]
    • Coined Correspondence flow

Figure: First time result of DAVIS dataset [7]

10 of 26

Audio-visual learning

  • Labelling unlabelled video with multi-modal supervision [8]
    • Combine video and audio together!
    • Unsupervised clustering
    • Correspondence between audio and video

Figure: Process audio and video simultaneously for clustering [8]

11 of 26

Audio-visual learning

  • Disentangled speech embedding using cross-modal self-supervision [9]
    • Trained on unlabelled video with voice only data
    • VoxCeleb dataset

Figure: Training strategy for audio-visual representation [9]

12 of 26

Audio-visual learning

  • Seeing voices and Hearing faces: Cross-modal biometric matching [10]
    • Voice to face and vice versa

Figure: Task description [10]

13 of 26

Audio-visual learning

  • Speech2Action: Cross-modal supervision for action recognition [11]
    • Learn from co-occurrence in video [action and audio]

14 of 26

Audio-visual learning

  • Utterance-level aggregation for speaker recognition in the wild [12]
    • Speaker recognition [Voxceleb dataset]

15 of 26

Understanding CNN

  • Deep image understanding by inverting [13]
    • Reconstruction and representation to original image!!
  • Deep image Prior [14]
    • Image reconstruction, imprinting, super-resolution

16 of 26

Visual counting

  • Amplifying key cues for Human-object interaction detection [19]
    • Encoding and semantic information
    • Predicting future interaction
    • Gate memory and fusion of information

17 of 26

Visual detect and tracking

  • Detect and track [20]
    • RPN on top of video features
    • ROI tracking

18 of 26

Miscellaneous

  • Vertebrae Detection and Labelling in Spine MRI [15]

19 of 26

Miscellaneous

  • Semi-Supervised Learning with Scarce Annotations [16]
    • Transfer learning
    • Fitting both labelled and unlabelled data

20 of 26

Miscellaneous

  • Constrained Video Face Clustering using 1NN relations [17]
    • Friends dataset
    • Clustering and representation

21 of 26

Miscellaneous

  • Learn novel visual from deep transfer clustering [20]
    • Deep embedding cluster to transfer setting
    • Number of class in unlabelled data

22 of 26

Conclusion

  • Formulation of vision problem from multiple domains
  • Focus on semi-supervised training
  • Engineering inside the network
  • Draw inspiration from cross-modal sensors
  • Interesting Recent works and their follow ups.
  • Research Jobs

23 of 26

Thank you

:)

Questions?

24 of 26

Reference

  • [1] Afouras, Triantafyllos, Andrew Owens, Joon Son Chung, and Andrew Zisserman. "Self-Supervised Learning of Audio-Visual Objects from Video." arXiv preprint arXiv:2008.04237 (2020).
  • [2] Han, Tengda, Weidi Xie, and Andrew Zisserman. "Memory-augmented dense predictive coding for video representation learning." arXiv preprint arXiv:2008.01065 (2020).
  • [3] Jakab, Tomas, Ankush Gupta, Hakan Bilen, and Andrea Vedaldi. "Unsupervised learning of object landmarks through conditional image generation." In Advances in neural information processing systems, pp. 4016-4027. 2018.
  • [4] Thewlis, James, Samuel Albanie, Hakan Bilen, and Andrea Vedaldi. "Unsupervised learning of landmarks by descriptor vector exchange." In Proceedings of the IEEE International Conference on Computer Vision, pp. 6361-6371. 2019.
  • [5] Wiles, Olivia, A. Sophia Koepke, and Andrew Zisserman. "X2face: A network for controlling face generation using images, audio, and pose codes." In Proceedings of the European conference on computer vision (ECCV), pp. 670-686. 2018.
  • [6] Wiles, Olivia, A. Koepke, and Andrew Zisserman. "Self-supervised learning of a facial attribute embedding from video." arXiv preprint arXiv:1808.06882 (2018).
  • [7] Lai, Zihang, and Weidi Xie. "Self-supervised learning for video correspondence flow." arXiv preprint arXiv:1905.00875 (2019).
  • [8] Asano, Yuki M., Mandela Patrick, Christian Rupprecht, and Andrea Vedaldi. "Labelling unlabelled videos from scratch with multi-modal self-supervision." arXiv preprint arXiv:2006.13662 (2020).
  • [9] Nagrani, Arsha, Joon Son Chung, Samuel Albanie, and Andrew Zisserman. "Disentangled speech embeddings using cross-modal self-supervision." In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6829-6833. IEEE, 2020.

25 of 26

References

  • [10] Nagrani, Arsha, Samuel Albanie, and Andrew Zisserman. "Seeing voices and hearing faces: Cross-modal biometric matching." In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8427-8436. 2018.
  • [11] Nagrani, Arsha, Chen Sun, David Ross, Rahul Sukthankar, Cordelia Schmid, and Andrew Zisserman. "Speech2Action: Cross-modal Supervision for Action Recognition." In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10317-10326. 2020.
  • [12] Xie, Weidi, Arsha Nagrani, Joon Son Chung, and Andrew Zisserman. "Utterance-level aggregation for speaker recognition in the wild." In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5791-5795. IEEE, 2019.
  • [13] Mahendran, Aravindh, and Andrea Vedaldi. "Understanding deep image representations by inverting them." In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5188-5196. 2015.
  • [14] Ulyanov, Dmitry, Andrea Vedaldi, and Victor Lempitsky. "Deep image prior." In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9446-9454. 2018.
  • [15] Windsor, Rhydian, Amir Jamaludin, Timor Kadir, and Andrew Zisserman. "A Convolutional Approach to Vertebrae Detection and Labelling in Whole Spine MRI." In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 712-722. Springer, Cham, 2020.
  • [16] Rebuffi, Sylvestre-Alvise, Sebastien Ehrhardt, Kai Han, Andrea Vedaldi, and Andrew Zisserman. "Semi-supervised learning with scarce annotations." In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 762-763. 2020.
  • [17] Kalogeiton, V., and A. Zisserman. "Constrained video face clustering using 1NN relations." (2020).
  • [18] Han, Kai, Andrea Vedaldi, and Andrew Zisserman. "Learning to discover novel visual categories via deep transfer clustering." In Proceedings of the IEEE International Conference on Computer Vision, pp. 8401-8409. 2019.

26 of 26

References

  • [19] Liu, Y., Q. Chen, and A. Zisserman. "Amplifying key cues for human-object-interaction detection." Lecture Notes in Computer Science.
  • [20] Feichtenhofer, Christoph, Axel Pinz, and Andrew Zisserman. "Detect to track and track to detect." In Proceedings of the IEEE International Conference on Computer Vision, pp. 3038-3046. 2017.