1 of 26

Visual Geometry Group

Presented by: Zahid Hasan

MPSC lab, IS Department, UMBC

2 of 26

Objective

State of the art computer vision research
Latest techniques and methodologies
Problems in vision research
Future research directions
Integration or new research ideas

3 of 26

Peoples

Professor Andrew Zisserman
Professor Andrea Vedaldi
Research Fellows
Research Students
Huge collaborations
….

4 of 26

Research Overview

Self-supervised learning
Audio-visual learning
Understanding and training CNN
Search and retrieval of images and video
Video-based recognition and understanding
Counting, detection, reading and tracking
Art recognition and search
Miscellaneous

5 of 26

Self-supervised learning

Self-supervised learning of audio-visual from video [1]

Sound source separation and tracking speaker
Correct misaligned audio-video
Trained on unlabeled data

Audio visual object detection [1]

6 of 26

SSL

Memory augmented DPC for video representation learning [2]

Representation of activity
Pretraining
Performed on 4 downstreamed task

Memory Predictive coding from the input video [2]

7 of 26

SSL

Unsupervised Learning of Object Landmarks thought conditional image generation. [3]

Generation on geometric constraint

Unsupervised learning of landmarks by DVE [4]

High dimensional unsupervised landmark

Figure: Landmark Detection from image [3]

8 of 26

SSL

Self-supervised learning from watching faces

X2Face: Controlling face generation [5]
FAb-Net: Facial attributes embedding [6]

Figure: FAb-Net [6]

9 of 26

SSL

Self-supervised Learning for Video correspondence flow [7]

Coined Correspondence flow

Figure: First time result of DAVIS dataset [7]

10 of 26

Audio-visual learning

Labelling unlabelled video with multi-modal supervision [8]

Combine video and audio together!
Unsupervised clustering
Correspondence between audio and video

Figure: Process audio and video simultaneously for clustering [8]

11 of 26

Audio-visual learning

Disentangled speech embedding using cross-modal self-supervision [9]

Trained on unlabelled video with voice only data
VoxCeleb dataset

Figure: Training strategy for audio-visual representation [9]

12 of 26

Audio-visual learning

Seeing voices and Hearing faces: Cross-modal biometric matching [10]

Voice to face and vice versa

Figure: Task description [10]

13 of 26

Audio-visual learning

Speech2Action: Cross-modal supervision for action recognition [11]

Learn from co-occurrence in video [action and audio]

14 of 26

Audio-visual learning

Utterance-level aggregation for speaker recognition in the wild [12]

Speaker recognition [Voxceleb dataset]

15 of 26

Understanding CNN

Deep image understanding by inverting [13]

Reconstruction and representation to original image!!

Deep image Prior [14]

Image reconstruction, imprinting, super-resolution

16 of 26

Visual counting

Amplifying key cues for Human-object interaction detection [19]

Encoding and semantic information
Predicting future interaction
Gate memory and fusion of information

17 of 26

Visual detect and tracking

Detect and track [20]

RPN on top of video features
ROI tracking

18 of 26

Miscellaneous

Vertebrae Detection and Labelling in Spine MRI [15]

19 of 26

Miscellaneous

Semi-Supervised Learning with Scarce Annotations [16]

Transfer learning
Fitting both labelled and unlabelled data

20 of 26

Miscellaneous

Constrained Video Face Clustering using 1NN relations [17]

Friends dataset
Clustering and representation

21 of 26

Miscellaneous

Learn novel visual from deep transfer clustering [20]

Deep embedding cluster to transfer setting
Number of class in unlabelled data

22 of 26

Conclusion

Formulation of vision problem from multiple domains
Focus on semi-supervised training
Engineering inside the network
Draw inspiration from cross-modal sensors
Interesting Recent works and their follow ups.
Research Jobs

23 of 26

Thank you

:)

Questions?

24 of 26

Reference

[1] Afouras, Triantafyllos, Andrew Owens, Joon Son Chung, and Andrew Zisserman. "Self-Supervised Learning of Audio-Visual Objects from Video." arXiv preprint arXiv:2008.04237 (2020).
[2] Han, Tengda, Weidi Xie, and Andrew Zisserman. "Memory-augmented dense predictive coding for video representation learning." arXiv preprint arXiv:2008.01065 (2020).
[3] Jakab, Tomas, Ankush Gupta, Hakan Bilen, and Andrea Vedaldi. "Unsupervised learning of object landmarks through conditional image generation." In Advances in neural information processing systems, pp. 4016-4027. 2018.
[4] Thewlis, James, Samuel Albanie, Hakan Bilen, and Andrea Vedaldi. "Unsupervised learning of landmarks by descriptor vector exchange." In Proceedings of the IEEE International Conference on Computer Vision, pp. 6361-6371. 2019.
[5] Wiles, Olivia, A. Sophia Koepke, and Andrew Zisserman. "X2face: A network for controlling face generation using images, audio, and pose codes." In Proceedings of the European conference on computer vision (ECCV), pp. 670-686. 2018.
[6] Wiles, Olivia, A. Koepke, and Andrew Zisserman. "Self-supervised learning of a facial attribute embedding from video." arXiv preprint arXiv:1808.06882 (2018).
[7] Lai, Zihang, and Weidi Xie. "Self-supervised learning for video correspondence flow." arXiv preprint arXiv:1905.00875 (2019).
[8] Asano, Yuki M., Mandela Patrick, Christian Rupprecht, and Andrea Vedaldi. "Labelling unlabelled videos from scratch with multi-modal self-supervision." arXiv preprint arXiv:2006.13662 (2020).
[9] Nagrani, Arsha, Joon Son Chung, Samuel Albanie, and Andrew Zisserman. "Disentangled speech embeddings using cross-modal self-supervision." In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6829-6833. IEEE, 2020.

25 of 26

References

[10] Nagrani, Arsha, Samuel Albanie, and Andrew Zisserman. "Seeing voices and hearing faces: Cross-modal biometric matching." In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8427-8436. 2018.
[11] Nagrani, Arsha, Chen Sun, David Ross, Rahul Sukthankar, Cordelia Schmid, and Andrew Zisserman. "Speech2Action: Cross-modal Supervision for Action Recognition." In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10317-10326. 2020.
[12] Xie, Weidi, Arsha Nagrani, Joon Son Chung, and Andrew Zisserman. "Utterance-level aggregation for speaker recognition in the wild." In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5791-5795. IEEE, 2019.
[13] Mahendran, Aravindh, and Andrea Vedaldi. "Understanding deep image representations by inverting them." In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5188-5196. 2015.
[14] Ulyanov, Dmitry, Andrea Vedaldi, and Victor Lempitsky. "Deep image prior." In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9446-9454. 2018.
[15] Windsor, Rhydian, Amir Jamaludin, Timor Kadir, and Andrew Zisserman. "A Convolutional Approach to Vertebrae Detection and Labelling in Whole Spine MRI." In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 712-722. Springer, Cham, 2020.
[16] Rebuffi, Sylvestre-Alvise, Sebastien Ehrhardt, Kai Han, Andrea Vedaldi, and Andrew Zisserman. "Semi-supervised learning with scarce annotations." In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 762-763. 2020.
[17] Kalogeiton, V., and A. Zisserman. "Constrained video face clustering using 1NN relations." (2020).
[18] Han, Kai, Andrea Vedaldi, and Andrew Zisserman. "Learning to discover novel visual categories via deep transfer clustering." In Proceedings of the IEEE International Conference on Computer Vision, pp. 8401-8409. 2019.

26 of 26

References

[19] Liu, Y., Q. Chen, and A. Zisserman. "Amplifying key cues for human-object-interaction detection." Lecture Notes in Computer Science.
[20] Feichtenhofer, Christoph, Axel Pinz, and Andrew Zisserman. "Detect to track and track to detect." In Proceedings of the IEEE International Conference on Computer Vision, pp. 3038-3046. 2017.