The talk will describe self-supervised learning from videos with sound and will be divided into two parts. The first part will describe self-supervised learning from the visual stream alone (no sound), and shows that it is possible to learn powerful embedding for tasks such as facial attribute prediction and human actionrecognition. This requires defining a proxy loss, so that a deep network trained with this loss has to solve the task of interest. The second part will explore multi-modal self-supervised learning from video and audio. We investigate two proxy loss functions, synchronization and correspondence, to link the modalities. We showthat, in addition to training networks to encode images and audio, we get for free a number of functionalities including: active speaker identification; audio-visual speech enhancement; and localizing objects by their sound.
Andrew Zisserman is one of the principal architects of modern computer vision. His work in the 1980s on surface reconstruction with discontinuities is widely cited. He is best known for his leading role during the 1990s in establishing the computational theory of multiple view reconstruction and the development of practical algorithms that are widely in use today. This culminated in the publication, in 2000, of his book with Richard Hartley, already regarded as a standard text. His laboratory in Oxford is internationally renowned, and its work is currently shedding new light on the problems of object detection and recognition.