1 of 20

CityGuessr: City-Level Video Geo- Localization on a Global Scale

Parth Parag Kulkarni1, Dr. Gaurav Kumar Nayak2, Dr. Mubarak Shah1

1Center for Research in Computer Vision, University of Central Florida

2Mehta Family School of DS & AI, Indian Institute of Technology Roorkee, India

2 of 20

Preview

  • We formulate a novel problem of worldwide video geolocalization

  • first global-scale video dataset named ‘CityGuessr68k’(68,269 videos, 166 cities)

  • transformer-based architecture with two primary components
    • Self-Cross Attention module for incorporating scenes
    • TextLabel Alignment strategy for distilling knowledge from textlabels in feature space

  • performance results on CityGuessr68k as well as Mapillary(MSLS)[1] datasets.

3 of 20

CityGuessr68k Dataset

  • ∼ 68000 first-person driving and walking videos from 166 cities around the world

  • annotated with hierarchical location labels

  • primary benchmarking dataset for this task

4 of 20

Problem Statement

“Given an input video, determining which city in the world, this video was recorded in”

  • Consequently, determining state/province, country and continent

5 of 20

Model

6 of 20

Model – Encoder Backbone and Classifiers

  • VideoMAE [2] backbone

  • Pretrained on Kinetics-400 [3]

  • Outputs a 384-dimensional feature vector

  • Passed into each classifier

  • All outputs used for computing geolocalization loss

7 of 20

Model

8 of 20

Model – Scene Recognition

  • Auxiliary task

  • For scene identification
    • fuse the knowledge from all 4 hierarchies

  • Tokens of all hierarchies :
    • Self attention with itself
    • Cross attention with others

  • Output used to compute scene loss
  • We use soft scene labels to capture scene knowledge from all frames of a video

9 of 20

Model

10 of 20

Model – TextLabel Alignment Strategy

  • Associating a name with a picture/video of a location is helpful for humans

  • distill knowledge from the textlabels of all hierarchies to model features

11 of 20

Inference – Hierarchical Evaluation

  • predictions for fine-grained hierarchies could be improved with the assistance of the coarser hierarchies

  • Independent :
    • Predict every hierarchy independently

  • Codependent :
    • Predict the finest hierarchy only
    • Then back trace the coarser hierarchy predictions

Country

State/Province

City

12 of 20

Experiments, Results and Discussion

13 of 20

Image v/s Video

14 of 20

Impact of Proposed models

15 of 20

Independent v/s Codependent Hierarchical Evaluation

16 of 20

Comparison with State-of-the-art

17 of 20

Performance on Mapillary(MSLS)

18 of 20

Examples of Localizations – city correct

19 of 20

Conclusion

  • We formulated a novel problem of worldwide video geolocalization

  • We introduced a new global level video dataset, CityGuessr68k, containing 68,269 videos from 166 cities.

  • We also proposed a baseline approach which consists of
    • Self-Cross Attention module for incorporating an auxiliary task of scene recognition
    • TextLabel Alignment strategy to distill knowledge from location labels in feature space.

  • We demonstrated the efficacy of our method on our dataset as well as on Mapillary(MSLS) dataset

  • Future direction
    • to explore the generalizability of the combination of Self-Cross Attention module and TextLabel Alignment to other hierarchical video classification tasks.

20 of 20

References

[1] Warburg, F., Hauberg, S., Lopez-Antequera, M., Gargallo, P., Kuang, Y., Civera, J.: Mapillary street-level sequences: A dataset for lifelong place recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 2626–2635 (2020) 1, 3, 4, 5, 6, 13

[2] Tong, Z., Song, Y., Wang, J., Wang, L.: Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. Advances in neural information processing systems 35, 10078–10093 (2022) 7, 13

[3] Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F., Green, T., Back, T., Natsev, P., et al.: The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017) 7, 11