CityGuessr: City-Level Video Geo- Localization on a Global Scale
Parth Parag Kulkarni1, Dr. Gaurav Kumar Nayak2, Dr. Mubarak Shah1
1Center for Research in Computer Vision, University of Central Florida
2Mehta Family School of DS & AI, Indian Institute of Technology Roorkee, India
Preview
CityGuessr68k Dataset
Problem Statement
“Given an input video, determining which city in the world, this video was recorded in”
Model
Model – Encoder Backbone and Classifiers
Model
Model – Scene Recognition
Model
Model – TextLabel Alignment Strategy
Inference – Hierarchical Evaluation
Country
State/Province
City
Experiments, Results and Discussion
Image v/s Video
Impact of Proposed models
Independent v/s Codependent Hierarchical Evaluation
Comparison with State-of-the-art
Performance on Mapillary(MSLS)
Examples of Localizations – city correct
Conclusion
References
[1] Warburg, F., Hauberg, S., Lopez-Antequera, M., Gargallo, P., Kuang, Y., Civera, J.: Mapillary street-level sequences: A dataset for lifelong place recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 2626–2635 (2020) 1, 3, 4, 5, 6, 13
[2] Tong, Z., Song, Y., Wang, J., Wang, L.: Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. Advances in neural information processing systems 35, 10078–10093 (2022) 7, 13
[3] Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F., Green, T., Back, T., Natsev, P., et al.: The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017) 7, 11