1 of 1

SuperPoint

SuperGlue

Large Scale Camera Array Calibration via SfM

Sanjana Gunna Gaini Kussainova

Wei Pu Shubham Garg

OVERVIEW

INTRODUCTION

OUR METHOD

FUTURE DIRECTIONS

Mugsy v1

Mugsy v2

Images from the Multiface Dataset

Non-Rigid SfM

Multiface dataset contains an average of 12,200 (v1) frames per subject with capture rate at 30 fps and each frame has 40 (v1) & 60 (v2) different camera views. Non-Rigid SfM over these video clips might help improve the performance over the Rigid SfM baselines.

High-end Multi-view Capturing System (Mugsy)
Captures synchronized multi-view videos of facial expressions exhibited by the subjects sitting inside the dome
Applications: Facial Reconstruction, Virtual Human Generation (Photo-realistic avatars)

Dataset Statistics

However, calibration of the cameras using calibration objects in the dome setup (40 ~ 150) cameras is time-consuming!

Build an auto-calibration system, which is capable of determining internal camera parameters directly from multiple uncalibrated images.

Integrate into Structure from Motion pipeline to achieve better calibration precision when compared to computing calibration data using a special calibration object.

References:

Source of Images - https://openaccess.thecvf.com/content_ICCV_2017/papers/Ha_Deltille_Grids_for_ICCV_2017_paper.pdf
Wuu et al. Multiface: A Dataset for Neural Face Rendering. 10.48550/arXiv.2207.11243.
Lindenberger, Philipp et al. “Pixel-Perfect Structure-from-Motion with Featuremetric Refinement.” 2021 IEEE/CVF International Conference on Computer Vision (ICCV) (2021): 5967-5977.

Better Features

We used pre-trained deep learning based models that are predominantly trained on datasets that are limited to places, architecture, etc. We could improve the baselines if we finetune such models on facial data to capture much better and more semantically meaningful facial keypoints.

RESULTS

Data Initialization

- Fixed Extrinsics (for all the camera views)

- Noisy Intrinsics (added gaussian noise to GT intrinsics)

SIFT

Facial Keypoints

Facial Keypoints + SIFT

From our experiments, we observed that the parameters heavily rely on the quality as well as the quantity of the features

SIFT feature matches are not good enough!

Additional features such as deep learning-based features could improve the performance such as

SuperPoint + SuperGlue, Facial Landmarks (Meta detector)

Analysis of the sensitivity of parameters (focal lengths [fx, fy] vs principal point [cx, cy]) can guide us in improving the accuracy of predictions

Experiments	L₁ (fx)	L₁ (fy)	L₁ (cx)	L₁ (cy)	Reprojection Error
Refine cx, cy	-	-	6.349	6.458	3.093
Refine fx, fy	3.764	2.927	-	-	0.180
Refine all	89.441	80.637	4.015	5.278	5.961

Table 1: Analysis of the sensitivity of parameters

Table 2: Analysis of different features

Experiments	L₁ (fx)	L₁ (fy)	L₁ (cx)	L₁ (cy)	Reprojection Error
SIFT	89.441	80.637	4.015	5.278	5.961
SuperPoint + SuperGlue	21.713	16.751	7.379	17.878	8.254
Facial Keypoints	207.606	173.701	20.845	12.344	20.525

3D Reconstruction using various features

Effect of using “Feature-metric Refinement” for keypoint adjustment

An overall decrease in the reprojection error irrespective of the kind of features used

Various Features:

SIFT: 5k

SuperPoint: 2k

Facial Keypoints: 200

Feature-metric Refinement: Keypoint Adjustment

High level of accuracy and precision from multiple views during the reconstruction process
Optimizes a feature-metric error based on dense features predicted by a neural network