IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY,
VOL. 30, NO. 12, DECEMBER 2020
Object Detection-Based
Video Retargeting
With Spatial–Temporal Consistency
Seung Joon Lee, Siyeong Lee , Sung In Cho , Member, IEEE,
and Suk-Ju Kang , Member, IEEE
20211020 陳顗汝
01 INTRODUCTION
02 PROPOSED METHOD
03 EXPERIMENTAL RESULTS
04 CONCLUSION
INTRODUCTION
Video Retargeting is a task that converts image sequences with a certain resolution into a new image sequence with different resolution and aspect ratio.
Distinguishing between areas to be distorted and areas not to be distorted in the video is a key factor in the performance of the video retargeting method.
Additionally, the temporal consistency between consecutive frame images should be considered.
INTRODUCTION
Previous studies related to image and video retargeting are divided into :
Saliency is estimated in many cases to select regions of interest (RoIs)
Aims to minimize the distortion of semantically important parts while maintaining the structural characteristics of the image.
Do not contain any semantic analysis of images.
INTRODUCTION
Previous studies related to image and video retargeting are divided into :
Saliency is estimated in many cases to select regions of interest (RoIs)
Aims to minimize the distortion of semantically important parts while maintaining the structural characteristics of the image.
INTRODUCTION
Previous studies related to image and video retargeting are divided into :
Saliency is estimated in many cases to select regions of interest (RoIs)
Aims to minimize the distortion of semantically important parts while maintaining the structural characteristics of the image.
This study proposes a novel video retargeting method using deep learning-based object detection.
INTRODUCTION
INTRODUCTION
The bounding box from the object detection has the advantage of accurately distinguishing between objects and non-objects.
Within the retargeted image, the contents of the area surrounded by the bounding box are kept intact and the remaining background areas are resized to preserve the content of the RoIs without deformation.
Therefore, by resizing the background as uniformly as possible with the RoIs area preserved, the retargeted image can be obtained.
The contributions of the proposed method are:
INTRODUCTION
PROPOSED METHOD
1D RoIs are obtained by projecting the bounding boxes in the vertical direction.
PROPOSED METHOD
The scaling ratio map S and the scaling ratio α are calculated as follows:
PROPOSED METHOD
The new index τ , for the column in the grid map where the contents of the original image will be relocated is determined as follows:
However, there may be hole columns which contain nothing. In this case, the resultant image is acquired using the directional interpolation to fill in the hole regions with appropriate content.
The following problems may be encountered :
PROPOSED METHOD
The correlation between sequential frames is inferred using the Siamese network framework.
First, a scene change point is specified between consecutive frames that exceeds a specific value through an L2 norm operation on a continuous frame of a video input.
A set of frame images between scene change points is considered as a set of scene images.
PROPOSED METHOD
The bounding box image Bt−1 of the previous frame image Ft−1 and the entire image Ft of the current frame are inputs to the same network ψ.
A cross correlation is performed for each feature map extracted from the network ψ, so that the object Objt−1 in the bounding box Bt−1 in the previous frame is present in the current frame image Ft , to predict the position of Objt .
The new position and size of Objt is represented by a relative variation vector for Objt−1, and the updated bounding box data Bt is repeated for the prediction of Objt+1 used.
PROPOSED METHOD
The fine size and position difference of the bounding box can be seen as a tremble in continuous frame playback that results in a slightly amplified heterogeneity in the video.
This problem is solved by using an exponential average filter in 1D RoIs for frames in the same scene.
The exponential average filtering is as follows.
Experimentally confirmed that we can get the most natural retargeted video when γ is 0.10.
PROPOSED METHOD
Sometimes, due to the inaccuracy of the object detection and the object tracking, the RoIs can be distorted in the retargeted video.
To prevent this, Intersection over Union (IOU) between the currently inferred RoIs and the accumulated RoIs is measured and reflected in determining γ .
PROPOSED METHOD
If RoIs are too large, the background area should be more stretched to satisfy the target aspect ratio. In this case, hole columns occur in background region in retargeted frame image.
To address this problem, the proposed method sets the ratio to allow adaptive scaling according to the size of the object, considering aspect ratios for the input and output images.
PROPOSED METHOD
EXPERIMENTAL RESULTS
Implementation
A YOLO v3 model based on darknet was used as the object detection network.
In darknet implementation, the operation speed of the pooling layer is dramatically fast.
The YOLO v3 model was pretrained for a VOC_devkit dataset to be used in the experiments.
For the object tracking network, five layers of the pre-trained VGG-16 network were extracted and used as an encoding network.
The same two networks were used as the Siamese structure to enable object tracking through cross correlation.
EXPERIMENTAL RESULTS
To evaluate the proposed method, 30 fps Full HD video with a ratio of 16:9 were retargeted to a ratio of 21:9, which is a widely used case in these days.
EXPERIMENTAL RESULTS
The proposed method adapted the aspect ratio with the RoIs well preserved.
Additionally, the new location of the RoIs in the retargeted image maintained the structural characteristics of the input video.
EXPERIMENTAL RESULTS
CONCLUSION
However, motions distorted the surrounding structures or logos slightly in some cases.