1 of 23

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY,

VOL. 30, NO. 12, DECEMBER 2020

Object Detection-Based

Video Retargeting

With Spatial–Temporal Consistency

Seung Joon Lee, Siyeong Lee , Sung In Cho , Member, IEEE,

and Suk-Ju Kang , Member, IEEE

20211020 陳顗汝

2 of 23

01 INTRODUCTION

02 PROPOSED METHOD

03 EXPERIMENTAL RESULTS

04 CONCLUSION

3 of 23

INTRODUCTION

Video Retargeting is a task that converts image sequences with a certain resolution into a new image sequence with different resolution and aspect ratio.

Distinguishing between areas to be distorted and areas not to be distorted in the video is a key factor in the performance of the video retargeting method.

Additionally, the temporal consistency between consecutive frame images should be considered.

4 of 23

INTRODUCTION

Previous studies related to image and video retargeting are divided into :

Simple resizing methods

linear scaling
cropping
letterbox inserting

Energy optimization model-based methods

Saliency is estimated in many cases to select regions of interest (RoIs)

Deep neural network (DNN)-based methods

Aims to minimize the distortion of semantically important parts while maintaining the structural characteristics of the image.

Do not contain any semantic analysis of images.

5 of 23

INTRODUCTION

Previous studies related to image and video retargeting are divided into :

Simple resizing methods

linear scaling
cropping
letterbox inserting

Energy optimization model-based methods

Saliency is estimated in many cases to select regions of interest (RoIs)

Deep neural network (DNN)-based methods

Aims to minimize the distortion of semantically important parts while maintaining the structural characteristics of the image.

Merely calculating RoIs with pixel-wise low-level features of the image does not consider the cognitive characteristics of viewers.

RoIs and non-RoIs cannot be dichotomized because of the nature of saliency inference operation that has ambiguous distribution.

Consideration of consecutive frame images is not reflected in the saliency calculation.

6 of 23

INTRODUCTION

Previous studies related to image and video retargeting are divided into :

Simple resizing methods

linear scaling
cropping
letterbox inserting

Energy optimization model-based methods

Saliency is estimated in many cases to select regions of interest (RoIs)

Deep neural network (DNN)-based methods

Aims to minimize the distortion of semantically important parts while maintaining the structural characteristics of the image.

7 of 23

This study proposes a novel video retargeting method using deep learning-based object detection.

INTRODUCTION

8 of 23

INTRODUCTION

The bounding box from the object detection has the advantage of accurately distinguishing between objects and non-objects.

Within the retargeted image, the contents of the area surrounded by the bounding box are kept intact and the remaining background areas are resized to preserve the content of the RoIs without deformation.

Therefore, by resizing the background as uniformly as possible with the RoIs area preserved, the retargeted image can be obtained.

9 of 23

The contributions of the proposed method are:

The object detection-based video retargeting method determines RoIs immediately and inferred RoIs can be clearly distinguished from the surrounding background area (non-RoIs).

The computational complexity is reduced using DNN architecture of Siamese network compared to the object detection network.

INTRODUCTION

10 of 23

PROPOSED METHOD

1D RoIs are obtained by projecting the bounding boxes in the vertical direction.

11 of 23

PROPOSED METHOD

The scaling ratio map S and the scaling ratio α are calculated as follows:

12 of 23

PROPOSED METHOD

The new index τ , for the column in the grid map where the contents of the original image will be relocated is determined as follows:

However, there may be hole columns which contain nothing. In this case, the resultant image is acquired using the directional interpolation to fill in the hole regions with appropriate content.

13 of 23

The following problems may be encountered :

As the position and size of the bounding box extracted from continuous frames vary, temporal distortion may occur when a retargeted frame is continuously reproduced.

There is no exception handling when a bounding box is not detected in the object detection network. For example, an object in the previous frame image may not be detected in the next frame image.

Processing speed is an important consideration in terms of video retargeting. Using the object detection network for every frame is inefficient.

PROPOSED METHOD

14 of 23

The correlation between sequential frames is inferred using the Siamese network framework.

First, a scene change point is specified between consecutive frames that exceeds a specific value through an L2 norm operation on a continuous frame of a video input.

A set of frame images between scene change points is considered as a set of scene images.

PROPOSED METHOD

15 of 23

The bounding box image Bt−1 of the previous frame image Ft−1 and the entire image Ft of the current frame are inputs to the same network ψ.

A cross correlation is performed for each feature map extracted from the network ψ, so that the object Objt−1 in the bounding box Bt−1 in the previous frame is present in the current frame image Ft , to predict the position of Objt .

The new position and size of Objt is represented by a relative variation vector for Objt−1, and the updated bounding box data Bt is repeated for the prediction of Objt+1 used.

PROPOSED METHOD

16 of 23

The fine size and position difference of the bounding box can be seen as a tremble in continuous frame playback that results in a slightly amplified heterogeneity in the video.

This problem is solved by using an exponential average filter in 1D RoIs for frames in the same scene.

The exponential average filtering is as follows.

Experimentally confirmed that we can get the most natural retargeted video when γ is 0.10.

PROPOSED METHOD

17 of 23

Sometimes, due to the inaccuracy of the object detection and the object tracking, the RoIs can be distorted in the retargeted video.

To prevent this, Intersection over Union (IOU) between the currently inferred RoIs and the accumulated RoIs is measured and reflected in determining γ .

PROPOSED METHOD

18 of 23

If RoIs are too large, the background area should be more stretched to satisfy the target aspect ratio. In this case, hole columns occur in background region in retargeted frame image.

To address this problem, the proposed method sets the ratio to allow adaptive scaling according to the size of the object, considering aspect ratios for the input and output images.

PROPOSED METHOD

19 of 23

EXPERIMENTAL RESULTS

Implementation

A YOLO v3 model based on darknet was used as the object detection network.

In darknet implementation, the operation speed of the pooling layer is dramatically fast.

The YOLO v3 model was pretrained for a VOC_devkit dataset to be used in the experiments.

For the object tracking network, five layers of the pre-trained VGG-16 network were extracted and used as an encoding network.

The same two networks were used as the Siamese structure to enable object tracking through cross correlation.

20 of 23

EXPERIMENTAL RESULTS

To evaluate the proposed method, 30 fps Full HD video with a ratio of 16:9 were retargeted to a ratio of 21:9, which is a widely used case in these days.

21 of 23

EXPERIMENTAL RESULTS

The proposed method adapted the aspect ratio with the RoIs well preserved.

Additionally, the new location of the RoIs in the retargeted image maintained the structural characteristics of the input video.

22 of 23

EXPERIMENTAL RESULTS

23 of 23

CONCLUSION

Based on DNNs object detection and object tracking was an effective video retargeting algorithm that does not require heavy post-processing.

To reduce computational complexity, a suitable architecture using a low-complexity object tracking networks was created.

All retargeting tasks were processed in scene units, and retargeted frames in the same scene were able to obtain target frame images of new ratios with minimal spatial and temporal artifacts.

Qualitative and quantitative experimental results demonstrated that the proposed method is more stable than existing methods in terms of spatial-temporal aspects.

However, motions distorted the surrounding structures or logos slightly in some cases.