1 of 20

CrowdSR: Enabling High-Quality Video Ingest �in Crowdsourced Livecast via Super-Resolution

Zhenxiao Luo, Zelong Wang, Jinyu Chen, Miao Hu, Yipeng Zhou, Tom Z. J. Fu, Di Wu

2 of 20

CrowdSR: Enabling High-Quality Video �Ingest in Crowdsourced Livecast via Super-Resolution

Zhenxiao Luo, Zelong Wang, Jinyu Chen,

Miao Hu, Yipeng Zhou, Tom Z. J. Fu, Di Wu

3 of 20

Introduction

Crowdsourced Livecast

increasingly attractive
younger generations

Twitch TV

93 billion-minute per month
3.18 million viewers per month
9.71 million per month

4 of 20

Introduction

Livecast System

broadcaster
ingest server
content distribution network
viewer

5 of 20

Introduction

Neural-enhanced techniques

NAS
LiveNAS

Existing problem

seldom consider the collaboration among broadcasters

6 of 20

System Design

Challenge

different upstream bandwidth
different device capabilities

Motivation

provide 1080p video when upload video is 540p
solve the lack of high-resolution video samples for SR model training

The challenge of livecast system are the significant difference among broadcasters in terms of upstream bandwidth and device capabilities.

These factors restricts the maximum quality of video streams that a broadcaster can upload to the ingest server.

For example, as shown in Figure, suppose that a broadcaster 𝐴 is a low-end smartphone that can best-effort produce live video stream at 540p, it is difficult for viewers in 𝐴’s channel to watch a stream with the 1080p quality.

The major motivation of CrowdSR is to solve the lack of high-resolution video samples for SR model training.

Previous studies assume that a broadcaster can generate high-resolution videos but may not have sufficient bandwidth to upload the video streams.

In this paper, we consider a more general scenario, in which a broadcaster can only generate low-resolution videos.

The question we want to answer is that, given the highest video resolution that a broadcaster 𝐴 can upload is 540p, can we design a scheme to enable viewers in 𝐴’s channel to watch videos in 1080p?

7 of 20

System Design

Framework Overview

Patch Selector
Online Trainer
SR Processor

8 of 20

System Design

Patch Selector

Reference patch selector
Training patch selector

9 of 20

System Design

Reference patch selector

divide a frame from target video into patches
calculate the mean-square error (MSE) with patches of previous pframe
choose top k patches as reference patches

10 of 20

System Design

Training patch selector

divide a frame from similar videos into patches
calculate pHash value between reference patches and candidate patches
choose the top 𝑚 patches as training samples

11 of 20

System Design

Online Trainer

load pre-trained general model
using patches from patch selector
update model periodically

12 of 20

System Design

SR Processor

EDSR
enhance video quality
deliver high-resolution frame

13 of 20

Implementation

Main process

Receive video
aiortc, aiohttp

Online Trainer Process

Dataset update
Model training
PyTorch

SR process

Model update
Video quality enhancement
PyTorch, OpenCV

14 of 20

Evaluation

Datasets

Douyu and Bilibili
1080p, 30fps

Metrics

Peak Signal-to-Noise Ratio (PSNR)
Structural Similarity Index (SSIM)

Baselines

Bicubic interpolation (BI)
General SR
Specialized SR

For evaluation, we build our video dataset by capturing the live streams distributed by two leading crowdsourced livecast platforms in China, namely, Douyu and Bilibili.

We retrieve video streams in different catalog while the default resolution is 1080p and frame rate is 30 FPS.

We use two metrics to evaluate the enhancement of video quality by CrowdSR, namely, PSNR and SSIM.

We record the frames during the livecast and calculate the corresponding values of PSNR and SSIM.

We use three other methods as the baselines for comparison in our experiments, including

BI, which simply up-scales a low-resolution frame to a high-resolution frame with bicubic interpolation.

GeneralSR, which uses a general SR model trained on the DIV2K benchmark dataset.

SpecializedSR, which is trained over high-quality original videos and serves as the baseline to provide the best SR performance for a video.

15 of 20

Evaluation

Server Specification

Intel Xeon Silver 4210R
2 * NVIDIA RTX 3090

Training Parameter

Learning rate, 0.001
Batch size, 64
Epoch number, 100
Dataset size, 64000
Optimizer, Adam
Loss function, L1 loss

16 of 20

Evaluation

Average PSNR

Baseline (BI) is 29.48dB
0.42-1.09dB improvement

Average SSIM

Baseline (BI) is 0.881
0.006-0.014 improvement

17 of 20

Evaluation

PSNR change over time

sample every second
better than BI and GeneralSR
SpecializedSR is upper bound

18 of 20

Evaluation

Inference Latency

28ms on average
96% frames less than 33ms

19 of 20

Evaluation

GPU usages

most of the computation happens in training step
inference step is not such computationally intensive
takes about 4300MB and 3800 MB GPU memory

20 of 20

Conclusion

CrowdSR is novel video ingest framework

Utilizes super-resolution techniques
Utilizes similar broadcasters’ video as training sample
Utilizes online training to optimize performance

CrowdSR is effective video ingest framework

Achieve real-time video quality enhancement
Improve PSNR by 0.42-1.09 dB
Improve SSIM by 0.006-0.014