1 of 7

Txt2Vid�Ultra-Low Bitrate Compression of Talking-Head Videos via Text

Pulkit Tandon

Stanford University

Workshop on Video Analytics 2022

2 of 7

Videos, videos, everywhere….

  • Video streaming is ~80% of today’s internet traffic�
  • COVID-19 lead to an even bigger boom
      • Video conferencing tools such as Zoom saw ~10X increase in usage�
  • Limited access to bandwidth worldwide
      • Typical audio-video calls can take ~100 Kbps to few Mbps

References:�Cisco, ”Cisco visual networking index: global mobile data traffic forecast update, 2017–2022.”, accessed 2021. [Online]. Available: https: //s3.amazonaws.com/media.mediapost.com/uploads/CiscoForecast.pdf�N. Pandey, A. Pal et al., “Impact of digital surge during Covid-19 pandemic: A viewpoint on research and practice,” International Journal of Information Management, vol. 55, p. 102171, 2020.�Cisco, ”Cisco Annual Internet Report (2018–2023) White Paper”, accessed 2021. [Online]. Available: https: //www.cisco.com/c/en/us/solutions/collateral/executive- perspectives/ annual- internet- report/white- paper- c11- 741490.html�M. Candela, V. Luconi, and A. Vecchio, “Impact of the covid-19 pan- demic on the internet latency: A large-scale study,” Computer Networks, vol. 182, p. 107495, 2020�G. S. Ford, “Covid-19 and broadband speeds: A multi-country analysis,” Available at SSRN 3689044, 2020.

3 of 7

Hello Bob, �how are you doing?

Your stream is freezing Bob!

Much better,�wish we could also see each other while talking though.

Hello Alice,

I am doing great. What about you?

OK, let me try switching off the video. �Can you hear me better now?

B

4 of 7

Can we compress AV content generated via webcams to text and recover videos with similar QoE compared to standard codecs in a low bitrate regime?

YES!

~100,000 bps AV stream

H.264 (95 Kbps) + AAC (5Kbps)

~100 bps text stream �decoded using Txt2Vid

~100-1000X compression �at iso-quality

against AVC + AAC

Subjective Study (~240 participants)

5 of 7

Transmission Package

Speech-to-Text

Text-to-Speech

Video Generation

Driving Video

Transmission Package

Encoder

Decoder

Sender

Receiver

Sender

Receiver

Conventional Approach

Txt2Vid Approach

One-Time

Typical Operation

~10-100 kbps Video Stream

~1-5 kbps Audio Stream

~100 bps Text Stream

User ID

“Hello, how is it going?”

User ID

Lip-Sync

Visit Poster Session to learn more!

Generative ML models at decoder

6 of 7

Lots of Potential Applications

7 of 7