1 of 136

Introduction to digital video technology

2 of 136

Feedbacks

Please take notes about possible improvements.

Feedback are welcome, things like: better naming, shorten Y content, expand X content and etc.

Don’t let the others suffer as you will.

3 of 136

Basic Terminology

4 of 136

What is an image?

100

0

100

0

7

0

7

100

0

2

0

2

100

6

100

6

0

2

0

6

R

G

B

2D

3D

color intensity

5 of 136

What is picture element (pixel) ?

R

G

B

100

0

100

0

7

0

7

100

0

2

0

2

100

6

100

6

0

2

0

6

6 of 136

What is bit (color) depth?

8R+8G+8B = 24 bits

*it gives you 2^24 different colors

R

G

B

0-255

the range

100

0

100

0

7

0

7

100

0

2

0

2

100

6

100

6

0

2

0

6

7 of 136

Color depth

24 bpp

10 bpp

8 bpp

8 of 136

Color depth

9 of 136

All colors from RGB

X

10 of 136

All colors from RGB

https://lumeniquessl.com/2012/03/01/12-in-12-for-2012-the-flicker-indicator-machine/

11 of 136

All colors from RGB

https://lightingstudio.wordpress.com/2012/03/27/week5-light-object-shadow-contrast/

12 of 136

What is resolution?

4

width

height

13 of 136

What is display aspect ratio (DAR)?

16:9 (1.7777777778) 4:3 (1.3333333333)

1280/720 (1.7777777778) 1024/768 (1.3333333333)

14 of 136

What is pixel aspect ratio (PAR)?

PAR 1:1

PAR 2:1

15 of 136

DVD display aspect 4:3, pixel aspect: 10:11

Source https://xiph.org/video/vid1.shtml

16 of 136

What is a video?

4D

time

30 frames per sec (FPS)

framerate

a single frame

17 of 136

Interlaced | progressive

18 of 136

What are 480p, 1080i, 1080p formats?

[number][letter]

number is the resolution's height and letter: p means progressive and i means interlaced.

19 of 136

What is bitrate?

30 FPS

WIDTH * HEIGHT * BITS_PER_PIXEL * FPS

4 * 4 * 24 * 30

11,520 bits per second

20 of 136

Constant bitrate (CBR)?

1.2Mbps

time

21 of 136

Variable bitrate (VBR)?

1.2Mbps

time

2.4Mbps

200Kbps

22 of 136

Average bitrate (ABR)?

1.2Mbps

time

2.4Mbps

200Kbps

min

max

400Kbps

1.8Mbps

can be seen as constrained VBR

23 of 136

Space needed to store 1h of video at 720p 30fps

* without any compression technique at all.

WIDTH * HEIGHT * BITS_PER_PIXEL * FPS

1280 * 720 * 24 * 30

663,552,000 (663.552Mb) bits per second

2,388,787,200,000 (278GB)

24 of 136

Review

image, pixel, bit depth, resolution, display aspect ratio, pixel aspect ratio, video, frame rate, interlaced, progressive, bitrate, CBR, VBR, ABR

25 of 136

From the world to the bits

26 of 136

How images are captured? CCD Sensor

27 of 136

How images are captured? CMOS Sensor (APS)

Use less power

Transmit data faster than CCD

Cheaper

Most commonly in cell phone cameras, web cameras

28 of 136

How images are captured?

Color filter: Bayer array

Filters 3 primaries colors

Sensor

29 of 136

Bayer Demosaicing

30 of 136

Bayer Demosaicing

31 of 136

Bayer Demosaicing

32 of 136

Redundancy Removal

33 of 136

What can we do?

compress repetitions within the frame

exploit our vision

reduce repetitions in time

34 of 136

Exploiting our vision

35 of 136

Colors models

36 of 136

Colors

37 of 136

Our eyes - an oversimplification

38 of 136

Our eyes - an oversimplification

39 of 136

We better to see luma than color

40 of 136

Color space YUV (YCbCr, YPbPr)

Y (luma)

U (chroma blue)

V (chroma red)

41 of 136

42 of 136

From RGB to YCbCr

Y = 0.299R + 0.587G + 0.114B

C_b = 0.564(B - Y) | C_r = 0.713(R - Y)

From YCbCr to RGB

R = Y + 1.402C_r | B = Y + 1.772C_b | G = Y - 0.344C_b - 0.714C_r

*ITU-R BT.601-7

43 of 136

From RGB to YCbCr

It depends on the recommendation / standards from groups.

SDTV with BT.601 (Rec. 601)	Y=0.299R+0.587G+0.114B	U=0.492(B-Y)	V=0.877(R-Y)
HDTV with BT.709 (Rec. 709)	Y=0.2126R+0.7152G+0.0722B	...	...
UHDTV with BT.2020 (Rec. 2020)	Y=0.2627R+0.6780G+0.0593B	...	...

Standards	Y (Luma)	Chroma B	Chroma R

44 of 136

groups? ISO/IEC, ITU-R, JVT/JCT, AOM, MPEG-LA

45 of 136

recommendations? Rec. 601, Rec. 709, Rec. 2020

Recommendation	Resolutions	Frame Rate	Bit Depth	Chroma Sub
BT.601 (SDTV)	525i 625i	50 60	8	YCrCb 4:4:2
BT.709 (HDTV)	1080p 1080i	50 60 30 24	8 10	* YCrCb 4:4:2
BT.2020 (UHDTV)	7680p 3840p	120, 100, 60, 50, 30, 24	10 12	4:4:4, 4:2:2, and 4:2:0

Rec. 601

Rec. 709

Rec. 2020

46 of 136

Chroma subsampling YUV 4:4:4 4:2:2 4:2:0

Y (luma)

U (chroma blue)

V (chroma red)

47 of 136

Chroma subsampling YUV 4:4:4 4:2:2 4:2:0

Y (luma)

U

V

48 of 136

Chroma subsampling YUV 4:4:4 4:2:2 4:2:0

49 of 136

Chroma subsampling 4:2:0

50 of 136

Chroma subsampling YUV420

1280

720

180

320

51 of 136

Chroma subsampling

52 of 136

Space needed to store 1h of video at 720p 30fps

with chroma subsampling YUV420

WIDTH * HEIGHT * BITS_PER_PIXEL * FPS

1280 * 720 * 24 * 30

663,552,000 (663.552Mb) bits per second

2,388,787,200,000 (278GB)

WIDTH * HEIGHT * BITS_PER_PIXEL * FPS

1280 * 720 * 12 * 30

331,776,000 (331.776Mb) bits per second

1,194,393,600,000 (139GB)

53 of 136

Correlations in time

54 of 136

Frame types

One way to tackle is to try some types of frames classification

An I‑frame is an 'Intra-coded picture', in effect a fully specified picture, like a conventional static image file. P‑frames and B‑frames hold only part of the image information, so they need less space to store than an I‑frame and thus improve video compression rates.

A P‑frame ('Predicted picture') holds only the changes in the image from the previous frame. For example, in a scene where a car moves across a stationary background, only the car's movements need to be encoded. The encoder does not need to store the unchanging background pixels in the P‑frame, thus saving space. P‑frames are also known as delta‑frames.

A B‑frame ('Bi-predictive picture') saves even more space by using differences between the current frame and both the preceding and following frames to specify its content.

55 of 136

Temporal redundancy (inter prediction)

original frames

I-frame

P-frame

I-frame

F1

F0

F2

F3

F4

encoded frames

56 of 136

Temporal redundancy

original frames

|||||||||| (103Kb)

||| (2Kb)

|||||||||| (103Kb)

F1

F0

F2

F3

F4

57 of 136

Temporal redundancy with motion estimation

I-frame

P-frame

F1

F0

motion estimation (motion vector) applied to previous frame = predicted frame

predicted frame - real frame nth = residual (prediction error)

58 of 136

Temporal redundancy (B frames)

original frames

I-frame

P-frame

B-frame

I-frame

F1

F0

F2

F3

F4

59 of 136

Correlations in space

60 of 136

Lots of similarities

61 of 136

Lots of similarities

62 of 136

Spatial redundancy (intra prediction)

100	100	100	200
100	???	???	???
100	???	???	???
100	???	???	???

100	100	100	200
100	100	100	200
100	100	100	200
100	100	100	200

100	100	100	200
100	100	100	200
100	100	100	200
100	100	120	210

100	100	100	200
100	0	0	0
100	0	0	0
100	0	20	10

unknown values

direction of the prediction

real values

difference

highly compressible

63 of 136

Spatial redundancy (intra prediction) H264

64 of 136

CODEC - enCOder / DECoder

65 of 136

CODEC

“A codec is a device or computer program for encoding or decoding a digital data stream or signal.”

66 of 136

CODEC (VP9, H265) vs Container (.WEBM,.MP4)

Source https://xiph.org/video/vid1.shtml

67 of 136

Container vs CODEC

Containers

OGG
MP4
WMA
AVI
MKV, WebM
TS
MOV

CODEC

H264 / AVC
H265 / HEVC
MPEG-4
VP9
AV1
Theora
Daala

68 of 136

69 of 136

History

70 of 136

Patents all around

“Transform Coding of Image Diﬀerence Signals”

US patent 3679821, filed April 1970 and issued July 1972

“Motion vector estimation in television images”

US 4864393

“Block transform and quantization for image and video coding”

US 6882685

“Method and apparatus for binarization and arithmetic coding of a data value”

US 6900748

https://www.vcodex.com/video-compression-patents/

71 of 136

Patents all around (joke)

72 of 136

Alliance for Open Media - AV1

VP10

Thor

Daala

H.265 licensing has historically been extremely expensive.

Microsoft would have had to pay hundreds of millions of dollars for H.265

HEVC Advance has changed the licensing policy, but it may be too late,

The Alliance for Open Media is founded by leading Internet companies focused on developing next-generation media formats, codecs and technologies.

Day one founding members are Amazon, Cisco, Google, Intel Corporation, Microsoft, Mozilla and Netflix.

Google Microsoft, Amazon, Netflix, Cisco, Mozilla and others are developing a royalty free alternative (under the name "Alliance for Open Media") so online video can never be held hostage again and since they hold browser and mobile.

Although they were formed in September 2015 only on April 5, 2016 they’ve announced the AV1 codec.

73 of 136

Alliance for Open Media - AV1

Interoperable and open;
Optimized for the Internet;
Scalable to any modern device at any bandwidth;
Designed with a low computational footprint and optimized for hardware;
Capable of consistent, highest-quality, real-time video delivery; and
Flexible for both commercial and non-commercial content, including user-generated content.

74 of 136

Hybrid motion compensated CODEC

picture partitioning

predictions

transform

quantization

entropy coding

redundancy removal

entropy reduction

lossless compression

dct, dwt, intra-prediction, inter-prediction, motion estimation / compensation

linear, logarithm

huffman, lzw ...

75 of 136

CODEC

picture partitioning

predictions

transform

quantization

entropy coding

redundancy removal

entropy reduction

lossless compression

dct, dwt, intra-prediction, inter-prediction, motion estimation / compensation

linear, logarithm

huffman, lzw ...

76 of 136

Frame partitioning

slices

77 of 136

Fixed vs Variable block size

78 of 136

CODEC

picture partitioning

predictions

transform

quantization

entropy coding

redundancy removal

entropy reduction

lossless compression

dct, dwt, intra-prediction, inter-prediction, motion estimation / compensation

linear, logarithm

huffman, lzw ...

79 of 136

Motion estimation Inter|Intra-prediction

I-frame

P-frame

B-frame

I-frame

direction of the prediction

100	100	100	200
100	100	100	200
100	100	100	200
100	100	100	200

80 of 136

CODEC

picture partitioning

predictions

transform

quantization

entropy coding

redundancy removal

entropy reduction

lossless compression

dct, dwt, intra-prediction, inter-prediction, motion estimation / compensation

linear, logarithm

huffman, lzw ...

81 of 136

Transform

Double [3] f(x): x + x => [ 6]

Plus10 [3] f(x): x + 10 => [ 13]

Divide2 [3] f(x): x / 2 => [1.5]

82 of 136

Transform (DCT)

83 of 136

https://www.iem.thm.de/telekom-labor/zinke/mk/mpeg2beg/whatisit.htm

84 of 136

Transform (DCT[123], DWT, KLT, FFT, lapped…)

DEMO https://github.com/leandromoreira/digital_video_introduction/blob/master/dct_experiences.ipynb

Most of the signal information tends to be concentrated in a few low-frequency components of the DCT.

H264 uses a very simple 8x8 4x4 transform only + / - approximation of a DCT but faster

FT complex numbers vs DCT real numbers [computationally faster] Convolution on spatial domain is multiplication on frequency domain

In the DCT the signal is decomposed into a sum of cosines, as opposed

to the Discrete Fourier transform (DFT) where the signal is decomposed

into a sum of sines and cosines.

To compress by DCT, keep the lower frequency and discard the “high frequencies”. We will keep the DCT coefficients that are within a certain distance from the upper left corner of a block, and set the remaining values to zero.

https://people.xiph.org/~xiphmont/demo/daala/demo1.shtml

Although it looks hard (show formula) it’s easier to implement than you think

Transform - show demo DCT (and gist)

https://gist.github.com/leandromoreira/9bb7b519173ba5158b5b4b213c46d8fa

I think the genius is to come up with it.

85 of 136

CODEC

picture partitioning

predictions

transform

quantization

entropy coding

redundancy removal

entropy reduction

lossless compression

dct, dwt, intra-prediction, inter-prediction, motion estimation / compensation

linear, logarithm

huffman, lzw ...

86 of 136

Quantization over DCT (uniform, linear, logarithm)

120	40	1	0
45	3	0	0
-5	0	0	1
0	0	-2	0

Qstep (10)

12	4	0	0
4	0	0	0
0	0	0	0
0	0	0	0

120	40	0	0
40	0	0	0
0	0	0	0
0	0	0	0

Qstep (10)

12	4	0	0
5	0	0	0
0	0	0	0
0	0	0	0

87 of 136

CODEC

picture partitioning

predictions

transform

quantization

entropy coding

redundancy removal

entropy reduction

lossless compression

dct, dwt, intra-prediction, inter-prediction, motion estimation / compensation

linear, logarithm

huffman, lzw ...

88 of 136

Entropy coding quantized DCT

000010001110

010111101101

*CAVLC example

frequent symbols table

-1,1

trailing zeroes

...

zig-zag scan

(2D to 1D)

lossless compress

coded block

89 of 136

CODEC

picture partitioning

predictions

transform

quantization

entropy coding

redundancy removal

entropy reduction

lossless compression

dct, dwt, intra-prediction, inter-prediction, motion estimation / compensation

linear, logarithm

huffman, lzw ...

90 of 136

Space needed to store 1h of video at 720p 30fps

with H264 (chroma subsampling, motion estimation, intra prediction, CABAC…)

with chroma subsampling YUV420

WIDTH * HEIGHT * BITS_PER_PIXEL * FPS

1280 * 720 * 0.031 * 30

857,088 (837Kb) bits per second

3,085,516,800 (367.82MB)

WIDTH * HEIGHT * BITS_PER_PIXEL * FPS

1280 * 720 * 12 * 30

331,776,000 (331.776Mb) bits per second

1,194,393,600,000 (139GB)

91 of 136

CODEC and patents

picture partitioning

predictions

transform

quantization

entropy coding

US 4864393

US 6882685

* US 6900748

*it seems to be expired by 2000s

92 of 136

Bitstream format

Network Abstract Layer: First 1 byte for H264, 2 bytes H265 is the type, the rest is the payload.

NAL can be categorized as VCL (video coding layer) and non-VCL. NAL was created to serve to store, transmit everywhere (Internet, satellite and etc.)

NAL can be packet-oriented or bitstream-oriented.

In the byte stream format, each NAL unit is prefixed by a specific pattern of three bytes called a start code prefix.

In the packet-oriented, the coded data is carried in packets that are framed by the system transport protocol, without the start code prefix.

The parameters are also sent as NAL (SPS, PPS):

SPS - apply to a series of consecutive coded video pictures, (such as Entropy code, resolution…)

PPS: apply to the decoding of one or more individual pictures

Each VCL NAL unit contains an identifier that refers to the content of the relevant PPS and each PPS contains an identifier that refers to the content of the relevant SPS.

93 of 136

Bitstream format

94 of 136

Hybrid motion compensated encoding

95 of 136

Hybrid motion compensated decoding

96 of 136

H264 vs H265

97 of 136

HEVC @ 2Mbps

AVC @ 4Mbps

98 of 136

AVC @ 400kbps

HEVC @ 400kbps

99 of 136

H265 takes advantage of more powerful CPUs and it also brings things like: larger partitions, and more adaptable partitions (which makes it work even better or higher resolutions), it also counts with 35 intra prediction directions, it only supports progressive, CABAC… and the entropy coding too was advanced, they work in each area in order to bring the rule of thumb 50% better than last CODEC

H265 also introduced tiles as another way to partitioning frames, it enables parallel computation.

But on the other hand it is even expensive than h264 (royalt wise speaking) and the companies Microsoft, google, netflix, mozilla and other are going to AOM (AV1) and skipping h265 or webm vp9, which works in most browser.

H264 vs H265 - show ivpa with both codecs

HEVC replaces macroblocks, which were used with previous video standards, with CTUs which can use larger block structures of up to 64×64

100 of 136

Video streaming

101 of 136

General Video Delivery Architecture

ingest

origin

CDN (frontend, caching)

encoder

102 of 136

Content distribution

103 of 136

Progressive download

full_video.mp4

time

Range: bytes=0-299

HTTP 206

Range: bytes=300-499

Range: bytes=500-999

HTTP 206

104 of 136

Adaptive bitrate streaming

manifest

time

2G

HTTP 200

2G

wifi

HTTP 200

s480p_01.mp4

s480p_02.mp4

wifi

HTTP 200

s1080p_03.mp4

105 of 136

Adaptive bitrate streaming (hls)

source: https://www.encoding.com/http-live-streaming-hls/

106 of 136

Content protection

107 of 136

Token + CORS + TLS (https)

API

token=CAFE

video=3& cookie

CDN

video=3

HTTP 403

CDN

video=3&t=CAFE

HTTP 200

time

108 of 136

DRM (widevine, playready, fairplay)

new_video.mp4

encoding

DRM servers�Apple, M$, Google

CDN

dash_encrypted_new_video.mp4

109 of 136

Encoding parameters: the whys

110 of 136

CBR vs VBR

LIVE: CBR, the biggest problem is bandwidth, VBR might cause lots of rebufferings, latency is critical, small hiccups are acceptable.

VOD progressive download: “Constrained” VBR - min 50% of TARGET max 200%.

VOD adaptive streaming: “Constrained” VBR - min 85-100% of TARGET max 125-150%

111 of 136

112 of 136

Profiles for iOS-like >= High 3.1

https://developer.apple.com/library/content/technotes/tn2224/_index.html

113 of 136

Profiles for Android-like >= High 3.1

114 of 136

Frame types and GOP

P and B are lighter but they require search for frames back or forward.

Adjust keyframe (I) interval insertion to 2,3,4 or 5 seconds, otherwise in VBR you’re just wasting resources

Adjust your keyframes considering your chunk size, it should be a multiple of it. Ex: chunk of 6s therefore I-Frame each 1s or 2s

Turn off ‘keyframe scene detection”

Yes to B frame, “magic number” between 3-4

115 of 136

B-Frame magic

116 of 136

Bits per pixel

What is the resolution for 2.5Mbps?

Let’s try:

Height: 393, Width: 720 Pixels: 282960 Bitrate: 2500 FPS: 30

BitsPerPixel: Bitrate/(Pixels*FPS)

1500/282960*30 = 0.2650551315

117 of 136

lossless compression - entropy encoding

CABAC - more efficient, CPU intensive (battery and so on) [main, high profile]

CAVLC - less efficient, CPU less intensive

118 of 136

Apple TN2224

119 of 136

Bonus: audio codec

120 of 136

Analog audio conversion

121 of 136

Sampling (8,11, 32, 44.1, 48, 50, 88, 96, 192... kHz)

122 of 136

Bit depth (16, 24, 32, 64... bits) quantization

123 of 136

Channels 2

124 of 136

Channels 16.2

125 of 136

PCM encoder

126 of 136

AAC CODEC block

127 of 136

References

128 of 136

Links

129 of 136

Links

130 of 136

Links

131 of 136

Links

132 of 136

Links

133 of 136

Links

134 of 136

Links

135 of 136

Links

136 of 136

Links