1 of 38

VGGnet�Video Graphic Generation network

2 of 38

팀 구성

9기 박찬혁

11기 최가윤

12기 박승호

12기 유선재

12기 제갈건

3 of 38

프로젝트 목표

짧은 비디오 생성

Quality
Multimodality
Time

4 of 38

예상 결과물

Text

Audio

Image

Video

5 of 38

Show-1
ImageBind
Binding Network
실험결과

6 of 38

Show-1 (Baseline)

Text-to-Video 생성 분야에서 SoTA

Show-1: Marrying Pixel and Latent Diffusion Models for Text-to-Video Generation

Toad practicing karate

A burning lamborghini on the road.

Giant octopus invades new york city.

7 of 38

Show-1 (Baseline)

Pixel based diffusion + Latent based diffusion

Pixel based: 시간, cost high, text-video alignment good

Latent based: 시간, cost low, text-video alignment bad

성능 향상 + 기존 모델 inference 72G -> 15G

8 of 38

Show-1 (Baseline)

1. 텍스트를 기반으로 Keyframes 생성 (8장, fps=2)

2. Keyframe 사이를 interpolation (fps=7.5)

이미지는 make-a-video

9 of 38

Show-1 (Baseline)

3. 1차 super-resolution (64*40 -> 256*160)

4. 2차 super-resolution (256*160 -> 576*320)

이미지는 make-a-video

10 of 38

Show-1 (Baseline)

Base (8*64*48)

Interpolation (29*64*48)

SR1 (29*256*160)

SR2 (29*576*320)

“A burning lamborghini driving on rainbow.”

11 of 38

Show-1 (Baseline)

Base (8*64*48)

Interpolation (29*64*48)

SR1 (29*256*160)

SR2 (29*576*320)

“sleeping shrews in small bed.”

12 of 38

ImageBind (Baseline)

Text
Audio
Image
HeatMap
Depth
IMU

6개 modality를 하나의 embedding space에 표현하자!

13 of 38

ImageBind (Baseline)

6개 modality를 하나의 embedding space에 표현하자!

Text

Audio

Image

[1, 1024]

[77, 1024]

[1, 1024]

Linear

14 of 38

Our Task

T5 Embedding

CLIP Embedding

[ 77 * 4096 ]

[ 77 * 1024 ]

ImageBind

[ 1 * 1024 ]

15 of 38

Our Task

Option 1

임베딩 모델을 변경해서 각 레이어를 다시 학습

T5 Embedding

CLIP Embedding

[ 77 * 4096 ]

[ 77 * 1024 ]

ImageBind

[ 1 * 1024 ]

16 of 38

Our Task

T5 Embedding

CLIP Embedding

[ 77 * 4096 ]

[ 77 * 1024 ]

ImageBind

[ 1 * 1024 ]

Option 1

임베딩 모델을 변경해서 각 레이어를 다시 학습

Diffusion Model 4개 다시 학습

첫번째 모델 A100 48개로 72시간 학습…

17 of 38

Our Task

T5 Embedding

CLIP Embedding

[ 77 * 4096 ]

[ 77 * 1024 ]

ImageBind

[ 1 * 1024 ]

Option 2

ImageBind의 embedding space를

T5, CLIP Embedding의 embedding space로 mapping

Mapping

=

18 of 38

Mapping Network

T5 Embedding

ImageBind

Mapping

[ 1 * 1024 ]

[ 77 * 4096 ]

Loss

[ 77 * 4096 ]

❄

Text

ImageBind + Mapping Model의 출력값이 T5 임베딩 출력값과 같아지도록 학습

240만 건

Data

embedding-embedding

데이터셋 제작

19 of 38

Mapping Network

T5 Embedding

ImageBind

Mapping

[ 1 * 1024 ]

[ 77 * 4096 ]

Loss

[ 77 * 4096 ]

❄

Text

ImageBind + Mapping Model의 출력값이 T5 임베딩 출력값과 같아지도록 학습

Audio

3,000 건

Data

embedding-embedding

데이터셋 제작

20 of 38

Mapping Network

T5 Embedding

ImageBind

Mapping

[ 1 * 1024 ]

[ 77 * 4096 ]

Loss

[ 77 * 4096 ]

❄

Text

ImageBind + Mapping Model의 출력값이 T5 임베딩 출력값과 같아지도록 학습

Image

27,000 건

Data

embedding-embedding

데이터셋 제작

21 of 38

Mapping Network

T5 Embedding

[ 77 * 4096 ]

ImageBind

[ 1 * 1024 ]

Mapping

Linear
1DConv
Transformer
Residual
…

22 of 38

Mapping Network

ImageBind-LLM: Multi-modality Instruction Tuning

실험 결과 단순 Linear model보다 학습 속도, cost면에서 낫다고 판단

목적이나 사용하는 방법은 본 task와는 많이 다르지만

23 of 38

Mapping Network

강아지 짖는 소리

새소리

자동차 경적 소리

Input이 달라도 출력이 거의 같은 현상

24 of 38

Mapping Network

사이렌 소리

장작 타는 소리

트럭 소리

모델 구조 변경하고 배치를 줄여도 비슷한 현상 발생

새소리

25 of 38

Mapping Network

모든 입력에 대해 Loss 값을 적당히 작게 만드는 특정 값으로 수렴…?

26 of 38

Mapping Network

[ 77 * 4096 ]

T5의 값을 서로 다르게 만들어야 확실하게 서로 다른

Imagebind 출력 값을 다시 만들 수 있다.

Mutual Information

T5 Embedding

ImageBind

[ 1 * 1024 ]

문제점: T5의 값이 ImageBind의 값과 관계없이 항상 비슷하게 나온다.

Prediction

Reprojection

27 of 38

Mapping Network

Mapping

실제 사용할 값

Loss

T5 Embedding

❄

ImageBind

❄

Text

Reprojection

Loss

학습때만 사용될 값

[ 77 * 4096 ]

[ 1 * 1024 ]

Input이 다르면 출력도 달라지도록 학습

Mutual Information

28 of 38

Mapping Network

학습 초기부터 다른 입력에 대해선 확실히 다른 출력을 보임

Mutual Information

29 of 38

Mapping Network

Mutual Information

많은 시도 후..

30 of 38

Mapping Network

Mutual Information

Text, Audio, Image 다 포함한 데이터

약 3만 건으로 130 epoch 학습

31 of 38

Mapping Network

Mutual Information

최종 학습 결과 - Audio

32 of 38

Mapping Network

Mutual Information

최종 학습 결과 - Image

33 of 38

Mapping Network

최종 학습 결과 - Text

Mutual Information

Happy family using laptop on bed at home

Beautiful lake aerial view

Beautiful young woman runs up

time lapse view cityscap

34 of 38

Discussion

1. 큰 차원의 괴리

T5 Embedding

[ 77 * 4096 ]

ImageBind

[ 1 * 1024 ]

Mapping

Token-wise Embedding이 고려될 수 없음

차원의 크기 차이가 너무 심함 (308배)

35 of 38

Discussion

2. ImageBind 자체의 문제

Retrieval 할 수 있을 정도로는 가깝지만

텍스트와 일대일 매칭시킬 수 있을 정도의 정확성은 부족하다.

36 of 38

Discussion

3. Embedding Space mapping의 어려움

모델을 통해 완벽하게 매핑시키기 위해선 임베딩 모델을 만들 때 사용했던 거의 모든 데이터가 필요

37 of 38

Discussion

임베딩의 차원이 조금 더 낮은 Video Generation 모델이 있었다면

최대한 다양한 분포의 데이터를 구할 수 있었다면

38 of 38

VGGnet�Video Graphic Generation network