1 of 59

Disentanglement and Composition for AGI

Xingyi Yang

National University of Singapore

xyang@u.nus.edu

Sep 30 2024

Joint work with Dr. Jingwen Ye, Prof. Xinchao Wang and Prof. Shuicheng Yan

2 of 59

How do we define Artificial General Intelligence?

3 of 59

Our Narrow Scope

Program that can sense,

feel, reason, plan …

Pass Turning Test

Matches or surpasses human capabilities

Model that generalizes

Model memorize enough data

4 of 59

Our Narrow Scope

Program that can sense,

feel, reason, plan …

Pass Turning Test

Matches or surpasses human capabilities

Model that generalizes

Model memorize enough data

Great

But

Boring

5 of 59

Generalization

Learn in some cases.

and apply it in unseen cases.

- Composition is a good way!

6 of 59

Milan Cathedral is Unique, for Compositionality

7 of 59

Disentanglement and Composition for AGI

Xingyi Yang

National University of Singapore

xyang@u.nus.edu

Sep 30 2024

Joint work with Dr. Jingwen Ye, Prof. Xinchao Wang and Prof. Shuicheng Yan

Generalization

8 of 59

Outline

  • Preliminary: What is AGI and Why D&C for AGI?
  • Trends in AGI models are C&D
    • Trend 1: OpenAI o1 🌟
    • Trend 2: Unified Models 🌟
  • AGI through Compositionality
    • Compositional Models: Train model that are composable.
    • Compositional Strategy: Given trained models&tools, we compose expertise.
    • Compositional Data: Build data that enables to learn composable skills.

9 of 59

In fact, what we really need is generalization

Generalizing Outside Training Data

10 of 59

What do we know about generalization?

  • Traditional Generalization bound tells us:
    • More data is good
    • Simpler function prevent overfitting
    • (Test distribution == Train distribution)

PAC bound:

Model fits the data well

Simpler model

More data

😭

11 of 59

Why C&D is good for generalization?

1. Less Required Data: Factorize a complex distribution into factors

 

Compositional Generative Modeling: A Single Model is Not All You Need [ICML 2024]

12 of 59

Why C&D is good for generalization?

2. Simpler model at each components

Verifiable and Compositional Reinforcement Learning Systems [ICAPS 2022]

13 of 59

Why C&D is good for generalization?

3. Flexibility to test on distribution that was not trained on.

e.g. Caption = Detection

+ Object-to-Paragraph

HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face [NeurIPS 2023]

14 of 59

Trends in AGI are, in fact, C&D

15 of 59

16 of 59

Trend 1: OpenAI o1

Much more powerful than pure LLM.

17 of 59

What is OpenAI o1?

  •  

18 of 59

Process-supervision (Train)

RLHF

Process-supervision

Supervision at intermediate steps

>

One final supervision in the end

19 of 59

Chain-of-thought (Test)

20 of 59

Process-supervision (Train) �Chain-of-thought (Test)

21 of 59

O1 improves because it factorizes problems

22 of 59

Trend 2: Unified Models

Independent Models

Singular Models

Hybrid Models

Two dominant GenAI pipelines,

Hard to unified.

Any possible unifications

23 of 59

LLMs(ARs) are learned to factorize the distribution

24 of 59

LLMs(ARs) are learned to factorize the distribution

 

25 of 59

(Less Known) Diffusion Model are Freq. AR

https://sander.ai/2024/09/02/spectral-autoregression.html

26 of 59

(Less Known) Diffusion Model are Freq. AR

https://sander.ai/2024/09/02/spectral-autoregression.html

27 of 59

(Less Known) Diffusion Model are Freq. AR

Diffusion Model made Slim (CVPR 2023)

Under mild assumptions:

Every step in diffusion model is a Weiner Filter

28 of 59

AR and Diffusion are unified, because of D&C

  • Both models look to decompose data into smaller components.
  • Unified Model seeks Modularized/Decompose/Factorize.

29 of 59

AGI through Compositionality

30 of 59

Compositional Model

Definition: Multiple models collaboratively to produce an output.

Core: Share of tasks & Composition of Expertise

Singular Model

Input

Output

Model 1

Input

Output

Model 2

Model 3

Model 1

Input

Output

Model 2

Horizontal

Vertical

31 of 59

Horizontal Compositional Model

 

 

32 of 59

CSC321:

Neural Networks

Lecture 24

Products of Experts

Geoffrey Hinton

33 of 59

Horizontal Compositional Model: Mixture

Example:

  • Mixture of Gaussian
  • Mixture of expert

 

34 of 59

Horizontal: Mixture-of-Expert

  1. Router select experts for input.
  2. Input go through experts.
  3. Outputs are weighted summed.

Pro: Efficient Training and inference

Adaptive Mixture of Local Experts [Neural Computation 1991]

35 of 59

Horizontal: MoE and Transformers

  • FFN replace to MoE.
  • Sparsity & load balance.

36 of 59

Horizontal Compositional Model: Product

Energy-based Model: Parameterize Distribution as an Energy function

 

Composition as a form of sum

 

Example: Contrastive Energy Model, Score Denoising (Diffusion Model), Product of Expert

37 of 59

Horizontal Compositional Model: Product

Case 1: Energy based Model: composition as energy operation

Compositional Visual Generation with Energy Based Models [NeurIPS 2020]

38 of 59

Horizontal Compositional Model: Product

Case 2: Diffusion Model (Classifier-Free Guidance)

Compositional Visual Generation with Composable Diffusion Models [ECCV 2022]

39 of 59

Vertical Compositional Model

Factorize a full task as sequential of subtasks, similar to a probabilistic graphical model (PGM)

 

40 of 59

Vertical Compositional Model: Text-to-Image

DALLE-2

Stable-Diffusion [CVPR 2023]

Text Encoder (Freezed) -> Diffusion Model -> Autoencoder (Freezed)

41 of 59

Vertical Compositional Model: MLLM

LLaVA [NeurIPS 2023]

MiniGPT-4

Domain (Image) Encoder (Freezed) -> LLM (Freezed)

42 of 59

Vertical Compositional Model: MLLM

Unified-IO

V1 [ICLR 2023]

V2 [CVPR 2024]

Domain Encoder -> Domain Decoder

43 of 59

Compositional Strategy

Given some trained model/tools, how can we compose them to perform new tasks?

44 of 59

Compositional Prompting Techniques

45 of 59

Neural-symbolic Agent

  • Executing the neural network(s) using symbolic language rules to obtain the final output.
  • Combined with LLM.

46 of 59

Case 1: Visual Programming

Language generate code-like logics, ask tools/models to execute.

47 of 59

Case 2: LLM Agent

Compose LLM with Memory/Tools/Actions

48 of 59

Compositional Data

  • Collecting compositional data
  • Mixing diverse dataset for model training
  • Mixing prompt/target joint training
  • The collected data should best approximate the combinational nature of the world.

49 of 59

Collecting Compositional Data

  1. Collect real compositional data.
  2. Synthetic data (Simulation).

3. Augmented data (Order swapping, Negative Meaning)

Relation Rectification [CVPR 2024]

CLEVR [CVPR 2017]

50 of 59

Augment multi-modality data to be Comp.

Augment data from one modality, benefit other modality

What If We Recaption Billions of Web Images with LLaMA-3? [Arxiv 2024]

51 of 59

Mixing diverse dataset for model training

  • More diverse dataset
  • Higher Performance

SlimPajama-DC: Understanding Data Combinations for LLM Training

52 of 59

Mixing prompt/target training

Single prompt gets multiple output (backprop for min-loss)

Construct multiple prompt (box, points, e.g.) for a single mask when training

Segment Anything [ICCV 2021]

53 of 59

Are we anything closer to AGI?

  • Test Compositional Generalization: The ability to learn the meaning of a new word and then applying it to other language contexts.

54 of 59

LM not understand compositionally

Measuring Compositional Generalization: A Comprehensive Method on Realistic Data [ICLR2020]

55 of 59

VLM not understand compositionally

When and why vision-language models behave like bags-of-words, and what to do about it? [ICLR2023]

CREPE: Can Vision-Language Foundation Models Reason Compositionally? [CVPR 2023]

Performance after switching order

close to random guess.

Understanding like bag-of-words.

56 of 59

Are we anything closer to AGI that compose?

  • No soon. Mostly memorization, but extremely large memory is good.
  • Principle are clear. But no unified solution.

57 of 59

Key Take-aways

  • D&C do good for generalizations.
  • Best models’ success are closely related to D&C.
  • Some solution exists, but not finalized.

Non End-to-End Learning?

Distributed Learning?

58 of 59

Thanks for Listening

59 of 59