Disentanglement and Composition for AGI
Xingyi Yang
National University of Singapore
xyang@u.nus.edu
Sep 30 2024
Joint work with Dr. Jingwen Ye, Prof. Xinchao Wang and Prof. Shuicheng Yan
How do we define Artificial General Intelligence?
Our Narrow Scope
Program that can sense,
feel, reason, plan …
Pass Turning Test
Matches or surpasses human capabilities
Model that generalizes
Model memorize enough data
Our Narrow Scope
Program that can sense,
feel, reason, plan …
Pass Turning Test
Matches or surpasses human capabilities
Model that generalizes
Model memorize enough data
Great
But
Boring
Generalization
Learn in some cases.
and apply it in unseen cases.
- Composition is a good way!
Milan Cathedral is Unique, for Compositionality
Disentanglement and Composition for AGI
Xingyi Yang
National University of Singapore
xyang@u.nus.edu
Sep 30 2024
Joint work with Dr. Jingwen Ye, Prof. Xinchao Wang and Prof. Shuicheng Yan
Generalization
Outline
In fact, what we really need is generalization
Generalizing Outside Training Data
What do we know about generalization?
PAC bound:
Model fits the data well
Simpler model
More data
😭
Why C&D is good for generalization?
1. Less Required Data: Factorize a complex distribution into factors
Compositional Generative Modeling: A Single Model is Not All You Need [ICML 2024]
Why C&D is good for generalization?
2. Simpler model at each components
Verifiable and Compositional Reinforcement Learning Systems [ICAPS 2022]
Why C&D is good for generalization?
3. Flexibility to test on distribution that was not trained on.
e.g. Caption = Detection
+ Object-to-Paragraph
HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face [NeurIPS 2023]
Trends in AGI are, in fact, C&D
Trend 1: OpenAI o1
Much more powerful than pure LLM.
What is OpenAI o1?
Process-supervision (Train)
RLHF
Process-supervision
Supervision at intermediate steps
>
One final supervision in the end
Chain-of-thought (Test)
Process-supervision (Train) �Chain-of-thought (Test)
O1 improves because it factorizes problems
Trend 2: Unified Models
Independent Models
Singular Models
Hybrid Models
Two dominant GenAI pipelines,
Hard to unified.
Any possible unifications
LLMs(ARs) are learned to factorize the distribution
LLMs(ARs) are learned to factorize the distribution
(Less Known) Diffusion Model are Freq. AR
https://sander.ai/2024/09/02/spectral-autoregression.html
(Less Known) Diffusion Model are Freq. AR
https://sander.ai/2024/09/02/spectral-autoregression.html
(Less Known) Diffusion Model are Freq. AR
Diffusion Model made Slim (CVPR 2023)
Under mild assumptions:
Every step in diffusion model is a Weiner Filter
AR and Diffusion are unified, because of D&C
AGI through Compositionality
Compositional Model
Definition: Multiple models collaboratively to produce an output.
Core: Share of tasks & Composition of Expertise
Singular Model
Input
Output
Model 1
Input
Output
Model 2
Model 3
Model 1
Input
Output
Model 2
Horizontal
Vertical
Horizontal Compositional Model
Horizontal Compositional Model: Mixture
Example:
Horizontal: Mixture-of-Expert
Pro: Efficient Training and inference
Adaptive Mixture of Local Experts [Neural Computation 1991]
Horizontal: MoE and Transformers
Horizontal Compositional Model: Product
Energy-based Model: Parameterize Distribution as an Energy function
Composition as a form of sum
Example: Contrastive Energy Model, Score Denoising (Diffusion Model), Product of Expert
Horizontal Compositional Model: Product
Case 1: Energy based Model: composition as energy operation
Compositional Visual Generation with Energy Based Models [NeurIPS 2020]
Horizontal Compositional Model: Product
Case 2: Diffusion Model (Classifier-Free Guidance)
Compositional Visual Generation with Composable Diffusion Models [ECCV 2022]
Vertical Compositional Model
Factorize a full task as sequential of subtasks, similar to a probabilistic graphical model (PGM)
Vertical Compositional Model: Text-to-Image
DALLE-2
Stable-Diffusion [CVPR 2023]
Text Encoder (Freezed) -> Diffusion Model -> Autoencoder (Freezed)
Vertical Compositional Model: MLLM
LLaVA [NeurIPS 2023]
MiniGPT-4
Domain (Image) Encoder (Freezed) -> LLM (Freezed)
Vertical Compositional Model: MLLM
Unified-IO
V1 [ICLR 2023]
V2 [CVPR 2024]
Domain Encoder -> Domain Decoder
Compositional Strategy
Given some trained model/tools, how can we compose them to perform new tasks?
Compositional Prompting Techniques
Neural-symbolic Agent
Case 1: Visual Programming
Language generate code-like logics, ask tools/models to execute.
Case 2: LLM Agent
Compose LLM with Memory/Tools/Actions
Compositional Data
Collecting Compositional Data
3. Augmented data (Order swapping, Negative Meaning)
Relation Rectification [CVPR 2024]
CLEVR [CVPR 2017]
Augment multi-modality data to be Comp.
Augment data from one modality, benefit other modality
What If We Recaption Billions of Web Images with LLaMA-3? [Arxiv 2024]
Mixing diverse dataset for model training
SlimPajama-DC: Understanding Data Combinations for LLM Training
Mixing prompt/target training
Single prompt gets multiple output (backprop for min-loss)
Construct multiple prompt (box, points, e.g.) for a single mask when training
Segment Anything [ICCV 2021]
Are we anything closer to AGI?
LM not understand compositionally
Measuring Compositional Generalization: A Comprehensive Method on Realistic Data [ICLR2020]
VLM not understand compositionally
When and why vision-language models behave like bags-of-words, and what to do about it? [ICLR2023]
CREPE: Can Vision-Language Foundation Models Reason Compositionally? [CVPR 2023]
Performance after switching order
close to random guess.
Understanding like bag-of-words.
Are we anything closer to AGI that compose?
Key Take-aways
Non End-to-End Learning?
Distributed Learning?
Thanks for Listening
Web & Slides: https://trend-in-disen-and-compo.github.io/