The missing rungs on the ladder to general AI
ICCV2023 VLAR workshop
Francois Chollet
Flashback: February 2023…
"By 2028, the US presidential race might no longer be run by humans"
Yuval Harari (author)
49% owner of OpenAI
"Let's be honest, most people no longer need to work"
Dror Poleg (NYT columnist)
"AGI at the level of a generally well-educated human is coming in 2 to 3 years"
Dario Amodei (CEO of Anthropic)
"If you're not using AI, you're falling behind. Here's how to 100x your productivity by using ChatGPT"
GPT-4 scores much better than most human students on virtually any human benchmark…
…or does it?
Back in the real world:
LLMs seem to suffer from some limitations
Inability to adapt to small deviations from memorized patterns
(later patched via RLHF)
Extreme sensitivity to phrasing and tendency to break on rephrasing
Optimistic view: "All you need is prompt engineering! Your models are actually more capable than you think, you're just holding them wrong!"
Hard formulation: "For any LLM, for any query that seems to work, there exists an equivalent rephrasing of the query (readily understandable by a human) that will break"
Lack of ability to pick up any task not featured in training data
Fetching a previously learned function that matches a task != solving a new problem on the fly
Hypothesis: LLM performance depends purely on task familiarity,
not at all on task complexity
Not a hard problem, but neither Bard nor GPT-3.5 nor GPT-4 seem to solve it even after dozens of additional examples
Weak, patchy generalization
GPT-4: 59% accuracy
on 3 digit number multiplication
Improvements rely on armies of data collection contractors, resulting in "pointwise fixes"
Your failed queries will magically start working after 1-2 weeks,
but will break again if you change names / variables
Over 20,000 humans work full time to create training data for LLMs (sustainable?)
So which is it?
Key quantities for conceptualizing intelligent systems
Static skills:
repository of memorized programs
Narrow operational area of programs used (low abstraction)
Data-hungry program acquisition / synthesis
Fluid intelligence:
synthesize new programs on the fly
Broad operational area of programs used
(high abstraction)
Information-efficient program acquisition / synthesis
Fluidity
Operational area
Information-efficiency
Generalization (not task-specific skill)
is the central problem in AI
Generalization: ability to handle situations (or tasks) that differ from previously encountered situations
Dealing with uncertainty
Dealing with novelty
Displaying autonomy
System-centric generalization: ability to adapt to situations not previously encountered by the system/agent itself
Developer-aware generalization: ability to adapt to situations that could not be anticipated by the creators of the system (unknown unknowns), including by the creators of the training data
Forget about human exams.
GENERALIZATION is the key question, and it wasn't at all taken into account in the way human exams are designed.
To make progress towards AI,
we need to measure & maximize generalization
The nature of generalization
Conversion ratio between past experience (or developer- imparted priors) and potential operating area
i.e.
Rate of operationalization of experience w.r.t. future behavior
Lower-intelligence system:
lower information conversion ratio
Higher-intelligence system:
higher information conversion ratio
Measuring generalization power:
controlling for experience and priors
Measuring generalization must control for experiences and priors
To compare AI and natural intelligence: rely on human-like priors (Core Knowledge)
The Abstraction & Reasoning Corpus
ARC examples
LLM performance on ARC
Common issues:
SotA LLMs: ~5-10% test accuracy
via direct prompting
~30% for basic program synthesis, >80% for humans
etc…
We must investigate where that 5-10% comes from
If from (crude) reasoning: could be scaled up towards generality
If from memorized patterns: perpetual game of RLHF whack-a-mole
Abstraction
is the key to generalization
Central question: are LLMs capable of abstraction?
Can they reason?
Mind you: not a binary question!
The nature of abstraction:
reusing programs & representations
Hypothesis: the complexity & variability of any domain is the result of the repetition, composition, transformation, instantiation of a small number of "kernels of structure"
Abstraction = mining past experiences for reusable kernels of structure (identified by spotting similarities & analogies)
Intelligence = high sensitivity to similarities and isomorphisms, ability to recast old patterns into a new context
The Kaleidoscope hypothesis
Abstraction
is a spectrum
From pointwise factoids…
…to organized knowledge that works in many situations…
…to generalizable models that work on any situation in a domain…
…to the ability to produce new models to adapt to a new problem…
…to the ability to produce new models efficiently
def two_plus_two():
return 4
def two_plus_three():
return 5
def two_plus_ten():
return 12
From factoids…
To organized knowledge...
def two_plus_x(x):
if x == 0:
return 2
elif x == 1:
return 3
elif x == 2:
return 4
elif x == 3:
return 7 # oops haha
elif ...
This is abstraction! The program is abstract for x!
To generalizable models…
def addition(x, y):
if y == 0:
return x
return addition(x^y, (x&y) << 1)
Even more abstract!
Always return the right result even for never-seen-before digits!
Maximally high operational area
To abstraction generation (aka fluid intelligence)...
def find_model(examples):
... # ??
return model_function
To information-efficient abstraction generation
(holy grail / AGI)
def find_model(very_few_examples):
... # ??
return model_function
Pointwise factoids
Organized knowledge
Generalizable models
On-the-fly model synthesis
LLMs are mostly at this stage
Maximally efficient model synthesis
AGI
Massive jump to get here –
from program fetching to actual programming
If we solve most LLM limitations (hallucinations, brittleness, patchy generalization…) we get here
The two poles of abstraction: type 1 vs type 2
Prototype-centric (value-centric) abstraction
Program-centric abstraction
The two poles of abstraction
Prototype-centric (value-centric) abstraction
Program-centric abstraction
Perception
Intuition
Approximation
"abstraction" in art
...
Reasoning
Planning
Rigor
"abstraction" in CS
...
How do LLMs implement abstraction? A word2vec analogy
Transformers are a great type 1 abstraction machine.
But how do we get to type 2?
Discrete program search:
how to learn to reason
Program synthesis from input/output pairs: overview
Modern machine learning
Program synthesis from I/O pairs
XOR
OR
AND
AND
XOR
AND
ARC successes
1st place: J. S. Wind
2nd place: A. de Miquel Bleier et al.
Current SotA: ~30%
Leveraging program-centric abstraction in program synthesis
The road ahead:
bridging both worlds
Task-level vs global timescales:
where we can leverage Deep Learning
Space of tasks
Space of objects / concepts
Task 1
Task 2
Task 3
Perception, common sense, general knowledge:
lots of data, interpolation possible:
use deep learning (offline)
Individual tasks: few examples, no interpolation: use discrete search and symbolic tools (on the fly)
Meta-level regularities: semi-continuous structure, reasonable amount of data: leverage abstract subroutine reuse and use deep learning to provide intuition over the program search space
World
Type 2
Type 1
Type 1
Merging Deep Learning and program synthesis
Using Deep Learning components side by side with algorithmic components
Using Deep Learning to guide program search
An early example: using a LLM to guide a Python program search process
2x performance on ARC!
“In three to eight years we will have a machine with the general intelligence of an average human being”
- Marvin Minsky (1970)
“Human-level AI is harder
than it seemed in 1955”
- John McCarthy (2006)