1 of 47

The missing rungs on the ladder to general AI

ICCV2023 VLAR workshop

Francois Chollet

2 of 47

Flashback: February 2023…

3 of 47

"By 2028, the US presidential race might no longer be run by humans"

Yuval Harari (author)

49% owner of OpenAI

"Let's be honest, most people no longer need to work"

Dror Poleg (NYT columnist)

"AGI at the level of a generally well-educated human is coming in 2 to 3 years"

Dario Amodei (CEO of Anthropic)

"If you're not using AI, you're falling behind. Here's how to 100x your productivity by using ChatGPT"

4 of 47

GPT-4 scores much better than most human students on virtually any human benchmark…

5 of 47

…or does it?

6 of 47

Back in the real world:

LLMs seem to suffer from some limitations

  • Hallucinations, unreliability
  • Inability to adapt to small deviations from memorized patterns
  • Extreme sensitivity to phrasing, tendency to break on rephrasing
  • Inability to solve trivial problems that are unfamiliar
  • Weak, patchy generalization

7 of 47

Inability to adapt to small deviations from memorized patterns

(later patched via RLHF)

8 of 47

Extreme sensitivity to phrasing and tendency to break on rephrasing

Optimistic view: "All you need is prompt engineering! Your models are actually more capable than you think, you're just holding them wrong!"

Hard formulation: "For any LLM, for any query that seems to work, there exists an equivalent rephrasing of the query (readily understandable by a human) that will break"

9 of 47

Lack of ability to pick up any task not featured in training data

Fetching a previously learned function that matches a task != solving a new problem on the fly

Hypothesis: LLM performance depends purely on task familiarity,

not at all on task complexity

Not a hard problem, but neither Bard nor GPT-3.5 nor GPT-4 seem to solve it even after dozens of additional examples

10 of 47

Weak, patchy generalization

GPT-4: 59% accuracy

on 3 digit number multiplication

11 of 47

Improvements rely on armies of data collection contractors, resulting in "pointwise fixes"

Your failed queries will magically start working after 1-2 weeks,

but will break again if you change names / variables

Over 20,000 humans work full time to create training data for LLMs (sustainable?)

12 of 47

So which is it?

13 of 47

Key quantities for conceptualizing intelligent systems

Static skills:

repository of memorized programs

Narrow operational area of programs used (low abstraction)

Data-hungry program acquisition / synthesis

Fluid intelligence:

synthesize new programs on the fly

Broad operational area of programs used

(high abstraction)

Information-efficient program acquisition / synthesis

Fluidity

Operational area

Information-efficiency

14 of 47

Generalization (not task-specific skill)

is the central problem in AI

Generalization: ability to handle situations (or tasks) that differ from previously encountered situations

Dealing with uncertainty

Dealing with novelty

Displaying autonomy

System-centric generalization: ability to adapt to situations not previously encountered by the system/agent itself

Developer-aware generalization: ability to adapt to situations that could not be anticipated by the creators of the system (unknown unknowns), including by the creators of the training data

15 of 47

Forget about human exams.

GENERALIZATION is the key question, and it wasn't at all taken into account in the way human exams are designed.

16 of 47

To make progress towards AI,

we need to measure & maximize generalization

17 of 47

The nature of generalization

Conversion ratio between past experience (or developer- imparted priors) and potential operating area

i.e.

Rate of operationalization of experience w.r.t. future behavior

Lower-intelligence system:

lower information conversion ratio

Higher-intelligence system:

higher information conversion ratio

18 of 47

19 of 47

Measuring generalization power:

controlling for experience and priors

Measuring generalization must control for experiences and priors

To compare AI and natural intelligence: rely on human-like priors (Core Knowledge)

20 of 47

The Abstraction & Reasoning Corpus

  • Similar to Raven matrices
  • 1000 unique tasks (400 training, 400 eval, 200 test) unknown in advance
  • Input/output pair program synthesis benchmark
    • or psychometric intelligence test
    • or AI benchmark
  • Control for experience: few-shot program learning
    • 2 to 5 training pairs per task, 1 to 3 test pairs
    • Median: 3 training pairs, 1 test pair
  • Control for priors: ARC is grounded purely in Core Knowledge (4 systems)
  • No acquired human knowledge

21 of 47

ARC examples

22 of 47

LLM performance on ARC

Common issues:

  • Train set contains "curriculum" tasks (many papers only report train set accuracy or bundle it with evaluation set accuracy!)
  • JSON encoded solutions are included in most LLMs' training data (via GitHub)
  • Many papers only test on a favorably selected subset of the data
  • ARC not 100% up to its own goals (didn't anticipate all of the Internet to be in the training set of ARC-solving models!)

SotA LLMs: ~5-10% test accuracy

via direct prompting

~30% for basic program synthesis, >80% for humans

etc…

23 of 47

We must investigate where that 5-10% comes from

If from (crude) reasoning: could be scaled up towards generality

If from memorized patterns: perpetual game of RLHF whack-a-mole

24 of 47

Abstraction

is the key to generalization

Central question: are LLMs capable of abstraction?

Can they reason?

Mind you: not a binary question!

25 of 47

The nature of abstraction:

reusing programs & representations

Hypothesis: the complexity & variability of any domain is the result of the repetition, composition, transformation, instantiation of a small number of "kernels of structure"

Abstraction = mining past experiences for reusable kernels of structure (identified by spotting similarities & analogies)

Intelligence = high sensitivity to similarities and isomorphisms, ability to recast old patterns into a new context

The Kaleidoscope hypothesis

26 of 47

Abstraction

is a spectrum

From pointwise factoids

…to organized knowledge that works in many situations…

…to generalizable models that work on any situation in a domain…

…to the ability to produce new models to adapt to a new problem…

…to the ability to produce new models efficiently

27 of 47

def two_plus_two():

return 4

def two_plus_three():

return 5

def two_plus_ten():

return 12

From factoids

28 of 47

To organized knowledge...

def two_plus_x(x):

if x == 0:

return 2

elif x == 1:

return 3

elif x == 2:

return 4

elif x == 3:

return 7 # oops haha

elif ...

This is abstraction! The program is abstract for x!

29 of 47

To generalizable models

def addition(x, y):

if y == 0:

return x

return addition(x^y, (x&y) << 1)

Even more abstract!

Always return the right result even for never-seen-before digits!

Maximally high operational area

30 of 47

To abstraction generation (aka fluid intelligence)...

def find_model(examples):

... # ??

return model_function

31 of 47

To information-efficient abstraction generation

(holy grail / AGI)

def find_model(very_few_examples):

... # ??

return model_function

32 of 47

Pointwise factoids

Organized knowledge

Generalizable models

On-the-fly model synthesis

LLMs are mostly at this stage

Maximally efficient model synthesis

AGI

Massive jump to get here –

from program fetching to actual programming

If we solve most LLM limitations (hallucinations, brittleness, patchy generalization…) we get here

33 of 47

The two poles of abstraction: type 1 vs type 2

Prototype-centric (value-centric) abstraction

  • Set of prototypes + distance function
    • Example: classify face vs. non-face using abstract features
  • Abstract wrt details not present in the prototypes
  • Obtained by clustering concrete samples into prototypes
    • This is a value analogy!

Program-centric abstraction

  • Graph of (usually discrete) operators where input nodes can take different values within a type
    • Example: function that sorts a list
  • Abstract wrt input nodes values
  • Obtained by merging specialized functions under a new abstract signature
    • This is a program analogy!

34 of 47

The two poles of abstraction

Prototype-centric (value-centric) abstraction

  • Set of prototypes + distance function
    • Example: classify face vs. non-face using abstract features
  • Abstract wrt details not present in the prototypes
  • Obtained by combining concrete samples into prototypes
    • This is a value analogy!

Program-centric abstraction

  • Graph of (usually discrete) operators where input nodes can take different values within a type
    • Example: function that sorts a list
  • Abstract wrt input nodes values
  • Obtained by merging specialized functions under a new abstract signature
    • This is a program analogy!

Perception

Intuition

Approximation

"abstraction" in art

...

Reasoning

Planning

Rigor

"abstraction" in CS

...

35 of 47

How do LLMs implement abstraction? A word2vec analogy

  • word2vec: Learn token vector space (sphere) by maximize dot product between co-occurring vectors
    • Ends up "emergently" learning semantic vector functions, e.g. gender vector, plural vector
      • y = plural(x)
      • y = pet_version(x) ("dog is to wolf as cat is to tiger")
      • etc.

36 of 47

  • LLMs: Same! But in a deep, multistage way
    • Vectors with high dot product (=close in latent space) are pulled progressively closer together via self-attention
      • It's kind of Hebbian learning in latent space!
      • Or information gravity!
    • Latent space is guaranteed to be continuous and interpolative, by construction
      • Next sphere of tokens = linear interpolation of existing tokens
    • Ends up learning complex "vector functions"
      • y = write_this_like_shakespeare(x)
      • These are higher-level versions of word2vec magic vectors!
  • It's a canonical type 1 abstraction
    • Group similar items together according to continuous distance function
    • Generalization via latent space interpolation

37 of 47

Transformers are a great type 1 abstraction machine.

But how do we get to type 2?

38 of 47

Discrete program search:

how to learn to reason

39 of 47

Program synthesis from input/output pairs: overview

Modern machine learning

  • Model: differentiable parametric function
  • Learning engine: SGD
  • Feedback signal: loss function
  • Key challenge: data (obtaining a dense sampling of the problem space)

Program synthesis from I/O pairs

  • Model: graph of operators from a DSL
  • Learning engine: combinatorial search
  • Feedback signal: correctness check
  • Key challenge: combinatorial explosion

XOR

OR

AND

AND

XOR

AND

40 of 47

ARC successes

1st place: J. S. Wind

  • Exhaustive iterations of combinations of up to 4 ops out of a DSL of 142 ops
  • Lots of hardcoding of priors in the DSL (heuristics)
  • 20.6% of test tasks solved in 3 trials or less

2nd place: A. de Miquel Bleier et al.

  • Genetic algorithm combining ops from a DSL
  • 18.7% of test tasks solved

Current SotA: ~30%

  • Combination of past PS approaches

41 of 47

Leveraging program-centric abstraction in program synthesis

  • Solve N problems via search over DSL
  • Identify subgraph isomorphisms in solution programs
    • Within a single task-specific program
    • Across programs
  • Abstract them into reusable functions
  • Store functions in repository shared across tasks
  • Similar to pretrained feature reuse in deep learning, but more robust
  • Similar to Software Engineering!

42 of 47

The road ahead:

bridging both worlds

43 of 47

Task-level vs global timescales:

where we can leverage Deep Learning

Space of tasks

Space of objects / concepts

Task 1

Task 2

Task 3

Perception, common sense, general knowledge:

lots of data, interpolation possible:

use deep learning (offline)

Individual tasks: few examples, no interpolation: use discrete search and symbolic tools (on the fly)

Meta-level regularities: semi-continuous structure, reasonable amount of data: leverage abstract subroutine reuse and use deep learning to provide intuition over the program search space

World

Type 2

Type 1

Type 1

44 of 47

Merging Deep Learning and program synthesis

Using Deep Learning components side by side with algorithmic components

  • Use DL to parse world into discrete objects (perception)
  • Add DL models to DSL (trained across tasks)
  • Tool / retrieval enhanced DL models

Using Deep Learning to guide program search

  • Learn intuitive mapping between discrete task concepts and program "sketches"
  • Program embeddings for fuzzy characterization of program behavior
    • Score intermediate (incomplete) programs
  • Intuition over branching decisions
    • Fight combinatorial explosion

45 of 47

An early example: using a LLM to guide a Python program search process

2x performance on ARC!

46 of 47

In three to eight years we will have a machine with the general intelligence of an average human being

- Marvin Minsky (1970)

47 of 47

Human-level AI is harder

than it seemed in 1955

- John McCarthy (2006)