3 of 47

"By 2028, the US presidential race might no longer be run by humans"

Yuval Harari (author)

49% owner of OpenAI

"Let's be honest, most people no longer need to work"

Dror Poleg (NYT columnist)

"AGI at the level of a generally well-educated human is coming in 2 to 3 years"

Dario Amodei (CEO of Anthropic)

"If you're not using AI, you're falling behind. Here's how to 100x your productivity by using ChatGPT"

4 of 47

GPT-4 scores much better than most human students on virtually any human benchmark…

5 of 47

…or does it?

6 of 47

Back in the real world:

LLMs seem to suffer from some limitations

Hallucinations, unreliability
Inability to adapt to small deviations from memorized patterns
Extreme sensitivity to phrasing, tendency to break on rephrasing
Inability to solve trivial problems that are unfamiliar
Weak, patchy generalization

7 of 47

Inability to adapt to small deviations from memorized patterns

(later patched via RLHF)

8 of 47

Extreme sensitivity to phrasing and tendency to break on rephrasing

Optimistic view: "All you need is prompt engineering! Your models are actually more capable than you think, you're just holding them wrong!"

Hard formulation: "For any LLM, for any query that seems to work, there exists an equivalent rephrasing of the query (readily understandable by a human) that will break"

9 of 47

Lack of ability to pick up any task not featured in training data

Fetching a previously learned function that matches a task != solving a new problem on the fly

Hypothesis: LLM performance depends purely on task familiarity,

not at all on task complexity

Not a hard problem, but neither Bard nor GPT-3.5 nor GPT-4 seem to solve it even after dozens of additional examples

10 of 47

Weak, patchy generalization

GPT-4: 59% accuracy

on 3 digit number multiplication

11 of 47

Improvements rely on armies of data collection contractors, resulting in "pointwise fixes"

Your failed queries will magically start working after 1-2 weeks,

but will break again if you change names / variables

Over 20,000 humans work full time to create training data for LLMs (sustainable?)

12 of 47

So which is it?

13 of 47

Key quantities for conceptualizing intelligent systems

Static skills:

repository of memorized programs

Narrow operational area of programs used (low abstraction)

Data-hungry program acquisition / synthesis

Fluid intelligence:

synthesize new programs on the fly

Broad operational area of programs used

(high abstraction)

Information-efficient program acquisition / synthesis

Fluidity

Operational area

Information-efficiency

14 of 47

Generalization (not task-specific skill)

is the central problem in AI

Generalization: ability to handle situations (or tasks) that differ from previously encountered situations

Dealing with uncertainty

Dealing with novelty

Displaying autonomy

System-centric generalization: ability to adapt to situations not previously encountered by the system/agent itself

Developer-aware generalization: ability to adapt to situations that could not be anticipated by the creators of the system (unknown unknowns), including by the creators of the training data

15 of 47

Forget about human exams.

GENERALIZATION is the key question, and it wasn't at all taken into account in the way human exams are designed.

16 of 47

To make progress towards AI,

we need to measure & maximize generalization

17 of 47

The nature of generalization

Conversion ratio between past experience (or developer- imparted priors) and potential operating area

i.e.

Rate of operationalization of experience w.r.t. future behavior

Lower-intelligence system:

lower information conversion ratio

Higher-intelligence system:

higher information conversion ratio

19 of 47

Measuring generalization power:

controlling for experience and priors

Measuring generalization must control for experiences and priors

To compare AI and natural intelligence: rely on human-like priors (Core Knowledge)

20 of 47

The Abstraction & Reasoning Corpus

Similar to Raven matrices
1000 unique tasks (400 training, 400 eval, 200 test) unknown in advance
Input/output pair program synthesis benchmark

or psychometric intelligence test
or AI benchmark

Control for experience: few-shot program learning

2 to 5 training pairs per task, 1 to 3 test pairs
Median: 3 training pairs, 1 test pair

Control for priors: ARC is grounded purely in Core Knowledge (4 systems)
No acquired human knowledge

21 of 47

ARC examples

22 of 47

LLM performance on ARC

Common issues:

Train set contains "curriculum" tasks (many papers only report train set accuracy or bundle it with evaluation set accuracy!)
JSON encoded solutions are included in most LLMs' training data (via GitHub)
Many papers only test on a favorably selected subset of the data
ARC not 100% up to its own goals (didn't anticipate all of the Internet to be in the training set of ARC-solving models!)

SotA LLMs: ~5-10% test accuracy

via direct prompting

~30% for basic program synthesis, >80% for humans

etc…

23 of 47

We must investigate where that 5-10% comes from

If from (crude) reasoning: could be scaled up towards generality

If from memorized patterns: perpetual game of RLHF whack-a-mole

24 of 47

Abstraction

is the key to generalization

Central question: are LLMs capable of abstraction?

Can they reason?

Mind you: not a binary question!

25 of 47

The nature of abstraction:

reusing programs & representations

Hypothesis: the complexity & variability of any domain is the result of the repetition, composition, transformation, instantiation of a small number of "kernels of structure"

Abstraction = mining past experiences for reusable kernels of structure (identified by spotting similarities & analogies)

Intelligence = high sensitivity to similarities and isomorphisms, ability to recast old patterns into a new context

The Kaleidoscope hypothesis

26 of 47

Abstraction

is a spectrum

From pointwise factoids…

…to organized knowledge that works in many situations…

…to generalizable models that work on any situation in a domain…

…to the ability to produce new models to adapt to a new problem…

…to the ability to produce new models efficiently

27 of 47

def two_plus_two():

return 4

def two_plus_three():

return 5

def two_plus_ten():

return 12

From factoids…

28 of 47

To organized knowledge...

def two_plus_x(x):

if x == 0:

return 2

elif x == 1:

return 3

elif x == 2:

return 4

elif x == 3:

return 7 # oops haha

elif ...

This is abstraction! The program is abstract for x!

29 of 47

To generalizable models…

def addition(x, y):

if y == 0:

return x

return addition(x^y, (x&y) << 1)

Even more abstract!

Always return the right result even for never-seen-before digits!

Maximally high operational area

30 of 47

To abstraction generation (aka fluid intelligence)...

def find_model(examples):

... # ??

return model_function

31 of 47

To information-efficient abstraction generation

(holy grail / AGI)

def find_model(very_few_examples):

... # ??

return model_function

32 of 47

Pointwise factoids

Organized knowledge

Generalizable models

On-the-fly model synthesis

LLMs are mostly at this stage

Maximally efficient model synthesis

AGI

Massive jump to get here –

from program fetching to actual programming

If we solve most LLM limitations (hallucinations, brittleness, patchy generalization…) we get here

33 of 47

The two poles of abstraction: type 1 vs type 2

Prototype-centric (value-centric) abstraction

Set of prototypes + distance function

Example: classify face vs. non-face using abstract features

Abstract wrt details not present in the prototypes
Obtained by clustering concrete samples into prototypes

This is a value analogy!

Program-centric abstraction

Graph of (usually discrete) operators where input nodes can take different values within a type

Example: function that sorts a list

Abstract wrt input nodes values
Obtained by merging specialized functions under a new abstract signature

This is a program analogy!

34 of 47

The two poles of abstraction

Prototype-centric (value-centric) abstraction

Set of prototypes + distance function

Example: classify face vs. non-face using abstract features

Abstract wrt details not present in the prototypes
Obtained by combining concrete samples into prototypes

This is a value analogy!

Program-centric abstraction

Graph of (usually discrete) operators where input nodes can take different values within a type

Example: function that sorts a list

Abstract wrt input nodes values
Obtained by merging specialized functions under a new abstract signature

This is a program analogy!

Perception

Intuition

Approximation

"abstraction" in art

...

Reasoning

Planning

Rigor

"abstraction" in CS

...

35 of 47

How do LLMs implement abstraction? A word2vec analogy

word2vec: Learn token vector space (sphere) by maximize dot product between co-occurring vectors

Ends up "emergently" learning semantic vector functions, e.g. gender vector, plural vector

y = plural(x)
y = pet_version(x) ("dog is to wolf as cat is to tiger")
etc.

36 of 47

LLMs: Same! But in a deep, multistage way

Vectors with high dot product (=close in latent space) are pulled progressively closer together via self-attention

It's kind of Hebbian learning in latent space!
Or information gravity!

Latent space is guaranteed to be continuous and interpolative, by construction

Next sphere of tokens = linear interpolation of existing tokens

Ends up learning complex "vector functions"

y = write_this_like_shakespeare(x)
These are higher-level versions of word2vec magic vectors!

It's a canonical type 1 abstraction

Group similar items together according to continuous distance function
Generalization via latent space interpolation

37 of 47

Transformers are a great type 1 abstraction machine.

But how do we get to type 2?

38 of 47

Discrete program search:

how to learn to reason

39 of 47

Program synthesis from input/output pairs: overview

Modern machine learning

Model: differentiable parametric function
Learning engine: SGD
Feedback signal: loss function
Key challenge: data (obtaining a dense sampling of the problem space)

Program synthesis from I/O pairs

Model: graph of operators from a DSL
Learning engine: combinatorial search
Feedback signal: correctness check
Key challenge: combinatorial explosion

XOR

AND

XOR

AND

40 of 47

ARC successes

1st place: J. S. Wind

Exhaustive iterations of combinations of up to 4 ops out of a DSL of 142 ops
Lots of hardcoding of priors in the DSL (heuristics)
20.6% of test tasks solved in 3 trials or less

2nd place: A. de Miquel Bleier et al.

Genetic algorithm combining ops from a DSL
18.7% of test tasks solved

Current SotA: ~30%

Combination of past PS approaches

41 of 47

Leveraging program-centric abstraction in program synthesis

Solve N problems via search over DSL
Identify subgraph isomorphisms in solution programs

Within a single task-specific program
Across programs

Abstract them into reusable functions
Store functions in repository shared across tasks
Similar to pretrained feature reuse in deep learning, but more robust
Similar to Software Engineering!

42 of 47

The road ahead:

bridging both worlds

43 of 47

Task-level vs global timescales:

where we can leverage Deep Learning

Space of tasks

Space of objects / concepts

Task 1

Task 2

Task 3

Perception, common sense, general knowledge:

lots of data, interpolation possible:

use deep learning (offline)

Individual tasks: few examples, no interpolation: use discrete search and symbolic tools (on the fly)

Meta-level regularities: semi-continuous structure, reasonable amount of data: leverage abstract subroutine reuse and use deep learning to provide intuition over the program search space

World

Type 2

Type 1

44 of 47

Merging Deep Learning and program synthesis

Using Deep Learning components side by side with algorithmic components

Use DL to parse world into discrete objects (perception)
Add DL models to DSL (trained across tasks)
Tool / retrieval enhanced DL models

Using Deep Learning to guide program search

Learn intuitive mapping between discrete task concepts and program "sketches"
Program embeddings for fuzzy characterization of program behavior

Score intermediate (incomplete) programs

Intuition over branching decisions

Fight combinatorial explosion

1 of 47

2 of 47