1 of 35

What could be the Data-structures of the Mind?

Rina Panigrahy

Google

2 of 35

Mind-World interaction.

Algorithmic understanding, not physiological.

World

Mind

Phenomena

(sensory) Inputs

environment

Datastructures?

Deep network?

One/Many Modules?

Concepts?

Memory table?

(Tabula Rasa)

Predictions/actions

3 of 35

How could we store things in memory

You meet someone over coffee for a chat

Who was the person?
What was spoken?
Roughly How many people other were in the room
Were most people wearing dark shirts?

How are this organized, indexed, retrieved.

If I later were to meet with the person how would I retrieve this meeting and its details.
Is there a knowledge graph?

Algorithmic understanding, not physiological.

4 of 35

Recursive Sketches

with Badih Ghazi, Joshua Wang, ICML 2019

Close eyes: can sketch the object approximately.

Sketch may be lossy
But can get high level details.

Graceful Erasure: How does the memory fade?

Perhaps recursive
Higher layer embeddings have higher weight

Assume logically a Modular network

Different modules do different tasks

5 of 35

Modular Network

Assume modular network

Outputs with attributes
Hinton capsules, Mumford, Zhu.
Perhaps by cuts?

Heterogeneous:

Different algorithms

New Modules

evolve over time;

How to represents complex input in memory

Thickness

Color

Line

Circle

Eye

Human Head

Human

Nose

Hair

Human Torso

Human Hand

Tiger Eye

Human Eye

Human Hair

Human Nose

Tiger Head

Tiger

6 of 35

Want ‘sketch’

Sketch of image, or event (meeting)
Sketch: compressed summary of the input, network outputs
Can recover input object and attributes
Similarity preserving

Thickness

Color

Line

Circle

Eye

Human Head

Human

Nose

Hair

Human Torso

Human Hand

Tiger Eye

Human Eye

Human Hair

Human Nose

Tiger Head

Tiger

7 of 35

The sketch can be decoded by looking along a specific subspace.
Type encoded in direction

Assume one layer of Modules. Use random subspace embedding

.......

x₁

x₂

x_N

M₁

M₂

M_N

Modules

All in one layer

Only few fire

Combine into a sketch (sparse mostly 0’s)

8 of 35

The sketch can be decoded by looking along a specific subspace.
sketch-size >> module-output-dim * sparsity
Type encoded in direction

Assume one layer of Modules.

Random rotation R_i for each module
Sparse combination of sparse vectors: sparse coding/dictionary learning

+.......+

R₁

R₂

R_N

x₁

x₂

x_N

M₁

M_N

Modules

9 of 35

Dictionary Learning: y = r₁ x₁ + …. r_N x_N : x is sparse

Here: y = R₁ x₁ + …. R_N x_N

Most x_i are 0. Others sparse.

Multilayer: make recursive

Recovery: Assume one layer of Modules.

Spare Recovery: Can recover the sparse combination of vectors: sparse coding/dictionary learning
Random rotation R_i for each module

+.......+

R₁

R₂

R_N

x₁

x₂

x_N

10 of 35

Sketches preserve similarity. Sketch A references B then A, B similar
Different types can be encoded in different directions/subspaces/clusters
Subclusters for subtypes.
New cluster: New concept. Grow a new module
Similar to [Hyper dim. Computing Kanerva 09] [Vector Symb. Arch. Frady et al. 21]

. Tiger

.Meeting with Person X on topic Y

.Person X

.Topic Y

.Person

Words cluster/subspace

Animals cluster/subspace

Furniture

11 of 35

Memory and Modularity

Sketch memory and Modular Network work and inform each other
A new cluster creates a new module

Sketch Repository

Knowledge graph

Modular Network

Thickness

Color

Line

Circle

Eye

Human Head

Nose

Hair

Human Torso

Human Hand

Tiger Eye

Human Eye

Human Hair

Human Nose

Tiger Head

Tiger

Human

12 of 35

Recursive sketching: Benefits (provable)

Similarity preserving

Find past sketches containing the same objects as the current events.
Gives a knowledge graph of sketches

Graceful erasure, Attribute Recovery (similar to [Resonator Networks, 20])

Theorem: As memory fades, reconstruction fades gradually
Theorem: Can recover top level module outputs with small sketch

Interpretability

Sketch shows how the label was obtained.
Histograming / counting / mean /variance (moments through non-linearity)

Type information (implicit in subspace)

Protocol independent communication across modules. Object oriented.

There is an architecture for continual learning that [paper-link]

Helps detect a new concept as a new cluster in sketch space
Form a growing body of modules.

13 of 35

Sketch based Modular Continual Learning Architecture(paper)?

phenomena

sketch

context

bucket

new phenomena

Routing-Module (OS)

LSH Table (modules)

Environment

Input

Environment

Output

Buckets contain programs,

pointers to frequently co-occurring sketches, expected rewards, etc..

F function pays attention to some fields for example: facial features instead of clothes

World

Mind

14 of 35

New module, concept, knowledge-graph

A new concept

Is just a frequently occurring cluster in sketch space
Corresponds to an (or few) LSH bucket

A new module

Is allocated when a new bucket becomes popular
Tries to predict label/action/missing-info for that concept

Knowledge graph

Co-occurring/co-referencing buckets
Compound modules

sketch

context

bucket

Routing-Module (OS)

LSH Table (modules)

Mind

Compare with

[Sparse Distr. Memories, Kanerva 02]

15 of 35

Traversal over a graph of concepts: Pathways

Chains of memory: Sequence of steps in cooking a dish stored as associations.

Sequence of thoughts: static/dynamic (cooking/math) links in the graph are chased

Connection to Xformers:

Sketch is transformed by FF layer in each node. Routed via hashing. Sketches may be combined using attention

Cascaded lookups into memory becomes like a decision tree. Graph over deep learned modules

Node is an LSH bucket (region in sketch space); has a small module. Possibly multiple parallel pathways

16 of 35

Expandable
Adds capacity without increasing FLOPS much

Neural Network

Memory

Neural Memory (with Xin Wang, Manzil Zaheer, AISTATS20)

17 of 35

Use Locality Sensitive Hash table to index sketches coming of a layer
Similar sketches likely go to same bucket
Co occurring sketches can point to each other (knowledge graph)

Simplified Implementation.

Memory architecture: LSH table

18 of 35

Compact BERT models: memory helps model attend to contexts.

Experiments

19 of 35

Long Tail Image Classification: memory helps networks memorize few-shot examples.

Experiments

20 of 35

Connection to Switch Transformers: Neural memory where Hash function becomes routing function

21 of 35

Switch Transformer: A tiny FF layer in each bucket (rank k matrix).

LSH Sketch Memory

Rank k matrix per bucket

Hashing becomes the routing function
Embedding tables: Accessing large “embedding-like” tables at input and higher layers.
Should also include: hand curated entities, KG in memory: Entities as experts, REALM

Attn

22 of 35

Neural Memory is a form of factorization:

wide layer with sparse activations

Wide dense layer

Wide layer with

sparse activations

output

input

Lookup Operation

Memory Table

input

23 of 35

Understanding of how things are remembered. Continual learning architecture.
Neural memory: LSH for similarity mapping, sketching for info condensation.
References:

Conclusion

24 of 35

key-value lookup

fuzzy key-value lookup

Canonical Memory based problems

key space

value space

key space

value space

25 of 35

key-value: random keys from {0, 1}⁵⁰, random values from {0, 1}⁵⁰.
fuzzy key-value: similar to key-value, add random normal noise N(0, 1/50).
set coherence: keys are sets (e.g. a bag of words) of cardinality 5, fuzzy keys are random subsets (of the key) of cardinality 3, value from {0, 1}⁵⁰.

Synthetic Experiments

key-value

fuzzy key-value

set coherence

26 of 35

Recursive sketching formula.

27 of 35

Use block random matrices

28 of 35

Recursive sketching: Floating modules

These are like reusable functions that are not necessarily “hardwired”

Curve Analyzer

Counting Unit

Clustering unit

29 of 35

Video: Slithering Snake, Walking Man, Flying butterfly

Videos are periodic

In sketch space:
periodic loops

Different shapes and orientations

Different phrases
Exactly a curve analyzer or curve sketcher

30 of 35

Very Deep Random networks may be Cryptographically hard

With A. Das, S. Gollapudi, R. Kumar.

31 of 35

f = Teacher Deep Network (black box)

f = Student Deep Network

Inputs

x∊R^d

Outputs

y∊R

Inputs

x∊R^d

Outputs

y∊R

Natural functions: Modelled as deep networks
Better be able to learn black box teacher networks

32 of 35

Plot with random inputs (Gaussian) as function of depth. We prove:

Networks of short depth can be learned
For larger depths learnability drops exponentially
At large depths possibly becomes cryptographic hash function

ReLU Width = 100 Sign

33 of 35

Random Deep Network is like a Random function

Correlated inputs => random looking outputs
The angle θ between two correlated vectors increases up the layers.
Theorem: Angle becomes exponentially close to orthogonal
|E[dot product]| = exp(-depth)
Theorem: Statistical Query hardness exp(depth)
Maybe natural functions are not random deep network. Then what?

Input x

Inputs x,y

34 of 35

Modular teacher networks may be easier to learn (sparse cuts help)

With Surbhi Goel 2018

35 of 35

Sparse Cuts make it easier to learn a Teacher Network

If there is a layer of high thresholds, the teacher network may be easier to learn.
This is like a sparse cut of data flow

Reduces to learning each part of the cut separately

Can prove formally (only) when the cut is at the lowest layer.

Input x

Sparse Cut

(sparse flow

of information)

Few outputs fire