1 of 35

What could be the Data-structures of the Mind?

Rina Panigrahy

Google

2 of 35

Mind-World interaction.

Algorithmic understanding, not physiological.

World

Mind

Phenomena

(sensory) Inputs

environment

Datastructures?

Deep network?

One/Many Modules?

Concepts?

Memory table?

(Tabula Rasa)

Predictions/actions

3 of 35

  • You meet someone over coffee for a chat
    • Who was the person?
    • What was spoken?
    • Roughly How many people other were in the room
    • Were most people wearing dark shirts?
  • How are this organized, indexed, retrieved.
    • If I later were to meet with the person how would I retrieve this meeting and its details.
    • Is there a knowledge graph?
  • Algorithmic understanding, not physiological.

4 of 35

Recursive Sketches

with Badih Ghazi, Joshua Wang, ICML 2019

  • Close eyes: can sketch the object approximately.
    • Sketch may be lossy
    • But can get high level details.
  • Graceful Erasure: How does the memory fade?
    • Perhaps recursive
    • Higher layer embeddings have higher weight
  • Assume logically a Modular network
    • Different modules do different tasks

5 of 35

Modular Network

  • Assume modular network
    • Outputs with attributes
    • Hinton capsules, Mumford, Zhu.
    • Perhaps by cuts?
  • Heterogeneous:
    • Different algorithms
  • New Modules
    • evolve over time;
  • How to represents complex input in memory

Thickness

Color

Line

Circle

Eye

Human Head

Human

Nose

Hair

Human Torso

Human Hand

Tiger Eye

Human Eye

Human Hair

Human Nose

Tiger Head

Tiger

6 of 35

Want ‘sketch’

  • Sketch of image, or event (meeting)
  • Sketch: compressed summary of the input, network outputs
  • Can recover input object and attributes
  • Similarity preserving

Thickness

Color

Line

Circle

Eye

Human Head

Human

Nose

Hair

Human Torso

Human Hand

Tiger Eye

Human Eye

Human Hair

Human Nose

Tiger Head

Tiger

7 of 35

  • The sketch can be decoded by looking along a specific subspace.
  • Type encoded in direction

Assume one layer of Modules. Use random subspace embedding

.......

x1

x2

xN

M1

M2

MN

Modules

All in one layer

Only few fire

Combine into a sketch (sparse mostly 0’s)

8 of 35

  • The sketch can be decoded by looking along a specific subspace.
  • sketch-size >> module-output-dim * sparsity
  • Type encoded in direction

Assume one layer of Modules.

  • Random rotation Ri for each module
  • Sparse combination of sparse vectors: sparse coding/dictionary learning

+

+.......+

=

R1

R2

RN

x1

x2

xN

y

M1

M1

MN

Modules

9 of 35

Dictionary Learning: y = r1 x1 + …. rN xN : x is sparse

Here: y = R1 x1 + …. RN xN

Most xi are 0. Others sparse.

Multilayer: make recursive

Recovery: Assume one layer of Modules.

  • Spare Recovery: Can recover the sparse combination of vectors: sparse coding/dictionary learning
  • Random rotation Ri for each module

+

+.......+

=

R1

R2

RN

x1

x2

xN

y

10 of 35

  • Sketches preserve similarity. Sketch A references B then A, B similar
  • Different types can be encoded in different directions/subspaces/clusters
  • Subclusters for subtypes.
  • New cluster: New concept. Grow a new module
  • Similar to [Hyper dim. Computing Kanerva 09] [Vector Symb. Arch. Frady et al. 21]

. Tiger

.Meeting with Person X on topic Y

.Person X

.Topic Y

.Person

Words cluster/subspace

Animals cluster/subspace

Furniture

11 of 35

Memory and Modularity

  • Sketch memory and Modular Network work and inform each other
  • A new cluster creates a new module

Sketch Repository

Knowledge graph

Modular Network

Thickness

Color

Line

Circle

Eye

Human Head

Nose

Hair

Human Torso

Human Hand

Tiger Eye

Human Eye

Human Hair

Human Nose

Tiger Head

Tiger

Human

12 of 35

Recursive sketching: Benefits (provable)

  • Similarity preserving
    • Find past sketches containing the same objects as the current events.
    • Gives a knowledge graph of sketches
  • Graceful erasure, Attribute Recovery (similar to [Resonator Networks, 20])
    • Theorem: As memory fades, reconstruction fades gradually
    • Theorem: Can recover top level module outputs with small sketch
  • Interpretability
    • Sketch shows how the label was obtained.
    • Histograming / counting / mean /variance (moments through non-linearity)
  • Type information (implicit in subspace)
    • Protocol independent communication across modules. Object oriented.
  • There is an architecture for continual learning that [paper-link]
    • Helps detect a new concept as a new cluster in sketch space
    • Form a growing body of modules.

13 of 35

Sketch based Modular Continual Learning Architecture(paper)?

phenomena

sketch

context

bucket

new phenomena

Routing-Module (OS)

LSH Table (modules)

Environment

Input

Environment

Output

Buckets contain programs,

pointers to frequently co-occurring sketches, expected rewards, etc..

F function pays attention to some fields for example: facial features instead of clothes

World

World

Mind

14 of 35

New module, concept, knowledge-graph

  • A new concept
    • Is just a frequently occurring cluster in sketch space
    • Corresponds to an (or few) LSH bucket
  • A new module
    • Is allocated when a new bucket becomes popular
    • Tries to predict label/action/missing-info for that concept
  • Knowledge graph
    • Co-occurring/co-referencing buckets
    • Compound modules

sketch

context

bucket

Routing-Module (OS)

LSH Table (modules)

Mind

Compare with

[Sparse Distr. Memories, Kanerva 02]

15 of 35

Traversal over a graph of concepts: Pathways

  • Chains of memory: Sequence of steps in cooking a dish stored as associations.

  • Sequence of thoughts: static/dynamic (cooking/math) links in the graph are chased

  • Connection to Xformers:

Sketch is transformed by FF layer in each node. Routed via hashing. Sketches may be combined using attention

  • Cascaded lookups into memory becomes like a decision tree. Graph over deep learned modules

Node is an LSH bucket (region in sketch space); has a small module. Possibly multiple parallel pathways

16 of 35

  • Expandable
  • Adds capacity without increasing FLOPS much

Neural Network

Memory

Neural Memory (with Xin Wang, Manzil Zaheer, AISTATS20)

17 of 35

  • Use Locality Sensitive Hash table to index sketches coming of a layer
  • Similar sketches likely go to same bucket
  • Co occurring sketches can point to each other (knowledge graph)

Simplified Implementation.

Memory architecture: LSH table

18 of 35

Compact BERT models: memory helps model attend to contexts.

Experiments

19 of 35

Long Tail Image Classification: memory helps networks memorize few-shot examples.

Experiments

20 of 35

Connection to Switch Transformers: Neural memory where Hash function becomes routing function

21 of 35

Switch Transformer: A tiny FF layer in each bucket (rank k matrix).

FF

LSH Sketch Memory

Rank k matrix per bucket

  • Hashing becomes the routing function
  • Embedding tables: Accessing large “embedding-like” tables at input and higher layers.
  • Should also include: hand curated entities, KG in memory: Entities as experts, REALM

Attn

22 of 35

Neural Memory is a form of factorization:

wide layer with sparse activations

Wide dense layer

Wide layer with

sparse activations

output

input

A

B

C

B

Lookup Operation

C

Memory Table

input

input

23 of 35

Conclusion

24 of 35

key-value lookup

fuzzy key-value lookup

Canonical Memory based problems

key space

value space

key space

value space

25 of 35

  • key-value: random keys from {0, 1}50, random values from {0, 1}50.
  • fuzzy key-value: similar to key-value, add random normal noise N(0, 1/50).
  • set coherence: keys are sets (e.g. a bag of words) of cardinality 5, fuzzy keys are random subsets (of the key) of cardinality 3, value from {0, 1}50.

Synthetic Experiments

key-value

fuzzy key-value

set coherence

26 of 35

Recursive sketching formula.

27 of 35

Use block random matrices

28 of 35

Recursive sketching: Floating modules

These are like reusable functions that are not necessarily “hardwired”

Curve Analyzer

Counting Unit

Clustering unit

29 of 35

Video: Slithering Snake, Walking Man, Flying butterfly

  • Videos are periodic
    • In sketch space:
    • periodic loops
  • Different shapes and orientations
    • Different phrases
    • Exactly a curve analyzer or curve sketcher

30 of 35

Very Deep Random networks may be Cryptographically hard

With A. Das, S. Gollapudi, R. Kumar.

31 of 35

f = Teacher Deep Network (black box)

f = Student Deep Network

Inputs

x∊Rd

Outputs

y∊R

Inputs

x∊Rd

Outputs

y∊R

  • Natural functions: Modelled as deep networks
  • Better be able to learn black box teacher networks

^

32 of 35

Plot with random inputs (Gaussian) as function of depth. We prove:

  • Networks of short depth can be learned
  • For larger depths learnability drops exponentially
  • At large depths possibly becomes cryptographic hash function

ReLU Width = 100 Sign

33 of 35

Random Deep Network is like a Random function

  • Correlated inputs => random looking outputs
  • The angle θ between two correlated vectors increases up the layers.
  • Theorem: Angle becomes exponentially close to orthogonal
  • |E[dot product]| = exp(-depth)
  • Theorem: Statistical Query hardness exp(depth)
  • Maybe natural functions are not random deep network. Then what?

Input x

Inputs x,y

θ

θ

θ

34 of 35

Modular teacher networks may be easier to learn (sparse cuts help)

With Surbhi Goel 2018

35 of 35

Sparse Cuts make it easier to learn a Teacher Network

  • If there is a layer of high thresholds, the teacher network may be easier to learn.
  • This is like a sparse cut of data flow
    • Reduces to learning each part of the cut separately
  • Can prove formally (only) when the cut is at the lowest layer.

Input x

Sparse Cut

(sparse flow

of information)

Few outputs fire