1 of 21

ZeroC: A Neuro-Symbolic Model for

Zero-shot Concept Recognition and Acquisition at Inference Time

NeurIPS 2022

Tailin Wu¹, Megan Tjandrasuwita², Zhengxuan Wu¹, Xuelin Yang¹, Kevin Liu¹, Rok Sosic¹, Jure Leskovec¹

¹ Stanford University

² MIT

2 of 21

Motivation

Humans have the remarkable ability to recognize and acquire novel visual concepts in a zero-shot manner

Suppose we humans have only learned the concept of “line” and relation of “parallel” and “perpendicular”:

“Line”

“Parallel”

Prior knowledge:

“Perpendicular”

(concept)

(relation)

Given: Symbolic structure of a new concept

E.g.. when told a “rectangle” consists of two pairs of “lines”, the lines within the pairs are “parallel,” and the lines between the pairs are “perpendicular”

Zero-shot recognition:

“rectangle”

Zero-shot recognize novel (hierarchical) concepts:

3 of 21

Motivation

Humans have the remarkable ability to recognize and acquire novel visual concepts in a zero-shot manner

Zero-shot acquire novel (hierarchical) concepts:

Suppose we humans have only learned the concept of “line” and relation of “parallel” and “perpendicular”:

“Line”

“Parallel”

Prior knowledge:

“Perpendicular”

(concept)

(relation)

Zero-shot acquire: Symbolic structure of a new concept

A “rectangle” consists of two pairs of “lines”, the lines within the pairs are “parallel,” and the lines between the pairs are “perpendicular”

Given a single demonstration:

“rectangle”

4 of 21

Problem definition and significance:

How can we endow machine learning (ML) models with the capability of zero-shot recognition and acquisition of hierarchical visual concepts?

Having such capability will allow ML models to tackle more complex tasks at inference time, without further training on those specific tasks.

Why is it hard:

Because machine learning models typically generalize to examples drawn from same/similar distribution as in training. Here we would like the model to generalize to more complex, hierarchical concepts, not seen previously.

Prior methods:

Only address part of the problem:

Visual compositionality: [1-2] address factors of variation (e.g. color, position, smiling) without hierarchical structures; [3] addresses composition of transformation.
Concept or relation learning [4-7]: do not generalize to hierarchical concepts.
Zero-shot learning [8-10]: only generalize to new combinations of features (constituent concepts) while neglecting relation structures.

[1] Du et al. NeurIPS 2020

[2] Higgins et al. ICLR 2018

[3] Andreas et al. CVPR 2016

[4] Snell, NeurIPS 2017

[5] Mao et al. ICLR 2019

[6] Kipf et al. ICLR 2018

[7] Shanahan et al. ICML 2020

[8] Romera et al. ICML 2015

[9] Bucher et al. ICCV 2017

[10] Schonfeld et al. CVPR 2019

5 of 21

Our contribution:

In this work, we introduce Zero-shot Concept Recognition and Acquisition (ZeroC) to address this problem.

ZeroC represents concepts as graphs of constituent concept models (as nodes) and their relations (as edges). It allows a one-to-one mapping between a symbolic graph structure of a concept and its corresponding recognition model.

It (for the first time) allows acquiring new concepts, communicating its graph structure, and applying it to classification and detection tasks (even across domains) at inference time.

6 of 21

Illustration of concept:

observed

mental representation

image of a line

mask of the line

“line”

concept

observation

mask

concept name

concept graph:

concept probability model

“line” (elementary concept)

7 of 21

Illustration of concept:

concept

observation

mask

concept name

concept graph:

concept probability model

observed

mental representation

image of a parallel-line

mask of the parallel-line

“parallel-line”

“line” (elementary concept)

“line”

“parallel-line” (hierarchical concept)

“parallel” (relation)

8 of 21

Question: How to compose the probability function of a hierarchical concept?

mask

“line”

observation

concept name

How do we construct the concept probability model for a hierarchical concept (e.g. “parallel-line”), using the constituent probability models?

Here the f are non-negative functions

(use “View” Slidesshow for animation)

9 of 21

Energy-based models

The probability function can be written in terms of a energy-based model , which maps the input to a scalar value which we called “energy”.

The benefit of using EBM is that multiplication of probability translates to addition of the energy terms:

Du, Yilun, et al. Compositional Visual Generation with Energy Based Models, NeurIPS 2020: 6637-6647.

Sampling: start with a random , do gradient descent with noise on to a low energy input.

10 of 21

My contribution: energy-based model for concept

“line”

mask

observation

concept name

: only if the are consistent, the energy will be low.

Example task: detecting a concept:

Given image , concept name , infer the mask

Solution: diffusion to an that minimizes energy.

random initial mask

diffuse to the correct mask corresponding to the concept

11 of 21

My contribution: ZeroC (Zero-shot concept recognition & acquisition)

Hierarchal concept model as composition of constituent concepts and relations

Key innovation: Hierarchical Composition Rule (e.g. “parallel-line”)

Concept graph for “parallel-line”:

“line”

“parallel”

“parallel-line”

one-to-one

correspondence

12 of 21

ZeroC: Zero-shot Concept Recognition and Acquisition

Training:

Given: data tuples of or

Learn: energy-based model or

x: input

m: mask

c: concept name

r: relation name

We augment the state-of-the-art EBM training objective [1] with three more regularizations (from first principles) to learn:

[1] Du, Yilun, et al. "Improved contrastive divergence training of energy based models." ICML 2021

make sure positive example have similar energy

ensure consistency in concept acquisition

encourages “connected” masks

13 of 21

ZeroC: Zero-shot Concept Recognition and Acquisition

Inference: (1) Zero-shot concept recognition

E.g. for the concept of “Fshape”:

Given: graph structure of a hierarchical concept

Compose: ZeroC first compose an EBM based on the given graph:

Detection: (infer the mask given image x and concept name c):

14 of 21

ZeroC: Zero-shot Concept Recognition and Acquisition

Inference: (1) Zero-shot concept recognition

E.g. for the concept of “Eshape”:

Given: graph structure of a hierarchical concept

Classification:

correct!

15 of 21

ZeroC: Zero-shot Concept Recognition and Acquisition

Inference: (2) Zero-shot concept acquisition

Difficult because it is a NP-hard subgraph isomorphism task

16 of 21

Experiment 1: zero-shot recognition

Training dataset (HDConcept: elementary concepts and relations):

Training on concepts of “Eshape”, “rectangle” and relations of “inside”, “non-overlap”, “outside”:

17 of 21

Experiment 1: zero-shot recognition

Test dataset (HDConcept: hierarchical concepts):

Test on hierarchical concept (e.g. Concept1) that consists of “Eshape”, “rectangle” combined in certain way. E.g.:

18 of 21

Experiment 1: zero-shot recognition

ZeroC can zero-shot recognize hierarchical concepts with reasonable accuracy
ZeroC outperforms the strong zero-shot learning baseline of CADA-VAE
Ablation: The different components are necessary

19 of 21

Experiment 2: zero-shot acquisition

Example task:

2D to 3D transfer of concepts without training:

*we use a stringent subgraph isomorphism accuracy which is only 1 if the inferred graph is isomorphic to ground-truth.

An individual node/edge accuracy of 0.8 will result in overall accuracy of 0.8¹⁰ = 0.107

20 of 21

Experiment 3: CLEVR dataset:

ZeroC outperforms the strong baseline of CADA-VAE model, and able to reasonably classify the hierarchical concepts.

21 of 21

Summary:

In this work, we introduce Zero-shot Concept Recognition and Acquisition (ZeroC), a neuro-symbolic architecture that can recognize and acquire novel concepts in a zero-shot way.

It is able to perform:

Zero-shot recognition: recognize more complex concepts at inference, without further training
Zero-shot acquisition: discover the internal structure of more complex concepts at inference, and transfer the knowledge across domains.

For more, see our paper and code at http://snap.stanford.edu/zeroc/, or SCAN the QR code: