1 of 77

Commonsense

and how far have we come in giving our NLU systems common sense?

Oct 2022

0

Nasrin Mostafazadeh

Co-founder at Verneek (stealth)

@nasrinmmm

nasrin@verneek.com

2 of 77

Verneek is a deep-tech AI company in NYC, with mission of “enabling anyone to make better & faster decisions”. Verneek’s proprietary AI technologies power truly intuitive modalities of interaction on top of heterogeneous data sources.

We are hiring across all roles!

3 of 77

State of AI, ~17 years ago

Robotics vs. NLP

  • The monkey ate the banana because it was hungry.
    • Question: What is it? Monkey or the banana?

2

Answering the following basic question was deemed very challenging for the AI systems at the time!

Correct answer: Monkey

4 of 77

State of AI, now!

Robotics vs. NLP

Stanford CoreNLP Coreference Resolver

(March 2022)

And it’s still challenging!!

  • The monkey ate the banana because it was hungry.
    • What is it? Monkey or the banana?
    • Correct answer: Monkey

GPT-3

3

.. still doesn’t work!

5 of 77

Why is NLU so Hard?

People outside of the field often don’t understand what even is “AI”-related about human language, let alone why it’s the holy grail of AI!

4

6 of 77

… because of the Dual Problem of Language Ambiguity & Meaning Variability

5

The same expression can mean different things and be ambiguous.

The same meaning can be expressed in many ways

7 of 77

… because tackling the dual problem requires enormous amounts of world, common sense, and linguistic knowledge

  • The monkey ate the banana because it was hungry.

6

We use our knowledge about entities and their attributes, selectional restrictions of verbs, and beyond…

Moravec’s Paradox

Skills that appear effortless to be difficult to reverse-engineer, but skills that require effort may not necessarily be difficult to engineer at all.

8 of 77

So NLU is hard because of …

knowledge acquisition bottleneck!

7

9 of 77

The Spectrum of Knowledge:

from Word Knowledge to World Knowledge

Common sense Knowledge

8

Lexical Knowledge World knowledge

10 of 77

What is Common Sense Knowledge

Well, the definition of common sense is not common sense!

9

11 of 77

My definition of common sense knowledge

The most fundamental and general knowledge about the world, shared by most people including a 5-year old kid. This includes:

  • Class Attributes
    • Most people have two eyes.
    • There are no blue apples.
    • Monkeys eat bananas.
  • Events and their causal chains
    • If you kill someone, that person will be dead as a result.
    • If a person drops a glass on a hard floor, it may break.
    • If you fall from your bike, you may hurt yourself.
      • If you hurt yourself, you may cry as a result.
  • Core theories about the world
    • TRANSPORT requires the OBJECT to be IN the VEHICLE
    • BEFORE an OBJECT is TRANSPORTED by a VEHICLE from PLACE1 to PLACE2, the OBJECT and VEHICLE are at PLACE1

10

12 of 77

According to my definition, here are the loose boundaries of common sense knowledge vs world/domain knowledge

It is:

  • about kinds, not individuals.
    • Countries have capital cities, not that Edinburgh is the capital of Scotland
  • enduringly true rather than true at the moment
    • People enjoy listening to music, not that Barack Obama enjoys listening to Taylor Swift
  • Applicable to daily life of everyone, not to less familiar contexts
  • Students learn at school, not that students at NYU learned that cloning vector replicates within a cell

11

13 of 77

Commonsense Reasoning

The basic reasoning abilities for connecting the dots and applying common sense knowledge in everyday contexts!

12

14 of 77

What does Having Common Sense Mean?

Common Sense Knowledge + Commonsense Reasoning Capabilities

13

A program has common sense if it automatically deduces for itself a sufficiently

wide class of immediate consequences of anything it is told and what it already

knows. McCarthy (1959)

AI has common sense if it doesn’t make dumb mistakes. Someone on Twitter! (2021)

15 of 77

How do we tackle the problem of

common sense knowledge acquisition?

14

16 of 77

Approaches for Acquiring Common Sense Knowledge

  1. Knowledge engineering by manually authoring knowledge

“Let’s roll up our sleeves and get on with it!” But it’s a daunting task…

    • Use expensive experts!
    • High quality, but super slow and unscalable.
    • Examples: Wordnet, Cyc

15

17 of 77

Enter:

GOFAI

16

18 of 77

CYC: The Common Sense Knowledge Base

  • Douglas B. Lenat and MCC (1984)
  • GOFAI at a massive scale, where they construct “Common Sense” knowledge manually
  • Contains millions rules of rules, hundreds of thousands of terms and facts about everyday life
  • All written in CycL, their Knowledge Representation language

17

19 of 77

18

Dark times came after commonsense reasoning failed investments in 80s!

source: towards datascience

source: Yejin Choi

20 of 77

…when I started my PhD in Commonsense Reasoning in 2012, still no one really cared about the subject…

19

21 of 77

We have had much better approaches for acquiring Common Sense Knowledge since 2000s

  • Knowledge engineering by manually authoring knowledge

“Let’s roll up our sleeves and get on with it!” But it’s a daunting task…

    • Use expensive experts!
    • High quality, but super slow and unscalable.
    • Examples: Wordnet, Cyc
  • Automatic extraction
    • Cheaper, faster, but not as accurate.
    • Prior Examples: KnowIt All, TextRunner, KNEXT, NELL, DIRT
  • Community-powered/Crowdsourcing
    • Use large numbers of non-experts online!
    • Cheaper, faster, scalable, but prone to noise.
    • Prior Examples: OpenMind, ConceptNet, Wikipedia

20

  • We have more computing power
  • We have large corpora
  • We have crowdsourcing at scale
  • We can build better automatic extractors
  • We have better ways of knowledge representation

22 of 77

Overview of existing common sense resources

ATOMIC�(Sap et al., 2019)

OpenCyc�(Lenat, 2004)

ConceptNet 5.5(Speer et al., 2017)

Web Child 2.0�(Tandon et al., 2017)

ConceptNet�(Liu & Singh, 2004)

Web Child�(Tandon et al., 2014)

Open Mind �Common Sense�(Singh, 2002)

Cyc�(Lenat et al., 1984)

OpenCyc 4.0�(Lenat, 2012)

ResearchCyc�(Lenat, 2006)

NELL�(Mitchell et al., 2015)

NELL�(Carlson et al., 2010)

today

23 of 77

  • ConceptNet (Havasi, Speer, Liu, Singh, Eslick, et al., 2007) is a semantic network that has common sense knowledge in the form of concepts and the relationship among them.
  • Originally based on knowledge from Open Mind Common Sense (OMCS) project at MIT Media Lab (started in 1999) which was crowdsourcing facts and stories from Netizens!!

22

24 of 77

Benchmarks with small amount of direct supervision…

23

Winograd Schema Challenge (WSC)

273 examples

(Levesque, 2011)

Choice of Plausible Alternatives (COPA)

500 dev, 500 test

(Roemmele et al., 2011)

25 of 77

‘John didn’t see Brian’s car coming, because he was dizzy.’

Who was dizzy?’

‘John didn’t see Brian’s car coming, because he had his lights

off.’

Who had his lights off?’

Winograd Schema Common Sense Reasoning challenge, proposed by Levesque, 2011.

26 of 77

25

Around 2016, we saw a renewed interest in commonsense reasoning...

...with deep learning folks acknowledging it as the

Holy Grail of AI!

27 of 77

The paradigm shift in NLU, since 2015…

  • 2015-2017:
    • What happened: New SOTA established on various NLP/NLU benchmark
    • Recipe: Encode the input text using BiLSTMs, decode with attention!

26

Chris Manning

    • Shortcomings: Could not tackle reading comprehension tasks that (supposedly) required:
      • Vast amount of background knowledge, or
      • Reasoning, or
      • Had long established contexts.

28 of 77

���Schank and Abelson, 1977; Dyer, 1983; Charniak 1972; Turner, 1994; Schubert and Hwang, 2000, …����

27

Story Understanding

Storytelling

Story Generation

Narrative Intelligence

Script Learning

Narrative Structure Learning

Episodic Knowledge Acquisition

Story Understanding has been one of the oldest ambitions of AI, and one of the most extremely challenging tasks!

29 of 77

28

Story Understanding

Story Generation

  • Biggest challenge in story understanding and story generation: commonsense knowledge for the interpretation of narratives.
  • Narrative is a major cognitive tool humans use for holding meaningful communications (Dahlstrom, 2014; AC, 2002), serving a variety of purposes.
  • Stories are inherently filled with complex chains of events, with various causal and temporal networks.

30 of 77

Story Cloze Test (Mostafazadeh et al., 2016)�Commonsense Reasoning benchmark

Context:

Jim found an old disposable camera in the bottom of his junk drawer. He began snapping away at everything around him. The counter clicked down to one final photo. The gravity of the situation began to dawn on Jim.

29

Two alternative endings:

Jim took time to decide what he would take a picture of.

Jim took 20 more photos.

A challenging commonsense reasoning task, where SOTA was ~65% for many months after release of the dataset.

31 of 77

Story Cloze Test has only ~1,500 direct-supervision data points

30

We intentionally did not provide a large training data with direct-supervision, with the goal of pushing commonsense reasoning forward, and to prevent the task from becoming yet another pattern-recognition/memorization task.

32 of 77

Things got interesting in 2017-2018!

  • Late 2017-2018:
    • What happened: The dawn of Attention is All you need (Vaswani et al., 2017), introducing transformers.

Brand new established SOTA on various supposedly more complex reading comprehension tasks.

    • Recipe: fine-tune large pretrained transformer-based models on downstream tasks (even with a small supervised data)!

31

33 of 77

GPT-1, 2018

(Radford et al., 2018)

(Radford et al , 2018)

(Radford et al , 2018

32

These results were on the Story Cloze Test v1, where there had been some stylistic biases (Sap et al., 2017).

We tested a host of models on the new debiased Story Cloze Test v 1.5 test set (Sharma et al., 2018).

Then the GPT-1 (Radford et al., 2018) model was the only model still holding its rather high performance!

So, do these models actually have narrative understanding? Are they actually learning to transfer various lexical, conceptual, commonsense, and world knowledge?

34 of 77

GPT-3, 2020...

33

35 of 77

34

36 of 77

We have come a long way…�We have to keep moving the goal post!

35

37 of 77

So we’ve come a rather long way in the last few years in giving AI systems common sense reasoning capabilities, with lots of exciting progress.

�We need to work on tackling the following issues which we are still grappling with…

36

38 of 77

Issue: Our amazing models often make glaringly stupid mistakes, being brittle! This makes it hard to deploy these models into real-world products.

37

39 of 77

38

40 of 77

Other Issues...

  • We yet don’t know the implications of establishing SOTA on various benchmarks. Are we making any real progress? Do these models work outside of our narrow lab settings in the real world?
  • Handful of top industry players get to pay the costs for building ever-larger (and not-green) models, and play. Where are we going with this paradigm? (Schwartz et al., 2020)

… And we yet don’t have an AI system that has commonsense of perhaps even a dog (?), let alone a 5-year-old kid

39

Startups are a perfect setting for developing novel AI models that actually have to work in the real messy world!

41 of 77

Moving the Goalpost on

Natural Language Understanding

40

42 of 77

When humans, even young children, read, they make countless implicit commonsense inferences that frame their understanding of the unfolding narrative!

Peppa was riding her bike.

A car turned in front of her. Peppa turned her bike sharply.

She fell off of her bike.

Peppa skinned her knee.

(adapted from the ROCStories corpus)

41

43 of 77

While reading, humans construct a coherent representation of what happened and why, combining information from the text with relevant background knowledge.

42

44 of 77

Humans can construct the causal chain that describes how the sequence of events led to a particular outcome!

43

A car turned in front of Peppa causes

Peppa to turn her bike sharply causes

Peppa fell off of her bike

causes

Peppa skinned her knee

causes

(likely) she asks for help!

45 of 77

Humans can also describe how characters’ different states, such as emotions and location, changed throughout the story.�

44

Peppa went from feeling (likely) happy to feeling in pain after falling.

Peppa was on her bike throughout riding it. Then after falling, Peppa was on the ground.

46 of 77

Though humans build such mental models of situations with ease (Zwaan et al., 1995), AI systems for tasks such as reading comprehension and dialogue remain far from exhibiting similar commonsense reasoning capabilities!

45

Why?

  • Two major bottlenecks in AI research:

Difficulty in acquiring (often-implicit) commonsense knowledge at scale.

Difficulty in incorporating knowledge into state-of-the-art AI systems.

47 of 77

GLUCOSE: �GeneraLized and COntextualized Story Explanations!

(Mostafazadeh et al., 2020)

46

48 of 77

The GLUCOSE Task

  • Given a short story S and a selected sentence X in the story, GLUCOSE captures ten dimensions of causal explanation related to X.
  • These dimensions, inspired by human cognitive psychology, cover often-implicit causes and effects of X, including events, location, possession, and other attributes.
  • GLUCOSE encodes commonsense knowledge in the form of semi-structured inference rules (mini-theories about the world), each grounded in a specific story.

47

ToC

49 of 77

GLUCOSE framework through an ExamplePeppa was riding her bike. A car turned in front of her. Peppa turned her bike sharply. She fell off of her bike. Peppa skinned her knee.�

48

ToC

Dim #1

Is there an event

that directly causes

or enables X?

Dim #2

Is there an emotion or basic human drive that motivates X?

Dim #3

Is there a location state that enables X?

Generalized: General rules provide general mini-theories about the world!

Contextualized: Specific statements exemplify how a general rule could be grounded in a particular context

Semi-structured Inference Rule = antecedent connective consequent

50 of 77

GLUCOSE framework through an ExamplePeppa was riding her bike. A car turned in front of her. Peppa turned her bike sharply. She fell off of her bike. Peppa skinned her knee.�

49

GLUCOSE captures mini causal theories about the world focused around events, states (location, possession, emotion, etc), motivations, and naive human psychology.

ToC

Dim #4

Is there a possession state that enables X?

Dim #5

Are there any other attributes enabling X?

GLUCOSE is a unique perspective on commonsense reasoning for presenting often-implicit commonsense knowledge in the form of semi-structured general inference rules that are also grounded in the context of a specific story

51 of 77

How to address the problem of implicit knowledge acquisition at scale?

Filling in the GLUCOSE dimensions is a demanding task, requiring grasping the concepts of causality and generalization and to write semi-structured inference rules.

50

ToC

52 of 77

An effective multi-stage crowdsourcing platform

After many rounds of pilot studies, we successfully designed an effective platform for collecting GLUCOSE data that is cognitively accessible to laypeople!

51

ToC

GLUCOSE Qualification UI

GLUCOSE Main UI

GLUCOSE Review Dashboard

53 of 77

Statistics and Examples

52

ToC

# total annotations ~670K

# total unique stories 4,881

# workers participated 371

Various implicit and script-like mini-theories:

  • SomeoneA gives SomeoneB SomethingA >Results in> SomeoneB possess(es) SomethingA
  • SomeoneA is careless >Enables> SomeoneA forgets SomethingA SomewhereA
  • SomeoneA forgets SomethingA SomewhereA >Results in> SomethingA is SomewhereA

Number of rules collected for each of the GLUCOSE dimensions

54 of 77

GLUCOSE captures extensive commonsense knowledge that is unavailable in the existing resources

53

Ceiling overlap between GLUCOSE and other resources based on best-effort mapping of relations.

GLUCOSE Dim 1 2 5 6 7 10

ConceptNet (Speer et al., 2017) 1.2% 0.3% 0% 1.9% 0% 0%

ATOMIC (Sap et al., 2019) 7.8% 1.2% 2.9% 5.3% 1.8% 4.9%

55 of 77

Atomic:�inferential knowledge in natural language form

56 of 77

Atomic: 880,000 triples for AI systems to reason about causes and effects of everyday situations

X repels� Y’s attack

57 of 77

X repels� Y’s attack

58 of 77

X repels� Y’s attack

nine inference dimensions

59 of 77

Causes

Effects

X repels� Y’s attack

60 of 77

Static

Dynamic

X repels� Y’s attack

61 of 77

Involuntary

Voluntary

X repels� Y’s attack

62 of 77

Theme

Agent

X repels� Y’s attack

63 of 77

X repels� Y’s attack

300,000 event nodes to date

880,000 if-Event-then-* knowledge triples

64 of 77

Atomic: knowledge of cause and effect

  • Humans have theory of mind, allowing us to
    • make inferences about people’s mental states
    • understand likely events that precede and follow (Moore, 2013)
  • AI systems struggle with inferential reasoning
    • only find complex correlational patterns in data
    • limited to the domain they are trained on

(Pearl; Davis and Marcus 2015; Lake et al. 2017; Marcus 2018)

Theory of Mind

65 of 77

How to incorporate commonsense knowledge into the state-of-the-art AI systems?

64

66 of 77

GLUCOSE Empirical Evaluation Task�A testbed for evaluating models that can dynamically produce GLUCOSE-like inferences on novel input

  • Task: Given a story S, the sentence X, and dimension d, the GLUCOSE specific and general rules should be predicted. 
  • Test Set: We carefully curated a doubly vetted test set, based on previously unseen stories and on which our most reliable annotators had high agreement.
    • Our vetting process resulted in a test set of 500 GLUCOSE story/sentence pairs, each with 1-5 dimensions answered.
  • Evaluation Metrics:
    • Human: We designed a specialized UI for collecting calibrated human ratings.
    • Automatic: We used Corpus-level BLEU.

65

67 of 77

We designed a specialized Human Evaluation UI for collecting reliable, reproducible, and calibrated ratings!

66

68 of 77

Automatic Evaluation�of natural language generations

  • A majority of commonsense reasoning frameworks have been in multiple-choice form, as opposed to natural language generation, due to ease of evaluation
    • Multiple-choice tests are inherently easier to be gamed!
  • Automatic evaluation for tasks involving natural language generation with diverse possibilities has been a major bottleneck for research
  • BLEU’s ease of replicability has made it a popular automated metric, but its correlation with human judgement has proven weak in various tasks.�

67

69 of 77

Automatic Evaluation�of natural language generations in GLUCOSE

  • We found very strong pairwise correlation between human and ScareBLEU corpus-level scores on our test set.
    • Spearman = 0.891, Pearson = 0.855, and Kendall’s = 0.705, all with p-value < 0.001.
  • This is accomplished through various design choices in GLUCOSE:
    1. GLUCOSE semi-structured inference rules are designed to be evaluable, where the structure naturally limits the format of the generated rules
    2. We curated our test set to eliminate cases with a wide range of correct responses where humans cannot agree, making the limited number of gold references sufficient for automatic evaluation
    3. We designed a systematic human evaluation process that can collect calibrated ratings from judges who are well educated about what constitutes a correct GLUCOSE rule.

68

GLUCOSE task has a systematic evaluation that is fast and easily replicable!

GLUCOSE

Strong correlation between human and automatic metric!!

70 of 77

Example Predictions�Dimension 3; a location enabling X.

  • Input:
    • Karen made a pan of lasagna. She brought it to the party. Nobody wanted to eat lasagna. Karen ate it for a week. She became tired of lasagna.

69

GPT-2

Enc-Dec

Human

She was in front of a TV >Enables> Karen made a pan of lasagna.

Karen is in the kitchen >Enables> Karen makes a pan of lasagna

SomeoneA is in a kitchen >Enables> SomeoneA cooks SomethingA

Karen is in the kitchen >Enables> Karen made a pan of lasagna

SomeoneA is in a kitchen >Enables> SomeoneA prepares SomethingA (that is a dish)

ToC

Enc-Dec

(Raffel et al., 2019)

Human!

Avg: s 2.8/3 g 2.6/3

Avg: s 2.6/3 g 2.3/3

71 of 77

Example Predictions�Dimension 6; an event that X Causes/Enables. �.

  • Input:
    • Karen made a pan of lasagna. She brought it to the party. Nobody wanted to eat lasagna. Karen ate it for a week. She became tired of lasagna.

70

Enc-Dec

Human

Karen makes a pan of lasagna >Causes/Enables> Karen eats it for a week

SomeoneA makes SomethingA (that is food) >Causes/Enables> SomeoneA eats SomethingA

Karen makes a pan of lasagna >Causes/Enables> Karen brought it to the party

SomeoneA prepares SomethingA (that is a dish) >Causes/Enables> SomeoneA takes SomethingA to SomethingB (that is an event)

ToC

72 of 77

71

73 of 77

Conclusion

We proved our following hypothesis

72

A promising new recipe for giving machines commonsense is to use high-quality commonsense knowledge such as GLUOCSE for training neural models that have pre-existing lexical and conceptual knowledge.

GLUCOSE-Trained model that can dynamically generate GLUCOSE dimensions for any novel input

Static commonsense knowledge base with GLUCOSE mini-theories authored by humans

<

ToC

Classic commonsense knowledge bases have been static

New commonsense knowledge bases should be dynamic

value

All the data and models can be found under: https://github.com/ElementalCognition/glucose/

74 of 77

Are our commonsense reasoning systems simply hallucinating?!

73

Perhaps...

“Our conscious reality is just a hallucination

that we collectively agree upon.”

--Anil Seth

Actually, as humans, we also don’t just passively perceive the world,

we actively generate it and make best guesses based on our prior experiences!

So, our AI reasoning systems also simply make their best guesses, given the perceptual evidence, which will be always defeasible in light of further evidence.

75 of 77

74

To conclude:

We have come a long way in giving our AI systems narrative understanding, but we are just getting close to the foothills...

where we need to get

where we are

76 of 77

75

“To build truly intelligent machines, teach them cause & effect” --Judea Pearl

Truly intelligent AI systems need to learn to build causal explanations

77 of 77

Any Questions?

76

ToC