1 of 50

Relation Extraction

What is relation extraction?

Dan Jurafsky

2 of 50

Extracting relations from text

  • Company report: International Business Machines Corporation (IBM or the company) was incorporated in the State of New York on June 16, 1911, as the Computing-Tabulating-Recording Co. (C-T-R)…”
  • Extracted Complex Relation:

Company-Founding

Company IBM

Location New York

Date June 16, 1911

Original-Name Computing-Tabulating-Recording Co.

  • But we will focus on the simpler task of extracting relation triples

Founding-year(IBM,1911)

Founding-location(IBM,New York)

Dan Jurafsky

3 of 50

Extracting Relation Triples from Text

The Leland Stanford Junior University, commonly referred to as Stanford University or Stanford, is an American private research university located in Stanford, California … near Palo Alto, California… Leland Stanford…founded the university in 1891

Stanford EQ Leland Stanford Junior University

Stanford LOC-IN California

Stanford IS-A research university

Stanford LOC-NEAR Palo Alto

Stanford FOUNDED-IN 1891

Stanford FOUNDER Leland Stanford

Dan Jurafsky

4 of 50

Why Relation Extraction?

  • Create new structured knowledge bases, useful for any app
  • Augment current knowledge bases
    • Adding words to WordNet thesaurus, facts to FreeBase or DBPedia
  • Support question answering
    • The granddaughter of which actor starred in the movie “E.T.”?

(acted-in ?x “E.T.”)(is-a ?y actor)(granddaughter-of ?x ?y)

  • But which relations should we extract?

4

Dan Jurafsky

5 of 50

Automated Content Extraction (ACE)

17 relations from 2008 “Relation Extraction Task”

Dan Jurafsky

6 of 50

Automated Content Extraction (ACE)

  • Physical-Located PER-GPE

He was in Tennessee

  • Part-Whole-Subsidiary ORG-ORG

XYZ, the parent company of ABC

  • Person-Social-Family PER-PER

John’s wife Yoko

  • Org-AFF-Founder PER-ORG

Steve Jobs, co-founder of Apple

6

Dan Jurafsky

7 of 50

UMLS: Unified Medical Language System

  • 134 entity types, 54 relations

Injury disrupts Physiological Function

Bodily Location location-of Biologic Function

Anatomical Structure part-of Organism

Pharmacologic Substance causes Pathological Function

Pharmacologic Substance treats Pathologic Function

Dan Jurafsky

8 of 50

Extracting UMLS relations from a sentence

Doppler echocardiography can be used to diagnose left anterior descending artery stenosis in patients with type 2 diabetes

🡻

Echocardiography, Doppler DIAGNOSES Acquired stenosis

8

Dan Jurafsky

9 of 50

Databases of Wikipedia Relations

9

Relations extracted from Infobox

Stanford state California

Stanford motto “Die Luft der Freiheit weht”

Wikipedia Infobox

Dan Jurafsky

10 of 50

Relation databases �that draw from Wikipedia

  • Resource Description Framework (RDF) triples

subject predicate object

Golden Gate Park location San Francisco

dbpedia:Golden_Gate_Park dbpedia-owl:location dbpedia:San_Francisco

  • DBPedia: 1 billion RDF triples, 385 from English Wikipedia
  • Frequent Freebase relations:

people/person/nationality, location/location/contains

people/person/profession, people/person/place-of-birth

biology/organism_higher_classification film/film/genre

10

Dan Jurafsky

11 of 50

Ontological relations

  • IS-A (hypernym): subsumption between classes
    • Giraffe IS-A ruminant IS-A ungulate IS-A mammal IS-A vertebrate IS-A animal

  • Instance-of: relation between individual and class
    • San Francisco instance-of city

Examples from the WordNet Thesaurus

Dan Jurafsky

12 of 50

How to build relation extractors

  1. Hand-written patterns
  2. Supervised machine learning
  3. Semi-supervised and unsupervised
    • Bootstrapping (using seeds)
    • Distant supervision
    • Unsupervised learning from the web

Dan Jurafsky

13 of 50

Relation Extraction

What is relation extraction?

Dan Jurafsky

14 of 50

Relation Extraction

Using patterns to extract relations

Dan Jurafsky

15 of 50

Rules for extracting IS-A relation

Early intuition from Hearst (1992)

    • Agar is a substance prepared from a mixture of red algae, such as Gelidium, for laboratory or industrial use
  • What does Gelidium mean?
  • How do you know?`

Dan Jurafsky

16 of 50

Rules for extracting IS-A relation

Early intuition from Hearst (1992)

    • Agar is a substance prepared from a mixture of red algae, such as Gelidium, for laboratory or industrial use”
  • What does Gelidium mean?
  • How do you know?`

Dan Jurafsky

17 of 50

Hearst’s Patterns for extracting IS-A relations

(Hearst, 1992): Automatic Acquisition of Hyponyms

“Y such as X ((, X)* (, and|or) X)”

“such Y as X”

“X or other Y”

“X and other Y”

“Y including X”

“Y, especially X”

Dan Jurafsky

18 of 50

Hearst’s Patterns for extracting IS-A relations

Hearst pattern

Example occurrences

X and other Y

...temples, treasuries, and other important civic buildings.

X or other Y

Bruises, wounds, broken bones or other injuries...

Y such as X

The bow lute, such as the Bambara ndang...

Such Y as X

...such authors as Herrick, Goldsmith, and Shakespeare.

Y including X

...common-law countries, including Canada and England...

Y , especially X

European countries, especially France, England, and Spain...

Dan Jurafsky

19 of 50

Extracting Richer Relations Using Rules

  • Intuition: relations often hold between specific entities
    • located-in (ORGANIZATION, LOCATION)
    • founded (PERSON, ORGANIZATION)
    • cures (DRUG, DISEASE)
  • Start with Named Entity tags to help extract relation!

Dan Jurafsky

20 of 50

Named Entities aren’t quite enough.�Which relations hold between 2 entities?

Drug

Disease

Cure?

Prevent?

Cause?

Dan Jurafsky

21 of 50

What relations hold between 2 entities?

PERSON

ORGANIZATION

Founder?

Investor?

Member?

Employee?

President?

Dan Jurafsky

22 of 50

Extracting Richer Relations Using Rules and�Named Entities

Who holds what office in what organization?

PERSON, POSITION of ORG

      • George Marshall, Secretary of State of the United States

PERSON(named|appointed|chose|etc.) PERSON Prep? POSITION

      • Truman appointed Marshall Secretary of State

PERSON [be]? (named|appointed|etc.) Prep? ORG POSITION

      • George Marshall was named US Secretary of State

Dan Jurafsky

23 of 50

Hand-built patterns for relations

  • Plus:
    • Human patterns tend to be high-precision
    • Can be tailored to specific domains
  • Minus
    • Human patterns are often low-recall
    • A lot of work to think of all possible patterns!
    • Don’t want to have to do this for every relation!
    • We’d like better accuracy

Dan Jurafsky

24 of 50

Relation Extraction

Using patterns to extract relations

Dan Jurafsky

25 of 50

Relation Extraction

Supervised relation extraction

Dan Jurafsky

26 of 50

Supervised machine learning for relations

  • Choose a set of relations we’d like to extract
  • Choose a set of relevant named entities
  • Find and label data
    • Choose a representative corpus
    • Label the named entities in the corpus
    • Hand-label the relations between these entities
    • Break into training, development, and test
  • Train a classifier on the training set

26

Dan Jurafsky

27 of 50

How to do classification in supervised relation extraction

  1. Find all pairs of named entities (usually in same sentence)
  2. Decide if 2 entities are related
  3. If yes, classify the relation
  4. Why the extra step?
    • Faster classification training by eliminating most pairs
    • Can use distinct feature-sets appropriate for each task.

27

Dan Jurafsky

28 of 50

Automated Content Extraction (ACE)

17 sub-relations of 6 relations from 2008 “Relation Extraction Task”

Dan Jurafsky

29 of 50

Relation Extraction

Classify the relation between two entities in a sentence

American Airlines, a unit of AMR, immediately matched the move, spokesman Tim Wagner said.

SUBSIDIARY

FAMILY

EMPLOYMENT

NIL

FOUNDER

CITIZEN

INVENTOR

Dan Jurafsky

30 of 50

Word Features for Relation Extraction

  • Headwords of M1 and M2, and combination

Airlines Wagner Airlines-Wagner

  • Bag of words and bigrams in M1 and M2

{American, Airlines, Tim, Wagner, American Airlines, Tim Wagner}

  • Words or bigrams in particular positions left and right of M1/M2

M2: -1 spokesman

M2: +1 said

  • Bag of words or bigrams between the two entities

{a, AMR, of, immediately, matched, move, spokesman, the, unit}

American Airlines, a unit of AMR, immediately matched the move, spokesman Tim Wagner said

Mention 1

Mention 2

Dan Jurafsky

31 of 50

Named Entity Type and Mention Level�Features for Relation Extraction

  • Named-entity types
    • M1: ORG
    • M2: PERSON
  • Concatenation of the two named-entity types
    • ORG-PERSON
  • Entity Level of M1 and M2 (NAME, NOMINAL, PRONOUN)
    • M1: NAME [it or he would be PRONOUN]
    • M2: NAME [the company would be NOMINAL]

American Airlines, a unit of AMR, immediately matched the move, spokesman Tim Wagner said

Mention 1

Mention 2

Dan Jurafsky

32 of 50

Parse Features for Relation Extraction

  • Base syntactic chunk sequence from one to the other

NP NP PP VP NP NP

  • Constituent path through the tree from one to the other

NP 🡹 NP 🡹 S 🡹 S 🡻 NP

  • Dependency path

Airlines matched Wagner said

American Airlines, a unit of AMR, immediately matched the move, spokesman Tim Wagner said

Mention 1

Mention 2

Dan Jurafsky

33 of 50

Gazeteer and trigger word features for relation extraction

  • Trigger list for family: kinship terms
    • parent, wife, husband, grandparent, etc. [from WordNet]
  • Gazeteer:
    • Lists of useful geo or geopolitical words
      • Country name list
      • Other sub-entities

Dan Jurafsky

34 of 50

American Airlines, a unit of AMR, immediately matched the move, spokesman Tim Wagner said.

Dan Jurafsky

35 of 50

Classifiers for supervised methods

  • Now you can use any classifier you like
    • MaxEnt
    • Naïve Bayes
    • SVM
    • ...
  • Train it on the training set, tune on the dev set, test on the test set

Dan Jurafsky

36 of 50

Evaluation of Supervised Relation Extraction

  • Compute P/R/F1 for each relation

36

Dan Jurafsky

37 of 50

Summary: Supervised Relation Extraction

+ Can get high accuracies with enough hand-labeled training data, if test similar enough to training

- Labeling a large training set is expensive

- Supervised models are brittle, don’t generalize well to different genres

Dan Jurafsky

38 of 50

Relation Extraction

Supervised relation extraction

Dan Jurafsky

39 of 50

Relation Extraction

Semi-supervised and unsupervised relation extraction

Dan Jurafsky

40 of 50

Seed-based or bootstrapping approaches to relation extraction

  • No training set? Maybe you have:
    • A few seed tuples or
    • A few high-precision patterns
  • Can you use those seeds to do something useful?
    • Bootstrapping: use the seeds to directly learn to populate a relation

Dan Jurafsky

41 of 50

Relation Bootstrapping (Hearst 1992)

  • Gather a set of seed pairs that have relation R
  • Iterate:
    1. Find sentences with these pairs
    2. Look at the context between or around the pair and generalize the context to create patterns
    3. Use the patterns for grep for more pairs

Dan Jurafsky

42 of 50

Bootstrapping

  • <Mark Twain, Elmira> Seed tuple
    • Grep (google) for the environments of the seed tuple

“Mark Twain is buried in Elmira, NY.”

X is buried in Y

“The grave of Mark Twain is in Elmira”

The grave of X is in Y

“Elmira is Mark Twain’s final resting place”

Y is X’s final resting place.

  • Use those patterns to grep for new tuples
  • Iterate

Dan Jurafsky

43 of 50

Dipre: Extract <author,book> pairs

  • Start with 5 seeds:

  • Find Instances:

The Comedy of Errors, by William Shakespeare, was

The Comedy of Errors, by William Shakespeare, is

The Comedy of Errors, one of William Shakespeare's earliest attempts

The Comedy of Errors, one of William Shakespeare's most

  • Extract patterns (group by middle, take longest common prefix/suffix)

?x , by ?y , ?x , one of ?y ‘s

  • Now iterate, finding new seeds that match the pattern

Brin, Sergei. 1998. Extracting Patterns and Relations from the World Wide Web.

Author

Book

Isaac Asimov

The Robots of Dawn

David Brin

Startide Rising

James Gleick

Chaos: Making a New Science

Charles Dickens

Great Expectations

William Shakespeare

The Comedy of Errors

Dan Jurafsky

44 of 50

�Snowball

  • Similar iterative algorithm

  • Group instances w/similar prefix, middle, suffix, extract patterns
    • But require that X and Y be named entities
    • And compute a confidence for each pattern

{’s, in, headquarters}

{in, based}

ORGANIZATION

LOCATION

Organization

Location of Headquarters

Microsoft

Redmond

Exxon

Irving

IBM

Armonk

E. Agichtein and L. Gravano 2000. Snowball: Extracting Relations

from Large Plain-Text Collections. ICDL

ORGANIZATION

LOCATION

.69

.75

Dan Jurafsky

45 of 50

Distant Supervision

  • Combine bootstrapping with supervised learning
    • Instead of 5 seeds,
      • Use a large database to get huge # of seed examples
    • Create lots of features from all these examples
    • Combine in a supervised classifier

Snow, Jurafsky, Ng. 2005. Learning syntactic patterns for automatic hypernym discovery. NIPS 17

Fei Wu and Daniel S. Weld. 2007. Autonomously Semantifying Wikipeida. CIKM 2007

Mintz, Bills, Snow, Jurafsky. 2009. Distant supervision for relation extraction without labeled data. ACL09

Dan Jurafsky

46 of 50

Distant supervision paradigm

  • Like supervised classification:
      • Uses a classifier with lots of features
      • Supervised by detailed hand-created knowledge
      • Doesn’t require iteratively expanding patterns
  • Like unsupervised classification:
      • Uses very large amounts of unlabeled data
      • Not sensitive to genre issues in training corpus

Dan Jurafsky

47 of 50

Distantly supervised learning �of relation extraction patterns

For each relation

For each tuple in big database

Find sentences in large corpus with both entities

Extract frequent features (parse, words, etc)

Train supervised classifier using thousands of patterns

4

1

2

3

5

PER was born in LOC

PER, born (XXXX), LOC

PER’s birthplace in LOC

<Edwin Hubble, Marshfield>

<Albert Einstein, Ulm>

Born-In

Hubble was born in Marshfield

Einstein, born (1879), Ulm

Hubble’s birthplace in Marshfield

P(born-in | f1,f2,f3,…,f70000)

Dan Jurafsky

48 of 50

Unsupervised relation extraction

  • Open Information Extraction:
    • extract relations from the web with no training data, no list of relations

  1. Use parsed data to train a “trustworthy tuple” classifier
  2. Single-pass extract all relations between NPs, keep if trustworthy
  3. Assessor ranks relations based on text redundancy

(FCI, specializes in, software development)

(Tesla, invented, coil transformer)

48

M. Banko, M. Cararella, S. Soderland, M. Broadhead, and O. Etzioni. 2007. Open information extraction from the web. IJCAI

Dan Jurafsky

49 of 50

Evaluation of Semi-supervised and�Unsupervised Relation Extraction

  • Since it extracts totally new relations from the web
    • There is no gold set of correct instances of relations!
      • Can’t compute precision (don’t know which ones are correct)
      • Can’t compute recall (don’t know which ones were missed)
  • Instead, we can approximate precision (only)
    • Draw a random sample of relations from output, check precision manually

  • Can also compute precision at different levels of recall.
    • Precision for top 1000 new relations, top 10,000 new relations, top 100,000
    • In each case taking a random sample of that set
  • But no way to evaluate recall

49

Dan Jurafsky

50 of 50

Relation Extraction

Semi-supervised and unsupervised relation extraction

Dan Jurafsky