1 of 50

LC-QuAD

Large-scale Complex �Question Answering �Dataset

Priyansh Trivedi1, Gaurav Maheshwari1, Mohnish Dubey1, Jens Lehmann1,2�1 University of Bonn, Bonn, Germany

2 Fraunhofer IAIS, St. Augustin, Germany

ISWC 2017, Vienna

2 of 50

Outline

Question Answering

Motivation

Dataset(s)

LC-QuAD

Process of Creating LC-QuAD

Takeaways

Future Directions

3 of 50

Question Answering over KG

Understand the intent of a factual question, and return the implicit KB resource.�

Typically treated as a translation problem from natural language to formal language.

Seen major advancement in the past five years.

3

4 of 50

Motivation

4

5 of 50

Challenges when set, can then be overcome.

In other words ...

5

6 of 50

Datasets precede Research

In 2013, Berant et al. released WebQuestions.

Current State of the Art: 69% (Liang et al., 2016)

Over the past 8 years, Question Answering over Linked Data (QALD) challenge is being held.

Over 38 submissions.

6

7 of 50

Other Incentives

Dearth of large QA dataset over DBpedia.

Traditional dataset generation methods are time consuming,

do not scale.

7

8 of 50

Dataset(s)

8

Dataset

Size

Logical Forms

Complex Questions

Target KB

Free917 �(Cai et al., 2013 )

917

Yes

Yes

Freebase

WebQuestions �(Berant et al., 2013)

5 810

No

Yes

Freebase

SimpleQuestions

(Bordes et al., 2015)

108 442

No

No

Freebase

30M Factoid

(Serban et al., 2016 )

30 000 000

No

No

Freebase

QALD (Unger et al., 2016)

450

Yes

Yes

DBpedia

*unless unintentionally made complex

9 of 50

Dataset(s)

9

Dataset

Size

Logical Forms

Complex Questions

Target KB

Free917 �(Cai et al., 2013 )

917

Yes

Yes

Freebase

WebQuestions �(Berant et al., 2013)

5 810

No

Yes

Freebase

SimpleQuestions

(Bordes et al., 2015)

108 442

No

No

Freebase

30M Factoid

(Serban et al., 2016 )

30 000 000

No

No

Freebase

QALD (Unger et al., 2016)

450

Yes

Yes

DBpedia

LC-QuAD

5000

Yes

Yes

DBpedia

10 of 50

LC-QuAD

10

11 of 50

LC-QuAD

11

12 of 50

LC-QuAD

✔ has complex questions

12

13 of 50

LC-QuAD

✔ has complex questions

✔ has SPARQL queries

13

14 of 50

LC-QuAD

✔ has complex questions

✔ has SPARQL queries

✔ has boolean and aggregate based queries

14

15 of 50

LC-QuAD

✔ has complex questions

✔ has SPARQL queries

✔ has boolean and aggregate based queries

✔ is supervised (gold standard)

15

16 of 50

LC-QuAD

✔ has complex questions

✔ has SPARQL queries

✔ has boolean and aggregate based queries

✔ is supervised (gold standard)

✔ is extensible

16

17 of 50

LC-QuAD

✔ has complex questions

✔ has SPARQL queries

✔ has boolean and aggregate based queries

✔ is supervised (gold standard)

✔ is extensible

✔ is awesome 😄

17

18 of 50

Dataset Creation Process

18

19 of 50

Traditionally

Natural Language Questions are collected/created.

Thereafter, their logical form is manually created.

19

20 of 50

Traditionally

20

Name someone influenced by J. R. R. Tolkien, who won the Hugo Award?

SELECT DISTINCT ?uri WHERE {

?uri dbo:influencedBy dbr:J.R.R._Tolkien.

?uri dbo:award dbr:Hugo_Award

}

21 of 50

Traditionally

21

Name someone influenced by J. R. R. Tolkien, who won the Hugo Award?

SELECT DISTINCT ?uri WHERE {

?uri dbo:influencedBy dbr:J.R.R._Tolkien.

?uri dbo:award dbr:Hugo_Award

}

Prerequisites

✔ understanding KG Schema

✔ understanding target formal language (SPARQL)

✔ no room for errors

✔ understand NL

22 of 50

Inverting the Process

22

Name someone influenced by J. R. R. Tolkien, who won the Hugo Award?

SELECT DISTINCT ?uri WHERE {

?uri dbo:influencedBy dbr:J.R.R._Tolkien.

?uri dbo:award dbr:Hugo_Award

}

23 of 50

Inverting the Process

23

Name someone influenced by J. R. R. Tolkien, who won the Hugo Award?

SELECT DISTINCT ?uri WHERE {

?uri dbo:influencedBy dbr:J.R.R._Tolkien.

?uri dbo:award dbr:Hugo_Award

}

Prerequisites

understanding KG Schema

✔ understanding target formal language (SPARQL)

no room for errors

✔ understand NL

24 of 50

Upon Further Simplification

24

Name someone influenced by J. R. R. Tolkien, who won the Hugo Award?

What is person whose influenced by is J. R. R. Tolkien, and award is the Hugo Award.

SELECT DISTINCT ?uri WHERE {

?uri dbo:influencedBy dbr:J.R.R._Tolkien.

?uri dbo:award dbr:Hugo_Award }

automatic

25 of 50

Upon Further Simplification

25

Name someone influenced by J. R. R. Tolkien, who won the Hugo Award?

What is person whose influenced by is J. R. R. Tolkien, and award is the Hugo Award.

SELECT DISTINCT ?uri WHERE {

?uri dbo:influencedBy dbr:J.R.R._Tolkien.

?uri dbo:award dbr:Hugo_Award }

Prerequisites

understanding KG Schema

understanding target formal language (SPARQL)

no room for errors.

✔ understand NL

26 of 50

Upon Further Simplification

26

Increases the speed of creating questions.

Reduces domain expertise required.

Can afford slight errors.

Allowing us to scale up!

Prerequisites

understanding KG Schema

understanding target formal language (SPARQL)

no room for errors.

✔ understand NL

27 of 50

Remaining Challenges

27

Automatically create SPARQL queries.

Convert SPARQL queries to intermediary NLQs.

28 of 50

Automatically create SPARQL Queries

28

Create SPARQL Templates.

SELECT DISTINCT ?uri WHERE {

?uri e_to_e_out1 e_out1. � ?uri e_to_e_out2 e_out2

}

29 of 50

Automatically create SPARQL Queries

29

Create SPARQL Templates.

30 of 50

Automatically create SPARQL Queries

30

Manually select entities as answers to our queries.

Stephen King

31 of 50

Automatically create SPARQL Queries

31

Collect the 2-hop subgraph around these entities.

32 of 50

Automatically create SPARQL Queries

32

Juxtapose the SPARQL triple pattern on this subgraph.

33 of 50

SPARQL Queries Generated

33

SELECT DISTINCT ?uri WHERE {

?uri dbo:influencedBy dbr:J.R.R._Tolkien.

?uri dbo:award dbr:Hugo_Award

}

34 of 50

Remaining Challenges

34

Automatically create SPARQL queries.

Convert SPARQL queries to intermediary NLQs.

35 of 50

Creating intermediary NLQs

35

Whose influenced by is J. R. R. Tolkien, and award is the Hugo Award.

SELECT DISTINCT ?uri WHERE {

?uri dbo:influencedBy dbr:J.R.R._Tolkien.

?uri dbo:award dbr:Hugo_Award

}

36 of 50

Question Templates (NNQT)

36

Whose e_to_e_out1 is e_out1, and e_to_e_out2 is the e_out2.

SELECT DISTINCT ?uri WHERE {

?uri e_to_e_out1 e_out1. ?uri e_to_e_out2 e_out2

}

37 of 50

Template Instances

37

Whose influenced by is J. R. R. Tolkien, and award is the Hugo Award.

SELECT DISTINCT ?uri WHERE {

?uri dbo:influencedBy dbr:J.R.R._Tolkien.

?uri dbo:award dbr:Hugo_Award }

38 of 50

Summary

38

Whose influenced by is J. R. R. Tolkien, and award is the Hugo Award.

SELECT DISTINCT ?uri WHERE {

?uri dbo:influencedBy dbr:J.R.R._Tolkien.

?uri dbo:award dbr:Hugo_Award }

Name someone influenced by J. R. R. Tolkien, who won the Hugo Award?

39 of 50

Summary

39

Name someone influenced by J. R. R. Tolkien, who won the Hugo Award?

Whose influenced by is J. R. R. Tolkien, and award is the Hugo Award.

SELECT DISTINCT ?uri WHERE {

?uri dbo:influencedBy dbr:J.R.R._Tolkien.

?uri dbo:award dbr:Hugo_Award }

Automatic

40 of 50

Summary

40

Name someone influenced by J. R. R. Tolkien, who won the Hugo Award?

Whose influenced by is J. R. R. Tolkien, and award is the Hugo Award.

SELECT DISTINCT ?uri WHERE {

?uri dbo:influencedBy dbr:J.R.R._Tolkien.

?uri dbo:award dbr:Hugo_Award }

Automatic

Manual

41 of 50

Manual Work

41

Two Step Process:

  • First Human Intervention corrects the grammar of NNQTs, and sometimes paraphrases the questions.
  • Verifies the Question, and in case of missing space, typo etc, fixes them.

42 of 50

(Note: Outdated fact here 😭 )

42

43 of 50

Discussion

43

44 of 50

Dataset Characteristics

44

5,000Questions

33SPARQL Templates

18%Simple Questions

12.29wAvg. Question Size

04 16DBpedia Version

150+Downloads*

*as of 16th October, 2017

45 of 50

Controlling Size and Variety

45

Too many queries generated per subgraph.

Predicate links disproportionate (eg. dbp:birthplace).

Metadata triples.

Filters based on predicate whitelist.

Stochastically prune the subgraph.

46 of 50

Limitations

46

No literals are included in the questions.

No UNION, OPTIONAL queries.

No conditional aggregates.

No out-of-scope questions.

47 of 50

Future Directions

47

Creating baselines.

Automatic grammar correction.

Complex SPARQL templates.

Keeping up with DBpedia versions.

48 of 50

References, Citations

48

Cai, Qingqing, and Alexander Yates. "Large-scale Semantic Parsing via Schema Matching and Lexicon Extension." ACL (1). (2013).

Berant, Jonathan, et al. "Semantic Parsing on Freebase from Question-Answer Pairs." EMNLP. Vol. 2. No. 5. (2013).

Bordes, Antoine, et al. "Large-scale simple question answering with memory networks." arXiv preprint arXiv:1506.02075 (2015).

Serban, Iulian Vlad, et al. "Generating factoid questions with recurrent neural networks: The 30m factoid question-answer corpus." arXiv preprint arXiv:1603.06807 (2016).

Unger, Christina, Axel-Cyrille Ngonga Ngomo, and Elena Cabrio. "6th open challenge on question answering over linked data (qald-6)." Semantic Web Evaluation Challenge. Springer International Publishing, (2016).

Liang, Chen, et al. "Neural symbolic machines: Learning semantic parsers on freebase with weak supervision." arXiv preprint arXiv:1611.00020 (2016).

This presentation uses licensed works by generous individuals/organizations, namely:

  • Harald Nezbeda (Vienna, Slide 2)
  • Icons8.com (multiple slides)
  • Pinguino’s Flickr Account (Slide 29, 30)

49 of 50

Questions?

49

See what I did there?

50 of 50

See for yourself.

50

LC-QuAD Website

lc-quad.sda.tech