1 of 50

LC-QuAD

Large-scale Complex �Question Answering �Dataset

Priyansh Trivedi¹, Gaurav Maheshwari¹, Mohnish Dubey¹, Jens Lehmann^1,2�^�¹University of Bonn, Bonn, Germany

²Fraunhofer IAIS, St. Augustin, Germany

ISWC 2017, Vienna

2 of 50

Outline

Question Answering

Motivation

Dataset(s)

LC-QuAD

Process of Creating LC-QuAD

Takeaways

Future Directions

3 of 50

Question Answering over KG

Understand the intent of a factual question, and return the implicit KB resource.�

Typically treated as a translation problem from natural language to formal language.

Seen major advancement in the past five years.

4 of 50

Motivation

5 of 50

Challenges when set, can then be overcome.

In other words ...

6 of 50

Datasets precede Research

In 2013, Berant et al. released WebQuestions.

Current State of the Art: 69% (Liang et al., 2016)

Over the past 8 years, Question Answering over Linked Data (QALD) challenge is being held.

Over 38 submissions.

7 of 50

Other Incentives

Dearth of large QA dataset over DBpedia.

Traditional dataset generation methods are time consuming,

do not scale.

8 of 50

Dataset(s)

Dataset	Size	Logical Forms	Complex Questions	Target KB
Free917 �(Cai et al., 2013 )	917	Yes	Yes	Freebase
WebQuestions �(Berant et al., 2013)	5 810	No	Yes	Freebase
SimpleQuestions (Bordes et al., 2015)	108 442	No	No	Freebase
30M Factoid (Serban et al., 2016 )	30 000 000	No	No	Freebase
QALD (Unger et al., 2016)	450	Yes	Yes	DBpedia

*unless unintentionally made complex

9 of 50

Dataset(s)

Dataset	Size	Logical Forms	Complex Questions	Target KB
Free917 �(Cai et al., 2013 )	917	Yes	Yes	Freebase
WebQuestions �(Berant et al., 2013)	5 810	No	Yes	Freebase
SimpleQuestions (Bordes et al., 2015)	108 442	No	No	Freebase
30M Factoid (Serban et al., 2016 )	30 000 000	No	No	Freebase
QALD (Unger et al., 2016)	450	Yes	Yes	DBpedia
LC-QuAD	5000	Yes	Yes	DBpedia

12 of 50

LC-QuAD

✔ has complex questions

13 of 50

LC-QuAD

✔ has complex questions

✔ has SPARQL queries

14 of 50

LC-QuAD

✔ has complex questions

✔ has SPARQL queries

✔ has boolean and aggregate based queries

15 of 50

LC-QuAD

✔ has complex questions

✔ has SPARQL queries

✔ has boolean and aggregate based queries

✔ is supervised (gold standard)

16 of 50

LC-QuAD

✔ has complex questions

✔ has SPARQL queries

✔ has boolean and aggregate based queries

✔ is supervised (gold standard)

✔ is extensible

17 of 50

LC-QuAD

✔ has complex questions

✔ has SPARQL queries

✔ has boolean and aggregate based queries

✔ is supervised (gold standard)

✔ is extensible

✔ is awesome 😄

18 of 50

Dataset Creation Process

19 of 50

Traditionally

Natural Language Questions are collected/created.

Thereafter, their logical form is manually created.

20 of 50

Traditionally

Name someone influenced by J. R. R. Tolkien, who won the Hugo Award?

SELECT DISTINCT ?uri WHERE {

?uri dbo:influencedBy dbr:J.R.R._Tolkien.

?uri dbo:award dbr:Hugo_Award

}

21 of 50

Traditionally

Name someone influenced by J. R. R. Tolkien, who won the Hugo Award?

SELECT DISTINCT ?uri WHERE {

?uri dbo:influencedBy dbr:J.R.R._Tolkien.

?uri dbo:award dbr:Hugo_Award

}

Prerequisites

✔ understanding KG Schema

✔ understanding target formal language (SPARQL)

✔ no room for errors

✔ understand NL

22 of 50

Inverting the Process

Name someone influenced by J. R. R. Tolkien, who won the Hugo Award?

SELECT DISTINCT ?uri WHERE {

?uri dbo:influencedBy dbr:J.R.R._Tolkien.

?uri dbo:award dbr:Hugo_Award

}

23 of 50

Inverting the Process

Name someone influenced by J. R. R. Tolkien, who won the Hugo Award?

SELECT DISTINCT ?uri WHERE {

?uri dbo:influencedBy dbr:J.R.R._Tolkien.

?uri dbo:award dbr:Hugo_Award

}

Prerequisites

✔ understanding KG Schema

✔ understanding target formal language (SPARQL)

✔ no room for errors

✔ understand NL

24 of 50

Upon Further Simplification

Name someone influenced by J. R. R. Tolkien, who won the Hugo Award?

What is person whose influenced by is J. R. R. Tolkien, and award is the Hugo Award.

SELECT DISTINCT ?uri WHERE {

?uri dbo:influencedBy dbr:J.R.R._Tolkien.

?uri dbo:award dbr:Hugo_Award }

automatic

25 of 50

Upon Further Simplification

Name someone influenced by J. R. R. Tolkien, who won the Hugo Award?

What is person whose influenced by is J. R. R. Tolkien, and award is the Hugo Award.

SELECT DISTINCT ?uri WHERE {

?uri dbo:influencedBy dbr:J.R.R._Tolkien.

?uri dbo:award dbr:Hugo_Award }

Prerequisites

✔ understanding KG Schema

✔ understanding target formal language (SPARQL)

✔ no room for errors.

✔ understand NL

26 of 50

Upon Further Simplification

Increases the speed of creating questions.

Reduces domain expertise required.

Can afford slight errors.

Allowing us to scale up!

Prerequisites

✔ understanding KG Schema

✔ understanding target formal language (SPARQL)

✔ no room for errors.

✔ understand NL

27 of 50

Remaining Challenges

Automatically create SPARQL queries.

Convert SPARQL queries to intermediary NLQs.

28 of 50

Automatically create SPARQL Queries

Create SPARQL Templates.

SELECT DISTINCT ?uri WHERE {

?uri e_to_e_out1 e_out1. � ?uri e_to_e_out2 e_out2

}

29 of 50

Automatically create SPARQL Queries

Create SPARQL Templates.

30 of 50

Automatically create SPARQL Queries

Manually select entities as answers to our queries.

Stephen King

31 of 50

Automatically create SPARQL Queries

Collect the 2-hop subgraph around these entities.

32 of 50

Automatically create SPARQL Queries

Juxtapose the SPARQL triple pattern on this subgraph.

33 of 50

SPARQL Queries Generated

SELECT DISTINCT ?uri WHERE {

?uri dbo:influencedBy dbr:J.R.R._Tolkien.

?uri dbo:award dbr:Hugo_Award

}

34 of 50

Remaining Challenges

Automatically create SPARQL queries.

Convert SPARQL queries to intermediary NLQs.

35 of 50

Creating intermediary NLQs

Whose influenced by is J. R. R. Tolkien, and award is the Hugo Award.

SELECT DISTINCT ?uri WHERE {

?uri dbo:influencedBy dbr:J.R.R._Tolkien.

?uri dbo:award dbr:Hugo_Award

}

36 of 50

Question Templates (NNQT)

Whose e_to_e_out1 is e_out1, and e_to_e_out2 is the e_out2.

SELECT DISTINCT ?uri WHERE {

?uri e_to_e_out1 e_out1. ?uri e_to_e_out2 e_out2

}

37 of 50

Template Instances

Whose influenced by is J. R. R. Tolkien, and award is the Hugo Award.

SELECT DISTINCT ?uri WHERE {

?uri dbo:influencedBy dbr:J.R.R._Tolkien.

?uri dbo:award dbr:Hugo_Award }

38 of 50

Summary

Whose influenced by is J. R. R. Tolkien, and award is the Hugo Award.

SELECT DISTINCT ?uri WHERE {

?uri dbo:influencedBy dbr:J.R.R._Tolkien.

?uri dbo:award dbr:Hugo_Award }

Name someone influenced by J. R. R. Tolkien, who won the Hugo Award?

39 of 50

Summary

Name someone influenced by J. R. R. Tolkien, who won the Hugo Award?

Whose influenced by is J. R. R. Tolkien, and award is the Hugo Award.

SELECT DISTINCT ?uri WHERE {

?uri dbo:influencedBy dbr:J.R.R._Tolkien.

?uri dbo:award dbr:Hugo_Award }

Automatic

40 of 50

Summary

Name someone influenced by J. R. R. Tolkien, who won the Hugo Award?

Whose influenced by is J. R. R. Tolkien, and award is the Hugo Award.

SELECT DISTINCT ?uri WHERE {

?uri dbo:influencedBy dbr:J.R.R._Tolkien.

?uri dbo:award dbr:Hugo_Award }

Automatic

Manual

41 of 50

Manual Work

Two Step Process:

First Human Intervention corrects the grammar of NNQTs, and sometimes paraphrases the questions.
Verifies the Question, and in case of missing space, typo etc, fixes them.

42 of 50

(Note: Outdated fact here 😭 )

43 of 50

Discussion

44 of 50

Dataset Characteristics

5,000�Questions

33�SPARQL Templates

18%�Simple Questions

12.29w�Avg. Question Size

04 ‘16�DBpedia Version

150+�Downloads*

*as of 16th October, 2017

45 of 50

Controlling Size and Variety

Too many queries generated per subgraph.

Predicate links disproportionate (eg. dbp:birthplace).

Metadata triples.

Filters based on predicate whitelist.

Stochastically prune the subgraph.

46 of 50

Limitations

No literals are included in the questions.

No UNION, OPTIONAL queries.

No conditional aggregates.

No out-of-scope questions.

47 of 50

Future Directions

Creating baselines.

Automatic grammar correction.

Complex SPARQL templates.

Keeping up with DBpedia versions.

48 of 50

References, Citations

Cai, Qingqing, and Alexander Yates. "Large-scale Semantic Parsing via Schema Matching and Lexicon Extension." ACL (1). (2013).

Berant, Jonathan, et al. "Semantic Parsing on Freebase from Question-Answer Pairs." EMNLP. Vol. 2. No. 5. (2013).

Bordes, Antoine, et al. "Large-scale simple question answering with memory networks." arXiv preprint arXiv:1506.02075 (2015).

Serban, Iulian Vlad, et al. "Generating factoid questions with recurrent neural networks: The 30m factoid question-answer corpus." arXiv preprint arXiv:1603.06807 (2016).

Unger, Christina, Axel-Cyrille Ngonga Ngomo, and Elena Cabrio. "6th open challenge on question answering over linked data (qald-6)." Semantic Web Evaluation Challenge. Springer International Publishing, (2016).

Liang, Chen, et al. "Neural symbolic machines: Learning semantic parsers on freebase with weak supervision." arXiv preprint arXiv:1611.00020 (2016).

This presentation uses licensed works by generous individuals/organizations, namely:

Harald Nezbeda (Vienna, Slide 2)
Icons8.com (multiple slides)
Pinguino’s Flickr Account (Slide 29, 30)

1 of 50

2 of 50

3 of 50

4 of 50

5 of 50

6 of 50

7 of 50

8 of 50

9 of 50

10 of 50

11 of 50

12 of 50

13 of 50

14 of 50

15 of 50

16 of 50

17 of 50

18 of 50

19 of 50

20 of 50

21 of 50

22 of 50

23 of 50

24 of 50

25 of 50

26 of 50

27 of 50

28 of 50

29 of 50

30 of 50

31 of 50

32 of 50

33 of 50

34 of 50

35 of 50

36 of 50

37 of 50

38 of 50

39 of 50

40 of 50

41 of 50

42 of 50

43 of 50

44 of 50

45 of 50

46 of 50

47 of 50

48 of 50

49 of 50

50 of 50