1 of 55

Talk at IBM Almaden, July 21, 2016

TextDB: Declarative and Scalable Text Analytics on Large Data Sets

Chen Li

UC Irvine

1

2 of 55

Text is everywhere

2

3 of 55

Text-Processing Tools

3

Relational DBMS

Search Engines

IBM SystemT

4 of 55

Our Vision: text-centric data-management system

  • Support text storage
  • Provide basic primitives for text processing
  • Declarative language based on operators
  • Efficient by using indexes, query processing, and optimization

It’s NOT:

  • a full solution for all text-processing tasks

4

5 of 55

Requirements

5

Declarative

Efficient

6 of 55

TextDB System Architecture

6

Similarity

Matcher

Regex

Matcher

Query Processor

Dictionary

Matcher

Keyword Matcher

Declarative Query Language (TextQL)

Document Store (Lucene)

Inverted Index, Engine

Application 1

System T

Library

Application 2

...

Query

Rewriter

Stanford

NLP

7 of 55

Design Decisions

  • DB-style operators
  • Using Lucene
    • Widely adopted as a successful search engine
    • Support inverted index and keyword queries, required for many text-processing tasks
    • Expose various text analyzers
    • “Standing on the shoulders of giants”
    • Live with its limitations
  • Current focus: information extraction

7

8 of 55

Talk Overview

  • Overview of operators
  • Regex operator
  • Performance numbers
  • Open problems

8

9 of 55

Operator interface

open()

getNextTuple()

close()

9

10 of 55

Operators:

  • Keyword matcher
  • Dictionary matcher
  • Fuzzy matcher
  • Stanford NLP
  • Query rewriter
  • Regex matcher

10

11 of 55

Keyword Matcher

  • Example: Find documents related to “Zika Olympics Rio”.
    • Conjunctive query of keywords (Index-based)
    • Phrase matching (Index-based)
    • Substring matching (scan)

11

12 of 55

Keyword Matcher

12

PHRASE_INDEXBASED

SUBSTRING_SCANBASED

13 of 55

Keyword Matcher: using Lucene Query

13

14 of 55

Dictionary Matcher

14

Zika

Fever

X-ray

Bacteria

…...

Dictionary

Matcher

Conjunction_Indexbased

Phrase_Indexbased

Substring_Scanbased

Keyword

Matcher

Phrase_Indexbased

Substring_Scanbased

Conjunction_Indexbased

Iterative call

getNextTuple()

matching results

Dictionary

15 of 55

Fuzzy Token Matcher

  • Given a query of N tokens, find all documents which have at least T tokens.
  • Example: Find documents that have 4 out of 6 keywords “Zika Olympics Rio games diseases virus”.

15

16 of 55

Fuzzy Token Matcher: a use case :-)

  • Find speeches similar to Melania Trump's RNC speech

16

17 of 55

Fuzzy Token Matcher: a use case :-)

17

18 of 55

Fuzzy Token Matcher

18

Set up

Get matching results

19 of 55

Fuzzy Token Matcher: Lucene Query

19

20 of 55

Stanford NLP

Operator

20

  • Tokenizer (tokenize)
  • Sentence splitter (ssplit)
  • Part-of-speech (POS) tagger (pos)
  • Named entity recognizer (NER) (ner)
  • Parser (parse)
  • Coreference resolution system (dcoref)

Goal:

  • Supporting NLP in TextDB
  • Testing feasibility of integrating other 3rd-party packages

21 of 55

Operator: Stanford NLP, Name Entity Recognition (NER)

21

  • 7 Named Entity classes:
    • Number
    • Location
    • Person
    • Organization
    • Money
    • Percent
    • Date
    • Time
  • 4 types of Part of Speech entity:
    • Adjective
    • Adverb
    • Noun
    • Verb

22 of 55

Query Rewriter (fixing missing spaces in query)

Deal with errors in queries

Example:

  • Input: “Zika OlympicsRio
  • Output: “Zika Olympics Rio

22

23 of 55

Talk Overview

  • Overview of operators
  • Regex matcher operator
  • Performance numbers
  • Open problems

23

24 of 55

Regex Matcher

Algorithm by Russ Cox:

  • Build trigram inverted index
  • Translate regex to a boolean query of trigrams
  • Use index to find candidate documents
  • Run regex matching engine on candidates

24

25 of 55

Regex Matcher

25

Lucene

Trigram Inverted Index

zika\s*(virus|fever)

((zik AND ika AND vir AND iru AND rus)

OR

(zik AND ika AND fev AND eve AND ver))

Candidate Documents

Regex Engine

Matching Results

Regex Translator

Algorithm by Russ Cox.

26 of 55

Five Basic Regex Expressions

  • Concatenation ab
  • Alternation a|b
  • zero or one a?
  • zero or more a*
  • one or more a+

26

27 of 55

Function “trigrams()

trigrams(ab) = ANY (i.e., matching everything)

trigrams(abcd) = (abc AND bcd)

trigrams({abcd, wxyz}) = trigrams(abcd) OR trigrams(wxyz) = ((abc AND bcd) OR (wxy AND xyz))

27

28 of 55

Example: data*(bcd|pqr)

Parse Tree

28

29 of 55

data*(bcd|pqr)

29

dat

emptyable

false

exact

dat

prefix

dat

suffix

dat

match

trigrams(dat) = dat

30 of 55

data*(bcd|pqr)

30

a*

emptyable

true

exact

unknown

prefix

“”

suffix

“”

match

ANY

31 of 55

Example

data*(bcd|pqr)

31

dat

a*

data*

emptyable

false

true

false

exact

dat

unknown

unknown

prefix

dat

“”

dat

suffix

dat

“”

{dat, “”}

match

dat

ANY

dat

32 of 55

data*(bcd|pqr)

32

bcd

pqr

bcd|pqr

emptyable

false

false

false

exact

bcd

pqr

{bcd, pqr}

prefix

bcd

pqr

{bcd, pqr}

suffix

bcd

pqr

{bcd, pqr}

match

bcd

pqr

bcd OR pqr

33 of 55

data*(bcd|pqr)

33

data*

bcd|pqr

data*(bcd|pqr)

emptyable

false

false

false

exact

unknown

{bcd, pqr}

unknown

prefix

dat

{bcd, pqr}

dat

suffix

{dat, “”}

{bcd, pqr}

{datbcd, datpqr, bcd, pqr}

match

dat

bcd OR pqr

dat AND (bcd OR pqr)

34 of 55

Rewriting expression

match(expr) -> match(expr) AND trigrams(exact)

AND trigrams(prefix) AND trigrams(suffix)

data*(bcd|pqr)

34

data*(bcd|pqr)

emptyable

false

exact

unknown

prefix

dat

suffix

{datbcd, datpqr, bcd, pqr}

match

dat AND (bcd OR pqr)

dat AND (bcd OR pqr OR (dat AND atb AND tbc AND bcd) OR (dat AND atp AND tpq AND pqr))

35 of 55

Query -> DNF

data*(bcd|pqr)

35

36 of 55

Simplification using absorption law: a OR (a AND b) = a

36

37 of 55

Talk Overview

  • Overview of operators
    • Keyword operator
    • Fuzzy operator
  • Regex matcher operator
  • Performance numbers
  • Open problems

37

38 of 55

Data Set: MEDLINE articles

{ � "pmid":"19866847",� "affiliation":"Surgeon, U. S. A.",� "article_title":"ON THE APPEARANCE ......",� "authors":"W Reed",� "journal_issue":"2-5 Sep 1, 1897",� "journal_title":"The Journal of experimental medicine",� "keywords":"",� "mesh_headings":"",� "abstract":"1. The claim of L. Pfeiffer that small granular amoeboid bodies are present in the blood of vaccinated children and calves",� "zipf_score":0.019866847�}

38

39 of 55

Machine Setting

Machine setting: Macbook Air (mid-2013), Intel Core i7 (4650U), SSD hard drive, 8GB memory.

39

40 of 55

Indexing Overhead (standard analyzer)

40

Record #

Data Size

Time (s)

Index Size

10K

13.4MB

5.6

22MB

100K

140.9MB

31.82

236MB

1M

1.53GB

341.81

2.56GB

41 of 55

Indexing Overhead (trigram analyzer)

41

Record #

Data Size

Time (s)

Index Size

10K

13.4MB

10.62

83MB

100K

140.9MB

111.61

853MB

1M

1.53GB

1324.12

9.38GB

42 of 55

Keyword Matcher: queries

Randomly selected from documents

interplay

iontophoresis

inhibitor

result

choice

plasma adrenals

light interpretation

mechanical interphalangeal

joint's superficial arrangement

increase any rapidly

maximum until growth

42

43 of 55

Keyword Matcher: conjunctive queries

43

44 of 55

Keyword Matcher: Phrase-Index-Based

44

Record #

Avg Result #

Avg Time

10K

76

0.006

100K

824

0.064

1M

9,014

1.548

Machine setting: Macbook Air (mid-2013), Intel .Core i7 (4650U), SSD hard drive, 8GB memory

45 of 55

Keyword Matcher: Phrase-Index-Based

45

46 of 55

Fuzzy Token Matcher

  • “medicine medical smooth osteochondromatosis Baytussin pseudoephedrine”
  • “x-ray wound abnormal monophosphate Bapwon”
  • “blood pressure Congestion microphthalmia SanaDermRx PhenylAde”
  • “Soluble vs. insoluble fiber hydroxyurea bisabolol”
  • “surgeon cancer Micrainin Contragesic ertapenem idarubicin”

Words are randomly selected from the domain of medical keywords.

46

47 of 55

Fuzzy Token Matcher

47

1M records

Run time (s)

Avg result #

threshold = 0.35

0.863

8,737

threshold = 0.50

0.010

78

threshold = 0.65

0.009

75

threshold = 0.80

0.001

0.8

48 of 55

Stanford NLP Operator

“On 23 Jun 2016, the Colorado Department of Agriculture, State Veterinarian's Office, was notified by the USDA National Veterinary Services Laboratory (NVSL) that a non-racing horse presently located at Arapahoe Park in Aurora, [Colorado], tested positive for equine infectious anemia (EIA). Confirmatory tests are currently being run. Arapahoe Park is currently under a hold order that restricts movement of horses until an initial investigation is completed by the Colorado Department of Agriculture (CDA). The affected horse has been in Colorado less than 60 days and came from an out-of-state track. It appears that the horse was infected prior to coming to Colorado and previously tested negative for the disease in May of 2015.”

48

49 of 55

Stanford NLP Operator

49

Record #

All Named Entity time (s)

Part of Speech time (s)

5K

320.445

27.027

10K

576.864

46.908

Machine setting: Lenovo Yoga 2 Pro, 2.6 Ghz Intel core i7, 8GM RAM

50 of 55

Regex Matcher: 1M records

50

TextDB

Query

Time (s)

Result #

mosquitos?

0.669

3,337

market(ing)?

1.827

5,353

v[ir]{2}[us]{2}

9.87

91,491

medic(ine|al|ation|are|aid)?

24.737

70,048

[A-Z][aeiou|AEIOU][A-Za-z]*

71.748

5,629,558

51 of 55

Talk Overview

  • Overview of operators
  • Regex matcher
  • Performance numbers
  • Open problems

51

52 of 55

Open problems

  • More operators:
    • Join (on going)
    • Taking a document as input (e.g., finding similar documents)
  • More efficient regex queries
    • Using positional info
    • Deal with those “un-supported” expressions
  • Query language: TextSQL
  • Query optimization
    • Selectivity estimation
    • Cost estimation
  • Storing annotation for future queries
  • Go beyond information extraction

52

53 of 55

TextDB Summary

53

Similarity

Matcher

Regex

Matcher

Query Processor

Dictionary

Matcher

Keyword Matcher

Declarative Query Language (TextQL)

Document Store (Lucene)

Inverted Index, Engine

Application 1

System T

Library

Application 2

...

Query

Rewriter

Stanford

NLP

54 of 55

Acknowledgements

54

Chen Li

ZhenFeng Qi

Qing Tang

Hailey Pan

Zuozhi Wang

Rajesh Yarlagadda

Prakul Agarwal

Sandeep R. Madugula

Shuying

Varun Bharill

Sudeep Meduri

Parag Saraogi

Shiladitya Sen

Kishore Narendran

Akshay Jain

Jinggang Diao

Flavio Bayer

Feng Hong

Yang Jiao

Jianfeng Jia

Sripad

55 of 55

TextDB on github

55