1 of 55

Talk at IBM Almaden, July 21, 2016

TextDB: Declarative and Scalable Text Analytics on Large Data Sets

Chen Li

UC Irvine

2 of 55

Text is everywhere

3 of 55

Text-Processing Tools

Relational DBMS

Search Engines

IBM SystemT

4 of 55

Our Vision: text-centric data-management system

Support text storage
Provide basic primitives for text processing
Declarative language based on operators
Efficient by using indexes, query processing, and optimization

It’s NOT:

a full solution for all text-processing tasks

5 of 55

Requirements

Declarative

Efficient

6 of 55

TextDB System Architecture

Similarity

Matcher

Regex

Matcher

Query Processor

Dictionary

Matcher

Keyword Matcher

Declarative Query Language (TextQL)

Document Store (Lucene)

Inverted Index, Engine

Application 1

System T

Library

Application 2

...

Query

Rewriter

Stanford

NLP

7 of 55

Design Decisions

DB-style operators
Using Lucene

Widely adopted as a successful search engine
Support inverted index and keyword queries, required for many text-processing tasks
Expose various text analyzers
“Standing on the shoulders of giants”
Live with its limitations

Current focus: information extraction

8 of 55

Talk Overview

Overview of operators
Regex operator
Performance numbers
Open problems

9 of 55

Operator interface

open()

getNextTuple()

close()

10 of 55

Operators:

Keyword matcher
Dictionary matcher
Fuzzy matcher
Stanford NLP
Query rewriter
Regex matcher

11 of 55

Keyword Matcher

Example: Find documents related to “Zika Olympics Rio”.

Conjunctive query of keywords (Index-based)
Phrase matching (Index-based)
Substring matching (scan)

12 of 55

Keyword Matcher

PHRASE_INDEXBASED

SUBSTRING_SCANBASED

13 of 55

Keyword Matcher: using Lucene Query

14 of 55

Dictionary Matcher

Zika
Fever
X-ray
Bacteria
…...

Dictionary

Matcher

Conjunction_Indexbased
Phrase_Indexbased
Substring_Scanbased

Keyword

Matcher

Phrase_Indexbased

Substring_Scanbased

Conjunction_Indexbased

Iterative call

getNextTuple()

matching results

Dictionary

15 of 55

Fuzzy Token Matcher

Given a query of N tokens, find all documents which have at least T tokens.
Example: Find documents that have 4 out of 6 keywords “Zika Olympics Rio games diseases virus”.

16 of 55

Fuzzy Token Matcher: a use case :-)

Find speeches similar to Melania Trump's RNC speech

17 of 55

Fuzzy Token Matcher: a use case :-)

18 of 55

Fuzzy Token Matcher

Set up

Get matching results

19 of 55

Fuzzy Token Matcher: Lucene Query

20 of 55

Stanford NLP

Operator

Tokenizer (tokenize)
Sentence splitter (ssplit)
Part-of-speech (POS) tagger (pos)
Named entity recognizer (NER) (ner)
Parser (parse)
Coreference resolution system (dcoref)

Goal:

Supporting NLP in TextDB
Testing feasibility of integrating other 3rd-party packages

21 of 55

Operator: Stanford NLP, Name Entity Recognition (NER)

7 Named Entity classes:

Number
Location
Person
Organization
Money
Percent
Date
Time

4 types of Part of Speech entity:

Adjective
Adverb
Noun
Verb

22 of 55

Query Rewriter (fixing missing spaces in query)

Deal with errors in queries

Example:

Input: “Zika OlympicsRio”
Output: “Zika Olympics Rio”

23 of 55

Talk Overview

Overview of operators
Regex matcher operator
Performance numbers
Open problems

24 of 55

Regex Matcher

Algorithm by Russ Cox:

Build trigram inverted index
Translate regex to a boolean query of trigrams
Use index to find candidate documents
Run regex matching engine on candidates

25 of 55

Regex Matcher

Lucene

Trigram Inverted Index

zika\s*(virus|fever)

((zik AND ika AND vir AND iru AND rus)

(zik AND ika AND fev AND eve AND ver))

Candidate Documents

Regex Engine

Matching Results

Regex Translator

Algorithm by Russ Cox.

26 of 55

Five Basic Regex Expressions

Concatenation ab
Alternation a|b
zero or one a?
zero or more a*
one or more a+

27 of 55

Function “trigrams()”

trigrams(ab) = ANY (i.e., matching everything)

trigrams(abcd) = (abc AND bcd)

trigrams({abcd, wxyz}) = trigrams(abcd) OR trigrams(wxyz) = ((abc AND bcd) OR (wxy AND xyz))

28 of 55

Example: data*(bcd|pqr)

Parse Tree

29 of 55

data*(bcd|pqr)

	dat
emptyable	false
exact	dat
prefix	dat
suffix	dat
match	trigrams(dat) = dat

30 of 55

data*(bcd|pqr)

	a*
emptyable	true
exact	unknown
prefix	“”
suffix	“”
match	ANY

31 of 55

Example

data*(bcd|pqr)

	dat	a*	data*
emptyable	false	true	false
exact	dat	unknown	unknown
prefix	dat	“”	dat
suffix	dat	“”	{dat, “”}
match	dat	ANY	dat

32 of 55

data*(bcd|pqr)

	bcd	pqr	bcd\|pqr
emptyable	false	false	false
exact	bcd	pqr	{bcd, pqr}
prefix	bcd	pqr	{bcd, pqr}
suffix	bcd	pqr	{bcd, pqr}
match	bcd	pqr	bcd OR pqr

33 of 55

data*(bcd|pqr)

	data*	bcd\|pqr	data*(bcd\|pqr)
emptyable	false	false	false
exact	unknown	{bcd, pqr}	unknown
prefix	dat	{bcd, pqr}	dat
suffix	{dat, “”}	{bcd, pqr}	{datbcd, datpqr, bcd, pqr}
match	dat	bcd OR pqr	dat AND (bcd OR pqr)

34 of 55

Rewriting expression

match(expr) -> match(expr) AND trigrams(exact)

AND trigrams(prefix) AND trigrams(suffix)

data*(bcd|pqr)

	data*(bcd\|pqr)
emptyable	false
exact	unknown
prefix	dat
suffix	{datbcd, datpqr, bcd, pqr}
match	dat AND (bcd OR pqr)

dat AND (bcd OR pqr OR (dat AND atb AND tbc AND bcd) OR (dat AND atp AND tpq AND pqr))

35 of 55

Query -> DNF

data*(bcd|pqr)

36 of 55

Simplification using absorption law: a OR (a AND b) = a

37 of 55

Talk Overview

Overview of operators

Keyword operator
Fuzzy operator

Regex matcher operator
Performance numbers
Open problems

38 of 55

Data Set: MEDLINE articles

{ � "pmid":"19866847",� "affiliation":"Surgeon, U. S. A.",� "article_title":"ON THE APPEARANCE ......",� "authors":"W Reed",� "journal_issue":"2-5 Sep 1, 1897",� "journal_title":"The Journal of experimental medicine",� "keywords":"",� "mesh_headings":"",� "abstract":"1. The claim of L. Pfeiffer that small granular amoeboid bodies are present in the blood of vaccinated children and calves",� "zipf_score":0.019866847�}

39 of 55

Machine Setting

Machine setting: Macbook Air (mid-2013), Intel Core i7 (4650U), SSD hard drive, 8GB memory.

40 of 55

Indexing Overhead (standard analyzer)

Record #	Data Size	Time (s)	Index Size
10K	13.4MB	5.6	22MB
100K	140.9MB	31.82	236MB
1M	1.53GB	341.81	2.56GB

41 of 55

Indexing Overhead (trigram analyzer)

Record #	Data Size	Time (s)	Index Size
10K	13.4MB	10.62	83MB
100K	140.9MB	111.61	853MB
1M	1.53GB	1324.12	9.38GB

42 of 55

Keyword Matcher: queries

Randomly selected from documents

interplay

iontophoresis

inhibitor

result

choice

plasma adrenals

light interpretation

mechanical interphalangeal

joint's superficial arrangement

increase any rapidly

maximum until growth

43 of 55

Keyword Matcher: conjunctive queries

44 of 55

Keyword Matcher: Phrase-Index-Based

Record #	Avg Result #	Avg Time
10K	76	0.006
100K	824	0.064
1M	9,014	1.548

Machine setting: Macbook Air (mid-2013), Intel .Core i7 (4650U), SSD hard drive, 8GB memory

45 of 55

Keyword Matcher: Phrase-Index-Based

46 of 55

Fuzzy Token Matcher

“medicine medical smooth osteochondromatosis Baytussin pseudoephedrine”
“x-ray wound abnormal monophosphate Bapwon”
“blood pressure Congestion microphthalmia SanaDermRx PhenylAde”
“Soluble vs. insoluble fiber hydroxyurea bisabolol”
“surgeon cancer Micrainin Contragesic ertapenem idarubicin”

Words are randomly selected from the domain of medical keywords.

47 of 55

Fuzzy Token Matcher

1M records	Run time (s)	Avg result #
threshold = 0.35	0.863	8,737
threshold = 0.50	0.010	78
threshold = 0.65	0.009	75
threshold = 0.80	0.001	0.8

48 of 55

Stanford NLP Operator

“On 23 Jun 2016, the Colorado Department of Agriculture, State Veterinarian's Office, was notified by the USDA National Veterinary Services Laboratory (NVSL) that a non-racing horse presently located at Arapahoe Park in Aurora, [Colorado], tested positive for equine infectious anemia (EIA). Confirmatory tests are currently being run. Arapahoe Park is currently under a hold order that restricts movement of horses until an initial investigation is completed by the Colorado Department of Agriculture (CDA). The affected horse has been in Colorado less than 60 days and came from an out-of-state track. It appears that the horse was infected prior to coming to Colorado and previously tested negative for the disease in May of 2015.”

49 of 55

Stanford NLP Operator

Record #	All Named Entity time (s)	Part of Speech time (s)
5K	320.445	27.027
10K	576.864	46.908

Machine setting: Lenovo Yoga 2 Pro, 2.6 Ghz Intel core i7, 8GM RAM

50 of 55

Regex Matcher: 1M records

	TextDB
Query	Time (s)	Result #
mosquitos?	0.669	3,337
market(ing)?	1.827	5,353
v[ir]{2}[us]{2}	9.87	91,491
medic(ine\|al\|ation\|are\|aid)?	24.737	70,048
[A-Z][aeiou\|AEIOU][A-Za-z]*	71.748	5,629,558

51 of 55

Talk Overview

Overview of operators
Regex matcher
Performance numbers
Open problems

52 of 55

Open problems

More operators:

Join (on going)
Taking a document as input (e.g., finding similar documents)

More efficient regex queries

Using positional info
Deal with those “un-supported” expressions

Query language: TextSQL
Query optimization

Selectivity estimation
Cost estimation

Storing annotation for future queries
Go beyond information extraction

53 of 55

TextDB Summary

Similarity

Matcher

Regex

Matcher

Query Processor

Dictionary

Matcher

Keyword Matcher

Declarative Query Language (TextQL)

Document Store (Lucene)

Inverted Index, Engine

Application 1

System T

Library

Application 2

...

Query

Rewriter

Stanford

NLP

54 of 55

Acknowledgements

Chen Li

ZhenFeng Qi

Qing Tang

Hailey Pan

Zuozhi Wang

Rajesh Yarlagadda

Prakul Agarwal

Sandeep R. Madugula

Shuying

Varun Bharill

Sudeep Meduri

Parag Saraogi

Shiladitya Sen

Kishore Narendran

Akshay Jain

Jinggang Diao

Flavio Bayer

Feng Hong

Yang Jiao

Jianfeng Jia

Sripad

1 of 55

2 of 55

3 of 55

4 of 55

5 of 55

6 of 55

7 of 55

8 of 55

9 of 55

10 of 55

11 of 55

12 of 55

13 of 55

14 of 55

15 of 55

16 of 55

17 of 55

18 of 55

19 of 55

20 of 55

21 of 55

22 of 55

23 of 55

24 of 55

25 of 55

26 of 55

27 of 55

28 of 55

29 of 55

30 of 55

31 of 55

32 of 55

33 of 55

34 of 55

35 of 55

36 of 55

37 of 55

38 of 55

39 of 55

40 of 55

41 of 55

42 of 55

43 of 55

44 of 55

45 of 55

46 of 55

47 of 55

48 of 55

49 of 55

50 of 55

51 of 55

52 of 55

53 of 55

54 of 55

55 of 55