Talk at IBM Almaden, July 21, 2016
TextDB: Declarative and Scalable Text Analytics on Large Data Sets
Chen Li
UC Irvine
1
Text is everywhere
2
Text-Processing Tools
3
Relational DBMS
Search Engines
IBM SystemT
Our Vision: text-centric data-management system
It’s NOT:
4
Requirements
5
Declarative
Efficient
TextDB System Architecture
6
Similarity
Matcher
Regex
Matcher
Query Processor
Dictionary
Matcher
Keyword Matcher
Declarative Query Language (TextQL)
Document Store (Lucene)
Inverted Index, Engine
Application 1
System T
Library
Application 2
...
Query
Rewriter
Stanford
NLP
Design Decisions
7
Talk Overview
8
Operator interface
open()
getNextTuple()
close()
9
Operators:
10
Keyword Matcher
11
Keyword Matcher
12
PHRASE_INDEXBASED
SUBSTRING_SCANBASED
Keyword Matcher: using Lucene Query
13
Dictionary Matcher
14
Zika |
Fever |
X-ray |
Bacteria |
…... |
Dictionary
Matcher
Conjunction_Indexbased |
Phrase_Indexbased |
Substring_Scanbased |
Keyword
Matcher
Phrase_Indexbased
Substring_Scanbased
Conjunction_Indexbased
Iterative call
getNextTuple()
matching results
Dictionary
Fuzzy Token Matcher
15
Fuzzy Token Matcher: a use case :-)
16
Fuzzy Token Matcher: a use case :-)
17
Fuzzy Token Matcher
18
Set up
Get matching results
Fuzzy Token Matcher: Lucene Query
19
Stanford NLP
Operator
20
Goal:
Operator: Stanford NLP, Name Entity Recognition (NER)
21
Query Rewriter (fixing missing spaces in query)
Deal with errors in queries
Example:
22
Talk Overview
23
Regex Matcher
Algorithm by Russ Cox:
24
Regex Matcher
25
Lucene
Trigram Inverted Index
zika\s*(virus|fever)
((zik AND ika AND vir AND iru AND rus)
OR
(zik AND ika AND fev AND eve AND ver))
Candidate Documents
Regex Engine
Matching Results
Regex Translator
Algorithm by Russ Cox.
Five Basic Regex Expressions
26
Function “trigrams()”
trigrams(ab) = ANY (i.e., matching everything)
trigrams(abcd) = (abc AND bcd)
trigrams({abcd, wxyz}) = trigrams(abcd) OR trigrams(wxyz) = ((abc AND bcd) OR (wxy AND xyz))
27
Example: data*(bcd|pqr)
Parse Tree
28
data*(bcd|pqr)
29
| dat |
emptyable | false |
exact | dat |
prefix | dat |
suffix | dat |
match | trigrams(dat) = dat |
data*(bcd|pqr)
30
| a* |
emptyable | true |
exact | unknown |
prefix | “” |
suffix | “” |
match | ANY |
Example
data*(bcd|pqr)
31
| dat | a* | data* |
emptyable | false | true | false |
exact | dat | unknown | unknown |
prefix | dat | “” | dat |
suffix | dat | “” | {dat, “”} |
match | dat | ANY | dat |
data*(bcd|pqr)
32
| bcd | pqr | bcd|pqr |
emptyable | false | false | false |
exact | bcd | pqr | {bcd, pqr} |
prefix | bcd | pqr | {bcd, pqr} |
suffix | bcd | pqr | {bcd, pqr} |
match | bcd | pqr | bcd OR pqr |
data*(bcd|pqr)
33
| data* | bcd|pqr | data*(bcd|pqr) | |
emptyable | false | false | false | |
exact | unknown | {bcd, pqr} | unknown | |
prefix | dat | {bcd, pqr} | dat | |
suffix | {dat, “”} | {bcd, pqr} | {datbcd, datpqr, bcd, pqr} | |
match | dat | bcd OR pqr | dat AND (bcd OR pqr) |
Rewriting expression
match(expr) -> match(expr) AND trigrams(exact)
AND trigrams(prefix) AND trigrams(suffix)
data*(bcd|pqr)
34
| data*(bcd|pqr) | |
emptyable | false | |
exact | unknown | |
prefix | dat | |
suffix | {datbcd, datpqr, bcd, pqr} | |
match | dat AND (bcd OR pqr) |
dat AND (bcd OR pqr OR (dat AND atb AND tbc AND bcd) OR (dat AND atp AND tpq AND pqr))
Query -> DNF
data*(bcd|pqr)
35
Simplification using absorption law: a OR (a AND b) = a
36
Talk Overview
37
Data Set: MEDLINE articles
{ � "pmid":"19866847",� "affiliation":"Surgeon, U. S. A.",� "article_title":"ON THE APPEARANCE ......",� "authors":"W Reed",� "journal_issue":"2-5 Sep 1, 1897",� "journal_title":"The Journal of experimental medicine",� "keywords":"",� "mesh_headings":"",� "abstract":"1. The claim of L. Pfeiffer that small granular amoeboid bodies are present in the blood of vaccinated children and calves",� "zipf_score":0.019866847�}
38
Machine Setting
Machine setting: Macbook Air (mid-2013), Intel Core i7 (4650U), SSD hard drive, 8GB memory.
39
Indexing Overhead (standard analyzer)
40
Record # | Data Size | Time (s) | Index Size |
10K | 13.4MB | 5.6 | 22MB |
100K | 140.9MB | 31.82 | 236MB |
1M | 1.53GB | 341.81 | 2.56GB |
Indexing Overhead (trigram analyzer)
41
Record # | Data Size | Time (s) | Index Size |
10K | 13.4MB | 10.62 | 83MB |
100K | 140.9MB | 111.61 | 853MB |
1M | 1.53GB | 1324.12 | 9.38GB |
Keyword Matcher: queries
Randomly selected from documents
interplay
iontophoresis
inhibitor
result
choice
plasma adrenals
light interpretation
mechanical interphalangeal
joint's superficial arrangement
increase any rapidly
maximum until growth
42
Keyword Matcher: conjunctive queries
43
Keyword Matcher: Phrase-Index-Based
44
Record # | Avg Result # | Avg Time |
10K | 76 | 0.006 |
100K | 824 | 0.064 |
1M | 9,014 | 1.548 |
Machine setting: Macbook Air (mid-2013), Intel .Core i7 (4650U), SSD hard drive, 8GB memory
Keyword Matcher: Phrase-Index-Based
45
Fuzzy Token Matcher
Words are randomly selected from the domain of medical keywords.
46
Fuzzy Token Matcher
47
1M records | Run time (s) | Avg result # |
threshold = 0.35 | 0.863 | 8,737 |
threshold = 0.50 | 0.010 | 78 |
threshold = 0.65 | 0.009 | 75 |
threshold = 0.80 | 0.001 | 0.8 |
Stanford NLP Operator
“On 23 Jun 2016, the Colorado Department of Agriculture, State Veterinarian's Office, was notified by the USDA National Veterinary Services Laboratory (NVSL) that a non-racing horse presently located at Arapahoe Park in Aurora, [Colorado], tested positive for equine infectious anemia (EIA). Confirmatory tests are currently being run. Arapahoe Park is currently under a hold order that restricts movement of horses until an initial investigation is completed by the Colorado Department of Agriculture (CDA). The affected horse has been in Colorado less than 60 days and came from an out-of-state track. It appears that the horse was infected prior to coming to Colorado and previously tested negative for the disease in May of 2015.”
48
Stanford NLP Operator
49
Record # | All Named Entity time (s) | Part of Speech time (s) |
5K | 320.445 | 27.027 |
10K | 576.864 | 46.908 |
Machine setting: Lenovo Yoga 2 Pro, 2.6 Ghz Intel core i7, 8GM RAM
Regex Matcher: 1M records
50
| TextDB | |
Query | Time (s) | Result # |
mosquitos? | 0.669 | 3,337 |
market(ing)? | 1.827 | 5,353 |
v[ir]{2}[us]{2} | 9.87 | 91,491 |
medic(ine|al|ation|are|aid)? | 24.737 | 70,048 |
[A-Z][aeiou|AEIOU][A-Za-z]* | 71.748 | 5,629,558 |
Talk Overview
51
Open problems
52
TextDB Summary
53
Similarity
Matcher
Regex
Matcher
Query Processor
Dictionary
Matcher
Keyword Matcher
Declarative Query Language (TextQL)
Document Store (Lucene)
Inverted Index, Engine
Application 1
System T
Library
Application 2
...
Query
Rewriter
Stanford
NLP
Acknowledgements
54
Chen Li
ZhenFeng Qi
Qing Tang
Hailey Pan
Zuozhi Wang
Rajesh Yarlagadda
Prakul Agarwal
Sandeep R. Madugula
Shuying
Varun Bharill
Sudeep Meduri
Parag Saraogi
Shiladitya Sen
Kishore Narendran
Akshay Jain
Jinggang Diao
Flavio Bayer
Feng Hong
Yang Jiao
Jianfeng Jia
Sripad
TextDB on github
55