1 of 35

Automata Invasion

Michael McCandless

Robert Muir

Lucene Revolution

May 9, 2012

2 of 35

Agenda

  • Who we are
  • Overview of finite state automata/transducers
  • Three Lucene implementations
    • Automaton
    • FST
    • TokenStream
  • Applications in Lucene

3 of 35

Who we are

  • Lucene/Solr committer, PMC member
    • 6 years = old hat!
  • Tika committer, PMC member
  • Co-author Lucene in Action, 2nd ed.
  • http://blog.mikemccandless.com
  • Sponsored by IBM (thank you!)

4 of 35

Who we are

  • Lucene/Solr committer, PMC member
    • 3 years = young hat, yet: less hair!
  • Lucid Imagination employee
  • Background in internationalization
  • Tangled up in finite state
    • NLP tasks
    • One of my first contributions to Lucene
  • http://twitter.com/rcmuir

5 of 35

Agenda

  • Who we are
  • Overview of finite state automata/transducers
  • Three Lucene implementations
    • Automaton
    • FST
    • TokenStream
  • Applications in Lucene

6 of 35

Finite State Automata (FSA)

  • Set<int[]>:
    • mop, moth, pop, slop, sloth, stop, top
  • Start node, final node
  • Append arc labels as you traverse
  • Traverse all paths to get all strings
  • Determinize, minimize, union, intersect, ...

7 of 35

Finite State Transducer (FST)

  • Adds optional output to each arc
    • Accumulate output as you traverse
  • Map<int[],T>
    • mop: 0, moth: 1, pop: 2, slop: 3, sloth: 4, stop: 5, top: 6
  • T is pluggable (defined by output algebra)
  • Deep theory, many algorithms...

8 of 35

Agenda

  • Who we are
  • Overview of finite state automata/transducers
  • Three Lucene implementations
    • Automaton
    • FST
    • TokenStream
  • Applications in Lucene

9 of 35

Lucene's FSA Implementation

  • org.apache.lucene.util.automaton.*
    • Poached from Anders Møller's Brics http://www.brics.dk/automaton
  • Build up any FSA one node/arc at a time
    • Automaton, Transition, State
    • Transition has min/max label
  • Many standard algorithms
    • Minimize
    • Determinize
    • Intersect
    • Union
    • Regexp -> Automaton
  • RunAutomaton for stepping

10 of 35

Lucene's FST Implementation

  • org.apache.lucene.util.fst.*
  • FST encoded as a byte[]
  • Write-once API
    • From Mihov & Maurel paper
    • Build minimal, acyclic FST from pre-sorted inputs
    • Fast (linear time with input size), low memory
    • Optional two-pass packing can shrink by ~25%
  • Traverse FST one arc at a time
    • Decode byte[] during lookup
  • SortedMap<int[],T>: arcs are sorted by label
    • getByOutput also possible if outputs are sorted
  • http://s.apache.org/LuceneFSTs

11 of 35

Lucene's FST Implementation

Downside: Generics Policeman does not approve!

@UweSays "I looked at the code, it was ununderstandable why this thing was generified"

12 of 35

Lucene's TokenStreams are FSAs!

  • Streaming FSA, one arc at a time

  • New PositionLengthAttribute in 3.6.0
    • How many positions does the token "span"
    • TokenStreamToAutomaton
    • Indexer ignores it (sausage!)
    • http://s.apache.org/TokenGraphs

13 of 35

Lucene's TokenStreams are FSAs!

  • Fixing all analyzers to behave with graphs?

14 of 35

Agenda

  • Who we are
  • Overview of finite state automata/transducers
  • Three Lucene implementations
    • Automaton
    • FST
    • TokenStream
  • Applications in Lucene

15 of 35

AutomatonQuery

  • Automaton specifies which terms match
  • New Terms.intersect(Automaton) method
    • Visits all matching terms & docs
  • Example
    • RegexpQuery
    • WildcardQuery
    • FuzzyQuery
  • Other possible uses
    • Stemming at query time via expansion

16 of 35

How to implement algorithm from 67-page paper

Hands-On

17 of 35

18 of 35

FuzzyQuery

  • Lev1(dogs)
  • 100X faster!
  • Schulz and Mihov�algo is hairy
  • Help from Jean-Philippe�Barrette-LaPierre
  • Includes transpositions
  • UTF32toUTF8 conversion
  • http://s.apache.org/FuzzyQuery

http://s.apache.org/FuzzyQuery

19 of 35

DirectSpellChecker

  • Use LevT automata to find respellings
  • Decent performance
    • ~47 QPS on Wikipedia, single thread
  • No side-car index required!

"I'm confident that in three weeks, I'll be done."

20 of 35

JapaneseTokenizer (Kuromoji)

  • Donated by Atilika Inc. (アティリカ株式会社)
  • Hard problem! (no whitespace, mixed Kanji, Hiragana, Katakana, compounds)
  • Dictionaries are stored as FST
    • System dictionary and user dictionaries
    • 11.8X smaller than double array trie
  • Viterbi search finds least-cost segmentation
  • Token graph for compound words
    • ショッピングセンター (shopping center),
    • ショッピング(shopping), センター (center)

21 of 35

Query Suggesters

  • E.g: w e a suggests weather
  • Compile all suggestible queries + weights into FST(s)
  • Two FST based suggesters
    • FSTSuggester quantizes weights into buckets
    • WFSTSuggester doesn't
  • Find all paths after user's prefix
  • Fast: ~240K lookups/sec
  • Active work in progress
    • Fuzzy, analyzing, ngram

22 of 35

WFSTSuggester example

wacky|1

waffle|4

wealthy|3

weather|10

weaver|7

"I don't think this will work but I can't provide a counterexample right now"

23 of 35

SynonymFilter

  • Apply at index time or query time or both
  • First version used recursive maps
  • New version (in 3.4.0) uses FST
    • 5X faster filter time
    • 14X faster build time
    • 59X less RAM
  • Multi-token synonyms mess up graph
  • Cannot consume token graph
  • http://s.apache.org/TokenGraphs

24 of 35

BlockTree Terms Dictionary

25 of 35

BlockTree Terms Dictionary

  • Terms dict maps terms to metadata/postings
    • SortedMap<Term,Postings>
    • Pluggable (per codec)
  • Term blocks on disk; FST index in memory
    • Variable number of terms per block (vs 3.x)
  • Variant of a burst trie (Heinz, Zobel, Williams)
    • Terms assigned to blocks by shared prefix, e.g. �http://www.*
  • Fast intersect(Automaton)
  • Fast "term can't exist"

26 of 35

BlockTree Example

able

above

apple

perfect

preface

prefecture

prefix

previous

profit

programmer

project

zoo

27 of 35

BlockTree Example

a*

perfect

preface

prefecture

prefix

previous

profit

programmer

project

zoo

ble

bove

pple

28 of 35

BlockTree Example

a*

p*

zoo

ble

bove

pple

erfect

reface

refecture

refix

revious

rofit

rogrammer

roject

29 of 35

BlockTree Example

a*

p*

zoo

ble

bove

pple

erfect

re*

ro*

face

fecture

fix

vious

fit

grammer

ject

30 of 35

BlockTree Example

a*

p*

zoo

ble

bove

pple

erfect

re*

ro*

face

fecture

fix

vious

fit

grammer

ject

31 of 35

MemoryPostingsFormat

32 of 35

MemoryPostingsFormat

  • Postings format is pluggable per-field
  • Store all terms + postings in an FST
    • FST is saved to disk
  • Great match for primary key fields, date filters
    • 2.8X faster PK lookup
  • http://s.apache.org/MemPostingFormat

33 of 35

Future FSA/T Usages

  • MappingCharFilter (in progress...)
  • FieldCache/DocValues (prototype patch)
  • FSTQuery?
  • More suggesters
  • Top-N most frequent terms for approx distributed IDF
  • .... patches welcome!

34 of 35

Conclusions

  • FSA/Ts are a good match for Lucene
  • Lucene has three wildly different implementations
  • FSA/Ts are now used in a number of places...
  • ... and more coming

35 of 35

More Information

  • Finite State Automata in Lucene
    • Dawid Weiss: Lucene Revolution 2011
    • http://slidesha.re/vKtpVA
  • FST API code samples
    • http://s.apache.org/pR