Getting Started
with
Python and Xapian
HackNY Office Hours -- 04/06/2013
Matthew Story
Director, Axial Corps of Engineers
About Me
Getting Started with Python and Xapian
HackNY -- Spring 2013
About Me
Getting Started with Python and Xapian
HackNY -- Spring 2013
About Me
Getting Started with Python and Xapian
HackNY -- Spring 2013
About Me
Getting Started with Python and Xapian
HackNY -- Spring 2013
About Me
Getting Started with Python and Xapian
HackNY -- Spring 2013
Python and Xapian
Getting Started with Python and Xapian
HackNY -- Spring 2013
Python and Xapian
Getting Started with Python and Xapian
HackNY -- Spring 2013
Python and Xapian
Getting Started with Python and Xapian
HackNY -- Spring 2013
Python and Xapian
Getting Started with Python and Xapian
HackNY -- Spring 2013
What is Xapian?
Xapian is a Keyword Indexer and Search library
Getting Started with Python and Xapian
HackNY -- Spring 2013
What is Xapian?
Xapian is a Keyword Indexer and Search library
Getting Started with Python and Xapian
HackNY -- Spring 2013
What is Xapian?
Xapian is a Keyword Indexer and Search library
Getting Started with Python and Xapian
HackNY -- Spring 2013
What is Xapian?
Xapian is a Keyword Indexer and Search library
Getting Started with Python and Xapian
HackNY -- Spring 2013
What is Xapian?
Xapian is a Keyword Indexer and Search library
Getting Started with Python and Xapian
HackNY -- Spring 2013
What Xapian is Not
A Search Engine Appliance
Getting Started with Python and Xapian
HackNY -- Spring 2013
What Xapian is Not
A Search Engine Appliance
Getting Started with Python and Xapian
HackNY -- Spring 2013
What Xapian is Not
A Search Engine Appliance
Getting Started with Python and Xapian
HackNY -- Spring 2013
What Xapian is Not
A Search Engine Appliance
Getting Started with Python and Xapian
HackNY -- Spring 2013
What Xapian is Not
Written in Java
Getting Started with Python and Xapian
HackNY -- Spring 2013
What Xapian is Not
Written in Java
Getting Started with Python and Xapian
HackNY -- Spring 2013
What Xapian is Not
Written in Java
Getting Started with Python and Xapian
HackNY -- Spring 2013
The Database
Making a DB is easy ...
$ # make a home for the DB
$ sudo install -o matt -g matt -d /var/xdb
$ python
>>> import xapian as _x
>>> # open if exists, else create and open
>>> sonnet_db = _x.WritableDatabase(
... '/var/xdb/sonnets.db',
... _x.DB_CREATE_OR_OPEN)
Getting Started with Python and Xapian
HackNY -- Spring 2013
The Database
A Xapian Database is just a directory ...
$ tree /var/xdb/sonnets.db
/var/xdb/sonnets.db/
├── flintlock
├── iamchert
├── postlist.baseA
├── postlist.DB
├── record.baseA
├── record.DB
├── termlist.baseA
└── termlist.DB
Getting Started with Python and Xapian
HackNY -- Spring 2013
The Database
Things to know about the chert DB
Getting Started with Python and Xapian
HackNY -- Spring 2013
The Database
Things to know about the chert DB
Getting Started with Python and Xapian
HackNY -- Spring 2013
The Database
Things to know about the chert DB
Getting Started with Python and Xapian
HackNY -- Spring 2013
Indexing
Indexing:
Parsing a block of text for individual
keywords.
Getting Started with Python and Xapian
HackNY -- Spring 2013
Indexing
Indexing:
Parsing a block of text for individual
keywords.
Example:
Text: shall i compare thee
Terms: [ 'shall', 'i', 'compare', 'thee' ]
Getting Started with Python and Xapian
HackNY -- Spring 2013
Stemming
Stemming:
Reducing inflected or derived words to their
root.
Getting Started with Python and Xapian
HackNY -- Spring 2013
Stemming
Stemming:
Reducing inflected or derived words to their
root.
Example:
Query: write
Matches: writing, writes, written
Getting Started with Python and Xapian
HackNY -- Spring 2013
Indexing and Stemming
Xapian provides an indexer with stemming support:
import xapian as _x
# setup an indexer with english stemming
indexer = _x.TermGenerator()
indexer.set_stemmer(_x.Stem("english"))
x_doc = _x.Document()
index.set_document(x_doc)
index.index_text('shall i compare thee')
Getting Started with Python and Xapian
HackNY -- Spring 2013
Term Prefixes
All indexed terms are lowercase. This allows us to use uppercase prefixes to define different dimensions/facets:
# index the author, prefixed by 'A'
index.index_text('William Shakespeare',
1, 'A')
Getting Started with Python and Xapian
HackNY -- Spring 2013
Term Prefix Convention
Some terms have meaning by convention:
A -- Author
Q -- ID
S -- Title
...
http://xapian.org/docs/omega/termprefixes.html
Getting Started with Python and Xapian
HackNY -- Spring 2013
X-Prefixes
'X' is reserved by convention for custom term-prefixes, so you don't collide with once and future prefixes:
# add the number of lines in the poem
# as a term
x_doc.add_term('XLINES%s' % 14)
Getting Started with Python and Xapian
HackNY -- Spring 2013
Indexing Demo
Getting Started with Python and Xapian
HackNY -- Spring 2013
Querying
Xapian uses the Query object to both build individual queries, and combine them:
import xapian as _x
# Query all sonnets with 14 lines
x_query = _x.Query.add_term('XLINES%s' % 14)
Getting Started with Python and Xapian
HackNY -- Spring 2013
Parsing Queries
To stem a Query string, and support Google-style advanced searching, xapian provides the QueryParser class:
qp = _x.QueryParser()
stemmer = _x.Stem("english")
qp.set_stemmer(stemmer)
qp.set_database(x_db)
qp.set_stemming_strategy(_x.QueryParser.STEM_SOME)
x_query2 = qp.parse_query(
'Shall AND Summer', 0, prefix)
Getting Started with Python and Xapian
HackNY -- Spring 2013
Compound Queries
To stem a Query string, and support Google-style advanced searching, xapian provides the QueryParser class:
joined_query = _x.Query(
_x.Query.OP_AND, x_query, x_query2)
Getting Started with Python and Xapian
HackNY -- Spring 2013
Query Demo
Getting Started with Python and Xapian
HackNY -- Spring 2013
Thanks
matt.story@axial.net
github.com/matthewstory
Axial Corps of Engineers
www.axial.net/about/careers
github.com/axialmarket
axialcorps.wordpress.com
Getting Started with Python and Xapian
HackNY -- Spring 2013