1 of 41

Getting Started

with

Python and Xapian

HackNY Office Hours -- 04/06/2013

Matthew Story

Director, Axial Corps of Engineers

2 of 41

About Me

  • Programming since 1998, professionally since 2005, with Python since 2008.

Getting Started with Python and Xapian

HackNY -- Spring 2013

3 of 41

About Me

  • Programming since 1998, professionally since 2005, with Python since 2008.
  • B.A. Philosophy, University of Chicago

Getting Started with Python and Xapian

HackNY -- Spring 2013

4 of 41

About Me

  • Programming since 1998, professionally since 2005, with Python since 2008.
  • B.A. Philosophy, University of Chicago
  • Head Engineering at Axial since 2012.

Getting Started with Python and Xapian

HackNY -- Spring 2013

5 of 41

About Me

  • Programming since 1998, professionally since 2005, with Python since 2008.
  • B.A. Philosophy, University of Chicago
  • Head Engineering at Axial since 2012.
  • Strong bias towards UNIX
    • FreeBSD contributions (xargs, find, libc)
    • File-System is my storage backend of choice
    • Write A LOT of CLI programs/filters
    • Write A LOT of network daemons (HTTP, TCP and UDP based).

Getting Started with Python and Xapian

HackNY -- Spring 2013

6 of 41

About Me

  • Programming since 1998, professionally since 2005, with Python since 2008.
  • B.A. Philosophy, University of Chicago
  • Head Engineering at Axial since 2012.
  • Strong bias towards UNIX
    • FreeBSD contributions (xargs, find, libc)
    • File-System is my storage backend of choice
    • Write A LOT of CLI programs/filters
    • Write A LOT of network daemons (HTTP, TCP and UDP based).
  • I Collect Beer and Vinyl

Getting Started with Python and Xapian

HackNY -- Spring 2013

7 of 41

Python and Xapian

  • What is Xapian?

Getting Started with Python and Xapian

HackNY -- Spring 2013

8 of 41

Python and Xapian

  • What is Xapian?
  • The Database
    • WritableDatabase
    • Database

Getting Started with Python and Xapian

HackNY -- Spring 2013

9 of 41

Python and Xapian

  • What is Xapian?
  • The Database
    • WritableDatabase
    • Database
  • Intro to terms
    • Indexing
    • Stemming
    • XPREFIX terms

Getting Started with Python and Xapian

HackNY -- Spring 2013

10 of 41

Python and Xapian

  • What is Xapian?
  • The Database
    • WritableDatabase
    • Database
  • Intro to terms
    • Indexing
    • Stemming
    • XPREFIX terms
  • Intro to Querying
    • Querying terms
    • Query Parser

Getting Started with Python and Xapian

HackNY -- Spring 2013

11 of 41

What is Xapian?

Xapian is a Keyword Indexer and Search library

Getting Started with Python and Xapian

HackNY -- Spring 2013

12 of 41

What is Xapian?

Xapian is a Keyword Indexer and Search library

  • Written in C++
    • Available via apt, rpm or ports
    • In base for ubuntu/debian Linux (used by apt)

Getting Started with Python and Xapian

HackNY -- Spring 2013

13 of 41

What is Xapian?

Xapian is a Keyword Indexer and Search library

  • Written in C++
    • Available via apt, rpm or ports
    • In base for ubuntu/debian Linux (used by apt)
  • Open Source
    • license: GPL v.2 (NB: not GPL 3, yay!)
    • xapian.org/download

Getting Started with Python and Xapian

HackNY -- Spring 2013

14 of 41

What is Xapian?

Xapian is a Keyword Indexer and Search library

  • Written in C++
    • Available via apt, rpm or ports
    • In base for ubuntu/debian Linux (used by apt)
  • Open Source
    • license: GPL v.2 (NB: not GPL 3, yay!)
    • xapian.org/download
  • Actively developed
    • current stable version: 1.2.14 (released: 3/14/2013)

Getting Started with Python and Xapian

HackNY -- Spring 2013

15 of 41

What is Xapian?

Xapian is a Keyword Indexer and Search library

  • Written in C++
    • Available via apt, rpm or ports
    • In base for ubuntu/debian Linux (used by apt)
  • Open Source
    • license: GPL v.2 (NB: not GPL 3, yay!)
    • xapian.org/download
  • Actively developed
    • current stable version: 1.2.14 (released: 3/14/2013)
  • Python bindings via SWIG
    • source: http://xapian.org/docs/bindings/
    • apt: sudo apt-get install python-xapian

Getting Started with Python and Xapian

HackNY -- Spring 2013

16 of 41

What Xapian is Not

A Search Engine Appliance

Getting Started with Python and Xapian

HackNY -- Spring 2013

17 of 41

What Xapian is Not

A Search Engine Appliance

  • Not a server (like SOLR/ElasticSearch)

Getting Started with Python and Xapian

HackNY -- Spring 2013

18 of 41

What Xapian is Not

A Search Engine Appliance

  • Not a server (like SOLR/ElasticSearch)
  • Limited Replication Support

Getting Started with Python and Xapian

HackNY -- Spring 2013

19 of 41

What Xapian is Not

A Search Engine Appliance

  • Not a server (like SOLR/ElasticSearch)
  • Limited Replication Support
  • More flexibility / Programmable Interface
    • Xapian::MatchDecider
    • Xapian::MatchSpy
    • Weight (custom weighting schemes)

Getting Started with Python and Xapian

HackNY -- Spring 2013

20 of 41

What Xapian is Not

Written in Java

Getting Started with Python and Xapian

HackNY -- Spring 2013

21 of 41

What Xapian is Not

Written in Java

  • Extremely small footprint (~30MB all-in)

Getting Started with Python and Xapian

HackNY -- Spring 2013

22 of 41

What Xapian is Not

Written in Java

  • Extremely small footprint (~30MB all-in)
  • Very few dependencies

Getting Started with Python and Xapian

HackNY -- Spring 2013

23 of 41

The Database

Making a DB is easy ...

$ # make a home for the DB

$ sudo install -o matt -g matt -d /var/xdb

$ python

>>> import xapian as _x

>>> # open if exists, else create and open

>>> sonnet_db = _x.WritableDatabase(

... '/var/xdb/sonnets.db',

... _x.DB_CREATE_OR_OPEN)

Getting Started with Python and Xapian

HackNY -- Spring 2013

24 of 41

The Database

A Xapian Database is just a directory ...

$ tree /var/xdb/sonnets.db

/var/xdb/sonnets.db/

├── flintlock

├── iamchert

├── postlist.baseA

├── postlist.DB

├── record.baseA

├── record.DB

├── termlist.baseA

└── termlist.DB

Getting Started with Python and Xapian

HackNY -- Spring 2013

25 of 41

The Database

Things to know about the chert DB

  • Single Writer / Multiple Reader
    • flintlock file used with flock(2)

Getting Started with Python and Xapian

HackNY -- Spring 2013

26 of 41

The Database

Things to know about the chert DB

  • Single Writer / Multiple Reader
    • flintlock file used with flock(2)
  • WritableDatabase is NOT threadsafe
    • kludge warning: exec(2) to hold lock
    • Xapian::Database is threadsafe

Getting Started with Python and Xapian

HackNY -- Spring 2013

27 of 41

The Database

Things to know about the chert DB

  • Single Writer / Multiple Reader
    • flintlock file used with flock(2)
  • WritableDatabase is NOT threadsafe
    • kludge warning: exec(2) to hold lock
    • Xapian::Database is threadsafe
  • Database must be re-opened after modifications.
    • will raise DatabaseModified error

Getting Started with Python and Xapian

HackNY -- Spring 2013

28 of 41

Indexing

Indexing:

Parsing a block of text for individual

keywords.

Getting Started with Python and Xapian

HackNY -- Spring 2013

29 of 41

Indexing

Indexing:

Parsing a block of text for individual

keywords.

Example:

Text: shall i compare thee

Terms: [ 'shall', 'i', 'compare', 'thee' ]

Getting Started with Python and Xapian

HackNY -- Spring 2013

30 of 41

Stemming

Stemming:

Reducing inflected or derived words to their

root.

Getting Started with Python and Xapian

HackNY -- Spring 2013

31 of 41

Stemming

Stemming:

Reducing inflected or derived words to their

root.

Example:

Query: write

Matches: writing, writes, written

Getting Started with Python and Xapian

HackNY -- Spring 2013

32 of 41

Indexing and Stemming

Xapian provides an indexer with stemming support:

import xapian as _x

# setup an indexer with english stemming

indexer = _x.TermGenerator()

indexer.set_stemmer(_x.Stem("english"))

x_doc = _x.Document()

index.set_document(x_doc)

index.index_text('shall i compare thee')

Getting Started with Python and Xapian

HackNY -- Spring 2013

33 of 41

Term Prefixes

All indexed terms are lowercase. This allows us to use uppercase prefixes to define different dimensions/facets:

# index the author, prefixed by 'A'

index.index_text('William Shakespeare',

1, 'A')

Getting Started with Python and Xapian

HackNY -- Spring 2013

34 of 41

Term Prefix Convention

Some terms have meaning by convention:

A -- Author

Q -- ID

S -- Title

...

http://xapian.org/docs/omega/termprefixes.html

Getting Started with Python and Xapian

HackNY -- Spring 2013

35 of 41

X-Prefixes

'X' is reserved by convention for custom term-prefixes, so you don't collide with once and future prefixes:

# add the number of lines in the poem

# as a term

x_doc.add_term('XLINES%s' % 14)

Getting Started with Python and Xapian

HackNY -- Spring 2013

36 of 41

Indexing Demo

To play the demo, clone the sonnetsdemo repo, and follow the index-sonnets.py instructions.

Getting Started with Python and Xapian

HackNY -- Spring 2013

37 of 41

Querying

Xapian uses the Query object to both build individual queries, and combine them:

import xapian as _x

# Query all sonnets with 14 lines

x_query = _x.Query.add_term('XLINES%s' % 14)

Getting Started with Python and Xapian

HackNY -- Spring 2013

38 of 41

Parsing Queries

To stem a Query string, and support Google-style advanced searching, xapian provides the QueryParser class:

qp = _x.QueryParser()

stemmer = _x.Stem("english")

qp.set_stemmer(stemmer)

qp.set_database(x_db)

qp.set_stemming_strategy(_x.QueryParser.STEM_SOME)

x_query2 = qp.parse_query(

'Shall AND Summer', 0, prefix)

Getting Started with Python and Xapian

HackNY -- Spring 2013

39 of 41

Compound Queries

To stem a Query string, and support Google-style advanced searching, xapian provides the QueryParser class:

joined_query = _x.Query(

_x.Query.OP_AND, x_query, x_query2)

Getting Started with Python and Xapian

HackNY -- Spring 2013

40 of 41

Query Demo

To play the demo, clone the sonnetsdemo repo, and follow the query-sonnets.py instructions.

Getting Started with Python and Xapian

HackNY -- Spring 2013

41 of 41

Thanks

matt.story@axial.net

github.com/matthewstory

Axial Corps of Engineers

www.axial.net/about/careers

github.com/axialmarket

axialcorps.wordpress.com

Getting Started with Python and Xapian

HackNY -- Spring 2013