More permanent stuff at http://www.dusbabek.org/~garyd

06 February 2009

Full-text Indexing on the Cheap

After Christmas I decided I wanted to build an mp3 blog aggregator that would be all my own. I've decided to chronicle the process here on my blog.

I have been a big fan of Lucene for years, although I now prefer Solr (it's still Lucene) for new projects. The main reason I've stuck to Lucene is that most projects I've worked on have been Java-related and limited hardware resources haven't been an issue.

All that changed with the latest incarnation of Tagfriendly. I develop it on a linux VM, which has all the hardware resources of my workstation (more than enough to handle Lucene), but I deploy Tagfriendly to a VPS where my VM is one of many that may be sharing the same physical hardware. To put it simply: I am hardware-constrained, currently stick with 128MB RAM and 5GB of disk space. For example, consider this: apt-get update craps out if I'm running lighttpd+paster. Yeah, it's like that.

While I can grow to a bigger VPS if I need to, I figured it would be good to build Tagfriendly to be lean on resources. Bottom line: I couldn't consider using Lucene as the search engine. So I started to look at other options.

The first two I considered were tsearch2 and sphinx.

tsearch2 kind of surpised me. The last time I looked into it was several years ago when it was a postgres contrib module--not part of the standard release. Even though the idea of having a full-text search structured like a SQL query was alluring, I didn't consider tsearch2 for long. I knew I wanted to be able throw just about anything into the index and rely on a "type" field to restrict results to particular data (e.g., mp3s or blog entries).

That left me with sphinx. I came across it while I was googling for Lucene alternatives. It has python bindings, is written in C, and claimed to be very lightweight. Seemed like a good fit. One downside I noticed is that once a sphinx index is created and initially filled, it can't be added to. That means I would have to generated the index from scratch every time I need to update it. Ouch. (Maybe this isn't true in newer releases.) I kept Sphinx on the list, but kept looking.

A day or two later, before I had written any indexing code, a friend pointed me to Xapian, which has a lot of the characteristics of Sphinx, except that Xapian is more mature (and can be updated). I settled on using Xappy which is a python wrapper around the basic python bindings provided by Xapian. Xappy simplifies the fielded aspect of managing a Xapian index (makes it feel more like Lucene). After playing with both the standard python Xapian bindings and xappy, I recommend using Xappy for simple indexing projects. The learning curve for the traditional Xapian API is pretty steep.

A few hours after finding Xappy (spread out over a few nights of work), I managed to produce a simple index page in pylons and an indexing daemon for Tagfriendly. The index only contains about 1500 documents at this point, so searching is still quite snappy. We'll see how it goes as the document count grows.

With an index in place, I am free to move on to some of the really interesting features and ideas I want to explore with Tagfriendly. For example, I am excited to implement a feature where users can create RSS feeds based on custom search criteria (e.g. includes Morrissey, but not The Smiths).

0 comments: