More permanent stuff at http://www.dusbabek.org/~garyd

29 September 2010

RESTful Cassandra

A lot of people, when first learning about Cassandra, wonder why there isn't any easier (say, RESTful) way to perform operations.  I did.  It didn't take someone very long to point out that it mainly has to do with performance.  Cassandra spends a significant amount of resources marshaling data and Thrift currently does that very efficiently.

So I put away my RESTlessness.
I've heard more people lately clamoring for the feature, so I gave some thought about how I'd go about it.  One approach would be to wrap Thrift.  That would be nice from a coupling standpoint, but I think performance would be pretty crappy.  After all, it is just adding another layer of marshaling that needs to be done; nobody needs that.
I eventually arrived at the decision that an HTTP Cassandra Daemon/Server pair similar to the existing Avro and Thrift versions would do the trick.  It would basically be a straight port, with a few minor caveats.  One big thing is that HTTP uses a new connection for each request, so storing Cassandra session information in threadlocals is gone out the window.  This means that authentication needs to be abandoned, done with every request, or we need to use HTTP sessions.  Punt.
Today, while half-listening to some lectures at ICOODB I decided to see how hard it would be to throw something together.  I ended up with two classes containing stripped down implementations of get and set.  I pushed the whole thing to github if anybody is interested.  I used the built-in Sun HTTP service because I didn't want any extra dependencies and building services on top of it is pretty straightforward.  Results are returned in JSON format that should match what you would see if you used sstable2json to export data.

This is clearly a proof of concept, but I think it demonstrates that the idea is sound and could be implemented fairly quickly.  Maintenance would be another story.  One problem with maintaining the Cassandra Avro bindings is that they regularly get out of sync with what is capable using Thrift.  An HTTP Cassandra wrapper would suffer the same fate without an active champion.  I'm interested, but I'm not *that* interested.

Anyway, have fun.

-- DETAILS --
The following URI formats are expected:
/get/keyspace/column_family/row_id/super_column/column_start/column_end/consistency_level.  If you don't want to pass it in, leave it blank.  Empty strings are interpreted as null when appropriate.

Here is an example from my tests:  http://127.0.0.1:9160/get/Keyspace1/Standard1/1//10/11/ONE

The main thing to deduce here is that the super column is empty (see the double slash?).  If you haven't realized by now, I've gone ahead with the assumption that your keys and column names are strings.  This isn't good enough.  All the details we need to become type-aware are available in the comparator for the column family.  As a shortcut for now, you can append "?asString" to the end of the URI to have all byte[] values converted to strings.  Without it they are displayed as hex.

Updating works the same way: /set/keyspace/coumn_family/row_key/super_column/column/value/consistency_level

e.g.: http://127.0.0.1:9160/set/Keyspace1/Standard1/1//11/aaa/ONE


UPDATE: I went ahead and created two additional implementations that use jetty (bare and with servlets).  This generated a bit more code, but opens up the way to getting sophisticated with sessions.

1 comments:

rich said...

Nice idea.

> This means that authentication needs to be abandoned, done with every request, or we need to use sessions. Punt.

I wonder if delegating the HTTP to something like orbited / mongrel2 would help? It could take care of which connection is where.

After all, session creation isn't expensive, it's the session maintenance that is.