Thursday, September 23, 2010

Geting started with CouchDB

I'm investigating using CouchDB for a data mining application. CouchDB is a schema-less document-oriented database that stores JSON documents and uses JavaScript as a query language. You write queries in the form of map-reduce. Applications connect to the database over a ReSTful HTTP API. So, Couch is a creature of the web in a lot of ways.

What I have in mind (eventually) is sharding a collection of documents between several instances of CouchDB each running on their own nodes. Then, I want to run distributed map-reduce queries over the whole collection of documents. But, I'm just a beginner, so we're going to start off with the basics. The CouchDB wiki has a ton of getting started material.

Couchdb's installation instructions cover several options for installing on Mac OS X, as well as other OS's. I used MacPorts.

sudo port selfupdate
sudo port install couchdb

Did I remember to update my port definitions the first time through? Of f-ing course not. Port tries to be helpful, but it's a little late sometimes with the warnings. Anyway, now that it's installed, let's start it up. I came across CouchDB on Mac OS 10.5 via MacPorts which tells you how to start CouchDB using Apple's launchctl.

sudo launchctl load /opt/local/Library/LaunchDaemons/org.apache.couchdb.plist
sudo launchctl start org.apache.couchdb

To verify that it's up and running, type:

curl http://localhost:5984/

...which should return something like:

{"couchdb":"Welcome","version":"1.0.1"}

Futon, the web based management tool for CouchDB can be browsed to at http://localhost:5984/_utils/.

Being a nerd, I tried to run Futon's test suite. After they failed, I found this: The tests run only(!) in a separate browser and that browser needs to be Firefox. Maybe that's been dealt with by now.

Let's create a test database and add some bogus records like these:

{
   "_id": "3f8e4c80b3e591f9f53243bfc8158abf",
   "_rev": "1-896ed7982ecffb9729a4c79eac9ef08a",
   "description": "This is a bogus description of a test document in a couchdb database.",
   "foo": true,
   "bogosity": 99.87526349
}

{
   "_id": "f02148a1a2655e0ed25e61e8cee71695",
   "_rev": "1-a34ffd2bf0ef6c5530f78ac5fbd586de",
   "foo": true,
   "bogosity": 94.162327,
   "flapdoodle": "Blither blather bonk. Blah blabber jabber jigaboo splat. Pickle plop dribble quibble."
}

{
   "_id": "9c24d1219b651bfeb044a0162857f8ab",
   "_rev": "1-5dd2f82c03f7af2ad24e726ea1c26ed4",
   "foo": false,
   "bogosity": 88.334,
   "description": "Another bogus document in CouchDB."
}

When I first looked at CouchDB, I thought Views were more or less equivalent to SQL queries. That's not really true in some ways, but I'll get to that later. For now, let's try a couple in Futon. First, we'll just use a map function, no reducer. Let's filter our docs by bogosity. We want really bogus documents.

Map Function

function(doc) {
  if (doc.bogosity > 95.0)
    emit(null, doc);
}

Now, let's throw in a reducer. This mapper emits the bogosity value for all docs. The reducer takes their sum.

Map Function

function(doc) {
  emit(null, doc.bogosity);
}

Reduce Function

function (key, values, rereduce) {
  return sum(values);
}

It's a fun little exercise to try and take the average. That's tricky because, for example, ave(ave(a,b), ave(c)) is not necessarily the same as ave(a,b,c). That's important because the reducer needs to be free to operate on subsets of the keys emitted from the mapper, then combine the values. The wiki doc Introduction to CouchDB views explains the requirements on the map and reduce functions. There's a great interactive emulator and tutorial on CouchDB and map-reduce that will get you a bit further writing views.

One fun fact about CouchDB's views is that they're stored in CouchDB as design documents, which are just regular JSON like everything else. This is in contrast to SQL where a query is a completely different thing from the data. (OK, yes, I've heard of stored procs.)

That's the basics. At this point, a couple questions arise:

  • How do you do parameterized queries? For example, what if I wanted to let a user specify a cut-off for bogosity at run time?
  • How do I more fully get my head around these map-reduce "queries"?
  • Can CouchDB do distributed map-reduce like Hadoop?

There's more to design documents than views. Both _show and _list functions let you transform documents. List functions use cursor-like iterator that enables on-the-fly filtering and aggregating as well. Apparently, there are plans for _update and _filter functions as well. I'll have to do some more reading and hacking and leave those for later.

Links

No comments:

Post a Comment