Showing posts with label NoSQL. Show all posts
Showing posts with label NoSQL. Show all posts

Saturday, May 07, 2011

Rails 3 MongoDB recipe

Here's a quick and dirty recipe for getting started with MongoDB and Rails 3 using mongo_mapper. We'll get set up and test out a Restful web service.

I installed MongoDB with MacPorts.

sudo port selfupdate
sudo port install mongodb

Start a rails project. You may want to leave out ActiveRecord using the switch --skip-active-record as suggested by the Getting Started guide. That's optional. You might want to leave it in, if you intend to use a relational DB alongside Mongo.

  rails new <myproject>

Add to Gemfile

source 'http://gemcutter.org'
gem 'mongrel'
gem 'mongo_mapper'
gem 'bson'
gem 'bson_ext'
gem 'SystemTimer'
gem 'rails3-generators'
gem 'json'

Bundler happily installs all this stuff.

bundle install

Create config/initializers/mongodb.rb

MongoMapper.connection = Mongo::Connection.new('localhost', 27017)
MongoMapper.database = "#myapp-#{Rails.env}"

if defined?(PhusionPassenger)
   PhusionPassenger.on_event(:starting_worker_process) do |forked|
     MongoMapper.connection.connect if forked
   end
end

Start the MongoDB server, giving it a place to put data files:

mkdir <path>/data
mongod --dbpath <path>/data

Create an entity to be stored in MongoDB. I like to use a database of Nerds as my test application.

rails generate scaffold Nerd name:string description:string iq:integer --orm=mongo_mapper 

I originally gave description type "text", but that didn't work for me, producing, "NameError in NerdsController#show, uninitialized constant Nerd::Text". So, I edited the model file to look like this:

class Nerd
  include MongoMapper::Document         
  key :name, String
  key :description, String
  key :iq, Integer
end

Getting mongodb objects as XML doesn't work?

I saw an error: undefined method `each' with mongo_mapper (0.8.6), which was fixed by upgrading to (0.9.0)

Started GET "/nerds/4dc5c41a1ff2367744000004.xml" for 127.0.0.1 at Sat May 07 15:41:54 -0700 2011
  Processing by NerdsController#show as XML
  Parameters: {"id"=>"4dc5c41a1ff2367744000004"}
Completed 200 OK in 9ms (Views: 2.6ms | ActiveRecord: 0.0ms)
Sat May 07 15:41:54 -0700 2011: Read error: #<NoMethodError: undefined method `each' for #<Nerd:0x1040bfb80>>

JSON web services in Rails

You'll likely want to serve up and accept JSON in HTTP requests. For some crazy reason, Rails doesn't generate code for JSON in the controller, only HTML and XML. You have to add a line to the respond_to code block, like this:

class NerdsController < ApplicationController
  # GET /nerds
  # GET /nerds.xml
  def index
    @nerds = Nerd.all

    respond_to do |format|
      format.html # index.html.erb
      format.xml  { render :xml => @nerds }
      format.json { render :json => @nerds }
    end
  end

  # GET /nerds/1
  # GET /nerds/1.xml
  def show
    @nerd = Nerd.find(params[:id])

    respond_to do |format|
      format.html # show.html.erb
      format.xml  { render :xml => @nerd }
      format.json { render :json => @nerds }
    end
  end
  ...
end

Use curl to test that this works. First, get HTML, then ask for JSON using the accept header.

curl --request GET -H "accept: application/json" http://localhost:3000/nerds/4dc5c41a1ff2367744000004

OK, now we can serve JSON, how about receiving it in POST requests? Another couple quick additions and we're in business.

# POST /nerds
# POST /nerds.xml
def create
  @nerd = Nerd.new(params[:nerd])

  respond_to do |format|
    if @nerd.save
      format.html { redirect_to(@nerd, :notice => 'Nerd was successfully created.') }
      format.xml  { render :xml => @nerd, :status => :created, :location => @nerd }
      format.json  { render :json => @nerd, :status => :created, :location => @nerd }
    else
      format.html { render :action => "new" }
      format.xml  { render :xml => @nerd.errors, :status => :unprocessable_entity }
      format.json  { render :json => @nerd.errors, :status => :unprocessable_entity }
    end
  end
end
curl --request POST -H "Content-Type: application/json" -H "Accept: application/json" --data '{"nerd":{"name":"Donald Knuth", "iq":199, "description":"Writes hard books."}}' http://localhost:3000/nerds

If you have an entity named "nerd" it's pretty reasonable to expect the XML representation to look something like <nerd>...</nerd>. By analogy to that, they expect your JSON to look like this {"nerd":{...}}, which I'm not sure I like. I coded around the issue, which is might be unwise, by making the controller's create method look like this:

# POST /nerds
# POST /nerds.xml
def create

  input = params[:nerd] || request.body.read
  if request.content_type == 'application/json'
    @nerd = Nerd.new(JSON.parse(input))
  else
    @nerd = Nerd.new(input)
  end

  respond_to do |format|
    if @nerd.save
      format.html { redirect_to(@nerd, :notice => 'Nerd was successfully created.') }
      format.xml  { render :xml => @nerd, :status => :created, :location => @nerd }
      format.json  { render :json => @nerd, :status => :created, :location => @nerd }
    else
      format.html { render :action => "new" }
      format.xml  { render :xml => @nerd.errors, :status => :unprocessable_entity }
      format.json  { render :json => @nerd.errors, :status => :unprocessable_entity }
    end
  end
end

...which you can test with curl like so:

curl --request POST -H "content-type: application/json" --data '{"name":"Bozo", "iq":178, "description":"A very smart clown."}' http://localhost:3000/nerds

So, there you have it - Rails 3 and MongoDB playing ReSTfully together.

Warning: Current release versions of MongoDB have a 4MB maximum document size. This makes some sense as documents are often serialized and deserialized in memory. Apparently, this is being raised in later versions. Fully streaming APIs would certainly help, but that brings up the question of how much of the XML dog-pile of technologies will (or should) be replicated in JSON. Guess we'll see.

Sunday, April 10, 2011

Thinking about CRUD has damaged your karma

I'm spending some time trying out MongoDB. Mongo is a NoSQL database that stores documents in a binary variant of JSON called BSON.

Mongo is often compared to CouchDB. But, aside from the fact that they both store JSON documents, their approaches are quite different.

Mongo stores documents in collections, which are vaguely like tables in a SQL database. Mongo queries are partial JSON documents matched against existing documents in the database. Couch builds views with map-reduce and is, in general, a little more conceptually heavy while Mongo is more straight forward with direct analogs to most SQL features.

One feature I like is the mongo console. It's a full javascript interpreter, which means you can easily script bulk updates and maintenance tasks.

The MongoDB site has a quickstart guide for popular OS's and a tutorial that will get you started using the console, as well as specific guides for loads of client languages.

You have not yet reached enlightenment...

From within Ruby, we can communicate with MongoDB with the mongo, bson and bson_ext gems. One fun way to learn about using MongoDB from Ruby is through the nicely kooky MongoDB_Koans, a series of unit tests with small omissions or bugs. You fix the bugs and make the tests pass, while the test harness gently urges "Please meditate on the following code...".

Saturday, October 02, 2010

CouchDB and R

Here are some quick crib notes on getting R talking to CouchDB using Couch's ReSTful HTTP API. We'll do it in two different ways. First, we'll construct HTTP calls with RCurl, then move on to the R4CouchDB package for a higher level interface. I'll assume you've already gotten started with CouchDB and are familiar with the basic ReST actions: GET PUT POST and DELETE.

First install RCurl and RJSONIO. You'll have to download the tar.gz's if you're on a Mac. For the second part, we'll need to install R4CouchDB, which depends on the previous two. I checked it out from GitHub and used R CMD INSTALL.

ReST with RCurl

Ping server

getURL("http://localhost:5984/")
[1] "{\"couchdb\":\"Welcome\",\"version\":\"1.0.1\"}\n"

That's nice, but we want to get the result back as a real R data structure. Try this:

welcome <- fromJSON(getURL("http://localhost:5984/"))
welcome$version
[1] "1.0.1"

Sweet!

PUT

One way to add a new record is with http PUT.

bozo = list(name="Bozo", occupation="clown", shoe.size=100)
getURL("http://localhost:5984/testing123/bozo",
       customrequest="PUT",
       httpheader=c('Content-Type'='application/json'),
       postfields=toJSON(bozo))
[1] "{\"ok\":true,\"id\":\"bozo\",\"rev\":\"1-70f5f59bf227d2d715c214b82330c9e5\"}\n"

Notice that RJSONIO has no high level PUT method, so you have to fake it using the costumrequest parameter. I'd never have figured that out without an example from R4CouchDB's source. The API of libCurl is odd, I have to say, and RCurl mostly just reflects it right into R.

If you don't like the idea of sending a put request with a get function, you could use RCurl's curlPerform. Trouble is, curlPerform returns an integer status code rather than the response body. You're supposed to provide an R function to collect the response body text. Not really worth the bother, unless you're getting into some of the advanced tricks described in the paper, R as a Web Client - the RCurl package.

bim <-  list(
  name="Bim", 
  occupation="clown",
  tricks=c("juggling", "pratfalls", "mocking Bolsheviks"))
reader = basicTextGatherer()
curlPerform(
  url = "http://localhost:5984/testing123/bim",
  httpheader = c('Content-Type'='application/json'),
  customrequest = "PUT",
  postfields = toJSON(bim),
  writefunction = reader$update
)
reader$value()

GET

Now that there's something in there, how do we get it back? That's super easy.

bozo2 <- fromJSON(getURL("http://localhost:5984/testing123/bozo"))
bozo2
$`_id`
[1] "bozo"

$`_rev`
[1] "1-646331b58ee010e8df39b5874b196c02"

$name
[1] "Bozo"

$occupation
[1] "clown"

$shoe.size
[1] 100

PUT again for updating

Updating is done by using PUT on an existing document. For example, let's give Bozo, some mad skillz:

getURL(
  "http://localhost:5984/testing123/bozo",
  customrequest="PUT",
  httpheader=c('Content-Type'='application/json'),
  postfields=toJSON(bozo2))

POST

If you POST to the database, you're adding a document and letting CouchDB assign its _id field.

bender = list(
  name='Bender',
  occupation='bending',
  species='robot')
response <- fromJSON(getURL(
  'http://localhost:5984/testing123/',
  customrequest='POST',
  httpheader=c('Content-Type'='application/json'),
  postfields=toJSON(bender)))
response
$ok
[1] TRUE

$id
[1] "2700b1428455d2d822f855e5fc0013fb"

$rev
[1] "1-d6ab7a690acd3204e0839e1aac01ec7a"

DELETE

For DELETE, you pass the doc's revision number in the query string. Sorry, Bender.

response <- fromJSON(getURL("http://localhost:5984/testing123/2700b1428455d2d822f855e5fc0013fb?rev=1-d6ab7a690acd3204e0839e1aac01ec7a",
  customrequest="DELETE"))

CRUD with R4CouchDB

R4CouchDB provides a layer on top of the techniques we've just described.

R4CouchDB uses a slightly strange idiom. You pass a cdb object, really just a list of parameters, into every R4CouchDB call and every call returns that object again, maybe modified. Results are returned in cdb$res. Maybe, they did this because R uses pass by value. Here's how you would initialize the object.

cdb <- cdbIni()
cdb$serverName <- "localhost"
cdb$port <- 5984
cdb$DBName="testing123"

Create

fake.data <- list(
  state='WA',
  population=6664195,
  state.bird='Lady GaGa')
cdb$dataList <- fake.data
cdb$id <- 'fake.data'  ## optional, otherwise an ID is generated
cdb <- cdbAddDoc(cdb)

cdb$res
$ok
[1] TRUE

$id
[1] "fake.data"

$rev
[1] "1-14bc025a194e310e79ac20127507185f"

Read

cdb$id <- 'bozo'
cdb <- cdbGetDoc(cdb)

bozo <- cdb$res
bozo
$`_id`
[1] "bozo"
... etc.

Update

First we take the document id and rev from the existing document. Then, save our revised document back to the DB.

cdb$id <- bozo$`_id`
cdb$rev <- bozo$`_rev`
bozo = list(
  name="Bozo",
  occupation="assassin",
  shoe.size=100,
  skills=c(
    'pranks',
    'honking nose',
    'kung fu',
    'high explosives',
    'sniper',
    'lock picking',
    'safe cracking'))
cdb <- cdbUpdateDoc(bozo)

Delete

Shortly thereafter, Bozo mysteriously disappeared.

cdb$id = bozo$`_id`
cdb <- cdbDeleteDoc(cdb)

More on ReST and CouchDB

  • One issue you'll probably run into is that unfortunately JSON left out NaN and Infinity. And, of course only R knows about NAs.
  • One-off ReST calls are easy using curl from the command line, as described in REST-esting with cURL.
  • I flailed about quite a bit trying to figure out the best way to do HTTP with R.
  • I originally thought R4CouchDB was part of a Google summer of code project to support NoSQL DBs in R. Dirk Eddelbuettel clarified that R4CouchDB was developed independently. In any case, the schema-less approach fits nicely with R's philosophy of exploratory data analysis.

Thursday, September 23, 2010

Geting started with CouchDB

I'm investigating using CouchDB for a data mining application. CouchDB is a schema-less document-oriented database that stores JSON documents and uses JavaScript as a query language. You write queries in the form of map-reduce. Applications connect to the database over a ReSTful HTTP API. So, Couch is a creature of the web in a lot of ways.

What I have in mind (eventually) is sharding a collection of documents between several instances of CouchDB each running on their own nodes. Then, I want to run distributed map-reduce queries over the whole collection of documents. But, I'm just a beginner, so we're going to start off with the basics. The CouchDB wiki has a ton of getting started material.

Couchdb's installation instructions cover several options for installing on Mac OS X, as well as other OS's. I used MacPorts.

sudo port selfupdate
sudo port install couchdb

Did I remember to update my port definitions the first time through? Of f-ing course not. Port tries to be helpful, but it's a little late sometimes with the warnings. Anyway, now that it's installed, let's start it up. I came across CouchDB on Mac OS 10.5 via MacPorts which tells you how to start CouchDB using Apple's launchctl.

sudo launchctl load /opt/local/Library/LaunchDaemons/org.apache.couchdb.plist
sudo launchctl start org.apache.couchdb

To verify that it's up and running, type:

curl http://localhost:5984/

...which should return something like:

{"couchdb":"Welcome","version":"1.0.1"}

Futon, the web based management tool for CouchDB can be browsed to at http://localhost:5984/_utils/.

Being a nerd, I tried to run Futon's test suite. After they failed, I found this: The tests run only(!) in a separate browser and that browser needs to be Firefox. Maybe that's been dealt with by now.

Let's create a test database and add some bogus records like these:

{
   "_id": "3f8e4c80b3e591f9f53243bfc8158abf",
   "_rev": "1-896ed7982ecffb9729a4c79eac9ef08a",
   "description": "This is a bogus description of a test document in a couchdb database.",
   "foo": true,
   "bogosity": 99.87526349
}

{
   "_id": "f02148a1a2655e0ed25e61e8cee71695",
   "_rev": "1-a34ffd2bf0ef6c5530f78ac5fbd586de",
   "foo": true,
   "bogosity": 94.162327,
   "flapdoodle": "Blither blather bonk. Blah blabber jabber jigaboo splat. Pickle plop dribble quibble."
}

{
   "_id": "9c24d1219b651bfeb044a0162857f8ab",
   "_rev": "1-5dd2f82c03f7af2ad24e726ea1c26ed4",
   "foo": false,
   "bogosity": 88.334,
   "description": "Another bogus document in CouchDB."
}

When I first looked at CouchDB, I thought Views were more or less equivalent to SQL queries. That's not really true in some ways, but I'll get to that later. For now, let's try a couple in Futon. First, we'll just use a map function, no reducer. Let's filter our docs by bogosity. We want really bogus documents.

Map Function

function(doc) {
  if (doc.bogosity > 95.0)
    emit(null, doc);
}

Now, let's throw in a reducer. This mapper emits the bogosity value for all docs. The reducer takes their sum.

Map Function

function(doc) {
  emit(null, doc.bogosity);
}

Reduce Function

function (key, values, rereduce) {
  return sum(values);
}

It's a fun little exercise to try and take the average. That's tricky because, for example, ave(ave(a,b), ave(c)) is not necessarily the same as ave(a,b,c). That's important because the reducer needs to be free to operate on subsets of the keys emitted from the mapper, then combine the values. The wiki doc Introduction to CouchDB views explains the requirements on the map and reduce functions. There's a great interactive emulator and tutorial on CouchDB and map-reduce that will get you a bit further writing views.

One fun fact about CouchDB's views is that they're stored in CouchDB as design documents, which are just regular JSON like everything else. This is in contrast to SQL where a query is a completely different thing from the data. (OK, yes, I've heard of stored procs.)

That's the basics. At this point, a couple questions arise:

  • How do you do parameterized queries? For example, what if I wanted to let a user specify a cut-off for bogosity at run time?
  • How do I more fully get my head around these map-reduce "queries"?
  • Can CouchDB do distributed map-reduce like Hadoop?

There's more to design documents than views. Both _show and _list functions let you transform documents. List functions use cursor-like iterator that enables on-the-fly filtering and aggregating as well. Apparently, there are plans for _update and _filter functions as well. I'll have to do some more reading and hacking and leave those for later.

Links