Monday, December 05, 2011

International Open Data Hackathon

This past Saturday, I hung out at the Seattle branch of the International Open Data Hackathon. The event was hosted at the Pioneer Square office of Socrata, a small company that helps governments provide public open data.

A pair of data analysts from Tableau were showing off a visualization for the Washington Post's FactChecker blog called Comparing Job Creation Records. Tableau pays these folks to play with data and make cool visualizations that make their software look good. One does politics and the other does pop culture. A nice gig, if you can get it!

A pair of devs from Microsoft's Open Data Protocol (OData) also showed up. OData looks to be a well thought out set of tools for ReST data services. If I understand correctly, it seems to have grown up around pushing relational data over Atom feeds. They let you define typed entities and associations between them, then do CRUD operations on them. You might call it ReSTful enterprise application integration.

Socrata's OpenData portal has all kinds of neat stuff, from White House staff salaries to radiation contamination measurements to investors who were bilked by Bernie Madoff. 13,710 datasets in all. They're available for download as well as through a nice ReST/JSON API. Socrata's platform runs explore.data.gov, data.seattle.gov among others.

For example, if you've got reasonably fat pipes, and want to know about building permits in Seattle, fire up R and enter this:

> permits.url <- 'http://data.seattle.gov/api/views/mags-97de/rows.csv'
> p <- read.csv(permits.url)
> head(p)

Socrata follows the rails-ish convention of letting you indicate the return format like a file extension. In this case, we're asking for .csv, 'cause R parses it so easily. You can get JSON, XML, RDF and several other formats.

Let's say you want to know what Seattlites are paying for kitchen remodels. Holy crap, it's appalling how boring and middle-aged I've gotten. Someone, shoot me!

> cost <- as.numeric(gsub('\\$(.*)', '\\1', p$Value))
> a <- cost[ grepl('kitchen', p$Description) & p$Category=="SINGLE FAMILY / DUPLEX" & cost > 0 & cost < 200000 ]
> hist(a, xlab='cost $', main='Distribution of kitchen remodels in Seattle', col=sample(gray.colors(10),10))

You saw it here, first, folks!

> a <- cost[ grepl('kitchen', p$Description) & p$Category=="SINGLE FAMILY / DUPLEX" & cost > 0]
> summary(a)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    500   18000   35000   46290   59770  420000

I dunno who's 420 thousand dollar kitchen that is, but if I find out, I'm coming over for dinner!

Socrata's API offers a JSON based way of defining queries. Several datasets are updated in near real time. There's gotta be loads of cool stuff to be done with this data. Let's hope the government sees the value in cheap and innovative ideas like these and continues funding for data.gov.