I wanted to get R talking to CouchDB. CouchDB is a NoSQL database that stores JSON documents and exposes a ReSTful API over HTTP. So, I needed to issue the basic HTTP requests: GET, POST, PUT, and DELETE from within R. Specifically, to get started, I wanted to add documents to the database using PUT.
There's CRAN package called httpRequest, which I thought would do the trick. This wound up being a dead end. There's a better way. Skip to the RCurl section unless you want to snicker at my hapless flailing.
Stuff that's totally beside the point
As Edison once said, "Failures? Not at all. We've learned several thousand things that won't work."
The httpRequest package is very incomplete, which is fair enough for a package at version 0.0.8. They implement only basic get and post and multipart post. Both post methods seem to expect name/value pairs in the body of the POST, whereas accessing web services typically requires XML or JSON in the request body. And, if I'm interpreting the HTTP spec right, these methods mishandle termination of response bodies.
Given this shaky foundation to start with, I implemented my own PUT function. While I eventually got it working for my specific purpose, I don't recommend going that route. HTTP, especially 1.1, is a complex protocol and implementing it is tricky. As I said, I believe the httpRequest methods, which send HTTP/1.1 in their request headers, get it wrong.
Specifically, they read the HTTP response with a loop like one of the following:
repeat{ ss <- read.socket(fp,loop=FALSE) output <- paste(output,ss,sep="") if(regexpr("\r\n0\r\n\r\n",ss)>-1) break() if (ss == "") break() }
repeat{ ss <- rawToChar(readBin(scon, "raw", 2048)) output <- paste(output,ss,sep="") if(regexpr("\r\n0\r\n\r\n",ss)>-1) break() if(ss == "") break() #if(proc.time()[3] > start+timeout) break() }
Notice that they're counting on a blank line, a zero followed by a blank line or the server closing the connection to signal the end of the response body. I dunno where the zero thing comes from or why we should count on it not being broken up during reading. Looking through RFC2616 we find this description of an HTTP message:
generic-message = start-line *(message-header CRLF) CRLF [ message-body ]
While the headers section ends with a blank line, the message body is not required to end in anything in particular. The part of the spec that refers to message length lists 5 ways that a message may be terminated, 4 of which are not "server closes connection". None of them are "a blank line". HTTP 1.1 was specifically designed this way so web browsers could download a page and all its images using the same open connection.
For my PUT implementation, I fell back to HTTP 1.0, where I could at least count on the connection closing at the end of the response. Even then, socket operations in R are confusing, at least for the clueless newbie such as myself.
One set of socket operations consists of: make.socket, read.socket/write.socket and close.socket. Of these functions, the R Data Import/Export guide states, "For new projects it is suggested that socket connections are used instead."
OK, socket connections, then. Now we're looking at: socketConnection, readLines, and writeLines. Actually, tons of IO methods in R can accept connections: readBin/writeBin, readChar/writeChar, cat, scan and the read.table methods among others.
At one point, I was trying to use the Content-Length header to properly determine the length of the response body. I would read the header lines using readLines, parse those to find Content-Length, then I tried reading the response body with readChar. By the name, I got the impression that readChar was like readLines but one character at a time. According to some helpful tips I got on the r-help mailing list this is not the case. Apparently, readChars is for binary mode connections, which seems odd to me. I didn't chase this down any further, so I still don't know how you would properly use Content-Length with the R socket functions.
Falling back to HTTP 1.0, we can just call readLines 'til the server closes the connection. In an amazing, but not recommended, feat of beating a dead horse until you actually get somewhere, I finally came up with the following code, with a couple variations commented out:
http.put <- function(host, path, data.to.send, content.type="application/json", port=80, verbose=FALSE) { if(missing(path)) path <- "/" if(missing(host)) stop("No host URL provided") if(missing(data.to.send)) stop("No data to send provided") content.length <- nchar(data.to.send) header <- NULL header <- c(header,paste("PUT ", path, " HTTP/1.0\r\n", sep="")) header <- c(header,"Accept: */*\r\n") header <- c(header,paste("Content-Length: ", content.length, "\r\n", sep="")) header <- c(header,paste("Content-Type: ", content.type, "\r\n", sep="")) request <- paste(c(header, "\r\n", data.to.send), sep="", collapse="") if (verbose) { cat("Sending HTTP PUT request to ", host, ":", port, "\n") cat(request, "\n") } con <- socketConnection(host=host, port=port, open="w+", blocking=TRUE, encoding="UTF-8") on.exit(close(con)) writeLines(request, con) response <- list() # read whole HTTP response and parse afterwords # lines <- readLines(con) # write(lines, stderr()) # flush(stderr()) # # # parse response and construct a response 'object' # response$status = lines[1] # first.blank.line = which(lines=="")[1] # if (!is.na(first.blank.line)) { # header.kvs = strsplit(lines[2:(first.blank.line-1)], ":\\s*") # response$headers <- sapply(header.kvs, function(x) x[2]) # names(response$headers) <- sapply(header.kvs, function(x) x[1]) # } # response$body = paste(lines[first.blank.line+1:length(lines)]) response$status <- readLines(con, n=1) if (verbose) { write(response$status, stderr()) flush(stderr()) } response$headers <- character(0) repeat{ ss <- readLines(con, n=1) if (verbose) { write(ss, stderr()) flush(stderr()) } if (ss == "") break key.value <- strsplit(ss, ":\\s*") response$headers[key.value[[1]][1]] <- key.value[[1]][2] } response$body = readLines(con) if (verbose) { write(response$body, stderr()) flush(stderr()) } # doesn't work. something to do with encoding? # readChar is for binary connections?? # if (any(names(response$headers)=='Content-Length')) { # content.length <- as.integer(response$headers['Content-Length']) # response$body <- readChar(con, nchars=content.length) # } return(response) }
After all that suffering, which was undoubtedly good for my character, I found an easier way.
RCurl
Duncan Temple Lang's RCurl is an R wrapper for libcurl, which provides robust support for HTTP 1.1. The paper R as a Web Client - the RCurl package lays out a strong case that wrapping an existing C library is a better way to get good HTTP support into R. RCurl works well and seems capable of everything needed to communicate with web services of all kinds. The API, mostly inherited from libcurl, is dense and a little confusing. Even given the docs and paper for RCurl and the docs for libcurl, I don't think I would have figured out PUT.
Luckily, at that point I found R4CouchDB, an R package built on RCurl and RJSONIO. R4CouchDB is part of a Google Summer of Code effort, NoSQL interface for R, through which high-level APIs were developed for several NoSQL DBs. Finally, I had stumbled across the answer to my problem.
I'm mainly documenting my misadventures here. In the next installment, CouchDB and R we'll see what actually worked. In the meantime, is there a conclusion from all this fumbling?
My point if I have one
HTTP is so universal that a high quality implementation should be a given for any language. HTTP-based APIs are being used by databases, message queues, and cloud computing services. And let's not forget plain old-fashioned web services. Mining and analyzing these data sources is something lots of people are going to want to do in R.
Others have stumbled over similar issues. There are threads on r-help about hanging socket reads, R with CouchDB, and getting R to talk over Stomp.
RCurl gets us pretty close. It could use high-level methods for PUT and DELETE and a high-level POST amenable to web-service use cases. More importantly, this stuff needs to be easier to find without sending the clueless noob running down blind alleys. RCurl is greatly superior to httpRequest, but that's not obvious without trying it or looking at the source. At minimum, it would be great to add a section on HTTP and web-services with RCurl to the R Data Import/Output guide. And finally, take it from the fool: trying to role your own HTTP (1.1 especially) is a fool's errand.