Saturday, December 08, 2012

The online education tidal wave

The Economist says American universities represent declining value for money.

There are plenty of issues and inefficiencies in the current university system: rising fees and student debt, high drop-out rates, bundling, mismatching of skills to jobs.

"Expenditures on instruction have risen more slowly than in any other category of spending, even as student numbers have risen."

A university is a special bundle of education, accreditation, branding, research, sports, campus lifestyle and amenities, and much else. But, as universities get more expensive and state funding declines, the opportunity grows for cheaper solutions to slide in underneath. In response, universities will likely start reducing time to graduation and offering more ala-cart options.

The declining plight of the middle class factors in. While a degree leads to a substantial increase in earnings, that's not because degree-holders are getting ahead over time, it's that high-school graduates are falling behind. With grade inflation and increasing competition, a degree means less than it once did. Some, like Peter Thiel, think you should skip college and go to a start-up.

On the other hand, Arnold Kling, author of Many-to-One vs. One-to-Many: An Opinionated Guide to Educational Technology, is skeptical about the value of online education.

In Kling on Education and the Internet, he argues that two factors are missing from massive online courses, as currently formulated:

  1. Personalization: what would you do if teaching was one-to-one?
  2. Feedback: "teaching equals feedback"
"What would I [as a teacher] do if there were just one student in the room? You wouldn't lecture to them. You would probably give them lots of things to think about, problems to work. And as they worked the problems you would coach them directly..."

At the graduate level, the learning skills to select material and reflect on progress are expected. The do-it-yourself scholar can personalize their own experience. At the K-12 level, where innovation is most needed, most students are still building those learning skills. It's hard to imagine grade school working very well without the individual attention that good teachers can provide.

Education isn't limited to vocational training. The Economist forgets to mention things like personal enrichment and citizenship. The social/signaling aspects of universities won't be easily replicated online. To some extent, universities are a mechanism for buying your way into a certain social class.

While online courses might be in something of a bubble, there's little doubt that the economist is right in seeing a "tornado of change in education". For higher education, right now is one of those wake-up moments, reminiscent of Bill Gate's famous 1995 memo alerting Microsoft to the "Internet Tidal Wave".

What do you think? Is hacking education the way to go? Inevitable? Will something important be lost?

Thursday, December 06, 2012

R in the Cloud

I've been having some great fun parallelizing R code on Amazon's cloud. Now that things are chugging away nicely, it's time to document my foibles so I can remember not to fall into the same pits of despair again.

The goal was to perform lots of trails of a randomized statistical simulation. The jobs were independent and fairly chunky, taking from a couple of minutes up to 90 minutes or so. From each simulation, we got back a couple dozen numbers. We worked our way up to running a few thousand simulations at a time on 32 EC2 nodes.

The two approaches I tried were Starcluster and the parallel package that comes with the R distribution. I'll save Starcluster for later. I ended up pushing through with the parallel package.

The helpful folks of the Bioconductor group put together a CloudFormation template and AMI: Bioconductor parallel cluster with SSH. It's a great way to get started hacking R in the cloud. All it takes is a mouse-click and an Amazon account.

Using the parallel package

Here's a quick example using the parallel package to start up a cluster and fool around a bit.

library(parallel)
help(package=parallel)

## compile list of workers with one entry per core
lines <- readLines("/usr/local/Rmpi/hostfile.plain")
hosts <- do.call(c, lapply(strsplit(lines, " "), function(host) { rep(host[1], as.integer(host[2])) }))

## create the cluster passing an IP address for
## the head node
## hostname -i works on Linux, but not on BSD
## descendants (like OS X)
cl <- makePSOCKcluster(hosts,
        master=system("hostname -i", intern=TRUE))

## for testing, start a cluster on your local machine
# cl <- makePSOCKcluster(rep("localhost", 3))

## do something once on each worker
ans <- clusterEvalQ(cl, { mean(rnorm(1000)) })

## test a time consuming job
## (~30 seconds on a 4 core machine)
system.time(ans <- parLapplyLB(cl, 1:100, function(i) {
  ## summarize a bunch of random sample means
  summary(
    sapply(1:runif(1, 100, 2000),
           function(j) { mean(rnorm(10000)) }))
}))

## push data to the workers
myBigData <- rnorm(10000)
moreData <- c("foo", "bar", "blabber")
clusterExport(cl, c('myBigData', 'moreData'))

## shut down worker processes
stopCluster(cl)

That was easy. But, it wouldn't be any fun if a few things didn't go wrong.

Hitting limits

Why 32 machines? As we scaled up to larger runs, I started hitting limits. The first was in the number of machines Amazon lets you start up.

According to the email Amazon sent me: "Spot Instances, which have an instance limit that is typically five times larger than your On-Demand Instance limit (On-Demand default is 20)..."

You can get the limits raised, but I'm a total cheapskate anyway, so I hacked Dan's Cloudformation template to use spot instances, adding a "SpotPrice" property in the right place. Spot instances can go away, at any time, but they're so much cheaper that it's worth dealing with that.

Then, I prompty hit another limit:

Error in socketConnection ... all connections are in use

Apparently, there's a limit built into R on the number of socketConnections you can open. It says, in src/main/connections.c:

#define NCONNECTIONS 128 /* snow needs one per slave node */

Sadly, my lust for more cores is doomed to be thwarted, for now. We can have a maximum of 128 workers, or only 32 instances with 4 cores apiece. Bummer.

makePSOCKcluster hangs

Usually, the worker processes come up fine. But sometimes, they couldn't connect to the head node. The symptom of this problem is that makePSOCKcluster hangs.

I did ps -aexww and stared at the following gobbledy-gook for a while:

/usr/local/lib/R/bin/exec/R --slave --no-restore -e parallel:::.slaveRSOCK() --args MASTER=ip-10-4-215-155 PORT=10187 OUT= TIMEOUT=2592000 METHODS=TRUE XDR=TRUE SED=/bin/sed R_INCLUDE_DIR=/usr/local/lib/R/include R_DEFAULT_PACKAGES=datasets,utils,grDevices,graphics,stats SHELL=/bin/bash SSH_CLIENT=10.4.215.155 46154 22 USER=ubuntu LD_LIBRARY_PATH=/usr/local/lib/R/lib:/usr/local/lib:/usr/lib/jvm/java-6-sun-1.6.0.26/jre/lib/amd64/server:/usr/lib/jvm/java-6-sun-1.6.0.26/jre/lib/amd64 PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games MAIL=/var/mail/ubuntu PWD=/home/ubuntu R_SHARE_DIR=/usr/local/lib/R/share LANG=en_US.UTF-8 R_ARCH= HOME=/home/ubuntu SHLVL=1 LOGNAME=ubuntu LC_CTYPE=en_US.UTF-8 SSH_CONNECTION=10.4.215.155 46154 10.158.53.86 22 R_HOME=/usr/local/lib/R R_DOC_DIR=/usr/local/lib/R/doc

I confirmed that I could connect to the dud machines manually, and also from there back to the head node, like so:

ssh -i ~/.ssh/id_rsa ubuntu@10.241.65.139

The bug is resistant to rebooting and even terminating the dud node. Seemingly at random, somewhere between none and 3 machines out of 32 would turn out to be duds. How irritating!

Luckily, Dan from the Bioconductor group found the problem, and you can even see it, if you know where to look, in the afore-mentioned gobbledy-gook. The parameter MASTER=ip-10-4-215-155 means the worker has to do name resolution, which apparently sometimes fails. (See the notes under master in the docs for makePSOCKCluster)

We can give it an IP address, neatly bypassing any name resolution tar-pit:

cl <- makePSOCKcluster(hosts,
    master=system("hostname -i", intern=TRUE))

Huge props to Dan for figuring that out and giving me a serious case of geek envy.

Load unbalancing

The LB in parLapplyLB stands for load balancing. It uses a simple and sensible strategy: give each worker one job, then when a worker is finished, give it another job, until all the jobs are assigned.

I think I saw cases where there were idle workers at a time when there were jobs that had not yet started. The only way that could happen is if the jobs were already assigned to a busy worker.

Looking at the code, that doesn't look possible, but I have a theory. There's an option in makePSOCKcluster to specify an outfile and outfile="" sends stdout and stderr back to the head node. I thought that might be handy for debugging.

Next, consider the call stack for parLapplyLB (down is deeper):

One could start to imagine that a chatty and long-running worker sending output back to the head node via the outfile="" option would cause a socket to be readable before the job is done. So, another job gets submitted to that worker. Then workers become available and go idle for lack of work, which has already been hogged up (but not started) by the chatty worker.

If it's only a weird interaction between outfile="" and parLapplyLB, it's not that big of a deal. A more unfortunate property of parLapplyLB is what happens when a worker goes away; say, a connection is dropped or a spot instance is terminated. The result of that is that parLapplyLB bombs out with a socket error, and all work on all workers is lost. Doh!

For this reason, I had the workers write out checkpoints and collected them onto the head node periodically. This way, getting a return value back from parLapplyLB wasn't all that critical. And that brings me to the topic of automation.

Slothful automation

Automation is great. Don't tell anyone, but I take something of a lazy approach to automation: starting with a hacky mess that just barely works with lots of manual intervention and gradually refining it as needed, in the general direction of greater automation and more robust error handling.

Here are a few half-hearted attempts:

All this is closer to a hacky mess than clean automation. A lot of babysitting is still required.

Features I'd like to see

  • shared EBS volume (via NFS?) for packages, checkpoints and results
  • a queuing system that doesn't require persistent socket connections to the workers
  • async-lapply - returns a list of futures, which can be used to ask for status and results
  • progress monitoring on head node
  • support for scalable pools of spot instances that can go away at any time.
  • grow and shrink pool according to size of queue

The right tool for 10,000 jobs

There are many ways to parallelize R. The approach in this document uses the parallel package and RStudio on Amazon EC2. The parallel package is nice for interactive development and has the advantage of keeping R worker processes alive rather than starting a new process for each job. But, this approach only works up to a point.

Different styles include interactive vs. batch, implicit vs. explicit and reformulating the problem as a map/reduce job. For map/reduce style computations, look at Rhipe. R at 12,000 Cores describes the “Programming with Big Data in R” project (pbdR). For batch jobs, Starcluster may be a better choice.

Starcluster provides several of those features, albeit with the caveat of restarting R for each job. Having pushed R/parallel to its limits, I intend try Starcluster a little more. So far I've only learned the term-of-art for when your Starcluster job goes horribly wrong - that's called a starclusterfuck.

Wednesday, November 14, 2012

Functional Programming in Scala

Here I go, congratulating myself for finishing Functional Programming Principles in Scala, an online class taught by Martin Odersky, creator of the Scala programming language, professor at EPFL in Lausanne, Switzerland and co-founder of Typesafe.

Scala is a well engineered modern successor to Java. It's a statically-typed functional/OO language with curly-brace syntax, pattern matching and type inference that runs on the JVM. Scala's libraries are well-thought-out and consistent. And, it's type system is some serious rocket-science.

The course took some material and examples from the classic Structure and Interpretation of Computer Programs (SICP), which is a great way to learn functional concepts in a new language. Topics included:

  • higher-order functions
  • pattern matching
  • immutability and immutable collections
  • laziness
  • touched on inductive proofs of correctness

Boatload of Scala resources

Class staff and students compiled a great list of resources for learning Scala.

While we're on the topic, Twitter has invested big in Scala, open-sourcing some of the results. The presentation Systems Programming at Twitter gives a peek into how Twitter uses Scala for "programming the datacenter". Kestrel is a distributed message queue.

Books

Principles for Good Design

The class was a great introduction to the language and some ideas that will come in handy in any language. I had a lot of fun with it. Odersky wrapped up the course with some principles for good program design.

  • Name everything you can
  • Put operations into natural scopes
  • Keep degrees of freedom for future refinements

Saturday, October 27, 2012

Feature selection and linear modeling

...being a test of the Knitr document generation tool for R.

library(glmnet)

Let's give the glmnet package a little workout. We'll generate data for a bunch of features, some of which yield a response. Many of the features are unrelated to the response. Of course, we'll inject some noise into the data, too.

generate.data <- function(n=1000, m=5,
      sig.features=1:5, noise.level=0.10) {

  # create bogus feature matrix
  features <- matrix(runif(n*m), nrow=n, ncol=m)
  rownames(features) <- sprintf("ex%04d",seq(n))
  colnames(features) <- sprintf("feat%04d",seq(m))

  # generate random model parameters
  intercept <- rnorm(1)
  coefs <- rnorm(length(sig.features))
  names(coefs) <- colnames(features)[sig.features]

  # create response data
  response <- features[,sig.features] %*% coefs
              + intercept 
              + rnorm(n, sd=noise.level)

  # return generated data
  list(n=n, m=m,
       sig.features=sig.features,
       noise.level=noise.level,
       features=features,
       params=list(intercept=intercept, coefs=coefs),
       response=response)
}

A function to check correspondence between true coefficients and fit cofficients:

## return a data.frame containing all non-zero fit
## coefficients along with the true values
compare.coefs <- function(data, fit) {
  coefs <- data$params$coefs
  intercept <- data$params$intercept
  merge(
    data.frame(
      feature.name=c('(Intercept)', names(coefs)),
      param=c(`(Intercept)`=intercept, coefs)),
    subset(
      data.frame(
        feature.name=rownames(coef(fit)),
        coef=coef(fit)[,1]),
      coef!=0),
    all=TRUE)
}

Make predicted vs actual plots

plot.predicted.vs.actual <-
  function(data, predicted, 
      actual, noise.level, label=NULL) {

  corr <- cor(predicted, actual)
  order.by.predicted <- order(predicted)

  ##  create a plot of predicted vs actual
  plot(actual[order.by.predicted],
       pch=21, col="#aaaaaaaa", bg="#cc000030",
       ylab="response", xlab="sample")

  title(main="predicted vs. actual",
        col.main="#666666")

  lines(predicted[order.by.predicted],
        col='blue', lwd=2)

  legend("topleft", pch=c(NA, 21), lwd=c(2,NA), 
         col=c("blue", "#aaaaaa"),
         pt.bg=c(NA,"#cc000030"),
         legend=c('predicted','actual'))

  if (!is.null(label)) mtext(label, padj=-0.5)

  legend("bottomright",
    legend=c(
      sprintf('corr=%0.3f', corr),
      if (abs(noise.level) >= 2.0)
        sprintf('noise=%0.1fx', noise.level)
      else
        sprintf('noise=%0.0f%%', noise.level*100)))
}

Generate data

n <- 1000
noise.level <- 0.50
data <- generate.data(n, m=20,
                     sig.features=1:5, noise.level)

Select training and testing sets

train <- sample.int(n, n*0.85)
test <- setdiff(seq(n), train)

Fit model using elastic net.

fit <- cv.glmnet(data$features[train,], 
                 data$response[train, drop=FALSE],
                 alpha=0.7)
compare.coefs(data, fit)
  feature.name   param     coef
1  (Intercept) -2.2119 -2.32176
2     feat0001 -1.4536 -1.23601
3     feat0002  1.2987  1.12962
4     feat0003 -0.6622 -0.56728
5     feat0004  0.8445  0.70143
6     feat0005 -2.4279 -2.25502
7     feat0013      NA -0.04111

Elastic net estimates the true parameters pretty well. It also gives a small weight to a spurious predictor. With 1000 features to choose from, it's not unlikely that one will look like a predictor by chance.

For some reason, cv.glmnet will cross validate over a range of lambda values, but doesn't do the same for alpha. Tweaking alpha might be help eliminate that extra coefficient.

Check our ability to model training data

p <- predict(fit, data$features[train,])
plot.predicted.vs.actual(
  data$features[train,], p,
  data$response[train,],
  noise.level, "training")

plot of chunk plot1

Correlation with training data

cor(p, data$response[train,])
    [,1]
1 0.8831

Check our ability to predict test data

p <- predict(fit, data$features[test,])
plot.predicted.vs.actual(
  data$features[test,], p,
  data$response[test,],
  noise.level, "testing")

plot of chunk plot2

Correlation with testing data

cor(p, data$response[test,])
    [,1]
1 0.8799

Created with the help of the excellent package Knitr. This document is also published at http://rpubs.com/cbare/2341.