Friday, June 24, 2011

Drawing heatmaps in R

A while back, while reading chapter 4 of Using R for Introductory Statistics, I fooled around with the mtcars dataset giving mechanical and performance properties of cars from the early 70's. Let's plot this data as a hierarchically clustered heatmap.

# scale data to mean=0, sd=1 and convert to matrix
mtscaled <- as.matrix(scale(mtcars))

# create heatmap and don't reorder columns
heatmap(mtscaled, Colv=F, scale='none')

By default, heatmap clusters by both rows and columns. It then reorders the resulting dendrograms according to mean. Setting Colv to false tells it not to reorder the columns, which will come in handy later. Let's also turn off the default scaling across rows. We've already scaled across columns, which is the sensible thing to do in this case.

If our columns are already in some special order, say as a time-series or by increasing dosage, we might want to cluster only rows. We could do that by setting the Colv argument to NA. One thing that clustering the columns tells us in this case is that some information is highly correlated, bordering on redundant. For example, displacement, horsepower and number of cylinders are quit similar. And the idea that to get more power (hp) and go faster (qsec) we need to burn more gas (mpg) is pretty well supported.

Separating clusters

If we'd like to separate out the clusters, I'm not sure of the best approach. One way is to use hclust and cutree, which allows you to specify k, the number of clusters you want. Don't forget that hclust requires a distance matrix as input.

# cluster rows
hc.rows <- hclust(dist(mtscaled))
plot(hc.rows)

# transpose the matrix and cluster columns
hc.cols <- hclust(dist(t(mtscaled)))

# draw heatmap for first cluster
heatmap(mtscaled[cutree(hc.rows,k=2)==1,], Colv=as.dendrogram(hc.cols), scale='none')

# draw heatmap for second cluster
heatmap(mtscaled[cutree(hc.rows,k=2)==2,], Colv=as.dendrogram(hc.cols), scale='none')

That works, but, I'd probably advise creating one heatmap and cutting it up in Illustrator, if need be. I have a nagging feeling that the color scale will end up being slightly different between the two clusters, since the range of values in each submatrix is different. Speaking of colors, if you don't like the default heat colors, try creating a new palette with color ramp.

palette <- colorRampPalette(c('#f0f3ff','#0033BB'))(256)
heatmap(mtscaled, Colv=F, scale='none', col=palette)

Confusing things

Another way to separate the clusters is to get the dendrograms out of heatmap and work with those. But Cutree applies to objects of class hclust, returned by hclust, and returns a map assigning each row in the original data to a cluster. Cutree takes either a height to cut at (h) or the desired number of clusters (k), which is nice.

Cut applies to dendrograms, which can be returned by heatmap if the keep.dendro option is set. Cut takes only h, not k, and returns a list with members upper and lower. Lower is a list of subtrees below the cut point.

Doing graphics with R starts easy, but gets arcane quickly. There's also a heatmap.2 function in the gplot package that adds color keys among other sparsely documented features.

This all needs some serious straightening out, but the basics are easy enough. Here are a couple more resources to make your heatmaps extra-hot:

...more on R.