A common data-munging operation is to compute cross tabulations of measurements by categories. SQL Server and Excel have a nice feature called pivot tables for this purpose. Here we'll figure out how to do pivot operations in R.
Let's imagine an experiment where we're measuring the gene activity of an organism under different conditions -- exposure to different nutrients and toxins. Our conditions are silly: copper, beer, pizza, and cheetos. First we make a list of genes. Then expand.grid generates all combinations of genes and conditions. Finally, we tack on a column of randomly generated measurements.
> genes = paste('MMP', sprintf("%04d",1:10), sep="") > data = expand.grid(gene=genes, condition=c('copper', 'cheetos', 'beer', 'pizza')) > data$value = rnorm(40) > data gene condition value 1 MMP0001 copper 0.90412805 2 MMP0002 copper 0.92664376 3 MMP0003 copper 0.27772147 4 MMP0004 copper 0.08958930 5 MMP0005 copper -0.20132304 6 MMP0006 copper 0.34524729 7 MMP0007 copper -0.33910206 8 MMP0008 copper 1.21006486 9 MMP0009 copper 0.78008022 10 MMP0010 copper 1.05364315 11 MMP0001 cheetos -2.31796229 12 MMP0002 cheetos 0.76706591 13 MMP0003 cheetos -2.93692935 14 MMP0004 cheetos 0.25452306 15 MMP0005 cheetos 0.24168329 16 MMP0006 cheetos 0.28739734 17 MMP0007 cheetos 0.69233543 18 MMP0008 cheetos 0.48865250 19 MMP0009 cheetos -0.11129319 20 MMP0010 cheetos 0.53322842 21 MMP0001 beer -0.74965948 22 MMP0002 beer 0.27105205 23 MMP0003 beer -0.99261363 24 MMP0004 beer 0.65143639 25 MMP0005 beer -0.35589696 26 MMP0006 beer 1.40147484 27 MMP0007 beer 0.37492710 28 MMP0008 beer 0.64453865 29 MMP0009 beer 0.35925345 30 MMP0010 beer 0.96394785 31 MMP0001 pizza -1.91818504 32 MMP0002 pizza 0.31690523 33 MMP0003 pizza -1.20566043 34 MMP0004 pizza -1.91750166 35 MMP0005 pizza 1.98010023 36 MMP0006 pizza 0.90468249 37 MMP0007 pizza 0.04284970 38 MMP0008 pizza -0.08141461 39 MMP0009 pizza -0.72471771 40 MMP0010 pizza -0.01085060
We want to pivot the conditions into columns so that we end up with one column for each condition and one row for each gene. The easy way is to use the reshape package by Hadley Wickham, which is made for restructuring data and does this job nicely. If you don't already have it, you'll have to run install.packages, then load the library.
> install.packages('reshape') > library(reshape)
Using cast to move conditions into columns is a snap.
> cast(data, gene ~ condition) gene copper cheetos beer pizza 1 MMP0001 0.9041281 -2.3179623 -0.7496595 -1.91818504 2 MMP0002 0.9266438 0.7670659 0.2710521 0.31690523 3 MMP0003 0.2777215 -2.9369294 -0.9926136 -1.20566043 4 MMP0004 0.0895893 0.2545231 0.6514364 -1.91750166 5 MMP0005 -0.2013230 0.2416833 -0.3558970 1.98010023 6 MMP0006 0.3452473 0.2873973 1.4014748 0.90468249 7 MMP0007 -0.3391021 0.6923354 0.3749271 0.04284970 8 MMP0008 1.2100649 0.4886525 0.6445386 -0.08141461 9 MMP0009 0.7800802 -0.1112932 0.3592535 -0.72471771 10 MMP0010 1.0536432 0.5332284 0.9639479 -0.01085060
Done!
That was too easy
Just as an exercise, what would we have to do without reshape? And, just to keep ourselves honest, let's make sure we can deal with missing data (as reshape can). Make some data go missing:
> data.incomplete <- data[data$value > -1.0,] > dim(data.incomplete) [1] 35 3
Now, split the data frame up by condition. This produces a list where each element is a data frame containing a subset of the data for each condition. Notice that the cheetos data frame has values for 8 of the 10 genes.
> data.by.condition <- split(data.incomplete, data.incomplete$condition) > typeof(data.by.condition) [1] "list" > names(data.by.condition) [1] "copper" "cheetos" "beer" "pizza" > data.by.condition$cheetos gene condition value 12 MMP0002 cheetos 0.7670659 14 MMP0004 cheetos 0.2545231 15 MMP0005 cheetos 0.2416833 16 MMP0006 cheetos 0.2873973 17 MMP0007 cheetos 0.6923354 18 MMP0008 cheetos 0.4886525 19 MMP0009 cheetos -0.1112932 20 MMP0010 cheetos 0.5332284
We're going to recombine the data into a data frame with one row for each gene, so let's get that started:
> result = data.frame(gene=genes)
Now comes some executable line noise. We're going to loop through the list and add a column to the result data frame during each iteration of the loop. We pull the column out of the data frame in the list, but we have to make sure the column has an element for each gene. Merging with the all parameter set is like an outer join. We get a row for each gene, inserting NA's where there data is missing.
> for (i in seq(along=data.by.condition)) { result[[names(data.by.condition)[i]]] <- merge(data.by.condition[[i]], genes, by.x='gene', by.y=1, all=T)$value } > result gene copper cheetos beer pizza 1 MMP0001 0.9041281 NA -0.7496595 NA 2 MMP0002 0.9266438 0.7670659 0.2710521 0.31690523 3 MMP0003 0.2777215 NA -0.9926136 NA 4 MMP0004 0.0895893 0.2545231 0.6514364 NA 5 MMP0005 -0.2013230 0.2416833 -0.3558970 1.98010023 6 MMP0006 0.3452473 0.2873973 1.4014748 0.90468249 7 MMP0007 -0.3391021 0.6923354 0.3749271 0.04284970 8 MMP0008 1.2100649 0.4886525 0.6445386 -0.08141461 9 MMP0009 0.7800802 -0.1112932 0.3592535 -0.72471771 10 MMP0010 1.0536432 0.5332284 0.9639479 -0.01085060
Extra finesse points if you can figure out how to do that last step with Reduce instead of a loop.
Reshape's cast command took about 3 minutes to do this on a 2500x1500 example.
ReplyDeleteAnother approach without reshape is to use by.
ReplyDeletedo.call('cbind',by(data, data$condition, function(z) z[,-2]))[,-seq(3,8,by=2)]
Thanks for the post, it definitely got me going in the right direction. After spending some time with the help files for 'melt' and 'cast' I was able to generate just what I needed.
ReplyDelete