A common data-munging operation is to compute cross tabulations of measurements by categories. SQL Server and Excel have a nice feature called pivot tables for this purpose. Here we'll figure out how to do pivot operations in R.
Let's imagine an experiment where we're measuring the gene activity of an organism under different conditions -- exposure to different nutrients and toxins. Our conditions are silly: copper, beer, pizza, and cheetos. First we make a list of genes. Then expand.grid generates all combinations of genes and conditions. Finally, we tack on a column of randomly generated measurements.
> genes = paste('MMP', sprintf("%04d",1:10), sep="")
> data = expand.grid(gene=genes, condition=c('copper', 'cheetos', 'beer', 'pizza'))
> data$value = rnorm(40)
> data
gene condition value
1 MMP0001 copper 0.90412805
2 MMP0002 copper 0.92664376
3 MMP0003 copper 0.27772147
4 MMP0004 copper 0.08958930
5 MMP0005 copper -0.20132304
6 MMP0006 copper 0.34524729
7 MMP0007 copper -0.33910206
8 MMP0008 copper 1.21006486
9 MMP0009 copper 0.78008022
10 MMP0010 copper 1.05364315
11 MMP0001 cheetos -2.31796229
12 MMP0002 cheetos 0.76706591
13 MMP0003 cheetos -2.93692935
14 MMP0004 cheetos 0.25452306
15 MMP0005 cheetos 0.24168329
16 MMP0006 cheetos 0.28739734
17 MMP0007 cheetos 0.69233543
18 MMP0008 cheetos 0.48865250
19 MMP0009 cheetos -0.11129319
20 MMP0010 cheetos 0.53322842
21 MMP0001 beer -0.74965948
22 MMP0002 beer 0.27105205
23 MMP0003 beer -0.99261363
24 MMP0004 beer 0.65143639
25 MMP0005 beer -0.35589696
26 MMP0006 beer 1.40147484
27 MMP0007 beer 0.37492710
28 MMP0008 beer 0.64453865
29 MMP0009 beer 0.35925345
30 MMP0010 beer 0.96394785
31 MMP0001 pizza -1.91818504
32 MMP0002 pizza 0.31690523
33 MMP0003 pizza -1.20566043
34 MMP0004 pizza -1.91750166
35 MMP0005 pizza 1.98010023
36 MMP0006 pizza 0.90468249
37 MMP0007 pizza 0.04284970
38 MMP0008 pizza -0.08141461
39 MMP0009 pizza -0.72471771
40 MMP0010 pizza -0.01085060
We want to pivot the conditions into columns so that we end up with one column for each condition and one row for each gene. The easy way is to use the reshape package by Hadley Wickham, which is made for restructuring data and does this job nicely. If you don't already have it, you'll have to run install.packages, then load the library.
> install.packages('reshape')
> library(reshape)
Using cast to move conditions into columns is a snap.
> cast(data, gene ~ condition)
gene copper cheetos beer pizza
1 MMP0001 0.9041281 -2.3179623 -0.7496595 -1.91818504
2 MMP0002 0.9266438 0.7670659 0.2710521 0.31690523
3 MMP0003 0.2777215 -2.9369294 -0.9926136 -1.20566043
4 MMP0004 0.0895893 0.2545231 0.6514364 -1.91750166
5 MMP0005 -0.2013230 0.2416833 -0.3558970 1.98010023
6 MMP0006 0.3452473 0.2873973 1.4014748 0.90468249
7 MMP0007 -0.3391021 0.6923354 0.3749271 0.04284970
8 MMP0008 1.2100649 0.4886525 0.6445386 -0.08141461
9 MMP0009 0.7800802 -0.1112932 0.3592535 -0.72471771
10 MMP0010 1.0536432 0.5332284 0.9639479 -0.01085060
Done!
That was too easy
Just as an exercise, what would we have to do without reshape? And, just to keep ourselves honest, let's make sure we can deal with missing data (as reshape can). Make some data go missing:
> data.incomplete <- data[data$value > -1.0,]
> dim(data.incomplete)
[1] 35 3
Now, split the data frame up by condition. This produces a list where each element is a data frame containing a subset of the data for each condition. Notice that the cheetos data frame has values for 8 of the 10 genes.
> data.by.condition <- split(data.incomplete, data.incomplete$condition)
> typeof(data.by.condition)
[1] "list"
> names(data.by.condition)
[1] "copper" "cheetos" "beer" "pizza"
> data.by.condition$cheetos
gene condition value
12 MMP0002 cheetos 0.7670659
14 MMP0004 cheetos 0.2545231
15 MMP0005 cheetos 0.2416833
16 MMP0006 cheetos 0.2873973
17 MMP0007 cheetos 0.6923354
18 MMP0008 cheetos 0.4886525
19 MMP0009 cheetos -0.1112932
20 MMP0010 cheetos 0.5332284
We're going to recombine the data into a data frame with one row for each gene, so let's get that started:
> result = data.frame(gene=genes)
Now comes some executable line noise. We're going to loop through the list and add a column to the result data frame during each iteration of the loop. We pull the column out of the data frame in the list, but we have to make sure the column has an element for each gene. Merging with the all parameter set is like an outer join. We get a row for each gene, inserting NA's where there data is missing.
> for (i in seq(along=data.by.condition)) { result[[names(data.by.condition)[i]]] <- merge(data.by.condition[[i]], genes, by.x='gene', by.y=1, all=T)$value }
> result
gene copper cheetos beer pizza
1 MMP0001 0.9041281 NA -0.7496595 NA
2 MMP0002 0.9266438 0.7670659 0.2710521 0.31690523
3 MMP0003 0.2777215 NA -0.9926136 NA
4 MMP0004 0.0895893 0.2545231 0.6514364 NA
5 MMP0005 -0.2013230 0.2416833 -0.3558970 1.98010023
6 MMP0006 0.3452473 0.2873973 1.4014748 0.90468249
7 MMP0007 -0.3391021 0.6923354 0.3749271 0.04284970
8 MMP0008 1.2100649 0.4886525 0.6445386 -0.08141461
9 MMP0009 0.7800802 -0.1112932 0.3592535 -0.72471771
10 MMP0010 1.0536432 0.5332284 0.9639479 -0.01085060
Extra finesse points if you can figure out how to do that last step with Reduce instead of a loop.