Saturday, January 09, 2010

Pivot tables in R

The R ProjectA common data-munging operation is to compute cross tabulations of measurements by categories. SQL Server and Excel have a nice feature called pivot tables for this purpose. Here we'll figure out how to do pivot operations in R.

Let's imagine an experiment where we're measuring the gene activity of an organism under different conditions -- exposure to different nutrients and toxins. Our conditions are silly: copper, beer, pizza, and cheetos. First we make a list of genes. Then expand.grid generates all combinations of genes and conditions. Finally, we tack on a column of randomly generated measurements.

> genes = paste('MMP', sprintf("%04d",1:10), sep="")
> data = expand.grid(gene=genes, condition=c('copper', 'cheetos', 'beer', 'pizza'))
> data$value = rnorm(40)
> data
      gene condition       value
1  MMP0001    copper  0.90412805
2  MMP0002    copper  0.92664376
3  MMP0003    copper  0.27772147
4  MMP0004    copper  0.08958930
5  MMP0005    copper -0.20132304
6  MMP0006    copper  0.34524729
7  MMP0007    copper -0.33910206
8  MMP0008    copper  1.21006486
9  MMP0009    copper  0.78008022
10 MMP0010    copper  1.05364315
11 MMP0001   cheetos -2.31796229
12 MMP0002   cheetos  0.76706591
13 MMP0003   cheetos -2.93692935
14 MMP0004   cheetos  0.25452306
15 MMP0005   cheetos  0.24168329
16 MMP0006   cheetos  0.28739734
17 MMP0007   cheetos  0.69233543
18 MMP0008   cheetos  0.48865250
19 MMP0009   cheetos -0.11129319
20 MMP0010   cheetos  0.53322842
21 MMP0001      beer -0.74965948
22 MMP0002      beer  0.27105205
23 MMP0003      beer -0.99261363
24 MMP0004      beer  0.65143639
25 MMP0005      beer -0.35589696
26 MMP0006      beer  1.40147484
27 MMP0007      beer  0.37492710
28 MMP0008      beer  0.64453865
29 MMP0009      beer  0.35925345
30 MMP0010      beer  0.96394785
31 MMP0001     pizza -1.91818504
32 MMP0002     pizza  0.31690523
33 MMP0003     pizza -1.20566043
34 MMP0004     pizza -1.91750166
35 MMP0005     pizza  1.98010023
36 MMP0006     pizza  0.90468249
37 MMP0007     pizza  0.04284970
38 MMP0008     pizza -0.08141461
39 MMP0009     pizza -0.72471771
40 MMP0010     pizza -0.01085060

We want to pivot the conditions into columns so that we end up with one column for each condition and one row for each gene. The easy way is to use the reshape package by Hadley Wickham, which is made for restructuring data and does this job nicely. If you don't already have it, you'll have to run install.packages, then load the library.

> install.packages('reshape')
> library(reshape)

Using cast to move conditions into columns is a snap.

> cast(data, gene ~ condition)
      gene     copper    cheetos       beer       pizza
1  MMP0001  0.9041281 -2.3179623 -0.7496595 -1.91818504
2  MMP0002  0.9266438  0.7670659  0.2710521  0.31690523
3  MMP0003  0.2777215 -2.9369294 -0.9926136 -1.20566043
4  MMP0004  0.0895893  0.2545231  0.6514364 -1.91750166
5  MMP0005 -0.2013230  0.2416833 -0.3558970  1.98010023
6  MMP0006  0.3452473  0.2873973  1.4014748  0.90468249
7  MMP0007 -0.3391021  0.6923354  0.3749271  0.04284970
8  MMP0008  1.2100649  0.4886525  0.6445386 -0.08141461
9  MMP0009  0.7800802 -0.1112932  0.3592535 -0.72471771
10 MMP0010  1.0536432  0.5332284  0.9639479 -0.01085060

Done!

That was too easy

Just as an exercise, what would we have to do without reshape? And, just to keep ourselves honest, let's make sure we can deal with missing data (as reshape can). Make some data go missing:

> data.incomplete <- data[data$value > -1.0,]
> dim(data.incomplete)
[1] 35  3

Now, split the data frame up by condition. This produces a list where each element is a data frame containing a subset of the data for each condition. Notice that the cheetos data frame has values for 8 of the 10 genes.

> data.by.condition <- split(data.incomplete, data.incomplete$condition)
> typeof(data.by.condition)
[1] "list"
> names(data.by.condition)
[1] "copper"  "cheetos" "beer"    "pizza"  
> data.by.condition$cheetos
      gene condition      value
12 MMP0002   cheetos  0.7670659
14 MMP0004   cheetos  0.2545231
15 MMP0005   cheetos  0.2416833
16 MMP0006   cheetos  0.2873973
17 MMP0007   cheetos  0.6923354
18 MMP0008   cheetos  0.4886525
19 MMP0009   cheetos -0.1112932
20 MMP0010   cheetos  0.5332284

We're going to recombine the data into a data frame with one row for each gene, so let's get that started:

> result = data.frame(gene=genes)

Now comes some executable line noise. We're going to loop through the list and add a column to the result data frame during each iteration of the loop. We pull the column out of the data frame in the list, but we have to make sure the column has an element for each gene. Merging with the all parameter set is like an outer join. We get a row for each gene, inserting NA's where there data is missing.

> for (i in seq(along=data.by.condition)) { result[[names(data.by.condition)[i]]] <- merge(data.by.condition[[i]], genes, by.x='gene', by.y=1, all=T)$value }

> result
      gene     copper    cheetos       beer       pizza
1  MMP0001  0.9041281         NA -0.7496595          NA
2  MMP0002  0.9266438  0.7670659  0.2710521  0.31690523
3  MMP0003  0.2777215         NA -0.9926136          NA
4  MMP0004  0.0895893  0.2545231  0.6514364          NA
5  MMP0005 -0.2013230  0.2416833 -0.3558970  1.98010023
6  MMP0006  0.3452473  0.2873973  1.4014748  0.90468249
7  MMP0007 -0.3391021  0.6923354  0.3749271  0.04284970
8  MMP0008  1.2100649  0.4886525  0.6445386 -0.08141461
9  MMP0009  0.7800802 -0.1112932  0.3592535 -0.72471771
10 MMP0010  1.0536432  0.5332284  0.9639479 -0.01085060

Extra finesse points if you can figure out how to do that last step with Reduce instead of a loop.