Thursday, December 17, 2009

Joining data frames in R

The R ProjectWant to join two R data frames on a common key? Here's one way do a SQL database style join operation in R.

We start with a data frame describing probes on a microarray. The key is the probe_id and the rest of the information describes the location on the genome targeted by that probe.

> head(probes)
          probe_id sequence strand   start     end
1 mm_ex_fwd_000541      Chr      + 1192448 1192507
2 mm_ex_fwd_000542      Chr      + 1192453 1192512
3 mm_ex_fwd_000543      Chr      + 1192458 1192517
4 mm_ex_fwd_000544      Chr      + 1192463 1192522
5 mm_ex_fwd_000545      Chr      + 1192468 1192527
6 mm_ex_fwd_000546      Chr      + 1192473 1192532

> dim(probes)
[1] 241019      5

We also have a bunch of measurements in a numeric vector. For each probe (well, a few probes missing due to bad data) we have a value.

> head(value)
mm_fwd_000002 mm_fwd_000003 mm_fwd_000004 mm_fwd_000005 mm_fwd_000006 mm_fwd_000007 
   0.05294899    0.11979251    0.28160017    0.57284569    0.74402510    0.78644199 

> length(value)
[1] 241007

Let's join up these tables, er data frame and vector. We'll use the match function. Match returns a vector of positions of the (first) matches of its first argument in its second (or NA if there is no match). So, we're matching our values into our probes.

> joined = cbind(probes[match(names(value), probes$probe_id),], value)

> dim(joined)
[1] 241007      6

> head(joined)
          probe_id sequence strand start end         value
3695 mm_fwd_000002      Chr      +    15  74 0.05294899
3696 mm_fwd_000003      Chr      +    29  88 0.11979251
3697 mm_fwd_000004      Chr      +    43 102 0.28160017
3698 mm_fwd_000005      Chr      +    57 116 0.57284569
3699 mm_fwd_000006      Chr      +    71 130 0.74402510
3700 mm_fwd_000007      Chr      +    85 144 0.78644199

Merge is probably more similar to a database join.

Inner joinmerge(df1, df2, by="common_key_column")
Outer joinmerge(df1, df2, by="common_key_column", all=TRUE)
Left outermerge(df1, df2, by="common_key_column", all.x=TRUE)
Right outermerge(df1, df2, by="common_key_column", all.y=TRUE)

If we have two data frames, we can use merge. Let's convert our vector tp to a data frame and merge, getting the same result (in a different sort order).

> tp.df = data.frame(probe_id=names(tp), value=tp)

> head(tp.df)
                   probe_id      value
mm_fwd_000002 mm_fwd_000002 0.05294899
mm_fwd_000003 mm_fwd_000003 0.11979251
mm_fwd_000004 mm_fwd_000004 0.28160017
mm_fwd_000005 mm_fwd_000005 0.57284569
mm_fwd_000006 mm_fwd_000006 0.74402510
mm_fwd_000007 mm_fwd_000007 0.78644199

> m = merge(probes, tp.df, by="probe_id")

> dim(m)
[1] 241007      6

> head(mmm)
          probe_id sequence strand   start     end     value
1 mm_ex_fwd_000541      Chr      + 1192448 1192507 0.1354668
2 mm_ex_fwd_000542      Chr      + 1192453 1192512 0.1942794
3 mm_ex_fwd_000543      Chr      + 1192458 1192517 0.1924457
4 mm_ex_fwd_000544      Chr      + 1192463 1192522 0.2526351
5 mm_ex_fwd_000545      Chr      + 1192468 1192527 0.1922655
6 mm_ex_fwd_000546      Chr      + 1192473 1192532 0.2610747

There's a good discussion of merge on Stack Overflow, which includes right, left, inner and outer joins. Also the R wiki covers both match and merge. See also, the prior entry on select operations on R data frames.