Want to join two R data frames on a common key? Here's one way do a SQL database style join operation in R.
We start with a data frame describing probes on a microarray. The key is the probe_id and the rest of the information describes the location on the genome targeted by that probe.
> head(probes)
probe_id sequence strand start end
1 mm_ex_fwd_000541 Chr + 1192448 1192507
2 mm_ex_fwd_000542 Chr + 1192453 1192512
3 mm_ex_fwd_000543 Chr + 1192458 1192517
4 mm_ex_fwd_000544 Chr + 1192463 1192522
5 mm_ex_fwd_000545 Chr + 1192468 1192527
6 mm_ex_fwd_000546 Chr + 1192473 1192532
> dim(probes)
[1] 241019 5
We also have a bunch of measurements in a numeric vector. For each probe (well, a few probes missing due to bad data) we have a value.
> head(value) mm_fwd_000002 mm_fwd_000003 mm_fwd_000004 mm_fwd_000005 mm_fwd_000006 mm_fwd_000007 0.05294899 0.11979251 0.28160017 0.57284569 0.74402510 0.78644199 > length(value) [1] 241007
Let's join up these tables, er data frame and vector. We'll use the match function. Match returns a vector of positions of the (first) matches of its first argument in its second (or NA if there is no match). So, we're matching our values into our probes.
> joined = cbind(probes[match(names(value), probes$probe_id),], value)
> dim(joined)
[1] 241007 6
> head(joined)
probe_id sequence strand start end value
3695 mm_fwd_000002 Chr + 15 74 0.05294899
3696 mm_fwd_000003 Chr + 29 88 0.11979251
3697 mm_fwd_000004 Chr + 43 102 0.28160017
3698 mm_fwd_000005 Chr + 57 116 0.57284569
3699 mm_fwd_000006 Chr + 71 130 0.74402510
3700 mm_fwd_000007 Chr + 85 144 0.78644199
Merge is probably more similar to a database join.
| Inner join | merge(df1, df2, by="common_key_column") |
| Outer join | merge(df1, df2, by="common_key_column", all=TRUE) |
| Left outer | merge(df1, df2, by="common_key_column", all.x=TRUE) |
| Right outer | merge(df1, df2, by="common_key_column", all.y=TRUE) |
If we have two data frames, we can use merge. Let's convert our vector tp to a data frame and merge, getting the same result (in a different sort order).
> tp.df = data.frame(probe_id=names(tp), value=tp)
> head(tp.df)
probe_id value
mm_fwd_000002 mm_fwd_000002 0.05294899
mm_fwd_000003 mm_fwd_000003 0.11979251
mm_fwd_000004 mm_fwd_000004 0.28160017
mm_fwd_000005 mm_fwd_000005 0.57284569
mm_fwd_000006 mm_fwd_000006 0.74402510
mm_fwd_000007 mm_fwd_000007 0.78644199
> m = merge(probes, tp.df, by="probe_id")
> dim(m)
[1] 241007 6
> head(mmm)
probe_id sequence strand start end value
1 mm_ex_fwd_000541 Chr + 1192448 1192507 0.1354668
2 mm_ex_fwd_000542 Chr + 1192453 1192512 0.1942794
3 mm_ex_fwd_000543 Chr + 1192458 1192517 0.1924457
4 mm_ex_fwd_000544 Chr + 1192463 1192522 0.2526351
5 mm_ex_fwd_000545 Chr + 1192468 1192527 0.1922655
6 mm_ex_fwd_000546 Chr + 1192473 1192532 0.2610747
There's a good discussion of merge on Stack Overflow, which includes right, left, inner and outer joins. Also the R wiki covers both match and merge. See also, the prior entry on select operations on R data frames.




R Bloggers
Comments closed. Go away, spammers!
ReplyDelete