Sunday, July 26, 2009

Select operations on R data frames

The R Project

The R language is weird - particularly for those coming from a typical programmer's background, which likely includes OO languages in the curly-brace family and relational databases using SQL. A key data structure in R, the data.frame, is used something like a table in a relational database. In terms of R's somewhat byzantine type system (which is explained nicely here), a data.frame is a list of vectors of varying types. Each vector is a column in the data.frame making this a column-oriented data structure as opposed to the row-oriented nature of relational databases.

In spite of this difference, we often want to do the same sorts of things to an R data.frame that we would to a SQL table. The R docs confuse the SQL-savvy by using different terminology, so here is a quick crib-sheet for applying SQL concepts to data.frames.

We're going to use a sample data.frame with the following configuration of columns, or schema, if you prefer: (sequence:factor, strand:factor, start:integer, end:integer, common_name:character, value:double) where the type character is a string and a factor is something like an enum. Well, more accurately, value is a vector of type double and so forth. Anyway, our example is motivated by annotation of genome sequences, but the techniques aren't particular to any type of data.

> head(df)
    sequence strand start   end common_name      value
1 chromosome      +  1450  2112        yvrO  0.9542516
2 chromosome      + 41063 41716       graD6  0.2374012
3 chromosome      + 62927 63640       graD3  1.0454790
4 chromosome      + 63881 64807         gmd  1.4383845
5 chromosome      + 71811 72701        moaE -1.8739953
6 chromosome      + 73639 74739        moaA  1.2711058

So, given a data.frame of that schema, how do we do some simple select operations?

Selecting columns by name is easy:

> df[,c('sequence','start','end')]
       sequence   start     end
1    chromosome    1450    2112
2    chromosome   41063   41716
3    chromosome   62927   63640
4    chromosome   63881   64807
5    chromosome   71811   72701
...

As is selecting row names, or both:

> df[566:570,c('sequence','start','end')]
      sequence  start    end
566 chromosome 480999 479860
567 chromosome 481397 480999
568 chromosome 503053 501275
569 chromosome 506476 505712
570 chromosome 515461 514277

Selecting rows that meet certain criteria is a lot like a SQL where clause:

> df[df$value>3.0,]
      sequence strand   start     end common_name    value
199 chromosome      +  907743  909506        hutU 3.158821
321 chromosome      + 1391811 1393337        nadB 3.092771
556 chromosome      -  431600  431037         apt 3.043373
572 chromosome      -  519043  518186        hbd1 3.077040

For extra bonus points, let's find tRNAs.

> df[grep("trna", df$common_name, ignore.case=T),]
      sequence strand   start     end common_name        value
18  chromosome      +  115152  115224    Asn tRNA -0.461038128
19  chromosome      +  115314  115422    Ile tRNA -0.925268307
31  chromosome      +  167315  167388    Tyr tRNA  0.112527023
32  chromosome      +  191112  191196    Ser tRNA  0.986357577
...

Duplicate row names

Row names are not necessarily unique in R, which breaks the method shown above for selecting by row name. Take matrix a:

< a = matrix(1:18, nrow=6, ncol=3)
< rownames(a) <- c('a', 'a', 'a', 'b', 'b', 'b')
< colnames(a) <- c('foo', 'bar', 'bat')
< a
  foo bar bat
a   1   7  13
a   2   8  14
a   3   9  15
b   4  10  16
b   5  11  17
b   6  12  18

It looks to me like trying to index by the row names just returns the first row of a given name:

< a['a',]
foo bar bat 
  1   7  13
< a['b',]
foo bar bat 
  4  10  16 

But this works:

< a[rownames(a)=='a',]
  foo bar bat
a   1   7  13
a   2   8  14
a   3   9  15

More Resources:

Help for R, the R language, or the R project is notoriously hard to search for, so I like to stick in a few extra keywords, like R, data frames, data.frames, select subset, subsetting, selecting rows from a data.frame that meet certain criteria, and find.

...more on R.

12 comments:

  1. How can i add header to data set containing no col names?

    ReplyDelete
  2. By header, I'm guessing you mean column names?

    You add them like this:

    colnames(a) <- c('foo', 'bar', 'bat')

    ReplyDelete
  3. Aw, come on! R is easy compared to learning a programming language. I can get interesting results (graphics, analysis) within twenty minutes of playing with R. Say the same for C++, PHP, or anywhere else on the gamut. I still can't even get a simple Perl script to I/O with my directory structure.

    ReplyDelete
  4. You can use sqldf if you really miss SQL, as long as you don't mind the 1000x performance hit. Or you can use bit vectors and 'ff' or 'hdf5' with some forethought, and get a 1000x speedup. And there's always 'data.table' (the package).

    Many roads to Rome.

    ReplyDelete
  5. By the way, are you actually using a data.frame to hold genomic ranges, or are you just using the data.frame-like accessors for IRanges/GenomicRanges?

    ReplyDelete
  6. @Tim, For our purposes with microbial genomes a regular data frame usually suffices. For cases where it doesn't, we've experimented with some of the disk-backed data structures. I've come to like RSQLite, since it's easy, portable, and accessible from other languages. You can populate a SQLite DB from a Python script and then manipulate the data in R.

    I should probably try GenomicRanges. I know those were getting some love from the BioConductor folks to support next-gen sequencing.

    ReplyDelete
  7. @Tim. Its a common misconception that due to the way sqldf works that it must be slow. Its true that the purpose of sqldf is convenience yet sqldf is often faster, not slower, than R. In one example here [link] sqldf was faster than R, plyr and data.table whereas in another example it was slower. Also there are some benchmarks that sqldf users provided which have been linked to on the sqldf home page and on all of those sqldf was faster than R. It very much depends on what you are doing and how you are doing it. Also, consider that sqldf can handle intermediate objects too large for R itself to handle since it performs its computations outside of R. Finally note that the above comparisons have been done with the SQLite backend but sqldf also supports several other backends too.

    ReplyDelete
  8. I agree with Gabor: sqldf, when the SQL is optimized, can be faster than native R - but may have a hard time competing with R functions that are really optimized C. It all depends on what you are doing.

    I can even show examples when a join of two tables (1 million rows in one and 2,000 in the other) runs slower in MySQL database then in R when you get the data from that database (I use RODBC) into R data frames and use sqldf inside R, even when the database and the repository are on the same machine as the one where the R code is running.

    It is important that the are on the same machine, because it eliminates the consideration of I/O and CPU power from deciding the runtime competition.

    ReplyDelete
  9. The examples use data frames right up until the point were it says duplicate row names are OK. At that point it switches to using a matrix. While matrices do allow duplicate names, data frames don't:

    > as.data.frame(a)
    Error in data.frame(foo = c("1", "2", "3", "4", "5", "6"), bar = c(" 7", :
    duplicate row.names: a, b

    ReplyDelete
  10. This article uses data frames right up to the duplicate row name example. Had it continued using them, it would have become clear that while matrices allow duplicate row names, data frames do not:

    > as.data.frame(a)

    Error in data.frame(foo = c("1", "2", "3", "4", "5", "6"), bar = c(" 7", :
    duplicate row.names: a, b

    ReplyDelete
  11. Hi guys,I am new to R bloggers.Could someone help me regarding the use of "if else".

    I have a data frame (dat) with a categorical variable Q1 (dat$Q1). dat$Q1 was coded as 1,2,3 or 4. I need to create a new column data$new1 based on the following rule: if dat$Q1==3 then dat$new1 should be given a new code 1. otherwise, dat$new1 will be given a code 0.

    What is the most efficient way of doing this please? thanks

    ReplyDelete
    Replies
    1. Howdy, this should work as you describe:

      dat$new1 <- ifelse(dat$Q1==3, 1, 0)

      Delete