The R language is weird - particularly for those coming from a typical programmer's background, which likely includes OO languages in the curly-brace family and relational databases using SQL. A key data structure in R, the data.frame, is used something like a table in a relational database. In terms of R's somewhat byzantine type system (which is explained nicely here), a data.frame is a list of vectors of varying types. Each vector is a column in the data.frame making this a column-oriented data structure as opposed to the row-oriented nature of relational databases.
In spite of this difference, we often want to do the same sorts of things to an R data.frame that we would to a SQL table. The R docs confuse the SQL-savvy by using different terminology, so here is a quick crib-sheet for applying SQL concepts to data.frames.
We're going to use a sample data.frame with the following configuration of columns, or schema, if you prefer: (sequence:factor, strand:factor, start:integer, end:integer, common_name:character, value:double) where the type character is a string and a factor is something like an enum. Well, more accurately, value is a vector of type double and so forth. Anyway, our example is motivated by annotation of genome sequences, but the techniques aren't particular to any type of data.
> head(df) sequence strand start end common_name value 1 chromosome + 1450 2112 yvrO 0.9542516 2 chromosome + 41063 41716 graD6 0.2374012 3 chromosome + 62927 63640 graD3 1.0454790 4 chromosome + 63881 64807 gmd 1.4383845 5 chromosome + 71811 72701 moaE -1.8739953 6 chromosome + 73639 74739 moaA 1.2711058
So, given a data.frame of that schema, how do we do some simple select operations?
Selecting columns by name is easy:
> df[,c('sequence','start','end')] sequence start end 1 chromosome 1450 2112 2 chromosome 41063 41716 3 chromosome 62927 63640 4 chromosome 63881 64807 5 chromosome 71811 72701 ...
As is selecting row names, or both:
> df[566:570,c('sequence','start','end')] sequence start end 566 chromosome 480999 479860 567 chromosome 481397 480999 568 chromosome 503053 501275 569 chromosome 506476 505712 570 chromosome 515461 514277
Selecting rows that meet certain criteria is a lot like a SQL where clause:
> df[df$value>3.0,] sequence strand start end common_name value 199 chromosome + 907743 909506 hutU 3.158821 321 chromosome + 1391811 1393337 nadB 3.092771 556 chromosome - 431600 431037 apt 3.043373 572 chromosome - 519043 518186 hbd1 3.077040
For extra bonus points, let's find tRNAs.
> df[grep("trna", df$common_name, ignore.case=T),] sequence strand start end common_name value 18 chromosome + 115152 115224 Asn tRNA -0.461038128 19 chromosome + 115314 115422 Ile tRNA -0.925268307 31 chromosome + 167315 167388 Tyr tRNA 0.112527023 32 chromosome + 191112 191196 Ser tRNA 0.986357577 ...
Duplicate row names
Row names are not necessarily unique in R, which breaks the method shown above for selecting by row name. Take matrix a:
< a = matrix(1:18, nrow=6, ncol=3) < rownames(a) <- c('a', 'a', 'a', 'b', 'b', 'b') < colnames(a) <- c('foo', 'bar', 'bat') < a foo bar bat a 1 7 13 a 2 8 14 a 3 9 15 b 4 10 16 b 5 11 17 b 6 12 18
It looks to me like trying to index by the row names just returns the first row of a given name:
< a['a',] foo bar bat 1 7 13 < a['b',] foo bar bat 4 10 16
But this works:
< a[rownames(a)=='a',] foo bar bat a 1 7 13 a 2 8 14 a 3 9 15
More Resources:
- R programming for those coming from other languages
- Extract or Replace Parts of a Data Frame
- Subsetting Vectors, Matrices and Data Frames
- R & BioConductor Manual
- R Help search engine
- Abandon hope all ye who enter The R Inferno
- R Programming for Bioinformatics
Help for R, the R language, or the R project is notoriously hard to search for, so I like to stick in a few extra keywords, like R, data frames, data.frames, select subset, subsetting, selecting rows from a data.frame that meet certain criteria, and find.
...more on R.
How can i add header to data set containing no col names?
ReplyDeleteBy header, I'm guessing you mean column names?
ReplyDeleteYou add them like this:
colnames(a) <- c('foo', 'bar', 'bat')
Aw, come on! R is easy compared to learning a programming language. I can get interesting results (graphics, analysis) within twenty minutes of playing with R. Say the same for C++, PHP, or anywhere else on the gamut. I still can't even get a simple Perl script to I/O with my directory structure.
ReplyDeleteYou can use sqldf if you really miss SQL, as long as you don't mind the 1000x performance hit. Or you can use bit vectors and 'ff' or 'hdf5' with some forethought, and get a 1000x speedup. And there's always 'data.table' (the package).
ReplyDeleteMany roads to Rome.
By the way, are you actually using a data.frame to hold genomic ranges, or are you just using the data.frame-like accessors for IRanges/GenomicRanges?
ReplyDelete@Tim, For our purposes with microbial genomes a regular data frame usually suffices. For cases where it doesn't, we've experimented with some of the disk-backed data structures. I've come to like RSQLite, since it's easy, portable, and accessible from other languages. You can populate a SQLite DB from a Python script and then manipulate the data in R.
ReplyDeleteI should probably try GenomicRanges. I know those were getting some love from the BioConductor folks to support next-gen sequencing.
@Tim. Its a common misconception that due to the way sqldf works that it must be slow. Its true that the purpose of sqldf is convenience yet sqldf is often faster, not slower, than R. In one example here [link] sqldf was faster than R, plyr and data.table whereas in another example it was slower. Also there are some benchmarks that sqldf users provided which have been linked to on the sqldf home page and on all of those sqldf was faster than R. It very much depends on what you are doing and how you are doing it. Also, consider that sqldf can handle intermediate objects too large for R itself to handle since it performs its computations outside of R. Finally note that the above comparisons have been done with the SQLite backend but sqldf also supports several other backends too.
ReplyDeleteI agree with Gabor: sqldf, when the SQL is optimized, can be faster than native R - but may have a hard time competing with R functions that are really optimized C. It all depends on what you are doing.
ReplyDeleteI can even show examples when a join of two tables (1 million rows in one and 2,000 in the other) runs slower in MySQL database then in R when you get the data from that database (I use RODBC) into R data frames and use sqldf inside R, even when the database and the repository are on the same machine as the one where the R code is running.
It is important that the are on the same machine, because it eliminates the consideration of I/O and CPU power from deciding the runtime competition.
The examples use data frames right up until the point were it says duplicate row names are OK. At that point it switches to using a matrix. While matrices do allow duplicate names, data frames don't:
ReplyDelete> as.data.frame(a)
Error in data.frame(foo = c("1", "2", "3", "4", "5", "6"), bar = c(" 7", :
duplicate row.names: a, b
This article uses data frames right up to the duplicate row name example. Had it continued using them, it would have become clear that while matrices allow duplicate row names, data frames do not:
ReplyDelete> as.data.frame(a)
Error in data.frame(foo = c("1", "2", "3", "4", "5", "6"), bar = c(" 7", :
duplicate row.names: a, b
Hi guys,I am new to R bloggers.Could someone help me regarding the use of "if else".
ReplyDeleteI have a data frame (dat) with a categorical variable Q1 (dat$Q1). dat$Q1 was coded as 1,2,3 or 4. I need to create a new column data$new1 based on the following rule: if dat$Q1==3 then dat$new1 should be given a new code 1. otherwise, dat$new1 will be given a code 0.
What is the most efficient way of doing this please? thanks
Howdy, this should work as you describe:
Deletedat$new1 <- ifelse(dat$Q1==3, 1, 0)