Sunday, February 21, 2010

The R type system

The R ProjectR is a weird beast. Through it's ancestor the S language, it claims a proud heritage reaching back to Bell Labs in the 1970's when S was created as an interactive wrapper around a set of statistical and numerical subroutines. As a programming language, R takes ideas from Unix shell scripting, functional languages (Lisp and ML), and also a little from C. Programmers will usually have at least some background in these languages, but one aspect of R that might remain puzzling is it's type system.

Because the purpose of R is programming with data, it has some fairly sophisticated tools to represent and manipulate data. First off, the basic unit of data in R is the vector. Even a single integer is represented as a vector of length 1. All elements in an atomic vector are of the same type. The sizes of integers and doubles are implementation dependent. Generic vectors, or lists, hold elements of varying types and can be nested to create compound data structures, as in Lisp-like languages.

Fundamental types

  • vectors
    • an ordered collection of elements all of one type
    • atomic types: logical, numeric (integer or double), complex, character or raw
    • special values:
      • NA (not available, missing data)
      • NaN (not a number)
      • +/-Inf (infinity)
  • lists
    • generic vectors, elements can be of any type, including list
    • because they can be nested, lists are sometimes called recursive
  • functions
    • functions are "first class" data types
    • can be assigned, passed as arguments and returned from functions
# a is a vector of length 1
> a <- 101
> length(a)
[1] 1

# the function c() combines is arguments
# construct a vector of numeric data and access its members
> ages <- c(40, 36, 2, 38, 27, 1)
> ages[2]
[1] 36
> ages[4:6]
[1] 38 27  1

> movie <- list(title='Monty Python\'s The Meaning of Life', year=1983, cast=c('Graham Chapman','John Cleese','Terry Gilliam','Eric Idle','Terry Jones','Michael Palin'))
> movie
$title
[1] "Monty Python's The Meaning of Life"
$year
[1] 1983
$cast
[1] "Graham Chapman" "John Cleese"    "Terry Gilliam"  "Eric Idle"      "Terry Jones"    "Michael Palin"

Attributes

R objects can have attributes - arbitrary key/value pairs - attached to them. One use for this is that elements in vectors or lists can be named. R's object system is based on the class attribute. (OK, I really mean the simpler of R's two object systems, but let's avoid that topic.) Attributes are also used to turn one-dimensional vectors into multi-dimensional structures by specifying their dimensions, as we'll see next.

Matrices and arrays

Matrices and arrays are special types of vectors, distinguished by having a dim (dimensions) attribute. A matrix has two dimensions, so the value of its dim attribute is a vector of length 2 specifying numbers of rows and columns in the matrix. Arrays are n dimensional vectors, sometimes used like an OLAP data cube, with dimension vectors of length n.

# create some data series
> bac = c(14.08, 7.05, 13.05, 16.21)
> hbc = c(48.67, 29.51, 41.93, 55.82)
> jpm = c(31.53, 28.14, 33.77, 41.37)

# create a matrix whose rows are companies and columns are quarters
# values in the matrix is closing stock price on the first day of the quarter
> m <- matrix(c(bac,hbc,jpm), nrow=3, byrow=T)
> rownames(m) <- c('bac','hbc','jpm')
> colnames(m) <- c('q1', 'q2', 'q3', 'q4')

> m
       q1    q2    q3    q4
bac 14.08  7.05 13.05 16.21
hbc 48.67 29.51 41.93 55.82
jpm 31.53 28.14 33.77 41.37

# check out the attributes
> attributes(m)
$dim
[1] 3 4
$dimnames
$dimnames[[1]]
[1] "bac" "hbc" "jpm"
$dimnames[[2]]
[1] "q1" "q2" "q3" "q4"

Factors

Statisticians divide data into four types: nominal, ordinal, interval and ratio. Factors are for the first two, depending on whether they are ordered or not. This makes a difference for some of the stats algorithms in R, but from a programmers point of view, a factor is just an enum. R turns character vectors into factors at the slightest provocation. It's sometimes necessary to coerce factors back to character strings, using as.character().

  • represent categorical or rank data compactly
  • examples: countries, male/female, small/medium/large, etc.

Data frames

A data frame is a special list in which all elements are vectors of equal length. It is analagous to a table in a database, except that it's column-oriented rather than row-oriented. Because the vectors are constrained to be of the same length, you can index any cell in a data frame by its row and column.

  • a list of vectors of the same length (columns)
  • like a table in a database
# make a simple data frame
> df <- data.frame(ticker=c('bac', 'hbc', 'jpm'), market.cap=c(137.37, 185.65, 157.80), yield=c(0.25,3.00,0.50))
> df
  ticker market.cap yield
1    bac     137.37  0.25
2    hbc     185.65  3.00
3    jpm     157.80  0.50

There's more, of course, but this gives you enough to be dangerous. Note that, because R natively works with vectors, many operations in R are vectorized, meaning they operate on whole vectors at once, rather than on a single scalar value. The key to performance in R is making good use of vectorized operations. Also, being functional, R inherits a full compliment of higher-order functions - Map, Reduce, Filter and many forms of apply (lapply, sapply, and tapply). Mixing higher-order functions and vectorized operations can get confusing (and is the source of the proliferation of apply functions). Both these techniques, as well as the organization of the type system, encourage you to work with blocks of data as a unit. This is what John Chambers called high-level prototyping for computations with data.

More information

2 comments:

  1. Very good summary of data types in R. Much appreciated is description of the nature and role of factors. Thank you, AW (one typo: coersion should be: coercion, from the verb: coerce)

    ReplyDelete