Saturday, June 04, 2011

Environments in R

The R Project

One interesting thing about R is that you can get down into the insides fairly easily. You're allowed to see more of how things are put together than in most languages. One of the ways R does this is by having first-class environments.

At first glance, environments are simple enough. An environment is just a place to store variables - a set of bindings between symbols and objects. If you start up R and make an assignment, you're adding an entry in the global environment.

> a <- 1234
> e <- globalenv()
> ls()
[1] "a" "e"
> ls(e)
[1] "a" "e"
> e$a
[1] 1234
> class(e)
[1] "environment"

Hmmm, the variable e is part of the global environment and it refers to the global environment, too, which is kind-of circular.

> ls(e$e$e$e$e$e$e$e)
[1] "a" "e"

We'd better cut that out, before we're sucked into a cosmic vortex.

> rm(e)

Most functional languages have some concept of environments, which serves as a higher level of abstraction over implementation details like allocating variables on the heap or stack. Saying that environments are first-class means that you can manipulate them from within the language, which is less common. Several advanced language features of R are built out of environments. We'll look at functions, packages and namespaces, and point out several Scheme-like features in R.

But first, the basics. The R Language Definition gives this definition:

Environments can be thought of as consisting of two things: a frame, which is a set of symbol-value pairs, and an enclosure, a pointer to an enclosing environment. When R looks up the value for a symbol the frame is examined and if a matching symbol is found its value will be returned. If not, the enclosing environment is then accessed and the process repeated. Environments form a tree structure in which the enclosures play the role of parents. The tree of environments is rooted in an empty environment, available through emptyenv(), which has no parent.

You can make a new environment with new.env() and assign a couple variables. The assign function works, as does the odd but convenient dollar sign notation. Think of the dollar sign as equivalent to the 'dot' operator that dereferences object members in Java-ish languages.

> my.env <- new.env()
> my.env
<environment: 0x114a9d940>
> ls(my.env)
character(0)
> assign("a", 999, envir=my.env)
> my.env$foo = "This is the variable foo."
> ls(my.env)
[1] "a"   "foo"

Now we have two variables named a, one in the global environment, the other in our new environment. Let's stick another variable b in the global environment, just for kicks.

> a
[1] 1234
> my.env$a
[1] 999
> b <- 4567

Also, note that the parent environment of my.env is the global environment.

> parent.env(my.env)
<environment: R_GlobalEnv>

A variable can be accessed using get or the dollar operator. By default, get continues up the chain of parents until it either finds a binding or reaches the empty environment. The dollar operator looks specifically in the given environment.

> get('a', envir=my.env)
[1] 999
> get('b', envir=my.env)
[1] 4567
> my.env$a
[1] 999
> my.env$b
NULL

Functions and environments

Functions have their own environments. This is the key to implementing closures. If you've never heard of a closure, it's just a function packaged up with some state. In fact, some say, closures are a poor man's object, while other insist it's the other way 'round. The R Language Definition explains the relationship between functions and environments like this:

Functions (or more precisely, function closures) have three basic components: a formal argument list, a body and an environment. [...] A function's environment is the environment that was active at the time that the function was created. [...] When a function is called, a new environment (called the evaluation environment) is created, whose enclosure is the environment from the function closure. This new environment is initially populated with the unevaluated arguments to the function; as evaluation proceeds, local variables are created within it.

When a function is evaluated, R looks in a series of environments for any variables in scope. The evaluation environment is first, then the function's enclosing environment, which will be the global environment for functions defined in the workspace. So, the global variable a, which had the value 1234 last time we looked, can be referenced inside a function.

> f <- function(x) { x + a }
> environment(f)
<environment: R_GlobalEnv>
> f(4321)
[1] 5555

We can change a function's environment if we want to.

> environment(f) <- my.env
> environment(f)
<environment: 0x114a9d940>
> my.env$a
[1] 999
> f(1)
[1] 1000

Suppose we wanted a counter to keep track of progress of some kind. That could be written and applied like so:

> createCounter <- function(value) { function(i) { value <<- value+i} }
> counter <- createCounter(0)
> counter(1)
> a <- counter(0)
> a
[1] 1
> counter(1)
> counter(1)
> a <- counter(1)
> a
[1] 4
> a <- counter(5)
> a
[1] 9

Notice the special <<- assignment operator. If we had used the normal <- assignment operator, we would have created a new variable 'value' in the evaluation environment of the function masking the value in the function closure environment. That environment disappears as soon as the function returns, sending our new value into the ether. What we want to do is change the value in the function closure environment, so that assignments to value will be persistent across invocations of our counter. Mutable state is generally not the default in functional languages, so we have to use the special assignment operator.

Just to look under the covers, where is that mutable state? In the counter function's enclosing environment.

> ls(environment(counter))
[1] "value"
> environment(counter)$value
[1] 9

For those that geek out on this stuff, this is an implementation of Paul Graham's Accumulator Generator from his article Revenge of the Nerds, which, years ago, I struggled to implement in Java.

Inspired by Scheme, lexical scoping is R's major point of departure from the S language. Gentleman and Ihaka's papers R: A Language for Data Analysis and Graphics (pdf) and Lexical Scope and Statistical Computing (pdf) describe some of their language design decisions around this point.

For functions defined in a package, the situation gets a bit more interesting. The various parts of the plot function are visible below, including a parameter list, (x, y, and some other junk), a block of code, elided here, and an environment, which is the namespace for the graphics package. Packages and namespaces are our next topic.

> plot
function (x, y, ...) 
{
  ...blah, blah, blah...
}
<environment: namespace:graphics>

Packages and namespaces

Walking up the chain of environments starting with the global environment, we see the packages loaded into R.

> globalenv()
<environment: R_GlobalEnv>
> g <- globalenv()
> while (environmentName(g) != 'R_EmptyEnv') { g <- parent.env(g); cat(str(g, give.attr=F)) }
<environment: 0x100fdf078>
<environment: package:stats>
<environment: package:graphics>
<environment: package:grDevices>
<environment: package:utils>
<environment: package:datasets>
<environment: package:methods>
<environment: 0x101a19f58>
<environment: base>
<environment: R_EmptyEnv>

Oddly, you can't test environments for equality. If you try, it says, "comparison (1) is possible only for atomic and list types". That's why we test for the end of the chain by name.

This same information can be had in slightly nicer form using search.

> search()
 [1] ".GlobalEnv"        "tools:RGUI"        "package:stats"     "package:graphics" 
 [5] "package:grDevices" "package:utils"     "package:datasets"  "package:methods"  
 [9] "Autoloads"         "package:base"

By now, you can guess how attach works. It creates an environment and slots it into the list right after the global environment, then populates it with the objects we're attaching.

beatles <- list('george'='guitar','ringo'='drums','paul'='bass guitar','john'='guitar')
> attach(beatles)
> search()
 [1] ".GlobalEnv"        "beatles"           "tools:RGUI"        "package:stats"    
 [5] "package:graphics"  "package:grDevices" "package:utils"     "package:datasets" 
 [9] "package:methods"   "Autoloads"         "package:base"     
> john
[1] "guitar"
> paul
[1] "bass guitar"
> george
[1] "guitar"
> ringo
[1] "drums"

Attaching a package using library adds an entry to the chain of environments. A package can optionally have another environment, a namespace, whose purpose is to prevent naming clashes between packages and hide internal implementation details. R Internals explains it like this:

A package pkg with a name space defines two environments namespace:pkg and package:pkg. It is package:pkg that can be attached and form part of the search path.

When a namespaced package is loaded, a new environment is created and all exported items are copied into it. That's package:pkg in the example above and is what you see in the search path. The namespace becomes the environment for the functions in that package. The parent environment of the namespace holds all the imports declared by the package. And the parent of that is a special copy of the base environment whose parent is the global environment.

We can see what namespaces are loaded using loadedNamespaces.

> loadedNamespaces()
[1] "base"      "graphics"  "grDevices" "methods"   "stats"     "tools"    
[7] "utils"

What if the same name is used in multiple environments? In general, R walks up the chain of environments and uses the first binding for a symbol it finds. R is smart enough to distinguish functions from other types. Here we try to mask the mean function, but R can still find it, knowing that we're trying to apply a function.

> z = list(mean='fluffernutter')
> attach(z)
> mean
[1] "fluffernutter"
> mean
[1] "fluffernutter"
> mean(c(1,2,3,4))
[1] 2.5
> detach(z)

We can mask a function with another function. Now, the mean of any list of numbers is "flapdoodle".

> z = list(mean=function(x){ return("flapdoodle") })
> attach(z)
The following object(s) are masked from 'package:base':
    mean
> mean(c(4,5,6,7))
[1] "flapdoodle"

The double-colon operator will let us specify which mean function we want. And, if you like to break the rules, the triple-colon operator lets you reach inside namespaces and touch private non-exported elements.

> base::mean(c(6,7,8,9))
[1] 7.5

So, there you have two fairly advanced language features built on the simple abstraction of environments. Thrown in for free is a nice look at R's functional side.

Is that everything you wanted to know about environments but were afraid to ask? Be warned that I'm just figuring this stuff out myself. If I've gotten anything bass-ackwards, please let me know. There's more information below, in case you can't get enough.

More Information

6 comments:

  1. Hey, thanks for this post. Your counter inspired me to write this: http://mickeymousemodels.blogspot.com/2011/06/little-r-counter.html

    ReplyDelete
  2. @ALT, I like your improvement on my counter example. It takes a step further in the direction of implementing OO as a special case of closures.

    For everyone else, there's a lot of neat R stuff on mickeymousemodels.

    ReplyDelete
  3. For a more thorough description of how R environments work, check out Suraj Gupta's excellent article, How R Searches and Finds Stuff.

    ReplyDelete
  4. See also: Luke Tierney's article Name space management for R from R News, June 2003

    ReplyDelete
  5. Well written blogpost. Very informative!
    Thanks for posting, Christopher.

    ReplyDelete
  6. Nice job, this finally made my understood . thank you

    ReplyDelete