Hadley Wickham gave a Google Tech Talk a couple weeks back titled Engineering Data Analysis (with R and ggplot2). These are my notes.
The data analysis cycle is to iteratively transform, visualize and model. Leading into the cycle is data access and the output of the process is knowledge, insight and understanding which can be communicated to others. Transforming the data is almost always necessary to bring data into a workable form. Visualization and modeling have something of a duality where visualization is good at revealing the unexpected but has problems scaling. Models scale better, but will only find expected relationships. A larger cycle comes about when answers to one question lead to more questions.
Hadley makes a case for data analysis in code, rather than GUIs and for R in particular. Working in a programming language gives you a means of:
- reproducibility
- automation
- version control
- communication
Advantages of R:
- open source
- runs anywhere
- well established community
- huge library of packages
- connectivity to other languages
Downsides of R are it's learning curve, strangeness relative to other programming languages, lack of programming infrastructure and prickliness of the community. R scales well up to about a million observations. How to scale the interactive analysis cycle up to billions of observations is an open question. Programming infrastructure is an area where programmers can contribute.
DSLs help express and think clearly about common problems in data analysis. Hadley views his libraries as DSLs (domain specific languages) within R for the phases of the analysis cycle. For visualization, there's ggplot2. DSLs align nicely with ggplot's philosophy as a grammar of graphics. R's model formula is the DSL for modeling. Plyr is the DSL for data transformation.
The four key verbs of data transformation are:
- subset
- mutate
- arrange
- summarize
...plus...
- group by
- join
Data can be divided by subsetting or filtering; mutated, for example adding new columns to a table that are functions of other columns; rearranged or sorted; and summarized, condensing a data set down to a smaller number of values. These actions can be combined with a group by operator. Finally, data sets can be joined to other related data sets.
The second half of a talk is a case study, dissecting a set of cause-of-death statistics from the Mexican government. Finally, Hadley makes a familiar sounding point about the tension between making new things and making well-engineered user-friedly software that does old things.