Digithead's Lab Notebook

Sunday, October 07, 2012

Hacking education

On-line education is blossoming in a virtuous cycle of innovation, threatening disruption to expensive traditional universities and opening access to higher learning for anyone with an internet connection and a curious mind.

For developers, the on-line education boom means rich opportunities to learn, to create learning environments, and to analyze the data collected in the process of running massive open online courses - hacking education itself.

Coursera, an early leader, signed up 17 more universities on top of the 12 that joined in July and are now offering 198 classes from 33 schools.

Founders Daphne Koller & Andrew Ng published an article in Forbes, Log On and Learn: The Promise of Access in Online Education. Koller, whose "Probabilistic Graphical Models" strained my few remaining brain cells, spoke at TED on What we’re learning from online education

An Old Scholar - Salomon Koninck

A Seattle Times piece on the recent wave of educational startups featured enthusiastic comments from Ed Lazowska at UW and Vitalina Komashko at Sage Bionetworks.

Why some of the best universities are giving away their courses Each has answers. But basically it comes down to these: To serve the greater good. To win a public-relations race. And, most especially, to enhance reputations.

On-line education is a perfect complement for hotness that is data science. Not only is it a means for transferring trendy skills, but the data collected in the process should have amazing things to teach us about learning.

Not all the action is in cyber-space, either. If you have a hungry mind that needs feeding, you can:

study for coursera classes at meetups in 928 cities around the world
study improvisation with Jazz vibraphonist Gary Burton (Your neighbors will love it!)
go to Hacker School, Startup School or hacker dojo
attend Biohacker Boot Camp at genspace
learn about the Design of Computer Programs from the CTO of Google

The only limit is your own bandwidth. That, and the tolerance of your spouse.

The flashy technology is new but, the ideal of open access to knowledge has been around for a long time. The Seattle Times quotes Dave Cillay, executive director of WSU Online, "We've had MOOCs and open learning resources for centuries. They're called libraries."

Echoing Carnegie and his libraries, the Gates Foundation announced in June $9 million in grants for on-line secondary education, including a million to the MIT/Harvard venture edX.

I remember poking my head into a cinder-block schoolhouse in a tiny village in Laos, back in my traveling days. There were 2 books; a book that appeared to be equal parts farming manual and government propaganda and another of Buddhist scripture. The potential to mitigate that kind of information-poverty in the remote corners of the world is one of the most exciting aspects of on-line education.

Building on previous innovation is key to progress, especially in science and technology. Hacking education will help information flow faster, getting people to the frontier where they can start pushing the envelope and maybe make the world a slightly better place. That's why these are exciting times for those that love learning.

Still don't believe me that there's a lot going on? Here's more Ed-tech news:

Scaling higher education
Economists Tyler Cowen and Alex Tabarrok introduced Marginal Revolution University. The first course offered at MRUniversity is Development Economics.
An NPR story on Coursera Online Education Grows Up, And For Now, It's Free
College 2.0: Inside the Coursera Contract: How an Upstart Company Might Profit From Free Courses
The School of Data is a collaboration between the Open Knowledge Foundation and P2PU (Peer to Peer University)
University of the People: A Free Online University Tests the Waters
CodeNow teaches coding to kids in DC. "Coding is the new literacy. It gives individuals the power to innovate and create."
Three Tips For a New Wave of Ed-Tech Entrepreneurs from the founder of ShowMe

Monday, October 01, 2012

The future of dev tools

Classic developer tools have a timeless quality to them. Greybeards and college kids alike happily hack away in Emacs, vi, Bash and a host of other tools older than many of their users. It's surprisingly hard to improve upon these old tools.

But, interest in designing new developer tools seems finally to be emerging. Rather than replacing powerful and expressive textual interfaces with pretty but limiting graphical interfaces, these new tools augment the command-line experience with immediate visual feedback.

Light Table

Chris Granger's re-imagining of the IDE, Light Table, seeks to enhance the developer's "ability to traverse abstraction". Slick demo videos here and here show Light Table's good looks, but also its presentation of functions as the primary unit of abstraction, the ability to show values propagating through code and its handy access to documentation.

During Granger's talk at StrangeLoop, one questioner raised the issue of whether dynamic languages lead to a different style of interaction between tool and developer. Static languages lend themselves to autocompletion and refactoring tools as in Eclipse, whereas dynamic languages emphasize the REPL and perhaps metaprogramming and DSLs.

Though originally funded through KickStarter project, Light Table is closed-source, at least for now. It's scheduled for release next May with support for Clojure, Javascript, and Python.

Neo4j

Modern browsers provide capabilities like accelerated graphics, advanced page layout and process isolation, enabling environments like Neo4j's console demo that combine command shells with interactive graphics.

As far as I can tell, that's just a demo. The real admin console for Neo4j doesn't suck either, but requires tabbing between command shell and graph visualization.

Dev tools in browsers

Browsers keep getting better. In-browser REPLs exists for numerous languages: clojure, haskell, javascript and others. These are typically targeted at language learners, as is Chas Emerick's Clojure Atlas, a visual and conceptual interface for navigating Clojure's documentation. But, I expect more advanced tools will find their way into the browser over time. Fogus's Himera project shows one way forward, delegating some of the heavy lifting to the server.

Amazingly, R Studio Server puts full IDE into the browser, with the help of QTWebkit, GWT and some grand-master level wizardry. With Knitr integration, R Studio approaches live document capabilities.

Sublime Text

On the desktop, the Sublime Text editor picks up where TextMate left off. Sublime can use syntax files from TextMate, which means it already supports your favorite language, plus it's programmable in Python.

Xiki

I heard about Xiki (for executable wiki) from Tom Henderson, who owes me a book, by the way. An impressive demo video shows off Xiki's merger of advanced UI and command shell.

Xiki integration with Sublime is progressing.

Design principles for programming tools

Bret Victor, known for his design work at Apple and lots of other cool things, has thought deeply about using technology to help people “learn, understand and create”. Victor followed up the inspiring talk Inventing on principle his presentation at StrangeLoop on design principles for programming tools, captured in the essay Learnable Programming.

Tools should enable the programmer to:

Read the vocabulary
Follow the flow
See the state
Create by reacting
Create by abstracting

What this means is roughly this: Quick access to docs, often triggered by mouse-over, simplifies reading. Visibility into flow and state increases comprehension. “Dumping the parts bucket onto the floor&rduo; encourages mixing and matching and provides visual prompting emphasizing recognition over recall. Abstractions are created by starting concrete and generalizing.

Code is written for a dual audience: machine and human reader, requiring a difficult combination of precision and clarity. As the tools get smarter, the conversation between machine and programmer will get richer. The common thread here is supporting the programmer without imposing limitations, providing an experience more like a blank page and a box of sharp pencils than a menu of canned options, helping to create what Bret Victor calls, “environments that function as an external imagination”.

Monday, September 24, 2012

Computing kook density in R

Do you ever see strange lights in the sky? Do you wonder what really goes on in Area 51? Would you like to use your R hacking skills to get to the bottom of the whole UFO conspiracy? Of course, you would!

UFO data from infochimps is the focus of a data munging exercise in Chapter 1 of Machine Learning for Hackers by Drew Conway and John Myles White, two social scientists with a penchant for statistical computing.

The exercise starts with slightly messy data, proceeds through cleaning up some dates. I think I slightly improved on the code given in the book. Have a look (gist:3775873) and see if you agree.

Dividing the data up by state (for sightings in the US), I noticed something funny. My home state of Washington has a lot of UFO sightings. Normalizing by population, this becomes even more pronounced.

I learned a neat trick from the chapter. The transform function helps to compute derived fields in a data.frame. I use transform to compute UFO sightings per capita, after merging in population data by state from the 2000 census.

sightings.by.state <- transform(
 sightings.by.state,
 state=state, state.name=name,
 sightings=sightings,
 sightings.per.cap=sightings/pop)

Creating the plot above, with a pile of ggplot code, we see that Washington state really is off the deep end when it comes to UFO sightings. Our northwest neighbors in Oregon come in second. I asked a couple fellow Washington residents what they thought. The first reasonably conjectured a relationship to the number of air bases. The second Washingtonian gave the explanation I favor: "High kook density".

If you'd like to the data, it's from Chapter 1 of Machine Learning for Hackers. Data and code can be found in John Myles White's github repo.

Thursday, September 13, 2012

OO in R

"Is there a package for obfuscating code in #rstats?", someone asked. "The S4 object system?!" came the snarky reply. If you're smiling right now, you know that it wouldn't be funny if it weren't at least a little bit true.

Options: S3, S4 or R5?

There can be little doubt that object oriented programming in R is the cause of some confusion. We'll look at S4 classes more closely in a minute, but be warned that S4 classes are just one of at least three object systems available to the R programmer:

S3: simple and lightweight
S4: formal classes implemented by the methods package
R5: Reference classes

It's not super clear when to use which, at least not to me. It seems to depend strongly on style and personal preference. The Bioconductor folks, for example, make heavy use of S4 classes. Google, on the other hand, advises to "avoid S4 objects and methods when possible".

Here's the way it looks to me. S3 classes feel a bit like Javascript classes - easy, loose and informal. S4 classes are rigid, verbose and harder to understand. But, they offer a better separation between interface and implementation, along with some advanced features like multiple dispatch, validation and type coercion. Reference classes (aka R5) encapsulate mutable state and look more like familiar Java-style classes. They're new and pass-by-reference can violate expectations of R users.

An S4 class example

Now, let's return to S4 classes with a simple example. First, we define a class to represent people.

# define an S4 class for people
setClass(
  "Person",
  representation(name="character", age="numeric"),
  prototype(name=NA_character_, age=NA_real_)
)

A person has a name and an age, which default to NAs of their respective types - character string and numeric. For the sake of demonstrating polymorphism, let's define a couple subclasses.

# define subclasses for different types of people
setClass("Musician",
  representation(instrument="character"),
  contains="Person")

setClass("Programmer",
  representation(language="character"),
  contains="Person")

There's no reason not to write normal R functions that take S4 classes as arguments. Polymorphism is called for when a method has different implementations for different classes. In that case, we declare a generic method.

# create a generic method called 'talent' that
# dispatches on the type of object it's applied to
setGeneric(
  "talent",
  function(object) {
    standardGeneric("talent")
  }
)

The following code implements two subtypes of person, each with a talent for something.

setMethod(
  "talent",
  signature("Programmer"),
  function(object) {
    paste("Codes in", 
      paste(object@language, collapse=", "))
  }
)

setMethod(
  "talent",
  signature("Musician"),
  function(object) {
    paste("Plays the",
      paste(object@instrument, collapse=", "))
  }
)

Now, let's make some talented people.

# create some talented people
donald <- new("Programmer",
  name="Donald Knuth",
  age=74,
  language=c("MMIX"))

coltrane <- new("Musician",
  name="John Coltrane",
  age=40,
  instrument=c("Tenor Sax", "Alto Sax"))

miles <- new("Musician",
  name="Miles Dewey Davis",
  instrument=c("Trumpet"))

monk <- new("Musician",
  name="Theloneous Sphere Monk",
  instrument=c("Piano"))

talent(miles)
[1] "Plays the Trumpet"

talent(donald)
[1] "Codes in MMIX"

talent(coltrane)
[1] "Plays the Tenor Sax, Alto Sax"

Mutability

One common stumbling block with S4 classes concerns changes in state. For instance, we might want to give our hard-working employees a raise.

setClass("Employee",
  representation(boss="Person", salary="numeric"),
  contains = "Person"
)

setGeneric(
  "raise",
  function(object, percent=0) {
    standardGeneric("raise")
  }
)

setMethod(
  "raise",
  signature("Employee"),
  function(object, percent=0) {
    object@salary <- object@salary * (1+percent)
    object
  }
)

True to it's functional heritage, R deals with immutable values. Changes in state happen by making new objects. The trick is to return the new object from the mutator methods and capture it on the way out.

smithers <- new("Employee",
  name="Waylon Smithers",
  boss=new("Person",name="Mr. Burns"),
  salary=100000)

# doesn't work?!?!
raise(smithers, percent=15)
smithers@salary
[1] 100000

Setting a new salary creates a new value. Notice that we return the modified object from the raise function. Don't forget to catch it.


# remember to reassign smithers to the new value
smithers <- raise(smithers, percent=15)
smithers@salary
[1] 115000

Multiple Inheritance

Through the magic of multiple inheritance, the lowly Code Monkey is both a programmer and an employee. Just set the contains value to indicate its two parent classes.


setClass("Code Monkey",
  contains=c("Programmer","Employee"))

setMethod(
  "talent",
  signature("Code Monkey"),
  function(object) {
    paste("Codes in",
      paste(object@language, collapse=", "),
        "for", object@boss@name)
  }
)

chris <- new("Code Monkey",
  name="Chris",
  age=29,
  boss=new("Person", name="The Man"),
  salary=2L,
  language=c("Java", "R", "Python", "Clojure"))


talent(chris)
[1] "Codes in Java, R, Python, Clojure for The Man"

So, there you have it - encapsulation, polymorphism and inheritance in S4 classes. Complete code for this example is in gist:3670578.

OO in R resources

It's lucky that there are loads of places to go to learn about S4 classes.

First, look at Hadley Wickhams's devtools wiki which has a boatload of information for R package developers, in addition to info on S3, S4 and reference classes. Also from Hadley is a slide deck on Object Oriented Programming.
S4 Classes in 15 pages, more or less
How S4 Methods Work by John Chambers
The R-docs for the Methods package are comprehensive. See:
...wait.. just do help(package="methods")
Dirk Eddelbuettel and Romain Francois gave a Google TechTalk titled Integrating R with C++: Rcpp, RInside, and RProtobuf, covering integration between R and C++, but also has some good information on OO programming in R, particularly starting around the one hour mark (1:00). Romain Fancois' slide deck Object Oriented Design(s) in R is really good.
Inside-R's references on the class systems:
A section of Introduction to R covers Classes, generic functions and object orientation with S3 classes, The classic bank-account example in the section on Scope
R5 Reference classes
Slides from Martin Morgan on Reference Classes
R for Programmers, by Norman Matloff of UCSD, or buy Norm's book: The Art of R Programming
A (Not So) Short Introduction to S4

Tuesday, August 28, 2012

The Joy of Clojure

Clojure is a modern Lisp dialect that runs on the JVM. Since its release in 2007, it's become quite popular with elite hackers of a certain persuasion. Michael Fogus and Chris Houser's Joy of Clojure is a thoroughly enjoyable bit of summer reading explaining the philosophy behind the language, showing how Clojure promotes simplicity and flexibility, deals with time and state, and generally brings fun back into programming.

Concurrency is one of the motivating factors behind Clojure. Hickey's approach to sane management of state is anchored on the concept of immutability. The language encourages "immutability by default" with persistent collections - efficient implementations of list, vector, map and set that preserve historical versions of their state. When concurrency becomes necessary, Clojure provides some very modern constructs including STM (software transactional memory), agents similar to the actor model of Erlang, and atomic values.

Aside from the parentheses, one big difference between Clojure and Java (as well as fellow JVM resident, Scala) is dynamic typing. Entities in Clojure are typically modeled in terms of the basic data structures already mentioned, especially maps and vectors. In practice, this ends up looking like JSON or Javascript objects, which are just bags of properties. The book includes a basic implementation of prototype based inheritance as an example. I've always thought that was more appropriate to dynamic languages than building class hierarchies (as in Python and Ruby). You might go so far as to say that JSON is to Clojure what lists are to classic Lisp.

Clojure's interop with it's host platform gracefully bridges the conceptual gap between Java and Lisp, enabling Clojure to call into Java code making available the wealth of Java libraries. It's equally possible to expose Clojure APIs to Java code. Dynamically generating Java classes and proxies as well as creating and implementing interfaces on the fly leads to the belief in the Clojure community that "Clojure does Java better than Java." In spite of the differences in semantics, Clojure feels like a natural layer on top of Java, raising the levels of abstraction and dynamism while easing many pain-points that every Java programmer stumbles over.

The Clojure language is also branching out beyond the JVM. Clojurescript is a Clojure to javascript cross-compiler. Another offshoot targets Microsoft dot.net's CLR.

Aside from persistent data structures and Java interop, Clojure comes with a bunch of functional goodness built in. Clojure's multimethods put control of method dispatch into the programmer's hands. Macros let you construct your own flow-of-control structures, as demonstrated by do-until and unless examples in the book. Illustrating how Lisp and it's macros can be used to construct little languages or DSLs, the book implements a mini-SQL interpreter.

One especially nice aspect of the book is the putting it all together sections, which cover examples like: lazy quicksort, A*, and a builder for chess moves. These longer examples are still bite sized, making them more easily digested than the extended case studies found in some books.

The Joy of Clojure is not an introductory book nor is it a language reference. It will appeal to the reader who already has some programming experience. It's a good idea to spend some time with the online tutorials first. Where the book is strongest is answering the why questions, getting you started learning how to think about programming in Clojure and showing you how Clojure changes the way you think.

Companies using Clojure:

Prismatic: a highly addictive social content recommendation engine who's backend pipeline and APIs are written mostly in Clojure. Prismatic's architecture is a great example of how to use machine learning and social graphs in a real application.
Climate corporation: data driven weather insurance
Relevance: a consultancy and home of Clojure/core.
Lonocloud

Popular Clojure projects and libraries:

Leiningen a build tool and dependency management system
Incanter a Clojure-based, R-like platform for statistical computing and graphics
Chris Granger's Light Table IDE
Noir web framework
Datomic: a distributed rethinking of the database
core.logic: Prolog-like logic programming for Clojure

Videos

Luckily for those wanting to learn more without leaving their hammock, there are lots of videos about Clojure. A lot of clear thinking about software, whether you do it in Clojure or not, can be found in Rich Hickey's talks.

More Clojure Stuff

Planet Clojure

Sunday, August 19, 2012

Scientific Python

Astronomer Joshua Bloom gave a talk titled Python as Super Glue for the Modern Scientific Workflow at SciPy2012. Bloom teaches Python for Scientific Computing at Berkeley (available as a podcast). Bloom showcased Pythonic contributions to work on Supernova and machine-learning in astronomy.

Python has solid support for data analysis and scientific computing, starting with Numpy for matrix manipulation and SciPy, which adds diff-eqs, optimization, statistics, etc. and matplotlib for graphics. I keep meaning to check out Pandas, scikit-learn, and Sage.

IPython keeps getting more impressive and appears to be evolving in the direction of Mathematica's Live Notebooks. I have a long-standing thing for mixing prose and code. Fernando Perez’s talk on IPython at SciPy2012 and a the longer version IPython in-depth are in my viewing queue.

Part of the beauty of Python is it's breadth as a general purpose programming language. Libraries for everyday programming tasks like web application and database interaction are well developed in the Python world. There are good arguments in favor of DSLs for math and statistics, the approach embodied by R, Matlab, Mathematica and Julia. On the other hand, some may agree with John Cook, who puts it like this: "I’d rather do math in a general-purpose language than try to do general-purpose programming in a math language."

Data Mining and Machine-Learning in Time-Domain Discovery & Classification, Bloom and Richards
BigMACC: Bloom's machine classified catalog of variable stars
pydata.org
NetworkX, a Python network library

Sunday, August 05, 2012

Simplicity made easy

Rich Hickey, inventor of Clojure, gave a talk at last year's Strange Loop called Simple Made Easy. In it, he makes the case that Lispy languages and Clojure in particular provide tools for achieving modularity and simplicity. Simplicity Matters, Hickey's keynote at RailsConf in April of this year is a variant on the same themes. Here are my notes, either quoted verbatim or paraphrased.

“Simplicity is a prerequisite for reliability”
- Edsger Dijkstra

“Simplicity is the ultimate sophistication”
- Leonardo da Vinci

Simplicity vs complexity

Complexity comes from braiding together, interleaving or conflating multiple concepts. Complect (braid together) vs. compose (place together)

Simplicity is defined as having one role - one task, one concept, one dimension. Composing simple components is the way we write robust software. A good designer splits things apart. When concepts are split apart, it becomes possible to vary them independently, use them in different contexts or to farm out their implementation.

There are reasons why you have to get off of this list. But there is NO reason why you shouldn't start with it.

Abstraction for simplicity

Analyze programs by asking who, what, when, where, why and how. The more your software components can say, "I don't know, I don't want to know", the better.

what: operations - specify interface and semantics, don't complect with how; how can be somebody else's problem
who: data or entities
how: implementation, connected by polymorphism constructs
when and where: use queues to make when and where less of a problem
why: policy and rules of the application

Data representation

Data is really simple. There are not a lot of variation in the essential nature of data. There are maps, there are sets, there are linear sequential things... We create hundreds of thousands of variations that have nothing to do with the essence of the stuff and that make it hard to manipulate the essence of the stuff. We should just manipulate the essence of the stuff. It's not hard, it's simpler. Same goes for communications... (44:30)

Information is simple, represent data as data, use maps and sets directly (56:30)

(From the RailsConf talk:) Subsystems must have well defined boundaries and interfaces and take and return data as lists as maps (aka JSON). ReSTful interfaces are popular in the context of web services. Why not do the same within our programs?

Choosing simplicity

Simplicity is a choice, but it takes work. Simplicity requires vigilance, sensibility and care. Choose simple constructs. Simplicity enables change and is the source of true agility. Simplicity = opportunity. Go make simple things.