Sunday, April 29, 2012

The Scalable Adapter Design Pattern for Interoperability

When wrestling with a gnarly problem, it's interesting to compare notes with others who've faced the same dilemma. Having worked on an interoperability framework, a system called Gaggle, I had a feeling of familiarity when I came across this well-thought-out paper:

The Scalable Adapter Design Pattern: Enabling Interoperability between Educational Software Tools, Harrer, Pinkwart, McLaren, Scheuer, IEEE TRANSACTIONS ON LEARNING TECHNOLOGIES, 2008

The paper describes a design pattern for getting heterogeneous software tools to interoperate with each other by exchanging data. Each application is augmented with an adapter that can interact with a shared representation. This hub-and-spokes model makes sense because it reduces the effort from writing n(n-1) adapters to connect all pairs of applications to writing one adapter per application.

Scalability refers to the composite design pattern, implementing (what I would call) a more general concept, that of hierarchically structured data. If you've ever worked with large XML documents, calling them scalable might seem like an overstatement, but I see their point. XML nicely represents small objects, like events, as well as moderately sized data documents. The same can be said of JSON.

Applications remain decoupled from the data structure, with an adapter mediating between the two. The adapter also provides events when the shared data structure is updated. A nice effect of the hierarchical data structure is that client applications can register their interest in receiving events at different levels of the tree structure.

The Scalable Adapter pattern combines of well established patterns - Adapter, Composite and Publish-subscribe yielding a flexible way for arbitrary application data to be exchanged at different levels of granularity.

The main difference between Scalable Adapter and Gaggle is that Gaggle focused on message passing rather than centrally stored data. The paper says, "it is critical that both the syntax and semantics of the data represented in the composite data structure...", but they don't really address what belongs in the "Basic Element" - the data to be shared. Gaggle solves this problem by explicitly defining a handful of universal data structures. Applications are free to implement their own data model, achieving interoperability by mapping (also in an adapter) their internal data structures onto the shared Gaggle data types.

The Scalable Adapter paper breaks the problem down systematically in terms of design patterns, while Gaggle was motivated by the software engineering strategies of separation of concerns and parsimony, plus the linguistic concept of semantic flexibility. It's remarkable that the two systems worked out quite similarly, given the different domains they were built for.

Friday, April 27, 2012

Sage Bionetworks Synapse

Michael Kellen, Director of Technology at Sage Bionetworks, is trying to build a GitHub for science. It's called Synapse and Kellen described it in a talk at the Sage Bionetworks Commons Congress 2012, this past weekend: 'Synapse' Pilot for Building an 'Information Commons'.

To paraphrase a Kellen's intro:

Science works better when people build off of each other's works. Every great advance is preceded by a multitude of smaller advances. It's no accident that the invention of the printing press and the emergence of the first scientific journals coincide with the many great scientific discoveries of the age of enlightenment. But scientific journals are stuck in a paradigm revolving around the printing press. In other domains, namely open source software, people are more radically reinventing systems for sharing information with each other. Github is a collaborative environment for the domain of software. Synapse aims to be a similar environment for medical and genomic research.

The Synapse concept of a project packages together data and the code to process it. I tried to download the R script shown in the contents and couldn't, either because I'm a knucklehead or because Synapse is a work in progress. On the plus side, they give you a helpful cut-n-paste snippet of R code in the lower right corner to access the project through their R API. When this is fully implemented, it could provide a key piece of computing infrastructure for reproducible data-driven science.

Sage intends to explore ways of connecting to traditional scientific journals. Picture figures that link to interactive visualizations or computational methods that link to code. I'm a big fan of the "live document" concept and it would be great to see journal articles evolve in that direction.

An unintended consequence of NGS, Robert Gentleman points out, is that the data is too big for existing pipes. Any concept of a GitHub for science will have to incorporate processing biological data in the cloud. I could imagine a Synapse project containing data sets, code and a recipe for standing up an EC2 instance (or several). At a click, a scripted process would run, bootstrapping the machines, installing software and dependencies, running a processing pipeline, and visualizing the results in a browser. How would that be for reproducible science?

Michael Kellen's blog has a bunch of interesting stuff about why building a GitHub for biology is more fun than selling sheets. I bet it is.

Wednesday, April 11, 2012

Interactive Dynamics for Visual Analysis

“To be most effective, visual analytics tools must support the fluent and flexible use of visualizations at rates resonant with the pace of human thought.”

This comes from a recent paper by Jeffrey Heer of Stanford's Visualization Group and Ben Schneiderman titled Interactive Dynamics for Visual Analysis in ACM Queue, following up on another Queue article from 2010 by Jeffrey Heer, Michael Bostock and Vadim Ogievetsky, A Tour Through the Visualization Zoo. The Interactive Dynamics article categorizes aspects of visual analysis that deserve careful consideration when designing visual analysis tools.

Taxonomy of interactive dynamics for visual analysis

  • Data and View Specification
    • Visualize
    • Filter
    • Sort
    • Derive (values or models)
  • View Manipulation
    • Select
    • Navigate (zoom, drill-down)
    • Coordinate (linked views)
    • Organize (multiple windows)
  • Analysis Process & Provenance
    • Record
    • Annotate
    • Share
    • Guide

Heer and collaborators created a series of software libraries for interactive visualization: prefuse, flare, Protovis and D3. These frameworks are designed to "represent data to facilitate reasoning", "flexibly construct representations" and "enable representational shifts" or transformations.

Protovis, which I've fooled around with a bit, is a functional domain specific language for data visualization. It's successor, D3 (Data Driven Documents), is an adaptation that increases performance and expressivity by making more direct use of the model (the DOM) inherent in the browser.

Jeffrey Heer spoke recently at the University of Washington (video available) about his research, citing as an early influence a 1962 paper by John W. Tukey titled The Future of Data Analysis.

In some ways, Heer nicely echoes themes from Hadley Wickam's talk on Engineering Data Analysis.

Data analysis pipeline:

Heer outlines an iterative process with these steps: Acquisition > Cleaning > Integration > Visualization > Modeling > Presentation > Dissemination.

Also from the Interactive Dynamics paper:

“In concert with data-management systems and statistical algorithms, analysis requires contextualized human judgments regarding the domain-specific significance of the clusters, trends, and outliers discovered in data.”

I've been a Jeffrey Heer fan-boy for some time, starting with his 2006 paper, Software Design Patterns for Information Visualization. Those interested in learning data science skills could do a lot worse than study Jeffrey Heer's work.

More papers