Tuesday, July 10, 2012

Katy Börner's Plug and Play Macroscopes

Two Cytoscape engineers pointed me towards Plug and Play Macroscopes by Katy Börner. The paper envisions highly flexible and configurable software tools for science through the mechanism of plugin architecture, and is highly worth reading if you're involved in building scientific software.

Decision making in science, industry, and politics, as well as in daily life, requires that we make sense of data sets representing the structure and dynamics of complex systems. Analysis, navigation, and management of these continuously evolving data sets require a new kind of data-analysis and visualization tool we call a macroscope...

A macroscope is a modular software framework enabling end users - biologists, physicists, social scientists - to assemble customized tools reusing and recombining data sources, algorithms and visualizations. These tools consist of reconfigurable bundles of software plug-ins, using the same OSGi framework as the Eclipse IDE. Once a component has been packaged as a plug-in, it can be shared and combined with other components in new and creative ways to make sense of complexity, synthesizing related elements, finding patterns and detecting outliers. Börner sees these software tools as instruments on par with the microscope and the telescope.

CIShell

These concepts are implemented in a framework called Cyberinfrastructure Shell (CIShell), who's source is in the CIShell repo on GitHub. The core abstraction inside CIShell is that of an algorithm - something that can be executed, might throw exceptions, and returns some data. Data is some object that has a format and key/value metadata.

public interface Algorithm {
   public Data[] execute() throws AlgorithmExecutionException; 
}

public interface Data {
  public Dictionary getMetadata();
  public Object getData();
  public String getFormat();
}

Parenthetically, it's too bad there's no really universal abstraction for a function in Java... Callable, Runnable, Method, Function. In general, trying to wedge dynamic code into the highly static world of Java is not the most natural fit.

I'm guessing that integrating a tool involves wrapping it's functionality in a set of algorithm implementations.

The framework also features what looks to be support for dynamic GUI construction, a database abstraction with a slot for a named schema and support for scripting in Python.

An upcoming version of Cytoscape is built on OSGi. Someone should write a genome browser along these same lines.

"To serve the needs of scientists the core architecture must empower non-programmers to plug, play, and share their algorithms and to design custom macroscopes and other tools. " In my experience, scientists who are capable of designing workflows in visual tools are not afraid of learning enough R or Python to accomplish much the same thing. I'm not saying it's obvious that they should do that. Just that the trade-offs are worth considering. The real benefit comes from raising the level of abstraction, rather than replacing command-line code with point-and-click GUIs.

Means of composition

Plugin architecture isn't the only way to compose independently developed software tools. My lab's Gaggle framework links software through messaging. Service oriented architecture boils down to composition of web services, for example Taverna and MyGrid. GenePattern and Galaxy both fit into this space, although I'm not sure I can do a good job of characterizing them. If I understand correctly, both seem to use common file formats and conversions between them as the intermediary between programs. The classic means of composition are Unix pipes and filters - small programs loosely joined - and scripting.

Visualization

In a keynote on Visual Design Principles at VIZBI in March of this year, Börner channeled inspiration from Yann Arthus-Bertrand's Home and Earth from Above, Drew Berry and Edward Tufte, and advised students of visualization to study human visual perception and cognitive processing first in order to "design for the human system".

Her workflow combines data-centric steps similar to processes outlined by Jeffrey Heer and Hadley Wickham with a detailed breakdown of the elements of visualization.

Katy Börner directs the Cyberinfrastructure for Network Science Center at Indiana University. Not content to have written the Atlas of Science (MIT press 2010), she is currently hanging out in Amsterdam writing an Atlas of Knowledge.

More