Tuesday, March 20, 2012

Finding the right questions

A 2010 PLOS Biology article, Finding the right questions: exploratory pathway analysis to enhance biological discovery in large datasets, makes some good points about exploratory analysis of noisy biological data and the design of software to help do it.

At a time when biological data are increasingly digital and thus amenable to computationally driven statistical analysis, it is easy to lose sight of the important role of data exploration. Succinctly defined over 30 years ago by John Tukey, exploratory data analysis is an approach to data analysis that focuses on finding the right question, rather than the right answer. In contrast to confirmatory analysis, which involves testing preconceived hypothesis, exploratory data analysis involves a broad investigation, a key component of which may be visual display. Though his arguments predate personal computing and thus focus on graph paper and ink, the point still stands: good data visualization leads to simpler (better) descriptions and underlying fundamental concepts. Today, there is tremendous potential for computational biologists, bioinformaticians, and related software developers to shape and direct scientific discovery by designing data visualization tools that facilitate exploratory analysis and fuel the cycle of ideas and experiments that gets refined into well-formed hypotheses, robust analyses, and confident results.

The authors are involved in Wikipathways, an open platform for curation of biological pathways comparable to KEGG and Ingenuity Pathway Analysis, MetaCyc, and Pathway Commons, and this provides the context for their comments. But, most of their conclusions apply more generally to software for research, where the goal is to enable “researchers to take a flexible, exploratory attitude and facilitate construction of an understandable biological story from complex data.”

...instead of aiming for a single, isolated software package, developers should implement flexible solutions that can be integrated in a larger toolbox [...], in which each tool provides a different perspective on the dataset.
For developers, realizing that exploratory pathway analysis tools might be used not only in isolation but also with other software and different types of data in a flexible analysis setup might guide software design and implementation.

Effective data integration

Flexibility and interactivity are keys to effectiveness. “Determining what to integrate and how to present it to the user depends on the context and the question being asked.” Researchers often need to follow up on a weak or uncertain signal by finding confirmatory evidence in relevant orthogonal or correlated datasets. This emphasizes the importance of well curated data, whether pathways or annotated genome assembies, which can form the scaffolding on which integration takes place.

Providing an API, in addition to UI, opens up possibilities for scripting and automation and enables advanced users to “combine functionalities of different tools to perform novel types of analysis.” The authors note that defining general data models increases reusability and unity among software tools. This resonates with my own experience. One of the key virtues of Gaggle is it's highly general data model consisting of basic data structures - list, matrix, network, table, and tuples - free of specific semantics.

Also noted are the difficulties which make current analysis tools “rather isolated and hard to combine”, specifically:

  • reformatting data
  • mapping identifiers
  • learning curve of multiple software packages

“Succinctly defined over 30 years ago by John Tukey, exploratory data analysis is an approach to data analysis that focuses on finding the right question, rather than the right answer.” We've not yet found the right answer for data integration for biology, but it's clear that “integration of annotations and data is critical to extracting the full potential from large and high-throughput datasets.” This paper contains some exceptionally clear thinking on building software tools that will help bring that about.

Citation

Kelder T, Conklin BR, Evelo CT, Pico AR (2010) Finding the Right Questions: Exploratory Pathway Analysis to Enhance Biological Discovery in Large Datasets. PLoS Biol 8(8): e1000472. doi:10.1371/journal.pbio.1000472