Friday, January 15, 2010

State of data integration for biology

The field of biology has a proliferation of data repositories and software tools. Probably, it's inevitable. Lab techniques, computational methods, and algorithms advance rapidly. Technology trends come and go. And funding for sustained maintenance is hard to come by. Biological data comes in a diversity in types and varying quality driving the need to coordinate several lines of evidence.

Data integration has been a persistent problem. Lincoln Stein's Creating a bioinformatics nation and Integrating biological databases give the background. A pair of reviews update this topic.

State of the nation in data integration for bioinformatics

In State of the nation in data integration for bioinformatics, Carole Goble and Robert Stevens give an update (to 2007) with a semantic web spin. The complexity of biological data arises from:

  • diversity of data types
  • varying quality
  • describing a sample and its originating context
  • ridiculous situation with identifiers: Every conceivable pathology of identifiers is standard practice in biology - multiple names for the same thing, the same name for different things. Equality is often hazy and context dependent.

To address this complication, various strategies that have been tried, which they categorize as:

  • Service oriented architectures: CORBA, WS, ReST
  • Link integration
  • Data warehousing
  • View integration (distributed queries)
  • Model-driven service oriented architecture (caBIG, Gaggle?)
  • Integration applications (not clear to me what these are)
  • Workflows (Taverna)
  • Mashups
  • Semantic Web (smashups):
    • RDF
    • Ontologies
    • Linked data sets
    • Semantic mapping

Messaging probably belongs on this list. They call Gaggle a Model-driven service oriented architecture, which I'm not sure is the right classification. Message passing fits Gaggle better. Messaging tends to get overlooked, in spite of lots of work on message-oriented middleware and enterprise application integration in the business domain. (See Gregor Hohpe.) Messaging has evolved from a basis on the remote procedure call to a more document-centric view, but it seems that shift is not always recognized.

They use the term touch-points to mean common keys on which to join data.

They integrate on a range of common or corresponding touch-points: data values, names, identities, schema properties, ontology terms, keywords, loci, spatial-temporal points, etc. These touch-points are the means by which integration is possible, and a great deal of a bioinformatician’s work is the mapping of one touch-point to another.

They mention BioMoby, a tool for composing web-services into workflows based on describing services and their input and output data types with controlled vocabularies. The founders of BioMoby seem to have drifted steadily towards the semantic web, but at one time they said this:

[...] interoperability in the domain of bioinformatics is, unexpectedly, largely a syntactic rather than a semantic problem. That is to say, interoperability between bioinformatics Web Services can be largely achieved simply by specifying the data structures being passed between the services (syntax) even without rich specification of what those data structures mean (semantics).

Microformats get a brief mention as a means to enrich web content with semantics - stuctured representation, links to supporting evidence or related data, provenance, etc.

Goble and Stevens also cite the late great SIMILE Piggy Bank, a firefox extension allowing users to add their own semantic markup to web pages - a key source of inspiration for Firegoose. Another interesting reference is Alon Halevy's Principles of dataspace systems (2006), an effort to think about (just-in-time) data integration at web scale. (See Dynamic Fusion of Web Data for more.)

State of the cyberinfrastructure

Lincoln Stein's Towards a cyberinfrastructure for the biological sciences (2008) assesses the state of computing in biology with the aim of “the ability to create predictive, quantitative models of complex biological processes”. He defines cyberinfrastructure as consisting of data sources, computing resources, communication (not just networks, but syntactic and semantic connectivity as well), and human infrastructure (skills and cultural aspects).

The current biology cyberinfrastructure has a strong data infrastructure, a weak to non‑existent computational grid, patchy syntactic and semantic connectivity, and a strengthening human infrastructure.

He seems to conclude that the semantic web is promising, but not quite there yet. Almost echoing the quote above from the BioMoby paper, Stein says:

[...] in the short term, [...] we can gain many of the benefits of more sophisticated systems by leveraging the human capacity to make sense of noisy and contradictory information. Semantic integration will occur in the traditional way: many human eyes interpreting what they read and many human hands organizing and reorganizing the information.

Both of these papers suggest that the semantic web may finally solve these persistent data integration problems, but this cynical reader was left with a picture of a string of technologies that promised the same and never lived up to the hype. Semantic markup is a significant effort, whose costs fall on the data provider but whose benefits go mainly to others. Maybe tools will come along that reduce that effort, but I can't imagine there will ever be a day when quick-and-dirty looses its appeal. This is research, after all. In an exploratory analysis, a scientist is likely to try several false starts before finding a fruitful path. If integration at the (quicker and dirtier) syntactic level is even a little easier, it seems like a better choice for this purpose. The opposite is probably true for final published results.

In both articles, mashups and just-in-time integration show signs of getting some respect, which I like. Same for ideas related to linked data.

The really important goal

There is probably a small set of principles of data organization within a given domain that would impose little or no burden on data providers. And if we all accepted those principles up front, any two arbitrary pieces of data from different sources could be joined together either automatically or with reasonably little effort, and without any need for further central coordination.

Mo' links