Friday, March 23, 2012

Applying Semantic Web Services to bioinformatics

Applying Semantic Web Services to bioinformatics: Experiences gained, lessons learnt
Phillip Lord, Sean Bechhofer, Mark D. Wilkinson, Gary Schiltz, Damian Gessler, Duncan Hull, Carole Goble, and Lincoln Stein
International Semantic Web Conference, Vol. 3298 (2004), pp. 350-364, doi:10.1007/b102467

Applying Semantic Web Services to bioinformatics is a 2004 paper on Semantic Web Services in context of bioinformatics, based on the experiences of the myGrid and BioMoby projects. The important and worthy goal behind these projects is enabling composition and interoperability of heterogeneous software. Is Semantic Web technology the answer to data integration in biology? I'm a little skeptical.

Here's a biased selection of what the paper has to say:

  • "The importance of fully automated service discovery and composition is an open question. It is unclear whether it is either possible or desirable, for all services, in this domain..."
  • "Requiring service providers and consumers to re-structure their data in a new formalism for external integration is also inappropriate."
  • "Bioinformaticians are just not structuring their data in XML schema, because it provides little value to them."
  • "All three projects have accepted that much of the data that they receive will not be structured in a standard way. The obvious corollary of this is that without restructuring, the information will be largely opaque to the service layer."

A couple of interesting asides are addressed:

  • Most services or operations can be described in terms in inputs and outputs and configuration parameters or secondary input. When building a pipeline, only main input and output need be considered, leaving parameters for later.
  • A mixed a user base divided between biologists and bioinformaticians is one difficulty noted in the paper. I've also found that tricky. Actually, the situation has changed since the article was written. Point-and-click biologists are getting to be an endangered species. The crop of biologists I see coming up these days is very computationally savvy. What I think of as the scripting-enabled biologist is a lot more common. Those not so enabled are increasingly likely to specialize in wet-lab work and do little or no data analysis.

In BioMOBY Successfully Integrates Distributed Heterogeneous Bioinformatics Web Services. The PlaNet Exemplar Case, (2005) Wilkinson writes,

...interoperability in the domain of bioinformatics is, unexpectedly, largely a syntactic rather than a semantic problem. That is to say, interoperability between bioinformatics Web Services can be largely achieved simply by specifying the data structures being passed between the services (syntax) even without rich specification of what those data structures mean (semantics).

In The Life Sciences Semantic Web is Full of Creeps!, (2006) Wilkinson and co-author Benjamin M. Good write, "both sociological and technological barriers are acting to inhibit widespread adoption of SW technologies," and acknowledge the complexity and high curatorial burden.

The Semantic Web for the Life Sciences (SWLS), when realized, will dramatically improve our ability to conduct bioinformatics analyses... The ultimate goal of the SWLS is not to create many separate, non-interacting data warehouses (as we already can), but rather to create a single, ‘crawlable’ and ‘queriable’ web of biological data and knowledge... This vision is currently being delayed by the timid and partial creep of semantic technologies and standards into the resources provided by the life sciences community.

These days, Mark Wilkinson is working on SADI, which “defines an open set of best-practices and conventions, within the spectrum of existing standards, that allow for a high degree of semantic discoverability and interoperability”.

More on the Semantic Web

...looks like this old argument is still playing out.