Sunday, June 24, 2012

Data analysis workflow patterns

About a year ago, I ran a kooky idea past some colleagues. They gave it a big WTF, so I sat on it for a while. Not to be deterred, I still like the idea, so here it is.

Workflows are key to bioinformatics. For example, an analysis of gene expression might go something like the following. Measure gene expression using arrays or RNA-seq. After a bit of normalization, cluster genes by correlated expression. Then compute functional enrichment on the clusters. This helps get at questions about how the cell reacts to some stimulus. Whether that's nutrients, toxins, pH, sunlight, whatever. Or, what's the difference between a sick cell and a healthy one. The answer is in terms of processes or pathways up or down regulated.

You might implement our analysis of gene expression in R or Python. You could do it with point and click software like MeV and web tools like DAVID or workflow tools like Galaxy. You might use k-means, hierarchical clustering or something fancier. You might use GO terms to annotate gene function or KEGG pathways. There are lots of options, but the central idea is conserved.

The notion of a data analysis workflow is something like the concept of a software design pattern, an idea that software engineers borrowed from architects. A design pattern is a general reusable template for how to solve a commonly occurring problem. Naming and documenting a design pattern makes sharing knowledge easier and lets you talk and think at a higher level of abstraction.

Thinking of the workflow as an asset, it follows that they should be collected, documented and published. "In bioinformatics, we are familiar with the idea of curated data as a prerequisite for data integration. We neglect, often to our cost, the curation and cataloguing of the processes that we use to integrate and analyse our data." (Carol Goble et al. 2008) Both Taverna and Galaxy have mechanisms for doing this, albeit within the context of those tools. Maybe a better place to document workflows would be in journals. Philip Bourne, Editor in chief of PLOS Computational Biology says, "I want the publisher of the future to be the guardian of these workflows create a better scholarly record." He's speaking here of scientific workflows in a more general context than just data analysis.

Design patterns are documented in a specific format, detailing the scenarios in which that pattern applies, the intent behind it, its structure, implementation and consequences. Examples are usually given of real usages of the pattern along with discussion of alternatives or related patterns and its risks and pitfalls. For data analysis workflows, we'd probably want to discuss possible sources of error, required conditions and statistical properties.

This is not to say we need formalism for the sake of formalism. It's tempting to get caught up with impractical methodological hoo-ha, focusing on process to the exclusion of real and practical goals. A workflow should be a tool in a researcher's toolbox, a pragmatic way to package a bite-sized piece of knowledge. To be avoided is the over-abstraction and questionable utility that some (i.e. me) see in BPEL, BPM, MDA and related (increasingly defunct) technology trends.

Rather than design patterns, maybe a more biologist-friendly term is a protocol. Plus, there's already a long history of journals dedicated to lab protocols. Why not do the same for protocols for data analysis?

More

Like every idea, good or crackpot, people have thought of this one before me.