Monday, November 09, 2009

Computational representation of biological systems

Computational representation of biological systems by Zach Frazier, Jason McDermott, Michal Guerquin, Ram Samudrala is a book chapter in Springer's Computational Systems Biology. It introduces basic data warehousing concepts along with a data warehousing effort targeted at biology called Bioverse.

They contrast data warehousing with online transaction processing. OLTP entails frequent concurrent updates. Updates traditionally look like bank machine operations or travel reservations. Data warehousing, in contrast, typically updates only occasionally in an additive way as new data arrives or annotations are added. The star schema, which supports efficient subsetting and computing of aggregates (min, max, sum, count, average), centers on a table of atomic data elements called facts surrounded by related tables holding different types of search criteria called dimensions.

Bioverse nicely illustrates several of the main problems challenges with data warehousing. First, it's data (54 organisms) appears to have been last updated in 2005. Also, we must choose at what granularity to create the fact table based on the questions we expect to ask, but questions come at many scales.

Hierarchical data occurs throughout the Bioverse. Representation of these structures is particularly difficult in relational databases.

They go on to cite a method for supporting efficient hierarchical queries from Kimball, R., Ross, M., (2002). The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling.

In addition to hierarchies, graphs and networks are common structures in biological systems including protein­protein interaction networks, biochemical pathways, and others. However, the techniques outlined for trees and directed acyclic graphs are no longer appropriate for graphs. Answering any more than very basic graph queries is hard in relational databases.

So, with all these drawbacks, you have to wonder whether the relational database is a good basis for data mining applications. Especially with the prevalence of networks in biological data. I'm a lot more intrigued by ideas along the lines of NoSQL schema-less databases, Dynamic fusion of web data and the web as a channel for structured data.