Monday, December 27, 2010

Cloud bioinformatics

Personally, the thing I love about cloud computing is never having to ask permission. There's no ops guy or pointy-haired boss between me and the launch-instance button. As lovely as that is, the cloud is also a powerful tool for scientific computing, particularly bioinformatics.

Next-gen sequencing, which can produce gigabytes per day, is one factor pushing bioinformatics into the cloud. Data analysis is now the major bottleneck for sequencing-based experiments. Labs are finding out that generating sequencing data is getting to be cheaper than processing it. According to Dave O’Connor Lab at the University of Wisconsin's Department of Pathology and Laboratory Medicine, "There is a real disconnect between the ability to collect next-generation sequence data (easy) and the ability to analyze it meaningfully (hard)."

O'Connor's group works with LabKey Software, a Seattle-based bioinformatics software company founded by the Fred Hutchinson Cancer Research Center. LabKey develops open-source data management software for proteomics, flow cytometry, plate-based assay, and HIV vaccine study data, described in a presentation by Lead Developer Adam Rauch. Their technology stack seems to include: Java, Spring, GWT, Lucene and Gauva (aka Google Collections). LabKey integrates with the impressive Galaxy genomics workflow system and the Trans-Proteomic Pipeline (TPP).

A good part of modern biology boils down to mining biological data, with the goal of correlating sequence, transcription or peptides to outputs like function, phenotype or disease. Machine learning and statistical modeling tend toward long-running CPU-intensive jobs that get run intermittently as new data arrives, making them ideal candidates for the cloud.

Amazon's EC2 seems to be better positioned than either Microsoft's Azure or Google's AppEngine for scientific computing. Amazon has been ahead of the curve in seeing the opportunity in genomic data overload. Microsoft has made some welcome efforts to attract scientific computing, including the Microsoft Biology Foundation and grants for scientific computing in Azure. But they're fighting a headwind arising from proprietary licensing and a closed ecosystem. Oddly, considering Google's reputation for openness, AppEngine looks surprisingly restrictive. Research computing typically involves building and installing binaries, programming in an odd patchwork of languages and long running CPU intensive tasks, none of which is particularly welcome on AppEngine. Maybe Google has a better offering in the works?

It's worth noting that open-source works without friction in cloud environments while many proprietary vendors have been slow to adapt their licensing models to on-demand scaling. For example, lots of folks are using R for machine learning in the cloud, while MatLab is still bogged down in licensing issues. The not-having-to-ask-permission aspect is lost.

According to Xconomy, Seattle has a growing advantage in the cloud. There are several Seattle companies operating in the bioinformatics and cloud spaces. Sage Bionetworks, also linked to FHCRC, was founded by Eric Schadt, also of Pacific Biosciences, and Stephen Friend former founder of Rosetta Inpharmatics. Revolution Analytics sells a scalable variant of R for all kinds of applications including life sciences. Seattle hosts a lot of activity in analytics, cloud computing and biotechnology, which will keep Seattle on the technology map for some time to come.