Thursday, April 21, 2011

Genome Browser's Anonymous

As you may know, I'm starting a support group for those afflicted with the tragedy of having written a genome browser. Mine is called the Gaggle Genome Browser. About the time I was writing it, everyone and their uncle's dog decided to write a genome browser. New instruments with new data types were coming into the lab. Computers had more memory and CPU cores than ever. It seemed like a good idea at the time.

The Broad Institute's Integrative genomics viewer (shown below) got a write-up in the January Nature Biotechnology. IGV seems particularly well developed for next-gen sequencing data, nicely displaying coverage plots and alignments of short reads, with attention to the nuances of paired-ends.

IGV is a Java desktop app that pulls data down from a server component, the IGV Data Server. In my case, I cooked up a two level hierarchy for caching chunks of data in memory backed by SQLite. It's probably smart to add a server as a third level. IGV's multi-resolution data mode precomputes aggregations for zoomed-out views in which data is denser than the pixels in which to display it. IGV splits data into "tiles" stored in a custom indexed binary file format. "Hence a single tile at the lowest resolution, which spans the entire genome, has the same memory footprint as a tile at the very high zoom levels, which might span only a few kilobases." My GGB aggregates on the fly, which hurts performance in zoomed out views.

The IGV Data Server seems to derive a lot of it's data from the UCSC Genome Browser, which maintains nicely curated data mapped to genomic coordinates for a bunch of eukaryotes and also microbes. One thing I enjoyed hacking with on GGB was integration with R. I wonder if that would be worthwhile for IGV.

Which functionality to put on the client vs. which in the server is debatable. We considered building a browser based implementation, experimenting a bit with the super-cool protovis visualization library. We went with desktop. X:map is a nice counter-point, an interactive web-based genome browser. In their approach, the Google Maps API serves up pre-rendered image tiles, keeping the big data and heavy-weight computing tasks on the server. They also have an R and Java program that lets you plot custom data. JBrowse, from CSHL, does the rendering in the browser. Putting a data intensive and graphically interactive app in the browser is still somewhere near the edge of the envelope, but browsers are improving like crazy, as are programming models for this type of development.

For what it's worth, I like the format of the IGV paper. It concisely covers motivation, what the software does, a few unique features and a couple figures showing example applications, all at a high level overview in just two pages. A supplement contains the technical detail of interest to software developers along with more example applications. I like that better than trying to awkwardly shoehorn biology and software engineering together.

Anyway... Nice work, IGV team! Let me know if you'd like to join the support group. We're here to help.