Wednesday, May 14, 2008

Working with microarray data

State-of-the-art DNA microarrays contain between 1 million and 6 million features (different probes) on a single slide. Assuming a 32 bit floats, we need at 4 bytes per feature. If we include start and stop coordinates on the genome for each feature, we're up to 12 bytes per feature.

featuresMBMB w/ coords
1 million424
6.5 million2475

In tiling arrays, probes target regularly spaced segments of the genome so that expression can be measured at every point along the genome. Our group has done some work with arrays containing 60bp probes tiled every 20 base-pairs along the genome of Halobacterium salinarium. The genome of H. salinarum is 2.6 million bases, so we are able to cover most of the genome with a little less than a quarter of a million probes. To cover the entire human genome at similar resolution would take ~307 million probes.

As a side note, AFFY sells a set of 14 arrays with a total of almost 90 million probes that covers the whole genome at 35 bp resolution. (90 million * 35bp = 3150 mpbs. Does their idea of a probe include both forward and reverse strand?)

These arrays are all custom made for an individual genome. I wonder if an array with all 4 million possible 11-mers would be useful? Since there's only 4 million of them, you could expect a fair number of collisions - where a probe is non-unique on the genome. How do you do that calculation?