Monday, February 02, 2009

Spelunking in the UCSC Genome Browser

The UCSC genome browser is an established workhorse of bioinformatics led by Jim Kent and David Haussler. The software is an open source C and MySQL web-app that generates gifs to display genome annotation in tracks plotted against genomic coordinates. The main UCSC genome browser covers eukaryotic model organisms, while its twin, The UCSC Archaeal Genome Browser covers archaea and, despite its name, several bacteria as well.

Aside from the visualization, the genome browser is also a great source for curated genomic data, which is made available through a convenient mechanism called the Table Browser. Much of this data takes the form of a tuple containing genomic coordinates and some other data, which together are traditionally known as a feature.

(sequence, strand, start, end, ...)

Examples of features might be a gene, a PFAM domain, or a measurement from a microarray or mass spectrometry.

I happen to have a motive for wanting to grab some of this data and use it for my own nefarious purposes. So, here's a little documentation on how to go about doing that.

They allow public access to their MySQL database, so let's explore that first. Each assembly gets its own database, for example hg18 for the March 2006 assembly of the human genome. Application scope data is in a database called hgcentral, which includes a table dbDb, a datab ase of databases. Most of this information is also in the FAQ entry on releases, but since we can get to the tables, we may as well have a look.

mysql> use hgcentral;
mysql> describe dbDb;
+----------------+--------------+------+-----+---------+-------+
| Field          | Type         | Null | Key | Default | Extra |
+----------------+--------------+------+-----+---------+-------+
| name           | varchar(32)  | NO   |     |         |       | 
| description    | varchar(255) | NO   |     |         |       | 
| nibPath        | varchar(255) | NO   |     |         |       | 
| organism       | varchar(255) | NO   |     |         |       | 
| defaultPos     | varchar(255) | NO   |     |         |       | 
| active         | int(1)       | NO   |     | 0       |       | 
| orderKey       | int(11)      | NO   |     | 1000000 |       | 
| genome         | varchar(255) | NO   |     |         |       | 
| scientificName | varchar(255) | NO   |     |         |       | 
| htmlPath       | varchar(255) | NO   |     |         |       | 
| hgNearOk       | tinyint(4)   | NO   |     | 0       |       | 
| hgPbOk         | tinyint(4)   | NO   |     | 0       |       | 
| sourceName     | varchar(255) | NO   |     |         |       | 
| taxId          | int(11)      | NO   |     | 0       |       | 
+----------------+--------------+------+-----+---------+-------+

The clade table partitions the organisms into clades, so a categorized list of currently active genomes can be generated like this:

     select
       d.name, d.description, d.genome, d.scientificName,
       d.taxId, gc.clade
     from dbDb as d join genomeClade as gc
       on d.genome=gc.genome
     where d.active > 0
     order by clade, scientificName;

There's also a table called dbDbArch, which I was hoping stood for archaea. Sadly, no luck. It looks to mean archived, instead. I wasn't able to find an open MySQL DB for the archaeal genome browser (hints, anyone?), but they list the available organisms right on the home page, so no worries. Scraping that and applying some regex's will get you at least a table of organisms and database names which is all you really need.

With that, knowing an organism, we can figure out which database to ransack. The first thing we'll want to know is the configuration of the organism's genome. How many chromosomes (or plasmids, etc.) are there and what are their sizes? An HTTP request for the chromInfo table will do that for us. Both GET and POST seem to work.

http://genome.ucsc.edu/cgi-bin/hgTables?db=hg18&hgta_group=allTables&hgta_track=hg18&hgta_table=chromInfo&hgta_regionType=genome&hgta_outputType=primaryTable&hgta_doTopSubmit=

Try it. Now, let's break that down.

http://genome.ucsc.edu/cgi-bin/hgTables?
db=hg18
hgta_group=allTables
hgta_track=hg18
hgta_table=chromInfo
hgta_regionType=genome
hgta_outputType=primaryTable
hgta_doTopSubmit=

I'm not sure whether all fields shown here are necessary, but this seems to do the trick. Well, one issue is that along with the expected chromosomes 1 through 22, plus X, Y and M for mitochondrial, we get chr1_random, chr6_cox_hap1 and a bunch of weird things like that. What are these things?

The equivalent for prokaryotes looks like this:

http://archaea.ucsc.edu/cgi-bin/hgTables?
db=eschColi_K12
hgta_group=allTables
hgta_track=eschColi_K12
hgta_table=chromInfo
hgta_regionType=genome
hgta_outputType=primaryTable
hgta_doTopSubmit=

Now, let's get us some features, in this case refseq genes:

http://genome.ucsc.edu/cgi-bin/hgTables?
db=hg18
hgta_group=genes
hgta_track=refGene
hgta_table=refGene
hgta_regionType=genome
hgta_outputType=primaryTable
hgta_doTopSubmit=

Oddly, on the archaeal side, we need to ask for refSeq rather than refGene. Try that, too.

http://archaea.ucsc.edu/cgi-bin/hgTables?
db=eschColi_K12
hgta_group=genes
hgta_track=refSeq
hgta_table=refSeq
hgta_regionType=genome
hgta_outputType=primaryTable
hgta_doTopSubmit=

Now we can get chromosome information and gene locations for our choice of organism. Of course, we've only scratched the surface of the tracks available, as a click here or here will show. The FAQ entry on linking also has a few hints.

With all this nicely curated data available so easily over HTTP and even straight from the database, it's begging to be mashed up, recombined and reintegrated. The Table Browser is a great idea. It's just a simple database dump to tab-delimited text, but it's so much easier to work with than, say for example, SOAP/WS-*. Thanks, UCSC, for making such a useful resource available!