Monday, October 27, 2008

WTF NCBI?

A previous post, Hacking NCBI Entrez, dealt with how to retrieve sequence information from NCBI's databases. That method seems to work for prokaryotes and for yeast, but fails for most other eukaryotes.

For mammals, efetch for genome XML gives back crap like this (for rat):

<gbseq_contig>join(NW_001084776.1:1..691014,gap(182895),NW_001084777.1:1..1914699,gap(182895),NW_001084778.1:1..26673,gap(182895),NW_001084779.1:1..2730,gap(182895),NW_001084780.1:1..61755,gap(182895),NW_001084781.1:1..20466,gap(182895),NW_001084782.1:1..657670,gap(182895),NW_001084783.1:1..55883,gap(182895),NW_001084784.1:1..9292,gap(182895),NW_001084785.1:1..10599,gap(182895),NW_001084786.1:1..14198,gap(182895),NW_001084787.1:1..3561,gap(182895),NW_001084788.1:1..106511,gap(182895),NW_001084789.1:1..21205827,gap(182895),NW_001084790.1:1..11152534,gap(182895),NW_001084791.1:1..6015389,gap(182895),NW_001084792.1:1..686425,gap(182895),NW_001084793.1:1..9344793)</gbseq_contig>

Or this for XML for human Y chromosome. Totally useless. I take it I'm supposed to request each of the referenced sequences and parse out the regions for each? What a colossal pain in the ass!

By the way, BioJava has a parser called GenbankXmlFormat, but it's docs say, "Deprecated. Use org.biojavax.bio.seq.io.INSDseqFormat". What INSDseqFormat is or how that is supposed to replace GenbankXML is totally unclear.