Monday, October 27, 2008

WTF NCBI?

A previous post, Hacking NCBI Entrez, dealt with how to retrieve sequence information from NCBI's databases. That method seems to work for prokaryotes and for yeast, but fails for most other eukaryotes.

For mammals, efetch for genome XML gives back crap like this (for rat):

<gbseq_contig>join(NW_001084776.1:1..691014,gap(182895),NW_001084777.1:1..1914699,gap(182895),NW_001084778.1:1..26673,gap(182895),NW_001084779.1:1..2730,gap(182895),NW_001084780.1:1..61755,gap(182895),NW_001084781.1:1..20466,gap(182895),NW_001084782.1:1..657670,gap(182895),NW_001084783.1:1..55883,gap(182895),NW_001084784.1:1..9292,gap(182895),NW_001084785.1:1..10599,gap(182895),NW_001084786.1:1..14198,gap(182895),NW_001084787.1:1..3561,gap(182895),NW_001084788.1:1..106511,gap(182895),NW_001084789.1:1..21205827,gap(182895),NW_001084790.1:1..11152534,gap(182895),NW_001084791.1:1..6015389,gap(182895),NW_001084792.1:1..686425,gap(182895),NW_001084793.1:1..9344793)</gbseq_contig>

Or this for XML for human Y chromosome. Totally useless. I take it I'm supposed to request each of the referenced sequences and parse out the regions for each? What a colossal pain in the ass!

By the way, BioJava has a parser called GenbankXmlFormat, but it's docs say, "Deprecated. Use org.biojavax.bio.seq.io.INSDseqFormat". What INSDseqFormat is or how that is supposed to replace GenbankXML is totally unclear.

4 comments:

  1. Some good, if dated, information about NCBI can be found in the NCBI Resource Guide. Especially, an explanation for the GenBank Flat File Format.

    ReplyDelete
  2. Thanks to Wayne Matten at NCBI for the following hints: Entrez Links.

    Other possibilities for getting the specific information you mention include
    the files here:

    ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/

    ReplyDelete
  3. Hello,

    I found this blog post about eFetch and the GBSeq_contig XML element. I have the same trouble that you mentioned - I would like to get to the entire sequence, not just this 'join' command.

    Did you ever find a way to do this easily through eFetch? (or other Genbank URL?)

    thank you,
    John Fowler
    Assoc. Professor, Oregon State University, Botany & Plant Pathology

    ReplyDelete
  4. Hi John,

    I was interested mainly in getting the annotations (gene start and end locations), which ends up being more easily acquired from the UCSC genome browser (see Spelunking in the UCSC Genome Browser). They have a nice feature called the table browser, which can be accessed programatically.

    If you're just interested in sequence, I think NCBI will give that to you, if you ask for a return type of fasta. A link such as this to the nucleotide database seems to work, but I haven't tested it much. Sometimes, it seems easier to use the text output mode, (like this) rather than bother with eutils.

    I also think I could have gotten what I wanted (the features) from refseq (for example, like this), but I'm sticking with UCSC for now.

    Most of this is what I've figured out by hacking around, as the docs are pretty thin for this stuff. I hope this helps.

    -chris

    ReplyDelete