Digithead's Lab Notebook: WTF NCBI?

Monday, October 27, 2008

WTF NCBI?

A previous post, Hacking NCBI Entrez, dealt with how to retrieve sequence information from NCBI's databases. That method seems to work for prokaryotes and for yeast, but fails for most other eukaryotes.

NCBI's data formats (as ASN.1)

For mammals, efetch for genome XML gives back crap like this (for rat):

<gbseq_contig>join(NW_001084776.1:1..691014,gap(182895),NW_001084777.1:1..1914699,gap(182895),NW_001084778.1:1..26673,gap(182895),NW_001084779.1:1..2730,gap(182895),NW_001084780.1:1..61755,gap(182895),NW_001084781.1:1..20466,gap(182895),NW_001084782.1:1..657670,gap(182895),NW_001084783.1:1..55883,gap(182895),NW_001084784.1:1..9292,gap(182895),NW_001084785.1:1..10599,gap(182895),NW_001084786.1:1..14198,gap(182895),NW_001084787.1:1..3561,gap(182895),NW_001084788.1:1..106511,gap(182895),NW_001084789.1:1..21205827,gap(182895),NW_001084790.1:1..11152534,gap(182895),NW_001084791.1:1..6015389,gap(182895),NW_001084792.1:1..686425,gap(182895),NW_001084793.1:1..9344793)</gbseq_contig>

Or this for XML for human Y chromosome. Totally useless. I take it I'm supposed to request each of the referenced sequences and parse out the regions for each? What a colossal pain in the ass!

By the way, BioJava has a parser called GenbankXmlFormat, but it's docs say, "Deprecated. Use org.biojavax.bio.seq.io.INSDseqFormat". What INSDseqFormat is or how that is supposed to replace GenbankXML is totally unclear.

4 comments:

Christopher Bare2/05/2009 3:33 PM
Some good, if dated, information about NCBI can be found in the NCBI Resource Guide. Especially, an explanation for the GenBank Flat File Format.
ReplyDelete
Replies
Christopher Bare2/05/2009 3:44 PM
Thanks to Wayne Matten at NCBI for the following hints: Entrez Links.

Other possibilities for getting the specific information you mention include
the files here:

ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/
ReplyDelete
Replies
John Fowler2/25/2009 4:46 PM
Hello,

I found this blog post about eFetch and the GBSeq_contig XML element. I have the same trouble that you mentioned - I would like to get to the entire sequence, not just this 'join' command.

Did you ever find a way to do this easily through eFetch? (or other Genbank URL?)

thank you,
John Fowler
Assoc. Professor, Oregon State University, Botany & Plant Pathology
ReplyDelete
Replies
Christopher Bare2/27/2009 12:41 AM
Hi John,

I was interested mainly in getting the annotations (gene start and end locations), which ends up being more easily acquired from the UCSC genome browser (see Spelunking in the UCSC Genome Browser). They have a nice feature called the table browser, which can be accessed programatically.

If you're just interested in sequence, I think NCBI will give that to you, if you ask for a return type of fasta. A link such as this to the nucleotide database seems to work, but I haven't tested it much. Sometimes, it seems easier to use the text output mode, (like this) rather than bother with eutils.

I also think I could have gotten what I wanted (the features) from refseq (for example, like this), but I'm sticking with UCSC for now.

Most of this is what I've figured out by hacking around, as the docs are pretty thin for this stuff. I hope this helps.

-chris
ReplyDelete
Replies

Add comment

Digithead's Lab Notebook

Monday, October 27, 2008

WTF NCBI?

4 comments:

About

About Me

Blog Archive

Labels

Cheat Sheets

Featured on

Digithead's Lab Notebook

Monday, October 27, 2008

WTF NCBI?

4 comments:

About

About Me

Blog Archive

Labels

Cheat Sheets

Feedz

Featured on