Monday, July 27, 2009

Getting sequence data out of NCBI Entrez

Thanks to a coworker, I finally found out how to get sequence data out of NCBI programmatically. The catch was that I wanted to get a chunk of sequence at a time, without needing to download the whole genome. Now, I can do that through NCBI's eutils. Yay! Here's a link to the key efetch help page.

First we can use the elink call to get a list of sequences (seems to return GI accession numbers) related to a genome project:

I suppose you'll have to make a few more calls to figure out which sequence is which, but I happen to know the one I want is 15789340. So, getting a chunk of sequence is as simple as this:


You can also use refseq accession numbers instead of GIs:


You can even do tricky stuff like get a gene (in this case VNG1179C) in the reverse strand along with 70 nucleotides on either side.


For more laughs, see my previous inept fumbling related to NCBI.