Monday, July 27, 2009

Getting sequence data out of NCBI Entrez

Thanks to a coworker, I finally found out how to get sequence data out of NCBI programmatically. The catch was that I wanted to get a chunk of sequence at a time, without needing to download the whole genome. Now, I can do that through NCBI's eutils. Yay! Here's a link to the key efetch help page.

First we can use the elink call to get a list of sequences (seems to return GI accession numbers) related to a genome project:

http://www.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=genomeprj&id=217&db=nucleotide

I suppose you'll have to make a few more calls to figure out which sequence is which, but I happen to know the one I want is 15789340. So, getting a chunk of sequence is as simple as this:

efetch.fcgi?db=nucleotide&id=15789340&seq_start=528770&seq_stop=531343&rettype=fasta

You can also use refseq accession numbers instead of GIs:

efetch.fcgi?db=nucleotide&id=NC_002607&seq_start=528770&seq_stop=531343&rettype=fasta

You can even do tricky stuff like get a gene (in this case VNG1179C) in the reverse strand along with 70 nucleotides on either side.

efetch.fcgi?db=nucleotide&id=NC_002607&seq_start=888547&seq_stop=889397&strand=2&rettype=fasta

For more laughs, see my previous inept fumbling related to NCBI.