Friday, December 28, 2007

Hacking NCBI Entrez

NCBI Entrez NCBI's Entrez collection of databases holds mountains of bioinformatics data, but getting at it programmatically can be a little tricky. There are two ways to go. Entrez offers a web service called EUtilities either through XML-over-HTTP or SOAP. EUtilities sometimes provides a frustratingly minimal amount of information and does so very slowly. Another option is to use the CGI scripts that make up the web interface which can usually return results in perfectly fine tabular text files. These are undocumented as far as I can tell, so the trick is figuring out how.

Let's say our goal is to select a completely sequenced organism and download a list of its genes and their coordinates. First, you want the user to input an organism. We want to look up the given organism and allow the user to resolve any ambiguity.

Entrez EUtilities

EUtilities provides the esearch method, which helpfully returns primary IDs related to the given search terms. We can search for an organism in the genome project database like this:

(prefix omitted:

The [orgn] is a semi-documented way of telling Entrez to search for our keyword in the organism field. That gets you a list matching IDs, like so:


OK, so now what? I need to ask the user which one he meant, so I have to query EUtilities again for a summary of each hit:


For which you get something like:

<Item Name="Organism_Name" Type="String">Halobacterium sp. NRC-1</Item>
<Item Name="Organism_Kingdom" Type="String">Archaea</Item>
<Item Name="Organism_Group" Type="String">Euryarchaeota</Item>
<Item Name="Center" Type="String">University of Massachusetts-Amherst, University of Washington</Item>
<Item Name="Sequencing_Status" Type="String">complete</Item>
<Item Name="Project_Type" Type="String">Genome sequencing</Item>
<Item Name="Genome_Size" Type="String">2</Item>
<Item Name="Number_of_Chromosomes" Type="String">1</Item>
<Item Name="Trace_Species_Code" Type="String"></Item>
<Item Name="Trace_Center_Name" Type="String"></Item>
<Item Name="Trace_Type_Code" Type="String"></Item>
<Item Name="Defline" Type="String">Chemoheterotrophic obligate extreme halophilic archeon</Item>
<Item Name="Number_of_Mitochondrion" Type="String"></Item>
<Item Name="Number_of_Plastid" Type="String"></Item>
<Item Name="Number_of_Plasmid" Type="String">2</Item>
<Item Name="Create_Date" Type="Date">1/01/01 00:00</Item>
<Item Name="Release_Date" Type="Date">1/01/01 00:00</Item>
<Item Name="Options" Type="String"></Item>

Once we do that for each project that matched our search, we can present the user with a choice and pick one of the projects. The next step is to use elink to link from the genome project to the related entries in the genome database.




The esummary method provides some data including length of the sequence.

Finally, efetch can be used to get a full report on each chromosome/replicon/whatever which returns results in asn1 or xml format, which you'll then have to parse.

Hacking Entrez's CGI scripts

There's a helpful, though incomplete, guide to Entrez URLs called the Entrez Link Helper. It's often possible to get exactly what you want this way more easily than through EUtilities. For example, my list of genes can be had through this URL:
(omitting prefix:


The results can be parsed in about 3 lines. Not that I'm some kind of disgruntled anti-XML militant, but sometimes a text file is a lot easier to deal with. The following will get you a nice table of the available prokaryotic genomes:|7:
Another way to get info about a specific genomic sequence (chromosome or plasmid) is this:

One thing I haven't figured out is how do you do this and get parsable text rather than HTML back?


NCBI Resource Locator

Note that NCBI's Resource Locator, a RESTful and less heinous URL scheme seems to have escaped my attention when I wrote this.