Friday, December 28, 2007

Hacking NCBI Entrez

NCBI Entrez NCBI's Entrez collection of databases holds mountains of bioinformatics data, but getting at it programmatically can be a little tricky. There are two ways to go. Entrez offers a web service called EUtilities either through XML-over-HTTP or SOAP. EUtilities sometimes provides a frustratingly minimal amount of information and does so very slowly. Another option is to use the CGI scripts that make up the web interface which can usually return results in perfectly fine tabular text files. These are undocumented as far as I can tell, so the trick is figuring out how.

Let's say our goal is to select a completely sequenced organism and download a list of its genes and their coordinates. First, you want the user to input an organism. We want to look up the given organism and allow the user to resolve any ambiguity.

Entrez EUtilities

EUtilities provides the esearch method, which helpfully returns primary IDs related to the given search terms. We can search for an organism in the genome project database like this:

(prefix omitted: http://eutils.ncbi.nlm.nih.gov/entrez/eutils/)
esearch.fcgi?db=genomeprj&term=Halobacterium[orgn]

The [orgn] is a semi-documented way of telling Entrez to search for our keyword in the organism field. That gets you a list matching IDs, like so:

<idlist>
<id>106</id>
<id>217</id>
</idlist>

OK, so now what? I need to ask the user which one he meant, so I have to query EUtilities again for a summary of each hit:

esummary.fcgi?db=genomeprj&id=217

For which you get something like:

<eSummaryResult>
<DocSum>
<Id>217</Id>
<Item Name="Organism_Name" Type="String">Halobacterium sp. NRC-1</Item>
<Item Name="Organism_Kingdom" Type="String">Archaea</Item>
<Item Name="Organism_Group" Type="String">Euryarchaeota</Item>
<Item Name="Center" Type="String">University of Massachusetts-Amherst, University of Washington</Item>
<Item Name="Sequencing_Status" Type="String">complete</Item>
<Item Name="Project_Type" Type="String">Genome sequencing</Item>
<Item Name="Genome_Size" Type="String">2</Item>
<Item Name="Number_of_Chromosomes" Type="String">1</Item>
<Item Name="Trace_Species_Code" Type="String"></Item>
<Item Name="Trace_Center_Name" Type="String"></Item>
<Item Name="Trace_Type_Code" Type="String"></Item>
<Item Name="Defline" Type="String">Chemoheterotrophic obligate extreme halophilic archeon</Item>
<Item Name="Number_of_Mitochondrion" Type="String"></Item>
<Item Name="Number_of_Plastid" Type="String"></Item>
<Item Name="Number_of_Plasmid" Type="String">2</Item>
<Item Name="Create_Date" Type="Date">1/01/01 00:00</Item>
<Item Name="Release_Date" Type="Date">1/01/01 00:00</Item>
<Item Name="Options" Type="String"></Item>
</DocSum>
</eSummaryResult>

Once we do that for each project that matched our search, we can present the user with a choice and pick one of the projects. The next step is to use elink to link from the genome project to the related entries in the genome database.

elink.fcgi?dbfrom=genomeprj&id=217&db=genome
<eLinkResult>
<LinkSet>
<DbFrom>genomeprj</DbFrom>
<IdList>
<Id>217</Id>
</IdList>
<LinkSetDb>
<DbTo>genome</DbTo>

<LinkName>genomeprj_genome</LinkName>
<Link>
<Id>13234</Id>
</Link>
<Link>
<Id>166</Id>
</Link>

<Link>
<Id>165</Id>
</Link>
</LinkSetDb>
</LinkSet>
</eLinkResult>

The esummary method provides some data including length of the sequence.

esummary.fcgi?db=genome&id=13234
Finally, efetch can be used to get a full report on each chromosome/replicon/whatever which returns results in asn1 or xml format, which you'll then have to parse.
efetch.fcgi?db=genome&retmode=xml&id=13234

Hacking Entrez's CGI scripts

There's a helpful, though incomplete, guide to Entrez URLs called the Entrez Link Helper. It's often possible to get exactly what you want this way more easily than through EUtilities. For example, my list of genes can be had through this URL:
(omitting prefix: http://www.ncbi.nlm.nih.gov/sites/)

entrez?db=genome&cmd=text&dopt=Protein+Table&list_uids=13234

The results can be parsed in about 3 lines. Not that I'm some kind of disgruntled anti-XML militant, but sometimes a text file is a lot easier to deal with. The following will get you a nice table of the available prokaryotic genomes:

http://www.ncbi.nlm.nih.gov/genomes/lproks.cgi?view=1&dump=selected&p3=6:|7:
Another way to get info about a specific genomic sequence (chromosome or plasmid) is this:
entrez?db=genome&cmd=Text&dopt=Overview&list_uids=13234

One thing I haven't figured out is how do you do this and get parsable text rather than HTML back?

entrez?Db=genomeprj&Cmd=search&Term=%22Halobacterium%22%5BOrganism%5D&doptcmdl=brief

NCBI Resource Locator

Note that NCBI's Resource Locator, a RESTful and less heinous URL scheme seems to have escaped my attention when I wrote this.

Sunday, December 23, 2007

Messaging

Messaging is hip and trendy these days. It's central to application integration and distributed computing, and may represent a way forward in tackling concurrency (see “communicating sequential processes”). Here are a couple messaging resources:

Mule may be the future of the Gaggle. The Gaggle is a framework for integrating bioinformatics software and databases on which I've done some work. Gaggle currently is implemented on Java RMI. It really needs to move to a more language-neutral messaging protocol. A messaging approach should also make it easier for Gaggle to interoperate with web-service based systems like bioMoby and Taverna.

Friday, December 21, 2007

What's a blog with no links to other blogs?

Bioinformatics Programming
Random stuff
BTW, I have a feeling that the ratio of blog readers to blogs is below 1 and falling.

Wednesday, December 05, 2007

Prototyping and Dynamic languages

The nearest thing to heroin available at my local supermarket is Haagen Daz chocolate peanut butter ice cream. It's so good, I'm lucky I don't weigh 500 pounds.

Dynamic languages seem to be the flavor of the month in programming languages. And, why not? They enable fast development, concise code, and freedom from the rigid static type systems of Java or C#. Everyone, for example, seems to be raving about Ruby.

Javascript gets fewer raves and a lot of (sometimes well deserved) abuse. Leaving aside it's well documented faults, one nice thing about Javascript is that functions are truly first class citizens. Javascript's semantics regarding functions, complete with the ability to create closures, look cleaner to my eyes than Ruby's code blocks. Treating functions as data is a feature borrowed from languages like Scheme. From the more obscure Self language, Javascript borrows another advanced feature that - though elegant and useful - is often misunderstood: prototype based inheritance.

Prototyping is an idea worth understanding, because it will help you think beyond traditional object oriented programming. Prototyping and class-based inheritance can be seen at a higher level as two variants of the same general concepts. Simply put, prototyping is inheritance without classes.

Classes = delegation + interfaces

Like class-based inheritance, prototyping is a shorthand form of delegation. When one object delegates to another, the first object may simply pass a method call on to its delegate or may perform some additional functionality before or after the call. Those familiar with design patterns will recognize the Decorator pattern.


Not to get too far off track, but a lot of the power of aspect oriented programming boils down to a type of decorator usually called an interceptor in that context. The common ingredients here are delegation and a common interface.

To see that subclassing is a form of delegation, picture an instance of a Java class CodeMonkey, which is a subclass of Employee. Simple enough. Call a method on an instance of CodeMonkey and (conceptually) we first check if the method is defined in the class. If not, we delegate the handling of the call to the superclass Employee. Employee, in turn, can delegate to the class Object, which is the root of the inheritance hierarchy.

Subclassing defines an is-a relationship. A CodeMonkey is an Employee. In practical terms, this means that there is an implicit common interface implemented by both Employee and CodeMonkey, namely the interface of the superclass Employee. This is required for polymorphism. We have to be able to use a CodeMonkey anywhere we could have used an Employee.

There is something of a weird dichotomy between classes and objects. It doesn't seem weird because we're used to it, but think of all the awkwardness around reflection. That comes about because sometimes you want to treat a class as an object, while usually it's something quite different.

In Java, methods belong to the class, while instance variables belong to the object. Two instances can't have different implementations of a method like they can have different values for a member variable. Imagine, then, how all this might work in a language with no classes, only objects.

Prototypes, a shorthand for delegation

In Javascript, the only place for a method to live is on an object. This is natural because a function in Javascript is just another piece of data. Objects have members. Members are just data. They may be things like strings or integers or they may be functions. Since there are no classes there is no question of whether two objects can have different implementations of a method. Any two object, whether or not one is a prototype of the other, can share a method, or share a common method signature with different implementations. That's a lot of flexibility, right there.

In a such a language, an object can't delegate method calls to its class or superclass. So, an object can only delegate to another object. That's the essence of prototype based inheritance. An object's prototype is just another object. That object may have its own prototype, forming a chain, just as in classes. The root of the prototype chain is the prototype of Object. Well, that's true unless you change it, which brings up another strange aspect of prototypes. The prototype chain, being just data, can be modified at runtime. Thinking about that for a while could really warp your brain cells.

So how do prototypes mix with dynamic languages? Polymorphism can be taken for granted in a dynamic language. By virtue of “duck typing”, we don't need to worry about explicitly defining common base classes or interfaces. If the method walks like duck, we can call it. So, in most dynamic languages polymorphism is decoupled from type.

Prototyping decouples inheritance from type. In a dynamically-typed language, why expend effort to define type hierarchies? Dynamic languages are supposed to free us from worrying excessively about types. Certainly, delegation is still worth-while. Interfaces, though implicit, remain important. But what benefit do classes bring to the table?

Prototypes and dynamic languages

The popularity of class-based dynamic languages is due more to familiarity than function. Fluency in dynamic languages entails a shift in thinking of at least the same magnitude as going from procedural languages to OO. In a few years, building class hierarchies in dynamic languages may look as silly as the fortran-written-in-Java or C++ foibles that were so common not too long ago.

Prototyping fits much better than classes with the ethos of dynamic languages. It provides the benefits inheritance without the baggage of types. Like class based OO, prototypes are something you have to work with to get a feel for. Once you do, you'll appreciate its flexibility as a shorthand notation for delegation.

The gang-of-four advise us to favor composition over inheritance precisely because of the extra baggage carried along by subclassing. In the context of dynamic typing, prototype based inheritance offers a middle road, uncluttered by type hierarchies. Prototyping and dynamic languages - two great tastes that go great together!

Sunday, December 02, 2007

The web as a channel for structured data

Structured data on the web is really taking off these days. It takes lots of different forms: mash-ups, web services, AJAX, microformats, and the semantic web. The common element in these "web 2.0" technologies is structured data traveling over http.

This is in contrast to HTML, which has structure of course, but is oriented towards page layout. In its emphasis on display, HTML looses the metadata which specifies, for example, that a certain bit of text is a person's last name, or the name of a city in an address, or a list of proteins involved in a specific metabolic process.

My experience with these ideas comes from my work on a Firefox extension called Firegoose. (Paper in BMC Bioinformatics) Firegoose attempts to provide users of biological data repositories on the web with an easy means to query those resources using local data and retrieve data for use in desktop analysis and visualization packages. In a way, Firegoose does on-demand data integration in the browser.

Browser extensions or plugins are a great way to combine the point-and-click ease of the browser with more powerful tools for working with data. Operator reads standard microformats for basic data types like calendar events and contact information. So adding appointments to a calendar or contacts to an address book can be done by surfing to the appropriate web page and clicking a button.

Notable Firefox plug-ins:

Other potentially related stuff: