Friday, December 28, 2007

Hacking NCBI Entrez

NCBI Entrez NCBI's Entrez collection of databases holds mountains of bioinformatics data, but getting at it programmatically can be a little tricky. There are two ways to go. Entrez offers a web service called EUtilities either through XML-over-HTTP or SOAP. EUtilities sometimes provides a frustratingly minimal amount of information and does so very slowly. Another option is to use the CGI scripts that make up the web interface which can usually return results in perfectly fine tabular text files. These are undocumented as far as I can tell, so the trick is figuring out how.

Let's say our goal is to select a completely sequenced organism and download a list of its genes and their coordinates. First, you want the user to input an organism. We want to look up the given organism and allow the user to resolve any ambiguity.

Entrez EUtilities

EUtilities provides the esearch method, which helpfully returns primary IDs related to the given search terms. We can search for an organism in the genome project database like this:

(prefix omitted: http://eutils.ncbi.nlm.nih.gov/entrez/eutils/)
esearch.fcgi?db=genomeprj&term=Halobacterium[orgn]

The [orgn] is a semi-documented way of telling Entrez to search for our keyword in the organism field. That gets you a list matching IDs, like so:

<idlist>
<id>106</id>
<id>217</id>
</idlist>

OK, so now what? I need to ask the user which one he meant, so I have to query EUtilities again for a summary of each hit:

esummary.fcgi?db=genomeprj&id=217

For which you get something like:

<eSummaryResult>
<DocSum>
<Id>217</Id>
<Item Name="Organism_Name" Type="String">Halobacterium sp. NRC-1</Item>
<Item Name="Organism_Kingdom" Type="String">Archaea</Item>
<Item Name="Organism_Group" Type="String">Euryarchaeota</Item>
<Item Name="Center" Type="String">University of Massachusetts-Amherst, University of Washington</Item>
<Item Name="Sequencing_Status" Type="String">complete</Item>
<Item Name="Project_Type" Type="String">Genome sequencing</Item>
<Item Name="Genome_Size" Type="String">2</Item>
<Item Name="Number_of_Chromosomes" Type="String">1</Item>
<Item Name="Trace_Species_Code" Type="String"></Item>
<Item Name="Trace_Center_Name" Type="String"></Item>
<Item Name="Trace_Type_Code" Type="String"></Item>
<Item Name="Defline" Type="String">Chemoheterotrophic obligate extreme halophilic archeon</Item>
<Item Name="Number_of_Mitochondrion" Type="String"></Item>
<Item Name="Number_of_Plastid" Type="String"></Item>
<Item Name="Number_of_Plasmid" Type="String">2</Item>
<Item Name="Create_Date" Type="Date">1/01/01 00:00</Item>
<Item Name="Release_Date" Type="Date">1/01/01 00:00</Item>
<Item Name="Options" Type="String"></Item>
</DocSum>
</eSummaryResult>

Once we do that for each project that matched our search, we can present the user with a choice and pick one of the projects. The next step is to use elink to link from the genome project to the related entries in the genome database.

elink.fcgi?dbfrom=genomeprj&id=217&db=genome
<eLinkResult>
<LinkSet>
<DbFrom>genomeprj</DbFrom>
<IdList>
<Id>217</Id>
</IdList>
<LinkSetDb>
<DbTo>genome</DbTo>

<LinkName>genomeprj_genome</LinkName>
<Link>
<Id>13234</Id>
</Link>
<Link>
<Id>166</Id>
</Link>

<Link>
<Id>165</Id>
</Link>
</LinkSetDb>
</LinkSet>
</eLinkResult>

The esummary method provides some data including length of the sequence.

esummary.fcgi?db=genome&id=13234
Finally, efetch can be used to get a full report on each chromosome/replicon/whatever which returns results in asn1 or xml format, which you'll then have to parse.
efetch.fcgi?db=genome&retmode=xml&id=13234

Hacking Entrez's CGI scripts

There's a helpful, though incomplete, guide to Entrez URLs called the Entrez Link Helper. It's often possible to get exactly what you want this way more easily than through EUtilities. For example, my list of genes can be had through this URL:
(omitting prefix: http://www.ncbi.nlm.nih.gov/sites/)

entrez?db=genome&cmd=text&dopt=Protein+Table&list_uids=13234

The results can be parsed in about 3 lines. Not that I'm some kind of disgruntled anti-XML militant, but sometimes a text file is a lot easier to deal with. The following will get you a nice table of the available prokaryotic genomes:

http://www.ncbi.nlm.nih.gov/genomes/lproks.cgi?view=1&dump=selected&p3=6:|7:
Another way to get info about a specific genomic sequence (chromosome or plasmid) is this:
entrez?db=genome&cmd=Text&dopt=Overview&list_uids=13234

One thing I haven't figured out is how do you do this and get parsable text rather than HTML back?

entrez?Db=genomeprj&Cmd=search&Term=%22Halobacterium%22%5BOrganism%5D&doptcmdl=brief

NCBI Resource Locator

Note that NCBI's Resource Locator, a RESTful and less heinous URL scheme seems to have escaped my attention when I wrote this.

Sunday, December 23, 2007

Messaging

Messaging is hip and trendy these days. It's central to application integration and distributed computing, and may represent a way forward in tackling concurrency (see “communicating sequential processes”). Here are a couple messaging resources:

Mule may be the future of the Gaggle. The Gaggle is a framework for integrating bioinformatics software and databases on which I've done some work. Gaggle currently is implemented on Java RMI. It really needs to move to a more language-neutral messaging protocol. A messaging approach should also make it easier for Gaggle to interoperate with web-service based systems like bioMoby and Taverna.

Friday, December 21, 2007

What's a blog with no links to other blogs?

Bioinformatics Programming
Random stuff
BTW, I have a feeling that the ratio of blog readers to blogs is below 1 and falling.

Wednesday, December 05, 2007

Prototyping and Dynamic languages

The nearest thing to heroin available at my local supermarket is Haagen Daz chocolate peanut butter ice cream. It's so good, I'm lucky I don't weigh 500 pounds.

Dynamic languages seem to be the flavor of the month in programming languages. And, why not? They enable fast development, concise code, and freedom from the rigid static type systems of Java or C#. Everyone, for example, seems to be raving about Ruby.

Javascript gets fewer raves and a lot of (sometimes well deserved) abuse. Leaving aside it's well documented faults, one nice thing about Javascript is that functions are truly first class citizens. Javascript's semantics regarding functions, complete with the ability to create closures, look cleaner to my eyes than Ruby's code blocks. Treating functions as data is a feature borrowed from languages like Scheme. From the more obscure Self language, Javascript borrows another advanced feature that - though elegant and useful - is often misunderstood: prototype based inheritance.

Prototyping is an idea worth understanding, because it will help you think beyond traditional object oriented programming. Prototyping and class-based inheritance can be seen at a higher level as two variants of the same general concepts. Simply put, prototyping is inheritance without classes.

Classes = delegation + interfaces

Like class-based inheritance, prototyping is a shorthand form of delegation. When one object delegates to another, the first object may simply pass a method call on to its delegate or may perform some additional functionality before or after the call. Those familiar with design patterns will recognize the Decorator pattern.


Not to get too far off track, but a lot of the power of aspect oriented programming boils down to a type of decorator usually called an interceptor in that context. The common ingredients here are delegation and a common interface.

To see that subclassing is a form of delegation, picture an instance of a Java class CodeMonkey, which is a subclass of Employee. Simple enough. Call a method on an instance of CodeMonkey and (conceptually) we first check if the method is defined in the class. If not, we delegate the handling of the call to the superclass Employee. Employee, in turn, can delegate to the class Object, which is the root of the inheritance hierarchy.

Subclassing defines an is-a relationship. A CodeMonkey is an Employee. In practical terms, this means that there is an implicit common interface implemented by both Employee and CodeMonkey, namely the interface of the superclass Employee. This is required for polymorphism. We have to be able to use a CodeMonkey anywhere we could have used an Employee.

There is something of a weird dichotomy between classes and objects. It doesn't seem weird because we're used to it, but think of all the awkwardness around reflection. That comes about because sometimes you want to treat a class as an object, while usually it's something quite different.

In Java, methods belong to the class, while instance variables belong to the object. Two instances can't have different implementations of a method like they can have different values for a member variable. Imagine, then, how all this might work in a language with no classes, only objects.

Prototypes, a shorthand for delegation

In Javascript, the only place for a method to live is on an object. This is natural because a function in Javascript is just another piece of data. Objects have members. Members are just data. They may be things like strings or integers or they may be functions. Since there are no classes there is no question of whether two objects can have different implementations of a method. Any two object, whether or not one is a prototype of the other, can share a method, or share a common method signature with different implementations. That's a lot of flexibility, right there.

In a such a language, an object can't delegate method calls to its class or superclass. So, an object can only delegate to another object. That's the essence of prototype based inheritance. An object's prototype is just another object. That object may have its own prototype, forming a chain, just as in classes. The root of the prototype chain is the prototype of Object. Well, that's true unless you change it, which brings up another strange aspect of prototypes. The prototype chain, being just data, can be modified at runtime. Thinking about that for a while could really warp your brain cells.

So how do prototypes mix with dynamic languages? Polymorphism can be taken for granted in a dynamic language. By virtue of “duck typing”, we don't need to worry about explicitly defining common base classes or interfaces. If the method walks like duck, we can call it. So, in most dynamic languages polymorphism is decoupled from type.

Prototyping decouples inheritance from type. In a dynamically-typed language, why expend effort to define type hierarchies? Dynamic languages are supposed to free us from worrying excessively about types. Certainly, delegation is still worth-while. Interfaces, though implicit, remain important. But what benefit do classes bring to the table?

Prototypes and dynamic languages

The popularity of class-based dynamic languages is due more to familiarity than function. Fluency in dynamic languages entails a shift in thinking of at least the same magnitude as going from procedural languages to OO. In a few years, building class hierarchies in dynamic languages may look as silly as the fortran-written-in-Java or C++ foibles that were so common not too long ago.

Prototyping fits much better than classes with the ethos of dynamic languages. It provides the benefits inheritance without the baggage of types. Like class based OO, prototypes are something you have to work with to get a feel for. Once you do, you'll appreciate its flexibility as a shorthand notation for delegation.

The gang-of-four advise us to favor composition over inheritance precisely because of the extra baggage carried along by subclassing. In the context of dynamic typing, prototype based inheritance offers a middle road, uncluttered by type hierarchies. Prototyping and dynamic languages - two great tastes that go great together!

Sunday, December 02, 2007

The web as a channel for structured data

Structured data on the web is really taking off these days. It takes lots of different forms: mash-ups, web services, AJAX, microformats, and the semantic web. The common element in these "web 2.0" technologies is structured data traveling over http.

This is in contrast to HTML, which has structure of course, but is oriented towards page layout. In its emphasis on display, HTML looses the metadata which specifies, for example, that a certain bit of text is a person's last name, or the name of a city in an address, or a list of proteins involved in a specific metabolic process.

My experience with these ideas comes from my work on a Firefox extension called Firegoose. (Paper in BMC Bioinformatics) Firegoose attempts to provide users of biological data repositories on the web with an easy means to query those resources using local data and retrieve data for use in desktop analysis and visualization packages. In a way, Firegoose does on-demand data integration in the browser.

Browser extensions or plugins are a great way to combine the point-and-click ease of the browser with more powerful tools for working with data. Operator reads standard microformats for basic data types like calendar events and contact information. So adding appointments to a calendar or contacts to an address book can be done by surfing to the appropriate web page and clicking a button.

Notable Firefox plug-ins:

Other potentially related stuff:

Friday, November 30, 2007

13949712720901ForOSX

There's a campaign called “Vote For Java 6 On Leopard“ to nag Apple into producing a Java 6 JDK for OS X. Somehow, it's supposed to help if you stick this in your blog: 13949712720901ForOSX

There's also Landon Fuller's project SoyLatte to port FreeBSD's port of the JDK to OS X. Fight the power: Donate here!

Personally, I plan to eat a pint of chocolate ice-cream every night and sleep in an extra 15 minutes every morning until Apple delivers a Java 6 JDK. Take that, Steve Jobs! Hya! 13949712720901ForOSX! Nyaaa!

Monday, November 26, 2007

Formatting Ruby source code in HTML

BTW, I used the syntax library to format the Ruby source code snippets to HTML. It works like this:

require 'rubygems'
require 'syntax/convertors/html'

code = File.read("my_nifty_program.rb")
convertor = Syntax::Convertors::HTML.for_syntax "ruby"
code_html = convertor.convert( code )
print code_html

You'll also need the corresponding CSS:

pre {
    background-color: #f1f1f3;
    border: 1px dashed #333333;
    padding: 10px;
    overflow: auto;
    margin: 4px 0px;
    width: 95%;
}

/* Syntax highlighting */
pre .normal {}
pre .comment { color: #999999; font-style: italic; }
pre .keyword { color: #006699; font-weight: bold; }
pre .method { color: #077; }
pre .class { color: #074; }
pre .module { color: #050; }
pre .punct { color: #000099; font-weight: bold; }
pre .symbol { color: #099; }
pre .string { color: #00CC00; }
pre .char { color: #00CC00; }
pre .ident { color: #333333; }
pre .constant { color: #0099CC; }
pre .regex { color: #00CC00; }
pre .number { color: #0099CC; }
pre .attribute { color: #5bb; }
pre .global { color: #7FB; }
pre .expr { color: #227; }
pre .escape { color: #277; }

I found this nifty information in an article called, "Howto format ruby code for blogs".

Powersets in Ruby

Another nifty exercise in Ruby - this one generates all subsets of a given array. Maybe I should have used a set.

# Powerset
# find all subsets of a set

# recursively compute the all subsets of the given set
# based on the idea that for any given subset, either an
# item is in the subset or it isn't
def powerset(set)
  if (set.length == 0)
    return [set]
  end
  result = []
  
  # remove an item from the list
  item = set.shift
  
  # compute the powerset of the smaller list
  sps = powerset(set)
  
  # for each subset, either readd the item or don't
  sps.each do |subset|
    result << subset
    result << (subset + [item])
  end
  return result
end

set = %w{ a b c d }
p = powerset(set)

# sort by length of subset
p.sort! do |a,b|
  a.length - b.length
end

p.each do |subset|
  puts "[#{subset.join(", ")}]"
end

Sunday, November 25, 2007

Permutations in Ruby

During the long weekend I decided to break it down geek style with some Ruby, because that's the way I roll. Or anyway, I'm starting to learn a little Ruby and here's a first little exercise. I'm not sure if it's officially sanctioned Ruby idiom, but it's a start.

# find and print all permutations of a list

# find permutations recursively based on the idea that
# for each item in the list, there is a set of permutations
# that start with that item followed by some permutation of
# the remaining items.
def permute(list)
 if (list.length <= 1)
   return [list]
 end
 permutations = []
 list.each do |item|
   sublist_permutations = permute(list - [item])
   sublist_permutations.each do |permutation|
     permutation.unshift(item)
     permutations << permutation
   end
 end
 return permutations
end

# test our permute function
list = %w{ a b c d }
p = permute(list)
p.each do |permutation|
 puts "#{permutation.join(", ")}"
end

Friday, November 23, 2007

A digithead's to-do list

Programming related stuff I want to learn, if I ever get time.

Thursday, November 22, 2007

More Swing Misery

Making a ListCellRenderer containing a word-wrapping JTextArea is apparently asking for trouble. The root of the problem is that there isn't a specified protocol for the negotiation between container and contained gui components to establish size. What you want here is for the JList to compute its width independent of the JTextAreas in the cells. The JTextAreas get their width from the JList and wrap appropriately. Finally, the JTextAreas now know their heights, so the JList can compute its height.

To laugh at my suffering see these forum threads:

Wednesday, November 21, 2007

Cruft and loathing in Java/Swing

Why am I writing a Swing app in 2007? That's what I'm asking myself. It's a conspiracy, I tell you, involving the dreaded secret society the FuglyPLAF Posse. Anyway, onward to more pissing and moaning...
Take the humble MouseEvent (alias java.awt.event.MouseEvent). The integer constants on MouseEvent include:
  • BUTTON1
  • BUTTON1_MASK
  • BUTTON1_DOWN_MASK
BUTTON1 is for use with the getButton() method, which (you'd think) might return the button being pressed when the event was generated. It actually does that on the Apple Java 5 JVM. On the Sun JVMs, apparently, getButton() returns the button "whose state has changed". If you're holding down a mouse button and dragging, getButton() lamely returns zero. Thanks for nothing.
BUTTON1_MASK is for use with getModifiers(). According to the JavaDocs, it's now recommended to use BUTTON1_DOWN_MASK which goes with getModifiersEx().
And, before you can say "Type-safe Enumeration", stop to ponder how many constants are needed to represent mouse buttons. How about none? Wouldn't a plain integer do just fine? How about a boolean function: event.buttonDown(int button)? OK, maybe it's useful to expose the bitmasking of the modifiers to the user, but the resulting mess argues that it would have been more useful not to.
Three different constants for a concept with all the complexity of mouse buttons? I sincerely hope someone was taken out and given an over-the-head turbo-wedgie for that one.