Showing posts with label gaggle. Show all posts
Showing posts with label gaggle. Show all posts

Thursday, December 10, 2009

Microformats

Jeff Atwood, who writes the well-known Coding Horror blog, took on the topic of Microformats recently. His misguided comments about the presumed hackiness of overloading CSS classes with semantic meaning (actually their intended purpose) had people quoting the HTML spec:

The class attribute, on the other hand, assigns one or more class names to an element; the element may be said to belong to these classes. A class name may be shared by several element instances. The class attribute has several roles in HTML:
  • As a style sheet selector (when an author wishes to assign style information to a set of elements).
  • For general purpose processing by user agents.

Browsers work great for navigation and presentation, but we can only really compute with structured data. Microformats combine the virtues of both.

There are at least a couple of ways in which the ability to script interaction with web applications comes in handy. For starters, microformats are a huge advance compared to screen-scraping. The fact that so many people suffered through the hideous ugliness of screen-scraping proves that there must be some utility to be had there.

Also, web-based data sources have a browser-based front-end and also often expose a web service. Microformats link these together. A user can find records of interest by searching in the browser, embedded microformats allow the automated construction of a web service call to retrieve the data in structured form.

Microformats aren't anywhere near the whole answer. But, the real question is how to do data integration at web scale using the web as a channel for structured data.

See also

Friday, March 20, 2009

More Hacking NCBI

Writing scripts to interface with NCBI's web site has it's challenges. Getting data from the UCSC genome browser is simpler.

If you need a list of complete genomes, that can be had from the NCBI Genome database. One form of list is the genlist.cgi script. The type parameter seems to be a flag that limits the list to chromosomes, plasmids, or organelle specific sequences. The name parameter seems to be there only for looks. So far, I haven't figured out how to make genlist spit out either XML or text.

Two other scripts can produce text output, lproks and leuks.

These two can be scripted like this using parameters like these: view=1 dump=selected p3=11:|12:Green Algae. This information is available by ftp from ftp://ftp.ncbi.nih.gov/genomes/genomeprj/. There are 3 lproks.txt files, which look to correspond to the three tabs Organism info, Complete genomes, Genomes in progress. lproks_1.txt is the one we want. There's a lot of good information in the ftp directories to plunder.

There seems to be yet a third script: GenomesGroup.cgi. This one is linked from the Virus genomes page.

If I really wanted to suffer, I'd look into NCBI's source. Does anyone know where the source of lproks.cgi or genlist.cgi are? Is that part of the NCBI C++ Toolkit? (which is on macports here.) Maybe it's buried in NCBI's ftp site? Maybe I should ask the NCBI Information Engineering Branch? Maybe I need to start doing something more productive!

Tuesday, December 09, 2008

Bioinformatics as a Queryable Knowledge Map: the Pygr Project

Pygr is a hypergraph database in Python with applications in bioinformatics written by Christopher Lee, a faculty member at UCLA. There's a 30 minute video of talk about Pygr and a bunch of other resources on the Lee Lab website and Lee's thinking bioinformatics blog.

Thesis: Hypergraphs are a general model for bioinformatics and Python’s core models are already a good model of Bioinformatics Data
  • Sequence: protein and nucleic acid sequences 
  • Mapping / Graphs: alignment, annotation 
  • Attributes: schema, i.e. relations between data 
  • Namespace (import): the ontology of all bioinformatics data 
Pygr aims to show that these Pythonic patterns are a general and scalable solution for bioinformatics.

The general idea is not entirely different from the data types behind Gaggle, especially in the emphasis on basic data structures without a heavy semantic component.

Dr. Lee is also writing a textbook on probabilistic inference.

Saturday, December 06, 2008

Dynamic Fusion of Web Data

I happened across a very cool project on web data integration at the University of Leipzig. Their paper Dynamic Fusion of Web Data is worth a look. They're working towards a theory of on-the-fly data integration for mashup applications that they refer to as dynamic data fusion. Data integration in mashups is dynamic in that it occurs as runtime. This provides for a pay-as-you-go model, rather than a large up-front semantic mapping task that limits the scalability of traditional data integration methods like data warehouses.

They describe mashups as workflow-like. Do they mean mashups are programmatic as opposed to declarative? In place of SQL, this group's iFuice system uses a scripting language with "set operations (e.g., union, intersection, and difference) and data transformation (e.g., fuse, aggregate) which can be used to post-process query results". Other key features are instance-level mapping and accommodation of structured and unstructured data.

This definitely gets at what Firegoose is good for - using the web as a channel for structured data - an approach that does for data integration what loose coupling does for software. Firegoose, part of the Gaggle framework, is a toolbar for Firefox that allows data to be exchanged between desktop software and the web. Firegoose can read microformats, call web services, query databases, or even perform nasty dirty screen scraping. Unlike a mashup, data integration in Firegoose and Gaggle requires user participation, although the user never deals with schemas, only instances of the Gaggle data types - mainly lists of identifiers, matrices of numeric data, networks, and tuples. The identifiers serve in a role somewhat analogous to primary keys.

More papers in a similar vein

Tuesday, December 02, 2008

Browsing genomes

I may as well come clean and admit that I'm developing a genome browser. What? Another genome browser? Why? You may well ask these questions. Well, it's a long story. But here is a completely non-exhaustive list of existing genome browsers.

Note: updated in Sept. 2009 to reflect the fact that everyone and their uncle built a genome browser this past couple of years. See Brother, can you spare a genome browser?

Note: updated again in May of 2010 and again in Feb 2011 to add Savant.

Monday, December 01, 2008

UCSC Genome Browser

A while back, I wrote a little hack to to download and parse genome data from NCBI, but was flummoxed by NCBI's format for eukaryotes. A couple of local bioinformatics gurus directed me to UCSC as an alternate data source. UCSC's Genome Browser provides a nice interface to it's underlying data through a Table Browser. The main genome browser has data for eukaryotes, while archaea (and other prokaryotes) are in a separate project. The Table Browser for the archaeal genome browser is a little tricky to find, but it's there.

Friday, November 28, 2008

Better color chooser for Java Swing

I came across a color chooser that kicks the Swing JColorChooser's booty. It looks great and supports transparency. It's available at colorchooser.dev.java.net and on the developer's blog. It will soon be used in my genome browser. Thanks, Jeremy!

Tuesday, November 18, 2008

Java in Firefox extension hosed again

A full-featured browser, an XUL front end, and the wealth of libraries available in Java makes for a powerful and flexible combination. The browser extension capability of Firefox, along with LiveConnect has been used by at least three extensions:

Most of what I've figured out about using Java from an extension came from Simile's David Huynh. Sadly, development of Piggy Bank has now "quiesced".

I don't know about the others, but Firegoose is hosed by the latest Java 6 update 10. Apparently, Java 6.10 introduces some significant changes into LiveConnect and the Java browser plugin. It's certainly good that Java in the browser is getting some attention, but I wish Java in a Firefox extension was a supported and regression tested use case (see whining here). The fact that it's such an arcane, unsupported and brittle hack is holding back what could otherwise be a nice technique.

Interest in Java in Firefox extensions appears to exist according to these posts in the MozillaZine Extension Development forum:

First Problem: The error I get appears to happen when reflectively instantiating a Java array and looks like this:

Error calling method on NPObject!
[plugin exception: java.lang.IllegalArgumentException: No method
found matching name newInstance and arguments
[com.sun.java.browser.plugin2.liveconnect.v1.JavaNameSpace,
java.lang.Integer]].

Instantiating the array through reflection was, itself, a work-around for another LiveConnect issue with type conversion between Javascript arrays and Java arrays. It's barfing on line 03 below:

// from http://simile.mit.edu/repository/java-firefox-extension/firefox/chrome/content/scripts/browser-overlay.js
_toJavaUrlArray: function(a) {
 var urlArray = java.lang.reflect.Array.newInstance(java.net.URL, a.length);
 for (var i = 0; i < a.length; i++) {
  var url = a[i];
  java.lang.reflect.Array.set(
   urlArray,
   i,
   (typeof url == "string") ? new java.net.URL(url) : url
  );
 }
 return urlArray;
}

Update 1: First problem solved easily enough.

var dummyUrl = new java.net.URL("http://gaggle.systemsbiology.org");
var urlArray = java.lang.reflect.Array.newInstance(dummyUrl.getClass(), a.length);

Now, on to more hideousness:

Error calling method on NPObject!
[plugin exception: java.security.AccessControlException: access denied
(java.lang.RuntimePermission createClassLoader)].

...caused by trying to instantiate a URLClassLoader. The next-generation Java Plug-in, including in update 10, makes changes to the security policy such that calls from Javascript to Java are uniformly treated as untrusted.

Update 2: A Work-around!

A post on the java.net forum has a work-around. You can disable the "next-generation plug-in" through the Java control panel. Under the Advanced tab, open Java Plug-in, deselect Enable the next-generation Java Plug-in, then, restart Firefox. There is a bug filed whose comments seem to suggest that it will be addressed in a future release of the Java Plug-in.

Update 3: According to this thread on java.net, a fix is on the way in Java SE 6 update 12. Thanks, Sun!

More references:

Tuesday, June 24, 2008

Firefox extension security

Firefox 3 takes a couple of steps to make extensions a little more secure. In general, extensions run with full privileges, so this is important. The extension update mechanism now requires either SSL or digital signatures for both the update.rdf file and the xpi file.
If you want to bypass these restriction, open the URL "about:config" and create a preference called extensions.checkUpdateSecurity whose value is set to false. This can be useful in testing, but is discouraged in practice.
Resources

Wednesday, June 18, 2008

Firefox extension development

Uhoh. Now that Firefix 3 is out, it looks like I have to update my extension Firegoose to work. The trick here is that Firegoose uses Java.
Briefly, Firegoose is a Firefox extension that integrates several bioinformatics web resources into the Gaggle integration framework. The Gaggle is based on passing messages of a few fundamental data types in the bioinformatics domain, including lists, matrices, and networks. The transport protocol used in Gaggle is (for better or worse) Java RMI, and that, of course, requires Java. Hence, the Firegoose's reliance on being able to crank up a working (and unrestricted) JVM from inside Firefox.
That was done in for Firefox 1.x and 2.x using an arcane and dirty trick from the fine folks at Simile project at MIT called javaFirefoxExtension. And the sad thing for me is that the trick is apparently broken in FireFox 3. [Note: this turns out not to be the case.]
Resources
Silent failure is the curse of the Firefox extension developer. Debugging in Firefox is painful, at best, and even more so when using the bridge between Java and javascript (aka LiveConnect).
In order to do anything useful with Java in a Firefox Extension, there are at least 2 nasty bits to overcome. First, you have to load classes from inside an XPI file. This is essentially a variant of the classpath problem. Second, you probably have to give yourself full permissions (by manipulating java.security.Policy).
As mentioned above, an arcane solution to these problems has been worked out by folks at the SIMILE lab at MIT. Their PiggyBank extension is the prototype for use of Java for heavy lifting inside a Firefox extension. They actually run a full app server inside the browser. Another really cool extension that uses the same technique is xquseme, which embeds the Saxon XQuery processor. You can then perform arbitrary XQueries against any document you can browse to. Using Java in an extension gives you the power to combine the wealth of libraries available in the Java universe with a fully featured browser. So how do we go about doing it?
Loading classes from your XPI file is possible using Firefox's capability to resolve chrome URLs to paths in the filesystem. The mapping between chrome URLs and files is defined in the chrome.manifest file in your XPI. Once we have paths (as file:/ URLs), we create our own java.net.URLClassLoader. Calling the constructor java.net.URLClassLoader(URL[] urls) requires a trick, because the Java bridge seems not to do a very good job of coercing javascript types to java types. To further muddy the waters, the way js-to-Java type coersion is handled in LiveConnect changed in Firefox 3. You'd expect that passing a javascript Array containing java.net.URL objects to work. But, try that and you'll get an error like this:
InternalError: Unable to convert JavaScript value [...blah blah...] to Java value of type java.net.URL[]
Code gleaned from the SIMILE lab (thanks!) solves this particular pain in the ass:
// from http://simile.mit.edu/repository/java-firefox-extension/firefox/chrome/content/scripts/browser-overlay.js
_toJavaUrlArray: function(a) {
 var urlArray = java.lang.reflect.Array.newInstance(java.net.URL, a.length);
 for (var i = 0; i < a.length; i++) {
  var url = a[i];
  java.lang.reflect.Array.set(
   urlArray,
   i,
   (typeof url == "string") ? new java.net.URL(url) : url
  );
 }
 return urlArray;
}
Truth and soul
Actually, problems with js-to-Java type conversion aren't limited to constructing classLoaders, but seem to be pervasive in FF3. Apparently, whenever you try to convert a js array to Java, LiveConnect screws it up. For example, passing a js array of strings, I get an array full of "true". Yes, it's true. There was an object there. Thanks a lot for that! (Happens on both Mac OS X and Windows, btw.)

Sunday, December 23, 2007

Messaging

Messaging is hip and trendy these days. It's central to application integration and distributed computing, and may represent a way forward in tackling concurrency (see “communicating sequential processes”). Here are a couple messaging resources:

Mule may be the future of the Gaggle. The Gaggle is a framework for integrating bioinformatics software and databases on which I've done some work. Gaggle currently is implemented on Java RMI. It really needs to move to a more language-neutral messaging protocol. A messaging approach should also make it easier for Gaggle to interoperate with web-service based systems like bioMoby and Taverna.