Showing posts with label microformats. Show all posts
Showing posts with label microformats. Show all posts

Thursday, December 10, 2009

Microformats

Jeff Atwood, who writes the well-known Coding Horror blog, took on the topic of Microformats recently. His misguided comments about the presumed hackiness of overloading CSS classes with semantic meaning (actually their intended purpose) had people quoting the HTML spec:

The class attribute, on the other hand, assigns one or more class names to an element; the element may be said to belong to these classes. A class name may be shared by several element instances. The class attribute has several roles in HTML:
  • As a style sheet selector (when an author wishes to assign style information to a set of elements).
  • For general purpose processing by user agents.

Browsers work great for navigation and presentation, but we can only really compute with structured data. Microformats combine the virtues of both.

There are at least a couple of ways in which the ability to script interaction with web applications comes in handy. For starters, microformats are a huge advance compared to screen-scraping. The fact that so many people suffered through the hideous ugliness of screen-scraping proves that there must be some utility to be had there.

Also, web-based data sources have a browser-based front-end and also often expose a web service. Microformats link these together. A user can find records of interest by searching in the browser, embedded microformats allow the automated construction of a web service call to retrieve the data in structured form.

Microformats aren't anywhere near the whole answer. But, the real question is how to do data integration at web scale using the web as a channel for structured data.

See also

Saturday, December 06, 2008

Dynamic Fusion of Web Data

I happened across a very cool project on web data integration at the University of Leipzig. Their paper Dynamic Fusion of Web Data is worth a look. They're working towards a theory of on-the-fly data integration for mashup applications that they refer to as dynamic data fusion. Data integration in mashups is dynamic in that it occurs as runtime. This provides for a pay-as-you-go model, rather than a large up-front semantic mapping task that limits the scalability of traditional data integration methods like data warehouses.

They describe mashups as workflow-like. Do they mean mashups are programmatic as opposed to declarative? In place of SQL, this group's iFuice system uses a scripting language with "set operations (e.g., union, intersection, and difference) and data transformation (e.g., fuse, aggregate) which can be used to post-process query results". Other key features are instance-level mapping and accommodation of structured and unstructured data.

This definitely gets at what Firegoose is good for - using the web as a channel for structured data - an approach that does for data integration what loose coupling does for software. Firegoose, part of the Gaggle framework, is a toolbar for Firefox that allows data to be exchanged between desktop software and the web. Firegoose can read microformats, call web services, query databases, or even perform nasty dirty screen scraping. Unlike a mashup, data integration in Firegoose and Gaggle requires user participation, although the user never deals with schemas, only instances of the Gaggle data types - mainly lists of identifiers, matrices of numeric data, networks, and tuples. The identifiers serve in a role somewhat analogous to primary keys.

More papers in a similar vein

Monday, April 07, 2008

Microformats and JSON

I maintain a Firefox extension called Firegoose that (among other things) works with data embedded in HTML pages. We do this using microformats, embedded XML, and even, if we have to, nasty screen scraping. I'm not too religious about the format of the data - just give me something I can parse.

So, it occurred to me to wonder why JSON wasn't more popular in this niche. Easily parsable, legal in a browser, scriptable from within the page - why not? Of course people are, as usual, miles ahead of me here:

Strangely, activity around this topic seems to have surfaced briefly and disappeared sometime in mid 2007.