Tuesday, June 25, 2013

The Dream 8 Challenges

The 8th iteration of the DREAM Challenges are underway.

DREAM is something like the Kaggle of computational biology with an open science bent. Participating teams apply machine learning and statistical modeling methods to biological problems, competing to achieve the best predictive accuracy.

This year's three challenges focus on reverse engineering cancer, toxicology and the kinetics of the cell.

Sage Bionetworks (my employer) has teamed up with DREAM to offer our Synapse platform as an integral part of the challenges. Synapse is targeted at providing a platform for hosting data analysis projects, much like GitHub is a platform for software development projects.

My own part in Synapse is on the Python client and a bit on the R client. I expect to get totally pummeled by bug reports once participation in the challenges really gets going.

Open data, collaborative innovation and reproducible science

The goal of Synapse is to enable scientists to combine data, source code, provenance, prose and figures to tell a story with data. The emphasis is on open data and collaboration, but full access control and governance for restricted access is built in.

In contrast to Kaggle, the DREAM Challenges are run in the spirit of open science. Winning models become part of the scientific record rather than the intellectual property of the organizers. Sharing code and building on other contestant's models is encouraged in with hopes of forming networks of collaborative innovation.

Aside from lively competition, these challenges are a great way to compare the virtues of various methods on a standardized problem. Synapse is aiming to become an environment for hosting standard open data sets and documenting reproducible methods for deriving models from them.

Winning methods will be presented at the RECOMB/ISCB Conference in Toronto this fall.

So, if you want to sharpen your data science chops on some important open biological problems, check out the DREAM8 challenges.

More on DREAM, Sage Bionetworks, and Challenges

Wednesday, June 12, 2013

Big Data Book

Big Data: A Revolution That Will Transform How We Live, Work, and Think asks what happens when data goes from scarce to abundant. The book expands on The data deluge, a report appearing in the Economist in early 2010, providing a concise overview of a topic that is at the same time over-hyped and also genuinely transformative.

Authors Kenneth Cukier, Data Editor at the Economist, and Viktor Mayer-Schonberger, Oxford professor of public policy, look at big data trends through a business oriented lens.

Their book identifies two important shifts in thinking as key to dealing with big data: comfort with the messy uncertainty of probability and what the book refers to as correlations over causality. This just means that some tasks are better suited to a correlative or pattern matching approach rather than modeling from first principles. No one drives a car by thinking about physics.

In domains such as natural language and biology we'd like to know the underlying theory, but don't. At least, not yet. Where theory is insufficiently developed to directly answer pressing questions, a correlative, probabilistic approach is the best we can do. The merits of this concession to incomplete information can be debated, and was by Norvig and Chomsky. But, empirical disciplines like medicine and engineering rely on it - let alone the really messy worlds of economics and politics. It's often necessary to act in absence of mechanistic understanding.

As data goes from a scarcity to abundance, judgment and interpretation remain valuable. What changes is that you don't necessarily have to formulate a question and only then collect data that may help answer it. Another strategy becomes viable - collect a trove of data first and then interrogate it. That may bother hypothesis-driven traditionalists, but once you have a high-throughput data source - a DNA sequencer, a sensor networks or an e-commerce site - it makes sense to ask yourself what questions you can ask of that data.


One industry or discipline after another has undergone the transition from information scarcity to super-abundance. One of the last hold-outs is medicine, where the potential impact is huge. Internet companies are accumulating vast data hordes streaming in through social networks and smartphones. The process has something of the character of a land-rush. Given the economies of scale, each category has its dominant player.

The dark side

As for obligatory dark-side of big data, the book worries about "the prospect of being penalized by for things we haven't even done yet, based on big data's ability to predict our future behavior."

While this won't keep me up at night, technology is not necessarily the liberating force that we may hope it is, nor is it necessarily a tool of Orwellian control. Certain technologies lend themselves more readily one way or the other.

A more pedestrian "threat to human agency" comes from marketers and political consultants equipped with data feeds, powerful machines and brainy statisticians, and able to manipulate people into consuming useless junk and acting against their own interests. Given detailed individual profiles and constant feedback on the effectiveness of targeted marketing, the puppet masters are likely to keep getting better at pulling our strings.

Data governance

To safeguard privacy and ethical uses of data, the book advocates a self-policing professional association of data scientists similar to those existing for accountants, lawyers and brokers. Personally, I'm a skeptical that these organizations have much in the way of teeth. An informed public debate might be a healthier option.

Bolder thinking in this area comes from Harvard's cyber-law group. Their VRM (for vendor relationship management) project proposes a protocol by which individuals and vendors can specify the terms under which they are willing to share data. Software can automate the negotiation and present options where terms are compatible. Precedents for these ideas can be found in finance where bonds and options are standardized contracts and can thus be fluidly traded.

The new oil?

Land, labor and capital are the classic factors of production - the primary inputs to the economy - or so they teach in Econ 101. Technology certainly belongs on that list and data does as well. The original robber barons sat in castles overlooking the Rhine or the Danube taxing the flow of goods and travelers around Europe. The term was later applied to giant industrial firms built on other flows: oil, rail and steel. Companies like Google, Facebook and Amazon extract wealth from a new flow: that of data. It's time to think of data as another factor of economic production, a raw material out of which something can be made.

Links, links, links