Wednesday, June 12, 2013

Big Data Book

Big Data: A Revolution That Will Transform How We Live, Work, and Think asks what happens when data goes from scarce to abundant. The book expands on The data deluge, a report appearing in the Economist in early 2010, providing a concise overview of a topic that is at the same time over-hyped and also genuinely transformative.

Authors Kenneth Cukier, Data Editor at the Economist, and Viktor Mayer-Schonberger, Oxford professor of public policy, look at big data trends through a business oriented lens.

Their book identifies two important shifts in thinking as key to dealing with big data: comfort with the messy uncertainty of probability and what the book refers to as correlations over causality. This just means that some tasks are better suited to a correlative or pattern matching approach rather than modeling from first principles. No one drives a car by thinking about physics.

In domains such as natural language and biology we'd like to know the underlying theory, but don't. At least, not yet. Where theory is insufficiently developed to directly answer pressing questions, a correlative, probabilistic approach is the best we can do. The merits of this concession to incomplete information can be debated, and was by Norvig and Chomsky. But, empirical disciplines like medicine and engineering rely on it - let alone the really messy worlds of economics and politics. It's often necessary to act in absence of mechanistic understanding.

As data goes from a scarcity to abundance, judgment and interpretation remain valuable. What changes is that you don't necessarily have to formulate a question and only then collect data that may help answer it. Another strategy becomes viable - collect a trove of data first and then interrogate it. That may bother hypothesis-driven traditionalists, but once you have a high-throughput data source - a DNA sequencer, a sensor networks or an e-commerce site - it makes sense to ask yourself what questions you can ask of that data.


One industry or discipline after another has undergone the transition from information scarcity to super-abundance. One of the last hold-outs is medicine, where the potential impact is huge. Internet companies are accumulating vast data hordes streaming in through social networks and smartphones. The process has something of the character of a land-rush. Given the economies of scale, each category has its dominant player.

The dark side

As for obligatory dark-side of big data, the book worries about "the prospect of being penalized by for things we haven't even done yet, based on big data's ability to predict our future behavior."

While this won't keep me up at night, technology is not necessarily the liberating force that we may hope it is, nor is it necessarily a tool of Orwellian control. Certain technologies lend themselves more readily one way or the other.

A more pedestrian "threat to human agency" comes from marketers and political consultants equipped with data feeds, powerful machines and brainy statisticians, and able to manipulate people into consuming useless junk and acting against their own interests. Given detailed individual profiles and constant feedback on the effectiveness of targeted marketing, the puppet masters are likely to keep getting better at pulling our strings.

Data governance

To safeguard privacy and ethical uses of data, the book advocates a self-policing professional association of data scientists similar to those existing for accountants, lawyers and brokers. Personally, I'm a skeptical that these organizations have much in the way of teeth. An informed public debate might be a healthier option.

Bolder thinking in this area comes from Harvard's cyber-law group. Their VRM (for vendor relationship management) project proposes a protocol by which individuals and vendors can specify the terms under which they are willing to share data. Software can automate the negotiation and present options where terms are compatible. Precedents for these ideas can be found in finance where bonds and options are standardized contracts and can thus be fluidly traded.

The new oil?

Land, labor and capital are the classic factors of production - the primary inputs to the economy - or so they teach in Econ 101. Technology certainly belongs on that list and data does as well. The original robber barons sat in castles overlooking the Rhine or the Danube taxing the flow of goods and travelers around Europe. The term was later applied to giant industrial firms built on other flows: oil, rail and steel. Companies like Google, Facebook and Amazon extract wealth from a new flow: that of data. It's time to think of data as another factor of economic production, a raw material out of which something can be made.

Links, links, links