Digithead's Lab Notebook: 2013

Wednesday, December 11, 2013

DREAM conference

Apparently, live-blogging isn't my thing. This post has been sitting on my desktop getting stale for entirely too long...

I'm just back (a month ago) from the conference for the DREAM challenges in Toronto. It was great to see the Dream 8 challenges wrap up and another round of challenges get under way. Plus, I think my IQ went up a few points just by standing in a room full of smart people applying machine learning to better understand biology.

The DREAM organization, led by Gustavo Stolovitzky, partnered with Sage Bionetworks to host the challenges on the Synapse platform. Running the challenges involves a surprising amount of logistics and the behind-the-scenes labor of many to pose a good question, data collection and preparation, provide clear instructions, and evaluate submissions. Puzzling out how to score submissions can be a data analysis challenge in it's own right.

HPN

The winners of the Breast Cancer Network Inference Challenge applied a concept from economics called Granger Causality. I have a thing for ideas that cross over from one domain to another, especially between economics to biology. The team, from Josh Stuart's lab at UCSC calls their algorithm Prophetic Granger Causality, which they combined with data from Pathway commons to infer signaling pathways from time-series proteomics data taken from breast cancer cell lines.

The visualization portion of the same challenge was swept by Amina Qutub's lab at Rice University with their stylish BioWheel visualization based on Circos. Qutub showed a prototype of a D3 re-implementation. I'm told that availability of source code is coming soon. There's a nice write-up Qutub bioengineering lab builds winning tool to visualize protein networks and a video.

BioWheel from Team ABCD writeup

Whole-cell modeling

In the whole-cell modeling challenge, led by Markus Covert of Stanford University, participants were tasked with estimating the model parameters for specific biological processes from a simulated microbial metabolic model.

Since synthetic data is cheap, a little artificial scarcity was introduced by giving participants a budget with which they could purchase data of several different types. Using this data budget wisely then becomes an interesting part of the challenge. One thing that's not cheap is running these models. That part was handled by BitMill a distributed computing platform developed by Numerate and presented by Brandon Allgood.

Making hard problems accessible to outsiders with new perspectives is a big goal of organizing a challenge. One of the winners of this challenge, a student in a neuroscience lab at Brandice, is a great example. They're condidering a second round for this challenge next year.

Toxicogenetics

Yang Xie's lab at UT SouthWestern's QBRC added another win to their collection in the Toxicogenetics challenge. Tao Wang revealed some of their secrets in his presentation.

From Tao Wang's slides

Wang explained that their approach relies on extensive exploratory analysis, careful feature selection, dimensionality reduction, and rigorous cross-validation.

Keynotes

Trey Ideker presented a biological ontology comparable in some ways to GO called NeXO for Network Extracted Ontology. The important difference is this: whereas GO is human curated from literature, NeXO is derived from data. There's a paper, A gene ontology inferred from molecular networks, and a slick looking NeXO web application.

NeXO: The Network-eXtracted Ontology

Tim Hughes gave the other keynote on RNA-binding motifs complementing prior work constructing RBPDB, a database of RNA-binding specificities.

DREAM 8.5

Since the next round of challenges is coming sooner than DREAM's traditional yearly cycle, it's numbered 8.5. There are three challenges that are getting underway now:

Discussion

During the discussion session, questions were raised about on whether to optimize the design of the challenges to produce generalizable methods or for answering a specific biological question. Domain knowledge can give competitors advantages in terms of understanding noise characteristics and artifacts in addition to informing realistic assumptions. If the goal is strictly to compare algorithms, this might be a bug. But in the context of answering biological questions, it's a feature.

Support was voiced for data homogeneity (No NAs, data plus quality/confidence metric) and more realtime feedback.

It takes impressive community management skills to maintain a balance that appeals to a diverse array of talent as well as providers of data and support.

Monday, November 25, 2013

Vaclav Smil

I am an incorrigible interdisciplinarian. I was trained in a broad range of basic natural sciences (biology, chemistry, geography, geology) and then branched into energy engineering, population and economic studies and history. For the past 30 years my main effort has gone into writing books that offer new, interdisciplinary perspectives on inherently complex, messy realities.

On Writing

Hemingway knew the secret. I mean, he was a lush and a bad man in many ways, but he knew the secret. You get up and, first thing in the morning, you do your 500 words. Do it every day and you’ve got a book in eight or nine months.

Bill Gates reads you a lot. Who are you writing for?

I have no idea. I just write.

On Apple

Apple! Boy, what a story... When people start playing with color, you know they’re played out.

Vaclav Smil
The Man Bill Gates Thinks You Should Be Reading
Scientists' Nightstand: Vaclav Smil (American Scientist)

Wednesday, October 30, 2013

Building distributed systems out of crap

Pat Helland gave the opening keynote at Basho's conference Ricon West, yesterday. The general topic was building distributed systems with enterprise guarantees and web scalability on crap. His argument is that enterprise-grade SLAs with lots of nines can be supported on cheap hardware using a strategy of expect failure and recover quickly.

Helland, formerly having done time on the Bing team and at Amazon, is building a distributed data storage system for Salesforce.com. It's design involves a catalog stored in a relational DB and files stored on clusters of storage servers, a technique Helland calls blobs-by-reference.

The files are stored in fragments distributed across a cluster. There was another concept called an “extent”. I wasn't sure if that meant an aggregation of related fragments or just a bucket to dump them in.

SSDs are used as a new layer of the memory hierarchy. Helland argues for using the cheapest and crappiest available. This entails a couple engineering tweaks. Because SSDs degrade with every operation, the software has to manage read write cycles. To detect data corruption, each fragment is packaged with a CRC error-detecting code.

“By surrounding the data with aggressive error checking, we can be extremely confident of detecting an error and fetching the desired data from one of the other places it has been stored.”

Helland emphasized the importance of immutable data, which goes a long way towards mitigating the inconsistency and race conditions that come with distributed computing. In the proposed storage system, fragments are immutable, which greatly reduces opportunity for the storage nodes to get out of sync with the catalog.

Aside from this talk, Ricon is loaded with good content including a talk by Jeff Dean coming up this afternoon. Send me next year!

The event was live streamed, so I'm guessing a video is forthcoming. (Ricon West Live)
For now, there's a video of Helland's talk at the same event in 2012 Immutability Changes Everything.
Helland spoke on Condos and Clouds at LinkedIn.
See the Register's writeup of the talk. “Salesforce's data-center design: Oracle sits at the core, but after that things get ugly - and cheap!”
Life beyond Distributed Transactions: an Apostate’s Opinion a 2012 position paper by Helland
There's Just No Getting around It: You're Building a Distributed System by Mark Cavage writing for ACM Queue.
A Distributed Systems Reading List from Dan Creswell and Distributed systems reading group at MIT

Monday, October 14, 2013

Concurrency and Parallelism - What's the difference?

For a while, I've been coming across references to the difference between concurrency and parallelism. The definitions go something like this: Concurrency concerns "interleaved threads of execution with access to shared state" which is distinct from parallelism because "parallel operations run simultaneously".

I'm quoting from - "Clojure Programming" by Chas Emerick, Brian Carper and Christophe Grand - which is a perfectly good book. I've seen similar definitions elsewhere, so I don't want to pick on these guys in particular. I'm going to disagree a bit, but overall, the book is really well done and I'm enjoying it.

My beef is this: I couldn't see the utility of the distinction they're drawing. I couldn't see why you'd want to design a program differently to run as threads scheduled on a single core versus threads scheduled on several cores. In fact, treating those cases the same seems like a plus.

In contrast, there are some distinctions between types of concurrency that are useful. Knowing your code will be distributed across machines tells you to bring network latency into the picture. Likewise, only certain problems are amenable to the single-instruction-multiple-data (SIMD) model of vector processors such as GPUs. These considerations have a real impact. But, why the pedantry over concurrency versus parallelism?

I was about to write a little rant about why this distinction is useless. But, keeping an open mind, I googled around a bit and up popped a talk by Rob Pike called "Concurrency Is Not Parallelism". Change of plan. Rob Pike is a bad-ass, well known as a Unix pioneer, Bell Labs veteran and Google Distinguished Engineer. New plan: go back to school and find out why I'm wrong.

Pike's talk explains things beautifully, and not just because he's wearing an orange suit jacket and a gopher t-shirt. Here's my understanding of the take-away:

Concurrency is a more general and abstract idea than parallelism. Concurrency is about the decomposition of a problem into subtasks at the design level. If you're creating a concurrent design, you haven't said yet whether your design will be executed in parallel. Parallelism is a detail to be decided at run-time. Which brings us to the good part.

Whenever you can take two things that were previously conjoined and let them vary independently, you're making progress. The two things in question here are the design - the decomposition of a problem into concurrent parts - and the execution of those parts, perhaps in parallel. Making this separation allows programs to be expressed correctly and structured clearly while making good use of available resources whether that's one core or many.

This important point is what's missing from the definitions above. That and they're comparing things at different levels of generality.

Next, Pike relates these ideas to Go. The language provides three concurrency primitives: Go routines, channels and select. Go routines are like threads but much cheaper. During execution, they're mapped onto OS threads by a scheduler. Channels and select statements enable communication. Go is an implementation of concepts have their origin in the classic paper Communicating Sequential Processes by Tony Hoare.

The moral of the story? Learn from the masters ...and from the gophers.

Tony Hoare's paper is on several "greats" lists including these:

Great Works in Programming Languages Collected by Benjamin C. Pierce
Classic Papers in Programming Languages and Logic

Thursday, September 19, 2013

Special secret stuff with s3cmd

According to my coworker, "Amazon's S3 is the best thing since sliced bread." For working with S3, s3cmd is a handy little tool. It's documentation is a bit on the sparse side, but, what do you expect for free?

One gotcha with S3 is that buckets and the files in them have entirely distinct ACLs. This can lead to scenarios where the owner of a bucket can't work with the files in it. An easy way for this to come about is to log into the S3 console and create a bucket with one set of credentials, then upload files with a tool like s3cmd under another set of credentials.


cd /path/to/files/
s3cmd -v sync . s3://damnbucket.bucketowners.org

You can give permissions to the bucket owner like so:


s3cmd setacl --acl-grant=full_control:i_own_the_damn_bucket@bucketowners.org --recursive s3://damnbucket.bucketowners.org

You might also want to make the files public, so they can be served as a static website.


s3cmd setacl --acl-public --recursive s3://damnbucket.bucketowners.org

AWS Command Line Interface

I've been using s3cmd for a while, out of habit, but maybe it's time to try to Amazon's AWS Command Line Interface, which just had their version 1.0 release.

From a brief look, AWS CLI looks nice. You can do the same sync operation as above and make files public in one command:


aws s3 sync . s3://damnbucket.bucketowners.org --acl public-read

Amazon is very mysterious about how to specify the target of a grant of permissions, aka the grantee. I tried to give permission to the owner of a bucket, but kept getting an error. Some more examples in the docs would help! I also get "Invalid Id" for no apparent reason in the permissions section of the web UI for S3, so maybe I'm just clueless.


aws s3api put-object-acl --bucket damnbucket.bucketowners.org --grant-full-control i_own_the_damn_bucket@bucketowners.org --key genindex.html
#> A client error (InvalidArgument) occurred: Argument format not recognized.

As far as I could tell, the AWS CLI tool seems to be missing the --recursive option that we used with s3cmd. That seems like a fairly essential thing.

Also, I couldn't get the profile feature to work:


aws s3 ls s3://damnbucket.bucketowners.org --profile docgenerator
#>The config profile (docgenerator) could not be found

NOTE: Many thanks to Mitch Garnaat, I now know how the --profile switch works. Contrary to the documentation, you need a heading in your config file like this: [profile docgenerator] rather than like this: [docgenerator].

I'm glad Amazon is taking the lead in developing this tool, and I'm sure they'll keep making it better. And, there's a Github repo, so I'm guessing that means they take pull requests.

Saturday, September 07, 2013

Who will prosper in the age of smart machines?

What if Siri really worked? ...worked so well that those that mastered co-operation with a digital assistant had a serious competitive advantage over those relying on their own cognitive powers alone?

That question is considered by economist Tyler Cowen in a new book called Average is Over. Previously, Cowen wrote The Great Stagnation and maintains the Marginal Revolution blog and teaches online classes at Marginal Revolution University.

"Increasingly, machines are providing not only the brawn but the brains, too, and that raises the question of where humans fit into this picture — who will prosper and who won’t in this new kind of machine economy?"

Tyler Cowen's answers

Who will gain: People who can collaborate with smart machines; life long learners; people with marketing skills; motivators.

"Sheer technical skill can be done by the machines, but integrating the tech side with an attention-grabbing innovation is a lot harder."

The psychological aspect is interesting. The traditional techy nerd (ahem... hello, self) has a psychology adapted to machines. But, machines are gaining the capacity to interface on a level adapted human psychology.

Who will lose: People who compete with smart machines; people who are squeamish about being tracked, evaluated and rated; the sick; bohemians; political radicals.

On being quantified

"Computing and software will make it easier to measure performance and productivity. [...] In essence everyone will suffer the fate of professional chess players, who always know when they have lost a game, have an exact numerical rating for their overall performance, and find excuses for failure hard to come by."

On hipsters

"These urban areas [he doesn't mention Portland by name] are full of people who are bright, culturally literate, Internet-savvy and far from committed to the idea of hard work directed toward earning a good middle-class living. We’ll need a new name for the group of people who have the incomes of the lower middle class and the cultural habits of the wealthy or upper middle class. They will spread a libertarian worldview that working for other people full time is an abominable way to get by."

How many will prosper

The current trend of unequal wealth distribution will only continue as technological literacy takes on new dimensions. "Big data" makes it easier to measure and grade our skills and failings. Apex skills are the ability to grab human attention, to motivate, to manage humans and machines in collaboration.

Another, not quite contrasting view comes from Northwestern University economist Robert Gordon. His paper "Is U.S. Economic Growth Over? Faltering Innovation Confronts the Six Headwinds", suggests that "the rapid progress made over the past 250 years could well turn out to be a unique episode in human history." In particular, he raises the possibility that the internet and digital and mobile technology may contribute less to productivity than previous industrial revolutions.

Technology can create winner-take-all situations, where a few capture enormous gains leaving the also-rans with little. A lot depends on the distribution of technology's benefits.

Who will prosper in the new world? in the New York Times Opinionator section
If you don't like to read, here's the video version: Smart Machines: The Next Big Thing For Smart Human Beings
Average is Over, Tyler Cowen's latest book
Thomas Friedman on Average is Over and Average is over, part 2.
Of course, I was totally hip years ago to the potential for algorithms to know their users, when I asked: Does Google know if you're honest?
The prospects of integration between digital tools and human psychology appears in Christopher Steiner's Automate This.
Hacking education
The online education tidal wave

Thursday, July 11, 2013

Generate UUIDs in R

Here a snippet of R to generate a Version 4 UUID. Dunno why there wouldn't be an official function for that in the standard libraries, but if there is, I couldn't find it.


## Version 4 UUIDs have the form:
##    xxxxxxxx-xxxx-4xxx-yxxx-xxxxxxxxxxxx
##    where x is any hexadecimal digit and
##    y is one of 8, 9, A, or B
##    f47ac10b-58cc-4372-a567-0e02b2c3d479
uuid <- function(uppercase=FALSE) {

  hex_digits <- c(as.character(0:9), letters[1:6])
  hex_digits <- if (uppercase) toupper(hex_digits) else hex_digits

  y_digits <- hex_digits[9:12]

  paste(
    paste0(
      sample(hex_digits, 8, replace=TRUE),
      collapse=''),
    paste0(
      sample(hex_digits, 4, replace=TRUE),
      collapse=''),
    paste0(
      '4',
      paste0(sample(hex_digits, 3, replace=TRUE),
             collapse=''),
      collapse=''),
    paste0(
      sample(y_digits,1),
      paste0(sample(hex_digits, 3, replace=TRUE),
             collapse=''),
      collapse=''),
    paste0(
      sample(hex_digits, 12, replace=TRUE),
      collapse=''),
    sep='-')
}
}

View as a gist: https://gist.github.com/cbare/5979354

Note: Thanks to Carl Witthoft for pointing out that my first version was totally broken. Turns out calling sample with __replace=TRUE__ greatly expands the possible UUIDs you might generate!

Carl also says, "In general, as I understand it, the value of UUID codes is directly dependent on the quality of the pseudo-random number generator behind them, so I’d recommend reading some R-related literature to make sure “sample” will be good enough for your purposes."

This sounds wise, but I'm not sure if I'm smart enough to follow up on it. It could be that the randomness of these UUIDs is less than ideal.

Saturday, July 06, 2013

Automate This!

The invention of the printing press by German blacksmith Johannes Gutenberg in 1439, the foundational event of the infomation age, is a common touchstone for technology stories, appearing in the opening chapter of both Nate Silver's The Signal and the Noise and Viktor Mayer-Schonberger and Kenneth Cukier's Big Data.

Automate This!, by Christopher Steiner, comes at current technology trends from a more mathy angle, tracing roots back to Leibniz and Gauss. Here, it's algorithms rather than data that take center stage. Data and algorithms are two sides of the same coin, really. But, it's nice to some of the heros of CS nerds everywhere get their due: Al Khwarizmi, Fibonacci, Pascal, the Bernoullis, Euler, George Boole, Ada Lovelace and Claude Shannon.

Automate This! is more anecdotal than Big Data, avoiding sweeping conclusions except the one announced in bold letters as the title of the last chapter: "The future belongs to the algorithms and their creators." The stories, harvested from Steiner's years as a tech journalist at Forbes, cover finance and the start-up scene, but also medicine, music and analysis of personality.

Many of the same players from Nate Silver's book or from Big Data make an appearance here as well: Moneyball baseball manager Billy Beane, game-theorist and political scientist Bruce Bueno de Mesquita, and Facebook data scientist and Cloudera founder Jeff Hammerbacher.

Finance

In the chapter on algorithmic trading we meet hungarian-born electronic-trading pioneer Thomas Peterffy, who built financial models in software in the 80's before it was cool by hacking a NASDAQ terminal.

In the same chapter, I gained new respect for financial commentator Jim Cramer. In contrast to his buffoonish on-screen persona, his real-time analysis of the May 2010 flash-crash was both "uncharacteristically calm" and uncannily accurate. As blue-chip stocks like JNJ dived to near-zero, he made the savvy assessment, "That's not a real price. Just go and buy it!" and, as prices recovered only minutes later, "You'll never know what happened here." There's little doubt that algorithmic trading was the culprit, but the unanswered question is whether it was a bot run amok or an intentional strategy that worked a little too well. Too bad, if you did buy they probably canceled it.

Music

Less scarily, algorithms can rate a pop song's chances of becoming a hit single. Serious composer and professor of music David Cope uses hand-coded (in LiSP) programs to compose music, pushing boundaries in automating the creative process.

Medicine

Having mastered Jeopardy, IBM's Watson is gearing up to take a crack at medical diagnostics, which is a field Hammerbacher thinks is ripe for hacking.

Psych and soc

Computers are beginning to understand people, which gives them a leg up on me, I'd have to say. Taibi Kahler developed a classification system for personality types based on patterns in language usage. Used by NASA psychiatrist Terry McGuire to compose well-balanced flight crews, the system divides personality into six bins: emotions-driven, thoughts-based, actions-driven, reflection-driven, opinions-driven, and reactions-based. If you know people more than superficially, they probably don't fit neatly into one of those categories, but some do (by which I mean they have me pretty well pegged).

At Cornell, Jon Kleinberg's research provides clues to the natural pecking order that emerges among working or social relationships - Malcolm Gladwell's influencers detected programmatically. One wonders if corporate hierarchies were better aligned with such psychological factors would the result be a harmonious workplace where everyone knows and occupies their right place? Or a brave new world of speciation into some technological caste system?

What next?

Perhaps surprisingly, Steiner cites the Kaufmann Foundation's Financialization and Its Entrepreneurial Consequences on "the damage wrought by wall street" - the brain drain toward finance and away from actual productive activity. The book ends with the hopeful message that the decline of finance will set quantitative minds free to work on creative entrepreneurial projects. For the next generation, there's a plea for urgently needed improvements in quantitative education, especially at the high-school level.

Automate This! is a quick and fun read. Steiner's glasses are bit rose-tinted at times and his book will make you feel like a chump if you haven't made a fortune algorithmically gaming the markets or disrupting some backwards corner of the economy. As my work-mates put it, we're living proof that tech-skills are a necessary but not sufficient condition.

Kahler's personality types

Emotions-driven: form relationships, get to know people, tense situations -> dramatic, overreative
Thoughts-based: do away with pleasantries, straight to the facts. Rigid pragmatism, humorless, pedantic, controlling
Actions-driven: crave action, progress, always, pushing, charming. Pressure -> impulsive, irrational, vengeful
Reflection-driven: calm and imaginative, think about what could be rather than work with what is, can dig into a new subject for hours, applying knowledge to the real work is a weakness
Opinions-driven: see one side, stick to their opinions in the face of proof. Persistent workers, but can be judgmental, suspicious and sensitive
Reactions-based: rebels, spontaneous, creative and playful. React strongly, either, "I love it!" or "That sucks!" Under pressure, can be stubborn, negative, and blameful

RESTful APIs

At Clojure-West, this past March, former ThoughtWorker Siva Jagadeesan spoke on on how to build good web APIs using Resource Oriented Architecture (ROA) and Clojure.

The talk isn't really specific to Clojure, but is really more a primer on REST APIs.

Rants about true REST versus REST-ish, REST influenced, or partially RESTful are not especially interesting. That's not what this is. It's a nicely pragmatic guide to the architectural patterns and why you might use them.

The swamp of POX (plain old XML, or more likely JSON, these days): At first glance, many, myself included, take REST to mean RPC over HTTP with XML/JSON.
URI: Start thinking in terms of resources.
HTTP: The HTTP verbs together with URIs define a uniform interface. CRUD operations are always handled the same way. So are error conditions.
Hypermedia (HATEOAS): Possible state transitions are given in the message body, allowing the logic of the application to reside (mostly? completely?) on the server side. The key here is "discoverability of actions on a resource."

RESTful Clojure by Siva Jagadeesan

Richardson Maturity Model: steps toward the glory of REST
Roy Feilding's REST paper
Clojure tools for REST APIs: Bishop and liberator
Christopher Miles on Bishop: Makes Your Web Service Shiny and Why Is My Web Service API Crappy?
REST and designing APIs
Clojure: Web Socket Introduction
RESTful principles
Exploring the implementation aspects of the REST architectural style
BlueEyes
What Makes a Great Open API?, a talk from OSCON 2012 by John Musser of ProgrammableWeb
Atlassian REST API Design Guidelines
The REST API Design Handbook
A 2006 article, Tim Bray on REST
Versioning: how to express this in a RESTful way? domain.org/myresource/version/1 implies that version 2 is a different resource. Would it be better to express versioning as an orthogonal dimension? A URL defines a resource, we set a header to ask for a version?

Tuesday, June 25, 2013

The Dream 8 Challenges

The 8th iteration of the DREAM Challenges are underway.

DREAM is something like the Kaggle of computational biology with an open science bent. Participating teams apply machine learning and statistical modeling methods to biological problems, competing to achieve the best predictive accuracy.

This year's three challenges focus on reverse engineering cancer, toxicology and the kinetics of the cell.

HPN-DREAM Breast Cancer Network Inference Challenge
Infer the signaling networks in breast cancer cell lines
NIEHS-NCATS-UNC DREAM Toxicogenetics Challenge
Predict individual response to environmental and pharmaceutical chemicals
The Whole-Cell Parameter Estimation DREAM Challenge
Infer the kinetic parameters underlying biological processes in whole cell models

Sage Bionetworks (my employer) has teamed up with DREAM to offer our Synapse platform as an integral part of the challenges. Synapse is targeted at providing a platform for hosting data analysis projects, much like GitHub is a platform for software development projects.

My own part in Synapse is on the Python client and a bit on the R client. I expect to get totally pummeled by bug reports once participation in the challenges really gets going.

Open data, collaborative innovation and reproducible science

The goal of Synapse is to enable scientists to combine data, source code, provenance, prose and figures to tell a story with data. The emphasis is on open data and collaboration, but full access control and governance for restricted access is built in.

In contrast to Kaggle, the DREAM Challenges are run in the spirit of open science. Winning models become part of the scientific record rather than the intellectual property of the organizers. Sharing code and building on other contestant's models is encouraged in with hopes of forming networks of collaborative innovation.

Aside from lively competition, these challenges are a great way to compare the virtues of various methods on a standardized problem. Synapse is aiming to become an environment for hosting standard open data sets and documenting reproducible methods for deriving models from them.

Winning methods will be presented at the RECOMB/ISCB Conference in Toronto this fall.

So, if you want to sharpen your data science chops on some important open biological problems, check out the DREAM8 challenges.

Wednesday, June 12, 2013

Big Data Book

Big Data: A Revolution That Will Transform How We Live, Work, and Think asks what happens when data goes from scarce to abundant. The book expands on The data deluge, a report appearing in the Economist in early 2010, providing a concise overview of a topic that is at the same time over-hyped and also genuinely transformative.

Authors Kenneth Cukier, Data Editor at the Economist, and Viktor Mayer-Schonberger, Oxford professor of public policy, look at big data trends through a business oriented lens.

Their book identifies two important shifts in thinking as key to dealing with big data: comfort with the messy uncertainty of probability and what the book refers to as correlations over causality. This just means that some tasks are better suited to a correlative or pattern matching approach rather than modeling from first principles. No one drives a car by thinking about physics.

In domains such as natural language and biology we'd like to know the underlying theory, but don't. At least, not yet. Where theory is insufficiently developed to directly answer pressing questions, a correlative, probabilistic approach is the best we can do. The merits of this concession to incomplete information can be debated, and was by Norvig and Chomsky. But, empirical disciplines like medicine and engineering rely on it - let alone the really messy worlds of economics and politics. It's often necessary to act in absence of mechanistic understanding.

As data goes from a scarcity to abundance, judgment and interpretation remain valuable. What changes is that you don't necessarily have to formulate a question and only then collect data that may help answer it. Another strategy becomes viable - collect a trove of data first and then interrogate it. That may bother hypothesis-driven traditionalists, but once you have a high-throughput data source - a DNA sequencer, a sensor networks or an e-commerce site - it makes sense to ask yourself what questions you can ask of that data.

Datafication

One industry or discipline after another has undergone the transition from information scarcity to super-abundance. One of the last hold-outs is medicine, where the potential impact is huge. Internet companies are accumulating vast data hordes streaming in through social networks and smartphones. The process has something of the character of a land-rush. Given the economies of scale, each category has its dominant player.

The dark side

As for obligatory dark-side of big data, the book worries about "the prospect of being penalized by for things we haven't even done yet, based on big data's ability to predict our future behavior."

While this won't keep me up at night, technology is not necessarily the liberating force that we may hope it is, nor is it necessarily a tool of Orwellian control. Certain technologies lend themselves more readily one way or the other.

A more pedestrian "threat to human agency" comes from marketers and political consultants equipped with data feeds, powerful machines and brainy statisticians, and able to manipulate people into consuming useless junk and acting against their own interests. Given detailed individual profiles and constant feedback on the effectiveness of targeted marketing, the puppet masters are likely to keep getting better at pulling our strings.

Data governance

To safeguard privacy and ethical uses of data, the book advocates a self-policing professional association of data scientists similar to those existing for accountants, lawyers and brokers. Personally, I'm a skeptical that these organizations have much in the way of teeth. An informed public debate might be a healthier option.

Bolder thinking in this area comes from Harvard's cyber-law group. Their VRM (for vendor relationship management) project proposes a protocol by which individuals and vendors can specify the terms under which they are willing to share data. Software can automate the negotiation and present options where terms are compatible. Precedents for these ideas can be found in finance where bonds and options are standardized contracts and can thus be fluidly traded.

The new oil?

Land, labor and capital are the classic factors of production - the primary inputs to the economy - or so they teach in Econ 101. Technology certainly belongs on that list and data does as well. The original robber barons sat in castles overlooking the Rhine or the Danube taxing the flow of goods and travelers around Europe. The term was later applied to giant industrial firms built on other flows: oil, rail and steel. Companies like Google, Facebook and Amazon extract wealth from a new flow: that of data. It's time to think of data as another factor of economic production, a raw material out of which something can be made.

Links, links, links

Comparing Big Data with the Signal and the Noise
The Rise of Big DataHow It's Changing the Way We Think About the World By Kenneth Neil Cukier and Viktor Mayer-Schoenberger published in Foreign Affairs May/June 2013
How Big Data Changes Our Relationship With Information by Irving Wladawsky-Berger in the WSJ, July 5, 2013
The Age of Big Data by Steve Lohr published in the NY Times in February of 2012
Jer Thorpe says, Big Data Is Not the New Oil
Niccolo Tempini finds that rather than showing how the impact of data-driven innovations will advance the march of humankind, the authors merely present a thin collection of happy-ending business stories.
Norvig vs. Chomsky and the Fight for the Future of AI
Doc Searls on the Intention Economy
Patients Know Best
Some colleagues are working on data governance for bio-medical research: US-EU Scientific Research Collaborations: Sage Bionetworks’ Experience Navigating the Complex Regulatory Landscape and Consent to Research
Is ‘Google Flu Trends’ Prescient Or Wrong?

Wednesday, May 29, 2013

Shiny talk by Joe Cheng

Shiny is a framework work for creating web applications with R. Joe Cheng of RStudio, Inc. presented on Shiny last evening in Zillow's offices 30 stories up in the former WaMu Center. Luckily, the talk was interesting enough to compete with the view of Elliot bay aglow with late evening sunlight streaming through breaks in the clouds over the Olympics.

Shiny is very slick, achieving interactive and pleasant looking web UIs with node.js, websockets and bootstrap under the hood. It's designed on a reactive programming model (like bacon and ember) that eliminates a lot of the boiler-plate code associated with listeners or observers in UI coding.

Shiny comes in two parts, the shiny R package for developing Shiny apps and Shiny server for deploying them. The RStudio company intends to create a paid tier consisting of an enterprise server and a paid hosting service, Glimmer, which is free for now.

Among several demos were a plot of TV Show Rankings over time and a neat integration with Google's Geochart library to map World Bank health nutrition and population statistics. There are also some examples of combining D3 with Shiny (G3Plot).

Possibly, the coolest demo was a tutorial on reactive programming in the form of an R console in a browser. Chunks of code could drag-and-drop around as in "live document" systems like IPython notebooks or Chris Granger's Clojure IDE Light Table.

Monday, April 08, 2013

HiveBio: a DIY biology lab for Seattle

Seattle is one of the few cities with a big biotech industry lacking a community lab space. Katriona Guthrie-Honea and Bergen McMurray are going to fix that by creating a DIY bioscience lab. The Seattle HiveBio Community Lab will be a community supported Do-It-Yourself (DIY) biology hacker-space or maker-space.

Katriona Guthrie-Honea is a student at Ingraham High and an intern at Fred Hutch. Bergen McMurray is a neuroscience student and an alumna of the Allen Brain Institute and Jigsaw Renaissance, a maker-space in Seattle's International District.

Worrying about an "innovational stagnation period" because not enough people are learning and playing with biotech, Guthrie-Honea wants to provide a place where people of all ages can do just that.

Synthetic Biology was founded on the idea of bringing an engineering mindset to biotechnology, with one result being BioBricks, the beginnings of a set of modular components. The iGEM competitions drive education and open community around synthetic biology.

But, one could argue that a standard engineer wouldn't make a centrifuge out of a salad spinner or a ceiling fan. To do that, what you need is a hacker.

I love the idea of bringing the hacker mentality to life sciences. Just like we should all take the lids off our computers and root our phones, we should be hacking the yeast in our beer like mad scientist Belgian monks.

Anticipating a May opening, Guthrie-Honea and McMurray are seeking funding from Microryza, which is like a Kickstarter for science, and a great idea in itself.

Do you love the idea, too? Want to help? Just like Kickstarter, Microryza is a crowdfunding platform. Check out their project and kick in a few bucks.

Thursday, March 28, 2013

Playing with earthquake data

This is a little example inspired by Jeff Leek's Data analysis class. Jeff pointed us towards some data from the US Geological Survey on earthquakes. I believe these are the earthquakes for the last 7 days.

According to Nate Silver, analysis of earthquake data has made fools of many. But, that won't stop us. Let's see what we can do with it.

fileUrl <- "http://earthquake.usgs.gov/earthquakes/catalogs/eqs7day-M1.txt"
download.file(fileUrl, destfile = "./data/earthquake_7day.csv", method = "curl")
eq <- read.csv("./data/earthquake_7day.csv")

Let's be good citizens and record the provinance of our data.

attributes(eq)["date.downloaded"] <- date()
attributes(eq)["url"] <- fileUrl
save(eq, file = "earthquakes_7day.RData")

attributes(eq)$date.downloaded
## [1] "Mon Feb  4 20:03:48 2013"

Let's see what they give us.

colnames(eq)

##  [1] "Src"       "Eqid"      "Version"   "Datetime"  "Lat"      
##  [6] "Lon"       "Magnitude" "Depth"     "NST"       "Region"

Something looks a little funny about the distribution of magnitudes. What's up with that bump at magnitude 5? I'm guessing that some detectors are more sensitive than others and the less sensitive ones bottom out at around 5.

hist(eq$Magnitude, main = "Histogram of Earthquake Magnitude", col = "#33669988", 
    border = "#000066")

plot of chunk unnamed-chunk-5

I'm guessing that the better detectors are on the west coast of the US.

westcoast <- c("Northern California", "Central California", "Southern California",  "Greater Los Angeles area, California", "Santa Monica Bay, California", "Long Valley area, California", "San Francisco Bay area, California", "Lassen Peak area, California", "offshore Northern California", "Washington", "Seattle-Tacoma urban area, Washington", "Puget Sound region, Washington", "Olympic Peninsula, Washington", "Mount St. Helens area, Washington", "San Pablo Bay, California", "Portland urban area, Oregon", "off the coast of Oregon")
hist(eq[eq$Region %in% westcoast, "Magnitude"], main = "Histogram of Earthquake Magnitude - West Coast US", 
    col = "#33669988", border = "#000066")

plot of chunk unnamed-chunk-6

Now that looks a little smoother. Let's try R's mapping ability. Check out the ring of fire.

library(maps)
map()
points(x = eq$Lon, y = eq$Lat, col = rgb(1, 0, 0, alpha = 0.3), cex = eq$Magnitude * 
    0.5, pch = 19)

plot of chunk unnamed-chunk-7

Zoom in on the west coast.

map("state", xlim = range(c(-130, -110)), ylim = range(c(25, 50)))
points(x = eq$Lon, y = eq$Lat, col = rgb(1, 0, 0, alpha = 0.3), cex = eq$Magnitude * 
    0.5, pch = 19)

plot of chunk unnamed-chunk-8

Lucky for us, almost all the recent quakes out here were tiny and the biggest was way off the coast.

summary(eq[eq$Region %in% westcoast, "Magnitude"])

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0     1.2     1.4     1.5     1.7     5.3

There are supposed to be active fault lines here in Seattle and you can see a few little ones. Down by Mt. St. Helens, they get a few moderately bigger ones. The only quakes I've felt recently were the contractors doing demolition for remodeling my basement.

The knitr output can also be seen on RPubs: Playing with Earthquake Data and on GitHub cbare/earthquakes.rmd.

Wednesday, December 11, 2013

HPN

Whole-cell modeling

Toxicogenetics

Keynotes

DREAM 8.5

Discussion

Monday, November 25, 2013

On Writing

On Apple

Wednesday, October 30, 2013

More

Monday, October 14, 2013

More

Thursday, September 19, 2013

AWS Command Line Interface

Saturday, September 07, 2013

Tyler Cowen's answers

On being quantified

On hipsters

How many will prosper

More

Thursday, July 11, 2013

Saturday, July 06, 2013

Finance

Music

Medicine

Psych and soc

What next?

Links

Kahler's personality types

More

Tuesday, June 25, 2013

Open data, collaborative innovation and reproducible science

More on DREAM, Sage Bionetworks, and Challenges

Wednesday, June 12, 2013

Datafication

The dark side

Data governance

The new oil?

Links, links, links

Wednesday, May 29, 2013

Links

Monday, April 08, 2013

Thursday, March 28, 2013

About

About Me

Blog Archive

Labels

Cheat Sheets

Feedz

Featured on