Apparently, live-blogging isn't my thing. This post has been sitting on my desktop getting stale for entirely too long...
I'm just back (a month ago) from the conference for the DREAM challenges in Toronto. It was great to see the Dream 8 challenges wrap up and another round of challenges get under way. Plus, I think my IQ went up a few points just by standing in a room full of smart people applying machine learning to better understand biology.
The DREAM organization, led by Gustavo Stolovitzky, partnered with Sage Bionetworks to host the challenges on the Synapse platform. Running the challenges involves a surprising amount of logistics and the behind-the-scenes labor of many to pose a good question, data collection and preparation, provide clear instructions, and evaluate submissions. Puzzling out how to score submissions can be a data analysis challenge in it's own right.
HPN
The winners of the Breast Cancer Network Inference Challenge applied a concept from economics called Granger Causality. I have a thing for ideas that cross over from one domain to another, especially between economics to biology. The team, from Josh Stuart's lab at UCSC calls their algorithm Prophetic Granger Causality, which they combined with data from Pathway commons to infer signaling pathways from time-series proteomics data taken from breast cancer cell lines.
The visualization portion of the same challenge was swept by Amina Qutub's lab at Rice University with their stylish BioWheel visualization based on Circos. Qutub showed a prototype of a D3 re-implementation. I'm told that availability of source code is coming soon. There's a nice write-up Qutub bioengineering lab builds winning tool to visualize protein networks and a video.
BioWheel from Team ABCD writeup
Whole-cell modeling
In the whole-cell modeling challenge, led by Markus Covert of Stanford University, participants were tasked with estimating the model parameters for specific biological processes from a simulated microbial metabolic model.
Since synthetic data is cheap, a little artificial scarcity was introduced by giving participants a budget with which they could purchase data of several different types. Using this data budget wisely then becomes an interesting part of the challenge. One thing that's not cheap is running these models. That part was handled by BitMill a distributed computing platform developed by Numerate and presented by Brandon Allgood.
Making hard problems accessible to outsiders with new perspectives is a big goal of organizing a challenge. One of the winners of this challenge, a student in a neuroscience lab at Brandice, is a great example. They're condidering a second round for this challenge next year.
Toxicogenetics
Yang Xie's lab at UT SouthWestern's QBRC added another win to their collection in the Toxicogenetics challenge. Tao Wang revealed some of their secrets in his presentation.
Wang explained that their approach relies on extensive exploratory analysis, careful feature selection, dimensionality reduction, and rigorous cross-validation.
Keynotes
Trey Ideker presented a biological ontology comparable in some ways to GO called NeXO for Network Extracted Ontology. The important difference is this: whereas GO is human curated from literature, NeXO is derived from data. There's a paper, A gene ontology inferred from molecular networks, and a slick looking NeXO web application.
Tim Hughes gave the other keynote on RNA-binding motifs complementing prior work constructing RBPDB, a database of RNA-binding specificities.
DREAM 8.5
Since the next round of challenges is coming sooner than DREAM's traditional yearly cycle, it's numbered 8.5. There are three challenges that are getting underway now:
- Predicting progression in Alzheimer's
- Mutation calling from next-gen sequencing data
- Predicting drug response in Rheumatoid arthritis
Discussion
During the discussion session, questions were raised about on whether to optimize the design of the challenges to produce generalizable methods or for answering a specific biological question. Domain knowledge can give competitors advantages in terms of understanding noise characteristics and artifacts in addition to informing realistic assumptions. If the goal is strictly to compare algorithms, this might be a bug. But in the context of answering biological questions, it's a feature.
Support was voiced for data homogeneity (No NAs, data plus quality/confidence metric) and more realtime feedback.
It takes impressive community management skills to maintain a balance that appeals to a diverse array of talent as well as providers of data and support.