Thursday, June 28, 2012

Institute for Systems Biology at Google I/O

The Shmulevich Lab scored a nice feature in the day 2 keynote at Google I/O. (See video starting at about 40:25 and again at 46:05.) Their work on the The Cancer Genome Atlas (TCGA) was part of the introduction of Google Compute Engine.

Ilya and Hector are quoted in the case study, Cancer Investigators Use Google Compute Engine to Accelerate Life-Saving Research.

The machine learning component is a random forest application called RF-ACE and the visualization is circvis. More Shmulevich lab code can be found at Code for Systems Biology.

Sunday, June 24, 2012

Data analysis workflow patterns

About a year ago, I ran a kooky idea past some colleagues. They gave it a big WTF, so I sat on it for a while. Not to be deterred, I still like the idea, so here it is.

Workflows are key to bioinformatics. For example, an analysis of gene expression might go something like the following. Measure gene expression using arrays or RNA-seq. After a bit of normalization, cluster genes by correlated expression. Then compute functional enrichment on the clusters. This helps get at questions about how the cell reacts to some stimulus. Whether that's nutrients, toxins, pH, sunlight, whatever. Or, what's the difference between a sick cell and a healthy one. The answer is in terms of processes or pathways up or down regulated.

You might implement our analysis of gene expression in R or Python. You could do it with point and click software like MeV and web tools like DAVID or workflow tools like Galaxy. You might use k-means, hierarchical clustering or something fancier. You might use GO terms to annotate gene function or KEGG pathways. There are lots of options, but the central idea is conserved.

The notion of a data analysis workflow is something like the concept of a software design pattern, an idea that software engineers borrowed from architects. A design pattern is a general reusable template for how to solve a commonly occurring problem. Naming and documenting a design pattern makes sharing knowledge easier and lets you talk and think at a higher level of abstraction.

Thinking of the workflow as an asset, it follows that they should be collected, documented and published. "In bioinformatics, we are familiar with the idea of curated data as a prerequisite for data integration. We neglect, often to our cost, the curation and cataloguing of the processes that we use to integrate and analyse our data." (Carol Goble et al. 2008) Both Taverna and Galaxy have mechanisms for doing this, albeit within the context of those tools. Maybe a better place to document workflows would be in journals. Philip Bourne, Editor in chief of PLOS Computational Biology says, "I want the publisher of the future to be the guardian of these workflows create a better scholarly record." He's speaking here of scientific workflows in a more general context than just data analysis.

Design patterns are documented in a specific format, detailing the scenarios in which that pattern applies, the intent behind it, its structure, implementation and consequences. Examples are usually given of real usages of the pattern along with discussion of alternatives or related patterns and its risks and pitfalls. For data analysis workflows, we'd probably want to discuss possible sources of error, required conditions and statistical properties.

This is not to say we need formalism for the sake of formalism. It's tempting to get caught up with impractical methodological hoo-ha, focusing on process to the exclusion of real and practical goals. A workflow should be a tool in a researcher's toolbox, a pragmatic way to package a bite-sized piece of knowledge. To be avoided is the over-abstraction and questionable utility that some (i.e. me) see in BPEL, BPM, MDA and related (increasingly defunct) technology trends.

Rather than design patterns, maybe a more biologist-friendly term is a protocol. Plus, there's already a long history of journals dedicated to lab protocols. Why not do the same for protocols for data analysis?


Like every idea, good or crackpot, people have thought of this one before me.

Saturday, June 23, 2012

Composition methods compared

Clojurist, technomancer and Leiningen creator, Phil Hagelberg does a nice job of dissecting "two ways to compose a number of small programs into a coherent system". Read the original in which three programming methods are compared. These are my notes, quoted mostly verbatim:

The Unix way

Consists of many small programs which communicate by sending text over pipes or using the occasional signal. Around this compelling simplicity and universality has grown a rich ecosystem of text-based processes with a long history of well-understood conventions. Anyone can tie into it with programs written in any language. But it's not well-suited for everything: sometimes the requirement of keeping each part of the system in its own process is too high a price to pay, and sometimes circumstances require a richer communication channel than just a stream of text.

The Emacs way

A small core written in a low-level language implements a higher-level language in which most of the rest of the program is implemented. Not only does the higher-level language ease the development of the trickier parts of the program, but it also makes it much easier to implement a good extension system since extensions are placed on even ground with the original program itself.

The core Mozilla platform is implemented mostly in a gnarly mash of C++, but applications like Firefox and Conkeror are primarily written in JavaScript, as are extensions.

Sunday, June 10, 2012

Scientific imaging with Hanchuan Peng

Hanchuan Peng of Janelia Farm spoke on bioimaging at ISB a couple weeks back. He's doing some very cool work mining microscopy images doing registration - aligning individual cells across images. They've created a 3D atlas of C. elegans which tracks every cell. The still pictures don't don't do it justice. Check out the movies.

By localizing and registering neural fibers in 2,954 fly brains, Peng's group constructing this wiring diagram of the fly's 100,000 neurons.


Tuesday, June 05, 2012

Scaling higher education

In the fall of last year, over 100,000 students signed up for Andrew Ng's Machine Learning Class and more than 12,000 of them completed the course. Sebastian Thrun and Peter Norvig taught Artificial Intelligence with similarly impressive numbers.

I was one of the thousands in the Machine Learning class. I had so much fun with that, I also took Daphne Koller's Probabilistic Graphical Models. That one was a quite a bit harder, covering some fairly advanced stuff at least for my few remaining brain cells. But, I finished! For the PGM class, 6702 took the first quiz and 1441 took the final - pretty good retention for such a ball-buster of a class.

This spring, at least two new companies offering online courses were founded. Andrew Ng and Daphne Koller founded Coursera. Partnering with professors from Princeton, Penn, University of Michigan, and Berkeley, they've broadened their course catalog from a base in computer science to include classes in history, mathematics and even poetry. Sebastian Thrun, founder of Udacity, calls his approach University 2.0 and speaks of "democratizing higher eduction" and "empowering students" especially in the developing world where access to higher education is more limited. Almost three quarters of Coursera's students are outside the US, in countries like Brazil, Britain, India and Russia.

The classes are surprisingly fun. The formula boils down to two key elements:

  • short segments
  • interaction

In the mold of Khan academy, the lectures are broken into short segments of 10 to 15 minutes, which fit nicely into busy schedules. Short quizzes test the student's understanding. The courses have social aspect, as well. Online forums provide a place for questions and a sense of camaraderie while struggling through difficult concepts. Meetups and study groups have sprung up in several cities across the world.

The programming exercises are where the real fun begins. Students write code that implements the crux of an operation, filling in the blanks in provided boilerplate code. Grading works a bit like unit testing. Progressing through the assignment by getting tests to pass gives gratifyingly immediate feedback. Completing an assignment results in working code for handwriting recognition, spam classification, image processing or recognizing an action from kinect position sensing data.

Thomas Friedman says, Let the revolution come:

Welcome to the college education revolution. Big breakthroughs happen when what is suddenly possible meets what is desperately necessary.

With the cost of tuition rising, and public funding falling, the timing might be right for some disruptive innovation in higher education. And, the skills on offer are in high demand. One proposed business model is to offer classes for free and charge employers for access to the data.

Refactoring the university classroom to function at internet scale meshes with the building momentum behind open access journals that some are calling an Academic Spring as well as with citizen science projects like Galaxy Zoo.

Increasing openness in academics, in both teaching and research, can only reduce friction in the process of transferring technology from the lab to production and may help engage the public with science. Look for lots of interesting developments in the next few years, as technology knocks a new door into the ivory tower.



Coursera continues to generate lots of news, signing up 12 new universities, including the University of Washington (yay, UW!), attracting the attention of Bill Gate, and being described as The Single Most Important Experiment in Higher Education by the Atlantic and The Beginning of the End for Traditional Higher Education by Fortune and Reshaping Education on the Web by the NYT.

Sunday, June 03, 2012

Working in academics

The benefits of working in academics are:

  • Important and interesting work
  • Opportunities for development and growth
  • Freedom and fun

Those paying attention will recognize Daniel Pink's elements of motivation - autonomy, mastery, and purpose.

The downside? Having to explain to your spouse why you're not making as much money as so-and-so, who has basically the same skills as you. Or worse yet, why you make less than some other so-and-so who can barely tie his own shoes.

That is, unless your spouse is in academics, too. In which case, God help you.

Explaining this is not fun, but if the three elements above are in abundant supply, a fairly convincing case can be made. If those factors start to run low... well, your spouse might be right.