Digithead's Lab Notebook: software engineering

Showing posts with label software engineering. Show all posts

Wednesday, October 30, 2013

Building distributed systems out of crap

Pat Helland gave the opening keynote at Basho's conference Ricon West, yesterday. The general topic was building distributed systems with enterprise guarantees and web scalability on crap. His argument is that enterprise-grade SLAs with lots of nines can be supported on cheap hardware using a strategy of expect failure and recover quickly.

Helland, formerly having done time on the Bing team and at Amazon, is building a distributed data storage system for Salesforce.com. It's design involves a catalog stored in a relational DB and files stored on clusters of storage servers, a technique Helland calls blobs-by-reference.

The files are stored in fragments distributed across a cluster. There was another concept called an “extent”. I wasn't sure if that meant an aggregation of related fragments or just a bucket to dump them in.

SSDs are used as a new layer of the memory hierarchy. Helland argues for using the cheapest and crappiest available. This entails a couple engineering tweaks. Because SSDs degrade with every operation, the software has to manage read write cycles. To detect data corruption, each fragment is packaged with a CRC error-detecting code.

“By surrounding the data with aggressive error checking, we can be extremely confident of detecting an error and fetching the desired data from one of the other places it has been stored.”

Helland emphasized the importance of immutable data, which goes a long way towards mitigating the inconsistency and race conditions that come with distributed computing. In the proposed storage system, fragments are immutable, which greatly reduces opportunity for the storage nodes to get out of sync with the catalog.

Aside from this talk, Ricon is loaded with good content including a talk by Jeff Dean coming up this afternoon. Send me next year!

The event was live streamed, so I'm guessing a video is forthcoming. (Ricon West Live)
For now, there's a video of Helland's talk at the same event in 2012 Immutability Changes Everything.
Helland spoke on Condos and Clouds at LinkedIn.
See the Register's writeup of the talk. “Salesforce's data-center design: Oracle sits at the core, but after that things get ugly - and cheap!”
Life beyond Distributed Transactions: an Apostate’s Opinion a 2012 position paper by Helland
There's Just No Getting around It: You're Building a Distributed System by Mark Cavage writing for ACM Queue.
A Distributed Systems Reading List from Dan Creswell and Distributed systems reading group at MIT

Monday, October 14, 2013

Concurrency and Parallelism - What's the difference?

For a while, I've been coming across references to the difference between concurrency and parallelism. The definitions go something like this: Concurrency concerns "interleaved threads of execution with access to shared state" which is distinct from parallelism because "parallel operations run simultaneously".

I'm quoting from - "Clojure Programming" by Chas Emerick, Brian Carper and Christophe Grand - which is a perfectly good book. I've seen similar definitions elsewhere, so I don't want to pick on these guys in particular. I'm going to disagree a bit, but overall, the book is really well done and I'm enjoying it.

My beef is this: I couldn't see the utility of the distinction they're drawing. I couldn't see why you'd want to design a program differently to run as threads scheduled on a single core versus threads scheduled on several cores. In fact, treating those cases the same seems like a plus.

In contrast, there are some distinctions between types of concurrency that are useful. Knowing your code will be distributed across machines tells you to bring network latency into the picture. Likewise, only certain problems are amenable to the single-instruction-multiple-data (SIMD) model of vector processors such as GPUs. These considerations have a real impact. But, why the pedantry over concurrency versus parallelism?

I was about to write a little rant about why this distinction is useless. But, keeping an open mind, I googled around a bit and up popped a talk by Rob Pike called "Concurrency Is Not Parallelism". Change of plan. Rob Pike is a bad-ass, well known as a Unix pioneer, Bell Labs veteran and Google Distinguished Engineer. New plan: go back to school and find out why I'm wrong.

Pike's talk explains things beautifully, and not just because he's wearing an orange suit jacket and a gopher t-shirt. Here's my understanding of the take-away:

Concurrency is a more general and abstract idea than parallelism. Concurrency is about the decomposition of a problem into subtasks at the design level. If you're creating a concurrent design, you haven't said yet whether your design will be executed in parallel. Parallelism is a detail to be decided at run-time. Which brings us to the good part.

Whenever you can take two things that were previously conjoined and let them vary independently, you're making progress. The two things in question here are the design - the decomposition of a problem into concurrent parts - and the execution of those parts, perhaps in parallel. Making this separation allows programs to be expressed correctly and structured clearly while making good use of available resources whether that's one core or many.

This important point is what's missing from the definitions above. That and they're comparing things at different levels of generality.

Next, Pike relates these ideas to Go. The language provides three concurrency primitives: Go routines, channels and select. Go routines are like threads but much cheaper. During execution, they're mapped onto OS threads by a scheduler. Channels and select statements enable communication. Go is an implementation of concepts have their origin in the classic paper Communicating Sequential Processes by Tony Hoare.

The moral of the story? Learn from the masters ...and from the gophers.

Tony Hoare's paper is on several "greats" lists including these:

Great Works in Programming Languages Collected by Benjamin C. Pierce
Classic Papers in Programming Languages and Logic

Thursday, December 06, 2012

R in the Cloud

I've been having some great fun parallelizing R code on Amazon's cloud. Now that things are chugging away nicely, it's time to document my foibles so I can remember not to fall into the same pits of despair again.

The goal was to perform lots of trails of a randomized statistical simulation. The jobs were independent and fairly chunky, taking from a couple of minutes up to 90 minutes or so. From each simulation, we got back a couple dozen numbers. We worked our way up to running a few thousand simulations at a time on 32 EC2 nodes.

The two approaches I tried were Starcluster and the parallel package that comes with the R distribution. I'll save Starcluster for later. I ended up pushing through with the parallel package.

The helpful folks of the Bioconductor group put together a CloudFormation template and AMI: Bioconductor parallel cluster with SSH. It's a great way to get started hacking R in the cloud. All it takes is a mouse-click and an Amazon account.

Using the parallel package

Here's a quick example using the parallel package to start up a cluster and fool around a bit.

library(parallel)
help(package=parallel)

## compile list of workers with one entry per core
lines <- readLines("/usr/local/Rmpi/hostfile.plain")
hosts <- do.call(c, lapply(strsplit(lines, " "), function(host) { rep(host[1], as.integer(host[2])) }))

## create the cluster passing an IP address for
## the head node
## hostname -i works on Linux, but not on BSD
## descendants (like OS X)
cl <- makePSOCKcluster(hosts,
        master=system("hostname -i", intern=TRUE))

## for testing, start a cluster on your local machine
# cl <- makePSOCKcluster(rep("localhost", 3))

## do something once on each worker
ans <- clusterEvalQ(cl, { mean(rnorm(1000)) })

## test a time consuming job
## (~30 seconds on a 4 core machine)
system.time(ans <- parLapplyLB(cl, 1:100, function(i) {
  ## summarize a bunch of random sample means
  summary(
    sapply(1:runif(1, 100, 2000),
           function(j) { mean(rnorm(10000)) }))
}))

## push data to the workers
myBigData <- rnorm(10000)
moreData <- c("foo", "bar", "blabber")
clusterExport(cl, c('myBigData', 'moreData'))

## shut down worker processes
stopCluster(cl)

That was easy. But, it wouldn't be any fun if a few things didn't go wrong.

Hitting limits

Why 32 machines? As we scaled up to larger runs, I started hitting limits. The first was in the number of machines Amazon lets you start up.

According to the email Amazon sent me: "Spot Instances, which have an instance limit that is typically five times larger than your On-Demand Instance limit (On-Demand default is 20)..."

You can get the limits raised, but I'm a total cheapskate anyway, so I hacked Dan's Cloudformation template to use spot instances, adding a "SpotPrice" property in the right place. Spot instances can go away, at any time, but they're so much cheaper that it's worth dealing with that.

Then, I prompty hit another limit:

Error in socketConnection ... all connections are in use

Apparently, there's a limit built into R on the number of socketConnections you can open. It says, in src/main/connections.c:

#define NCONNECTIONS 128 /* snow needs one per slave node */

Sadly, my lust for more cores is doomed to be thwarted, for now. We can have a maximum of 128 workers, or only 32 instances with 4 cores apiece. Bummer.

makePSOCKcluster hangs

Usually, the worker processes come up fine. But sometimes, they couldn't connect to the head node. The symptom of this problem is that makePSOCKcluster hangs.

I did ps -aexww and stared at the following gobbledy-gook for a while:

/usr/local/lib/R/bin/exec/R --slave --no-restore -e parallel:::.slaveRSOCK() --args MASTER=ip-10-4-215-155 PORT=10187 OUT= TIMEOUT=2592000 METHODS=TRUE XDR=TRUE SED=/bin/sed R_INCLUDE_DIR=/usr/local/lib/R/include R_DEFAULT_PACKAGES=datasets,utils,grDevices,graphics,stats SHELL=/bin/bash SSH_CLIENT=10.4.215.155 46154 22 USER=ubuntu LD_LIBRARY_PATH=/usr/local/lib/R/lib:/usr/local/lib:/usr/lib/jvm/java-6-sun-1.6.0.26/jre/lib/amd64/server:/usr/lib/jvm/java-6-sun-1.6.0.26/jre/lib/amd64 PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games MAIL=/var/mail/ubuntu PWD=/home/ubuntu R_SHARE_DIR=/usr/local/lib/R/share LANG=en_US.UTF-8 R_ARCH= HOME=/home/ubuntu SHLVL=1 LOGNAME=ubuntu LC_CTYPE=en_US.UTF-8 SSH_CONNECTION=10.4.215.155 46154 10.158.53.86 22 R_HOME=/usr/local/lib/R R_DOC_DIR=/usr/local/lib/R/doc

I confirmed that I could connect to the dud machines manually, and also from there back to the head node, like so:

ssh -i ~/.ssh/id_rsa ubuntu@10.241.65.139

The bug is resistant to rebooting and even terminating the dud node. Seemingly at random, somewhere between none and 3 machines out of 32 would turn out to be duds. How irritating!

Luckily, Dan from the Bioconductor group found the problem, and you can even see it, if you know where to look, in the afore-mentioned gobbledy-gook. The parameter MASTER=ip-10-4-215-155 means the worker has to do name resolution, which apparently sometimes fails. (See the notes under master in the docs for makePSOCKCluster)

We can give it an IP address, neatly bypassing any name resolution tar-pit:

cl <- makePSOCKcluster(hosts,
    master=system("hostname -i", intern=TRUE))

Huge props to Dan for figuring that out and giving me a serious case of geek envy.

Load unbalancing

The LB in parLapplyLB stands for load balancing. It uses a simple and sensible strategy: give each worker one job, then when a worker is finished, give it another job, until all the jobs are assigned.

I think I saw cases where there were idle workers at a time when there were jobs that had not yet started. The only way that could happen is if the jobs were already assigned to a busy worker.

Looking at the code, that doesn't look possible, but I have a theory. There's an option in makePSOCKcluster to specify an outfile and outfile="" sends stdout and stderr back to the head node. I thought that might be handy for debugging.

Next, consider the call stack for parLapplyLB (down is deeper):

One could start to imagine that a chatty and long-running worker sending output back to the head node via the outfile="" option would cause a socket to be readable before the job is done. So, another job gets submitted to that worker. Then workers become available and go idle for lack of work, which has already been hogged up (but not started) by the chatty worker.

If it's only a weird interaction between outfile="" and parLapplyLB, it's not that big of a deal. A more unfortunate property of parLapplyLB is what happens when a worker goes away; say, a connection is dropped or a spot instance is terminated. The result of that is that parLapplyLB bombs out with a socket error, and all work on all workers is lost. Doh!

For this reason, I had the workers write out checkpoints and collected them onto the head node periodically. This way, getting a return value back from parLapplyLB wasn't all that critical. And that brings me to the topic of automation.

Slothful automation

Automation is great. Don't tell anyone, but I take something of a lazy approach to automation: starting with a hacky mess that just barely works with lots of manual intervention and gradually refining it as needed, in the general direction of greater automation and more robust error handling.

Here are a few half-hearted attempts:

In lieu of a progress bar, I hacked up a little R script to process checkpoints and cough up summary statistics.
I pulled some ruby code out of Dan's bootstrap script that queries the EC2 API for the currently running worker nodes.
A little snippet of Python run from cron grabs checkpoint files from the workers every 20 minutes.

All this is closer to a hacky mess than clean automation. A lot of babysitting is still required.

Features I'd like to see

shared EBS volume (via NFS?) for packages, checkpoints and results
a queuing system that doesn't require persistent socket connections to the workers
async-lapply - returns a list of futures, which can be used to ask for status and results
progress monitoring on head node
support for scalable pools of spot instances that can go away at any time.
grow and shrink pool according to size of queue

The right tool for 10,000 jobs

There are many ways to parallelize R. The approach in this document uses the parallel package and RStudio on Amazon EC2. The parallel package is nice for interactive development and has the advantage of keeping R worker processes alive rather than starting a new process for each job. But, this approach only works up to a point.

Different styles include interactive vs. batch, implicit vs. explicit and reformulating the problem as a map/reduce job. For map/reduce style computations, look at Rhipe. R at 12,000 Cores describes the “Programming with Big Data in R” project (pbdR). For batch jobs, Starcluster may be a better choice.

Starcluster provides several of those features, albeit with the caveat of restarting R for each job. Having pushed R/parallel to its limits, I intend try Starcluster a little more. So far I've only learned the term-of-art for when your Starcluster job goes horribly wrong - that's called a starclusterfuck.

Sunday, August 05, 2012

Simplicity made easy

Rich Hickey, inventor of Clojure, gave a talk at last year's Strange Loop called Simple Made Easy. In it, he makes the case that Lispy languages and Clojure in particular provide tools for achieving modularity and simplicity. Simplicity Matters, Hickey's keynote at RailsConf in April of this year is a variant on the same themes. Here are my notes, either quoted verbatim or paraphrased.

“Simplicity is a prerequisite for reliability”
- Edsger Dijkstra

“Simplicity is the ultimate sophistication”
- Leonardo da Vinci

Simplicity vs complexity

Complexity comes from braiding together, interleaving or conflating multiple concepts. Complect (braid together) vs. compose (place together)

Simplicity is defined as having one role - one task, one concept, one dimension. Composing simple components is the way we write robust software. A good designer splits things apart. When concepts are split apart, it becomes possible to vary them independently, use them in different contexts or to farm out their implementation.

There are reasons why you have to get off of this list. But there is NO reason why you shouldn't start with it.

Abstraction for simplicity

Analyze programs by asking who, what, when, where, why and how. The more your software components can say, "I don't know, I don't want to know", the better.

what: operations - specify interface and semantics, don't complect with how; how can be somebody else's problem
who: data or entities
how: implementation, connected by polymorphism constructs
when and where: use queues to make when and where less of a problem
why: policy and rules of the application

Data representation

Data is really simple. There are not a lot of variation in the essential nature of data. There are maps, there are sets, there are linear sequential things... We create hundreds of thousands of variations that have nothing to do with the essence of the stuff and that make it hard to manipulate the essence of the stuff. We should just manipulate the essence of the stuff. It's not hard, it's simpler. Same goes for communications... (44:30)

Information is simple, represent data as data, use maps and sets directly (56:30)

(From the RailsConf talk:) Subsystems must have well defined boundaries and interfaces and take and return data as lists as maps (aka JSON). ReSTful interfaces are popular in the context of web services. Why not do the same within our programs?

Choosing simplicity

Simplicity is a choice, but it takes work. Simplicity requires vigilance, sensibility and care. Choose simple constructs. Simplicity enables change and is the source of true agility. Simplicity = opportunity. Go make simple things.

Saturday, June 23, 2012

Composition methods compared

Clojurist, technomancer and Leiningen creator, Phil Hagelberg does a nice job of dissecting "two ways to compose a number of small programs into a coherent system". Read the original in which three programming methods are compared. These are my notes, quoted mostly verbatim:

The Unix way

Consists of many small programs which communicate by sending text over pipes or using the occasional signal. Around this compelling simplicity and universality has grown a rich ecosystem of text-based processes with a long history of well-understood conventions. Anyone can tie into it with programs written in any language. But it's not well-suited for everything: sometimes the requirement of keeping each part of the system in its own process is too high a price to pay, and sometimes circumstances require a richer communication channel than just a stream of text.

The Emacs way

A small core written in a low-level language implements a higher-level language in which most of the rest of the program is implemented. Not only does the higher-level language ease the development of the trickier parts of the program, but it also makes it much easier to implement a good extension system since extensions are placed on even ground with the original program itself.

The core Mozilla platform is implemented mostly in a gnarly mash of C++, but applications like Firefox and Conkeror are primarily written in JavaScript, as are extensions.

Tuesday, May 08, 2012

Design Philosophies of Developer Tools

Clojure hacker Stuart Sierra wrote an insightful piece on the design philosophies of developer tools. His conclusions are paraphrased here:

Git is an elegantly designed system of many small programs, linked by the shared repository data structure.

Maven plugins, by contrast, are not designed to be composed and the APIs are underdocumented. With Maven, plugins fit into the common framework of the engine, which strikes me as maybe a more difficult proposition than many small programs working on a common data structure.

Of the anarchic situation with Ruby package management (Gems, Bundler, and RVM) Sierra says, “layers of indirection make debugging harder,” and “The speed of development comes with its own cost.”

Principles

Plan for integration
Rigorously specify the boundaries and extension points of your system

...and I really like this idea:

The filesystem is the universal integration point
Fork/exec is the universal plugin architecture

Sunday, April 29, 2012

The Scalable Adapter Design Pattern for Interoperability

When wrestling with a gnarly problem, it's interesting to compare notes with others who've faced the same dilemma. Having worked on an interoperability framework, a system called Gaggle, I had a feeling of familiarity when I came across this well-thought-out paper:

The Scalable Adapter Design Pattern: Enabling Interoperability between Educational Software Tools, Harrer, Pinkwart, McLaren, Scheuer, IEEE TRANSACTIONS ON LEARNING TECHNOLOGIES, 2008

The paper describes a design pattern for getting heterogeneous software tools to interoperate with each other by exchanging data. Each application is augmented with an adapter that can interact with a shared representation. This hub-and-spokes model makes sense because it reduces the effort from writing n(n-1) adapters to connect all pairs of applications to writing one adapter per application.

Scalability refers to the composite design pattern, implementing (what I would call) a more general concept, that of hierarchically structured data. If you've ever worked with large XML documents, calling them scalable might seem like an overstatement, but I see their point. XML nicely represents small objects, like events, as well as moderately sized data documents. The same can be said of JSON.

Applications remain decoupled from the data structure, with an adapter mediating between the two. The adapter also provides events when the shared data structure is updated. A nice effect of the hierarchical data structure is that client applications can register their interest in receiving events at different levels of the tree structure.

The Scalable Adapter pattern combines of well established patterns - Adapter, Composite and Publish-subscribe yielding a flexible way for arbitrary application data to be exchanged at different levels of granularity.

The main difference between Scalable Adapter and Gaggle is that Gaggle focused on message passing rather than centrally stored data. The paper says, "it is critical that both the syntax and semantics of the data represented in the composite data structure...", but they don't really address what belongs in the "Basic Element" - the data to be shared. Gaggle solves this problem by explicitly defining a handful of universal data structures. Applications are free to implement their own data model, achieving interoperability by mapping (also in an adapter) their internal data structures onto the shared Gaggle data types.

The Scalable Adapter paper breaks the problem down systematically in terms of design patterns, while Gaggle was motivated by the software engineering strategies of separation of concerns and parsimony, plus the linguistic concept of semantic flexibility. It's remarkable that the two systems worked out quite similarly, given the different domains they were built for.

Friday, March 23, 2012

Applying Semantic Web Services to bioinformatics

Applying Semantic Web Services to bioinformatics: Experiences gained, lessons learnt
Phillip Lord, Sean Bechhofer, Mark D. Wilkinson, Gary Schiltz, Damian Gessler, Duncan Hull, Carole Goble, and Lincoln Stein
International Semantic Web Conference, Vol. 3298 (2004), pp. 350-364, doi:10.1007/b102467

Applying Semantic Web Services to bioinformatics is a 2004 paper on Semantic Web Services in context of bioinformatics, based on the experiences of the myGrid and BioMoby projects. The important and worthy goal behind these projects is enabling composition and interoperability of heterogeneous software. Is Semantic Web technology the answer to data integration in biology? I'm a little skeptical.

Here's a biased selection of what the paper has to say:

"The importance of fully automated service discovery and composition is an open question. It is unclear whether it is either possible or desirable, for all services, in this domain..."
"Requiring service providers and consumers to re-structure their data in a new formalism for external integration is also inappropriate."
"Bioinformaticians are just not structuring their data in XML schema, because it provides little value to them."
"All three projects have accepted that much of the data that they receive will not be structured in a standard way. The obvious corollary of this is that without restructuring, the information will be largely opaque to the service layer."

A couple of interesting asides are addressed:

Most services or operations can be described in terms in inputs and outputs and configuration parameters or secondary input. When building a pipeline, only main input and output need be considered, leaving parameters for later.
A mixed a user base divided between biologists and bioinformaticians is one difficulty noted in the paper. I've also found that tricky. Actually, the situation has changed since the article was written. Point-and-click biologists are getting to be an endangered species. The crop of biologists I see coming up these days is very computationally savvy. What I think of as the scripting-enabled biologist is a lot more common. Those not so enabled are increasingly likely to specialize in wet-lab work and do little or no data analysis.

In BioMOBY Successfully Integrates Distributed Heterogeneous Bioinformatics Web Services. The PlaNet Exemplar Case, (2005) Wilkinson writes,

...interoperability in the domain of bioinformatics is, unexpectedly, largely a syntactic rather than a semantic problem. That is to say, interoperability between bioinformatics Web Services can be largely achieved simply by specifying the data structures being passed between the services (syntax) even without rich specification of what those data structures mean (semantics).

In The Life Sciences Semantic Web is Full of Creeps!, (2006) Wilkinson and co-author Benjamin M. Good write, "both sociological and technological barriers are acting to inhibit widespread adoption of SW technologies," and acknowledge the complexity and high curatorial burden.

The Semantic Web for the Life Sciences (SWLS), when realized, will dramatically improve our ability to conduct bioinformatics analyses... The ultimate goal of the SWLS is not to create many separate, non-interacting data warehouses (as we already can), but rather to create a single, ‘crawlable’ and ‘queriable’ web of biological data and knowledge... This vision is currently being delayed by the timid and partial creep of semantic technologies and standards into the resources provided by the life sciences community.

These days, Mark Wilkinson is working on SADI, which “defines an open set of best-practices and conventions, within the spectrum of existing standards, that allow for a high degree of semantic discoverability and interoperability”.

Tuesday, March 20, 2012

Finding the right questions

A 2010 PLOS Biology article, Finding the right questions: exploratory pathway analysis to enhance biological discovery in large datasets, makes some good points about exploratory analysis of noisy biological data and the design of software to help do it.

At a time when biological data are increasingly digital and thus amenable to computationally driven statistical analysis, it is easy to lose sight of the important role of data exploration. Succinctly defined over 30 years ago by John Tukey, exploratory data analysis is an approach to data analysis that focuses on finding the right question, rather than the right answer. In contrast to confirmatory analysis, which involves testing preconceived hypothesis, exploratory data analysis involves a broad investigation, a key component of which may be visual display. Though his arguments predate personal computing and thus focus on graph paper and ink, the point still stands: good data visualization leads to simpler (better) descriptions and underlying fundamental concepts. Today, there is tremendous potential for computational biologists, bioinformaticians, and related software developers to shape and direct scientific discovery by designing data visualization tools that facilitate exploratory analysis and fuel the cycle of ideas and experiments that gets refined into well-formed hypotheses, robust analyses, and confident results.

The authors are involved in Wikipathways, an open platform for curation of biological pathways comparable to KEGG and Ingenuity Pathway Analysis, MetaCyc, and Pathway Commons, and this provides the context for their comments. But, most of their conclusions apply more generally to software for research, where the goal is to enable “researchers to take a flexible, exploratory attitude and facilitate construction of an understandable biological story from complex data.”

...instead of aiming for a single, isolated software package, developers should implement flexible solutions that can be integrated in a larger toolbox [...], in which each tool provides a different perspective on the dataset.

For developers, realizing that exploratory pathway analysis tools might be used not only in isolation but also with other software and different types of data in a flexible analysis setup might guide software design and implementation.

Effective data integration

Flexibility and interactivity are keys to effectiveness. “Determining what to integrate and how to present it to the user depends on the context and the question being asked.” Researchers often need to follow up on a weak or uncertain signal by finding confirmatory evidence in relevant orthogonal or correlated datasets. This emphasizes the importance of well curated data, whether pathways or annotated genome assembies, which can form the scaffolding on which integration takes place.

Providing an API, in addition to UI, opens up possibilities for scripting and automation and enables advanced users to “combine functionalities of different tools to perform novel types of analysis.” The authors note that defining general data models increases reusability and unity among software tools. This resonates with my own experience. One of the key virtues of Gaggle is it's highly general data model consisting of basic data structures - list, matrix, network, table, and tuples - free of specific semantics.

Also noted are the difficulties which make current analysis tools “rather isolated and hard to combine”, specifically:

reformatting data
mapping identifiers
learning curve of multiple software packages

“Succinctly defined over 30 years ago by John Tukey, exploratory data analysis is an approach to data analysis that focuses on finding the right question, rather than the right answer.” We've not yet found the right answer for data integration for biology, but it's clear that “integration of annotations and data is critical to extracting the full potential from large and high-throughput datasets.” This paper contains some exceptionally clear thinking on building software tools that will help bring that about.

Citation

Kelder T, Conklin BR, Evelo CT, Pico AR (2010) Finding the Right Questions: Exploratory Pathway Analysis to Enhance Biological Discovery in Large Datasets. PLoS Biol 8(8): e1000472. doi:10.1371/journal.pbio.1000472

Thursday, April 21, 2011

Genome Browser's Anonymous

As you may know, I'm starting a support group for those afflicted with the tragedy of having written a genome browser. Mine is called the Gaggle Genome Browser. About the time I was writing it, everyone and their uncle's dog decided to write a genome browser. New instruments with new data types were coming into the lab. Computers had more memory and CPU cores than ever. It seemed like a good idea at the time.

The Broad Institute's Integrative genomics viewer (shown below) got a write-up in the January Nature Biotechnology. IGV seems particularly well developed for next-gen sequencing data, nicely displaying coverage plots and alignments of short reads, with attention to the nuances of paired-ends.

IGV is a Java desktop app that pulls data down from a server component, the IGV Data Server. In my case, I cooked up a two level hierarchy for caching chunks of data in memory backed by SQLite. It's probably smart to add a server as a third level. IGV's multi-resolution data mode precomputes aggregations for zoomed-out views in which data is denser than the pixels in which to display it. IGV splits data into "tiles" stored in a custom indexed binary file format. "Hence a single tile at the lowest resolution, which spans the entire genome, has the same memory footprint as a tile at the very high zoom levels, which might span only a few kilobases." My GGB aggregates on the fly, which hurts performance in zoomed out views.

The IGV Data Server seems to derive a lot of it's data from the UCSC Genome Browser, which maintains nicely curated data mapped to genomic coordinates for a bunch of eukaryotes and also microbes. One thing I enjoyed hacking with on GGB was integration with R. I wonder if that would be worthwhile for IGV.

Which functionality to put on the client vs. which in the server is debatable. We considered building a browser based implementation, experimenting a bit with the super-cool protovis visualization library. We went with desktop. X:map is a nice counter-point, an interactive web-based genome browser. In their approach, the Google Maps API serves up pre-rendered image tiles, keeping the big data and heavy-weight computing tasks on the server. They also have an R and Java program that lets you plot custom data. JBrowse, from CSHL, does the rendering in the browser. Putting a data intensive and graphically interactive app in the browser is still somewhere near the edge of the envelope, but browsers are improving like crazy, as are programming models for this type of development.

For what it's worth, I like the format of the IGV paper. It concisely covers motivation, what the software does, a few unique features and a couple figures showing example applications, all at a high level overview in just two pages. A supplement contains the technical detail of interest to software developers along with more example applications. I like that better than trying to awkwardly shoehorn biology and software engineering together.

Anyway... Nice work, IGV team! Let me know if you'd like to join the support group. We're here to help.

Tuesday, February 01, 2011

Annotated source code

We programmers are told that reading code is a good idea. It may be good for you, but it's hard work. Jeremy Ashkenas has come up with a simple tool that makes it easier: docco. Ashkenas is also behind underscore.js and coffeescript, a dialect of javascript in which docco is written.

Interesting ways to mix prose and code have appealed to me ever since I first discovered Mathematica's live notebook, which lets you author documents that combine executable source code, typeset text and interactive graphics. For those who remember the early 90's chiefly for their potty training, running Mathematica on the Next pizza boxes was like a trip to the future. Combining the quick cycles of a Read-evaluate-print-loop with complete word processing and mathematical typesetting encourages you to keep lovely notes on your thinking and trials and errors.

Along the same lines, there's Sweave for R and sage for Python.

Likewise, one of the great innovations of Java was Javadoc. Javadoc doesn't get nearly enough credit for the success of Java as a language. It made powerful API's like the collections classes a snap and even helped navigate the byzantine complexities of Swing and AWT.

These days, automated documentation is expected for any language. Nice examples are: RubyDoc, scaladoc, Haddock (for Haskell). Doxygen works with a number of languages. Python has pydoc, but in practice seems to rely more on the library reference. Anyway, there are a bunch, and if your favorite language doesn't have one, start coding now.

The grand-daddy of these ideas is Donald Knuth's literate programming.

I believe that the time is ripe for significantly better documentation of programs, and that we can best achieve this by considering programs to be works of literature. Hence, my title: "Literate Programming."

Let us change our traditional attitude to the construction of programs: Instead of imagining that our main task is to instruct a computer what to do, let us concentrate rather on explaining to human beings what we want a computer to do.

The practitioner of literate programming can be regarded as an essayist, whose main concern is with exposition and excellence of style. Such an author, with thesaurus in hand, chooses the names of variables carefully and explains what each variable means. He or she strives for a program that is comprehensible because its concepts have been introduced in an order that is best for human understanding, using a mixture of formal and informal methods that reinforce each other.

Indeed, Ashkenas references Knuth, calling docco "quick-and-dirty, hundred-line-long, literate-programming".

This goodness needs to come to more language. There's a ruby port called rocco by Ryan Tomayko. And for Clojure there's marginalia.

I love the quick-and-dirty aspect and that will be the key to encouraging programmers to do more documentation that looks like this. I hope they build docco, or something like it, into github. Maybe one day there will be a Norton's anthology of annotated source code.

Vaguely related

literate-programming.rb
The Semicolon Wars Every programmer knows there is one true programming language. A new one every week
The Factor language's Visual REPL
It would be awesome if similar work enables rich interactions in the Clojure REPL
IPython, a rich Python shell

Thursday, September 23, 2010

Geting started with CouchDB

I'm investigating using CouchDB for a data mining application. CouchDB is a schema-less document-oriented database that stores JSON documents and uses JavaScript as a query language. You write queries in the form of map-reduce. Applications connect to the database over a ReSTful HTTP API. So, Couch is a creature of the web in a lot of ways.

What I have in mind (eventually) is sharding a collection of documents between several instances of CouchDB each running on their own nodes. Then, I want to run distributed map-reduce queries over the whole collection of documents. But, I'm just a beginner, so we're going to start off with the basics. The CouchDB wiki has a ton of getting started material.

Couchdb's installation instructions cover several options for installing on Mac OS X, as well as other OS's. I used MacPorts.

sudo port selfupdate
sudo port install couchdb

Did I remember to update my port definitions the first time through? Of f-ing course not. Port tries to be helpful, but it's a little late sometimes with the warnings. Anyway, now that it's installed, let's start it up. I came across CouchDB on Mac OS 10.5 via MacPorts which tells you how to start CouchDB using Apple's launchctl.

sudo launchctl load /opt/local/Library/LaunchDaemons/org.apache.couchdb.plist
sudo launchctl start org.apache.couchdb

To verify that it's up and running, type:

curl http://localhost:5984/

...which should return something like:

{"couchdb":"Welcome","version":"1.0.1"}

Futon, the web based management tool for CouchDB can be browsed to at http://localhost:5984/_utils/.

Being a nerd, I tried to run Futon's test suite. After they failed, I found this: The tests run only(!) in a separate browser and that browser needs to be Firefox. Maybe that's been dealt with by now.

Let's create a test database and add some bogus records like these:

{
   "_id": "3f8e4c80b3e591f9f53243bfc8158abf",
   "_rev": "1-896ed7982ecffb9729a4c79eac9ef08a",
   "description": "This is a bogus description of a test document in a couchdb database.",
   "foo": true,
   "bogosity": 99.87526349
}

{
   "_id": "f02148a1a2655e0ed25e61e8cee71695",
   "_rev": "1-a34ffd2bf0ef6c5530f78ac5fbd586de",
   "foo": true,
   "bogosity": 94.162327,
   "flapdoodle": "Blither blather bonk. Blah blabber jabber jigaboo splat. Pickle plop dribble quibble."
}

{
   "_id": "9c24d1219b651bfeb044a0162857f8ab",
   "_rev": "1-5dd2f82c03f7af2ad24e726ea1c26ed4",
   "foo": false,
   "bogosity": 88.334,
   "description": "Another bogus document in CouchDB."
}

When I first looked at CouchDB, I thought Views were more or less equivalent to SQL queries. That's not really true in some ways, but I'll get to that later. For now, let's try a couple in Futon. First, we'll just use a map function, no reducer. Let's filter our docs by bogosity. We want really bogus documents.

Map Function

function(doc) {
  if (doc.bogosity > 95.0)
    emit(null, doc);
}

Now, let's throw in a reducer. This mapper emits the bogosity value for all docs. The reducer takes their sum.

Map Function

function(doc) {
  emit(null, doc.bogosity);
}

Reduce Function

function (key, values, rereduce) {
  return sum(values);
}

It's a fun little exercise to try and take the average. That's tricky because, for example, ave(ave(a,b), ave(c)) is not necessarily the same as ave(a,b,c). That's important because the reducer needs to be free to operate on subsets of the keys emitted from the mapper, then combine the values. The wiki doc Introduction to CouchDB views explains the requirements on the map and reduce functions. There's a great interactive emulator and tutorial on CouchDB and map-reduce that will get you a bit further writing views.

One fun fact about CouchDB's views is that they're stored in CouchDB as design documents, which are just regular JSON like everything else. This is in contrast to SQL where a query is a completely different thing from the data. (OK, yes, I've heard of stored procs.)

That's the basics. At this point, a couple questions arise:

How do you do parameterized queries? For example, what if I wanted to let a user specify a cut-off for bogosity at run time?
How do I more fully get my head around these map-reduce "queries"?
Can CouchDB do distributed map-reduce like Hadoop?

There's more to design documents than views. Both _show and _list functions let you transform documents. List functions use cursor-like iterator that enables on-the-fly filtering and aggregating as well. Apparently, there are plans for _update and _filter functions as well. I'll have to do some more reading and hacking and leave those for later.

Tuesday, July 20, 2010

How to design good APIs

A long time ago, I asked a bunch of programming gurus how to go about designing an API. Several gave answers that boiled down to the unsettling advice, "Try to get it right the first time," to which a super-guru then added, "...but you'll never get it right the first time." With that zen wisdom in mind, here's a pile of resources that may help get it slightly less wrong.

Joshua Bloch, designer of the Java collection classes and author of Effective Java, gives a Google tech-talk called How to Design a Good API & Why it Matters. Video for another version of the same talk is available on InfoQ. He starts off with the observation that, "Good programming is modular. Module boundaries are APIs."

Characteristics of a Good API

Easy to learn

Easy to use, even without documentation

Hard to misuse

Easy to read and maintain code that uses it

Sufficiently powerful to satisfy requirements

Easy to extend

Appropriate to audience

Michi Henning, in API Design Matters, Communications of the ACM, May 2009, observes that, "An API is a user interface. APIs should be designed from the perspective of the caller."

Much of software development is about creating abstractions, and APIs are the visible interfaces to these abstractions. Abstractions reduce complexity because they throw away irrelevant detail and retain only the information that is necessary for a particular job. Abstractions do not exist in isolation; rather, we layer abstractions on top of each other. [...] This hierarchy of abstraction layers is an immensely powerful and useful concept. Without it, software as we know it could not exist because programmers would be completely overwhelmed by complexity.

Because you'll get it wrong the first time, and just because things change, you'll have to evolve APIs. Breaking clients is unpleasant, but "Backward compatibility erodes APIs over time."

My own little bit of wisdom is this: Performance characteristics are often part of the API. Unless stated otherwise, the caller will assume that a function will complete quickly. For example, it often seems like a good idea to make remote method calls look just like local method calls. This is a bad idea, because you can't abstract away time.

Monday, May 31, 2010

How many distinct paradigms of programming are there?

I learned a few styles of programming in school. At least, I heard of them somewhere in passing.

Procedural
Object-oriented
Functional
Logical

Then, I come across a thread on Hacker News asking Ask HN: Do you recognize this approach to programming?. The answer seems to be something called functional reactive programming. So, I started wondering, "How many distinct styles or paradigms of programming are there?"

That led me to Peter Van Roy's The principal programming paradigms and Programming Paradigms for Dummies. PVR is a co-author of Concepts, Techniques, and Models of Computer Programming, which I've heard described as, "If you liked SICP, you'll like this."

Tuesday, May 11, 2010

Amazon's Dynamo distributed key-value store

Amazon's Dynamo [is] a highly available key-value storage system, [which] sacrifices consistency under certain failure scenarios. It makes extensive use of object versioning and application-assisted conflict resolution in a manner that provides a novel interface for developers to use.

You typically hear two rationales for NoSQL data stores. First, to selectively relax ACID properties to gain performance in distributed systems. The second is to support data models that fit poorly into tables with set operations, such as documents or objects. Although I'm more interested in the later, this paper is all about the former. Even so, it's an impressive piece of software engineering.

Dynamo is a distributed key-value store otherwise known as a distributed hash table, which is like a normal hash table except the buckets are on different nodes. It can store small objects on lots of servers and look them up by primary key. It's API is get(key, context) and put(key, context, object). The context is like a cookie. It's whatever the server wants to you remind it of in your next request. Nodes are assigned blocks of key-space using consistent hashing, ensuring that keys and/or load is evenly distributed among the nodes. One big motivation for a system like this is fault tolerance. Nodes can come and go and the system keeps working.

Redundant copies of data are kept on several nodes via replication. In RDBMS's, replication is typically done in ways that favor consistency over availability. This is one reason why RDBMS's don't scale out easily. Dynamo trades consistency for availability, especially for write operations using optimistic replication, meaning that replicas are guaranteed to converge only when the system has been quiesced for a period of time. Changes are allowed to propagate to replicas in the background, and concurrent, disconnected work is tolerated. Conflicting changes must be detected and resolved, later. You'll hear the terms lazy replication and eventual consistency thrown around with roughly the same meaning.

Dynamo resolves conflicts during reads and hands conflicting updates to the client application to resolve in an application-dependent way. Consistency is application defined anyway, right?

As a real-world analogy, writing paper checks is a distributed system which cannot guarantee 'consistency'. Conflict resolution happens in the clearing process. Also, distributed source control systems must face similar issues. How does GIT handle this sort of thing?

Strategies

Vector clocks - detect conflicting updates.
Sloppy Quorum and hinted handoff - handle failure and recovery.
Merkle hash trees - A tree in which parents are hashes of child nodes. They are a quick way to compute a diff, detecting divergence between nodes.
Gossip based protocol - distribute node membership and up/down status information.

Sunday, December 20, 2009

Layered architecture for a Java application

This is a template for an application architecture that came out of my genome browser project. This Java/Swing desktop application aspires to be very flexible - implementing new visualizations and accommodating task-specific components when necessary. I'm not sure how well I did at this goal, so don't take this as here's-what-you-should-do, just here's-what-one-poor-schmuck-did.

The Application object holds application scope state, and binds components together, performing dependency injection where necessary. The entry point is myapp.Main, which configures Options using command-line arguments and starts the application. The application manages a concurrent event dispatch queue.

Components communicate with each other by events placed on the event queue and have access to an application API. Like in Swing, component code may run in the thread of the event dispatch, or, if it is long running, spin up its own thread. Components are responsible for managing their own threading and synchronization.

Layering, in this template, works something like this:

+--------+
|   UI   |
+--------+
+-----------------+ +----------------+
|   Application   | |   components   |
+-----------------+ +----------------+
+--------------+
|   services   |
+--------------+
+------------+ +----------+
|   domain   | |   util   |
+------------+ +----------+

Higher layers can depend on lower layers, while dependencies may not flow up. Utilities and domain model can have no internal dependencies and anything may depend on them. Services are things like file I/O, persistence, and maybe algorithms. The application exposes an API to components and will in turn implement that API in terms of components. I think this circular dependency doesn't bother me too much. If it did, the Application could have no dependencies on the components and the components could have some organized dependency structure among themselves. The UI sits on top where nothing should depend on it.

Wednesday, December 16, 2009

Modern Information Management in Bioinformatics

Jon Udell talks bioinformatics with Randy Julian of Indigo BioSystems on the topic of flexible data repositories.

… without buying into all the hype around semantic web and so on, you would argue that a flexible schema makes more sense in a knowledge gathering or knowledge generation context than a fixed schema does.

His contention is that fixed schemas don't work for knowledge discovery, instead the right tools are flexible schemas and linked data. Also, it's not enough to represent experimental data in standard ways. We also need to describe the experimental design that provides the context for that data. To accomplish this use documents annotated with RDF style triples or XML plus (not-quite-free-text) descriptions built from controlled vocabulary. Use virtualization to archive complete data analysis environments for reproducability.

On the IndigoBio blog, there's a couple posts about interoperable data that make use of R and Cytoscape. Sounds like territory familiar to my current project/nemesis Gaggle.

The conversation then turns to the increasingly distributed nature of the drug industry and the IT challenges of strictly proscribed data sharing between highly paranoid competitors. The goal is to produce portable data assets with the ability to merge with any clients knowledge base -- mapping into the other's terms.

Monday, November 30, 2009

Design Patterns 15 years later

Design Patterns 15 Years Later: An Interview with Erich Gamma, Richard Helm, and Ralph Johnson was recently featured on Lambda the Ultimate.

Some say design patterns are just work-arounds for the defects of C++. The paper Essential Programming Paradigm argues that design patterns occur because the programming paradigm disallows certain run-time composition of dynamic and static code. The GoF authors confirm that their design patterns fit object-oriented languages, and arise specifically from experience with C++ and Smalltalk, so are tied to language of implementation. "Design patterns eventually emerge for any language. Design déjà-vu is language neutral." Different design patterns may be emerging for dynamic languages or for functional languages.

They discuss the development of more design patterns beyond the 23 examples chosen for the Design Patterns book. Eric Gamma suggest some sort of collective intelligence approach for editing design patterns and rating their importance and applicability. Sounds like a good idea. Some new patterns they mention as candidates for inclusion in a revised set are: Null Object, Type Object, Dependency Injection, and Extension Object/Interface. Their new (draft) categorization of design patterns looks like this:

They seem to have dropped several, some of which I won't miss. but why axe composite or observer? And bridge, maybe not the most useful in practise, but when I finally understood what they meant, I felt like I had accomplished something.

Design patterns links

Wikipedia entry for Design pattern
Architectural pattern
Patterns of Enterprise Application Architecture
Enterprise Integration Patterns
Jeffrey Heer's Software Design Patterns for Information Visualization; IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. 12, NO. 5, SEPTEMBER/OCTOBER 2006
Pattern-Oriented Software Architecture: Patterns for Concurrent and Networked Objects
Portland Pattern Repository
Peter Norvig's Design Patterns in Dynamic Programming
Rethinking Design Patterns

Tuesday, November 03, 2009

Elements of Programming

Don't you hate it when someone reads the first chapter of a book and then decides they're qualified to review it? So do I. But based on the first chapter, Elements of Programming by Alexander Stepanov, Paul McJones is something I might want to come back to, so here's my reminder to myself.

Stepanov is the guy behind the C++ Standard Template Library (STL), an advocate of generic programming and a critic of OOP. (papers here)

The authors have done for C++ something that is more commonly done for the functional languages, which is to put the language on a formal mathematical basis. Their central insight is the parallel between algorithms and data structures and algebraic structures. Abstract algebraic structures consist of one or more sets and one or more operations on those sets.

The formalism introduced in the book is based on objects with mutable state "computers with memory constitute the only available realization of a universal computing device". They seem to value that C++ is "faithful to the underlying machine". I wonder how this formalism will hold up to parallel computation, in contrast to the immutability and statelessness that seem popular in these increasingly parallel days. I also wonder about the comparison between this formalism and the lambda calculus.

They define a notion called regularity. A regular function is a pure side-effect-free function (If I'm reading right). Procedures and types can also be regular. A regular type is required to implement operation for equality, assignment, destructor, default constructor, copy constructor, total ordering, and underlying type. A concept is a minimal set of requirements a data type must satisfy for an algorithm to be applicable. Algorithms can then be implemented in terms of concepts rather than concrete data types. (I think?) Concepts can have semantic requirement like linear time complexity for a given operation, bringing efficiency into the mix.

Whether one likes this book will be strongly influenced by how well one likes generic programming in C++. And maybe whether one wants one's theory sullied by implementation details. I like the idea that generic programming can be something more than making a type checker happy -- more than the type-safe collections of Java's puny generics. And who would have thought you could extract a pretty mathematical theory from the warty ugster known as C++?

A reviewer on Amazon suggests a related Google Tech Talk "A Possible Future of Software Development" by Sean Parent.

Thursday, July 30, 2009

You can't control what you can't measure

In an opinion piece in IEEE Software, Software Engineering: An Idea Whose Time Has Come and Gone? Tom DeMarco puts his finger on something that's been bugging me about what passes for software engineering for a long time.

I still believe it makes excellent sense to engineer software. But that isn’t exactly what software engineering has come to mean. The term encompasses a specific set of disciplines including defined process, inspections and walkthroughs, requirements engineering, traceability matrices, metrics, precise quality control, rigorous planning and tracking, and coding and documentation standards. All these strive for consistency of practice and predictability.

I'll try to paraphrase his point. Although you can't control what you can't measure, control and predictability are important only in projects of marginal value. In high value projects, rigid cost control becomes insignificant. Maybe that's why so much real innovation takes place in hacker's garages rather than on corporate campuses.

Wednesday, October 30, 2013

More

Monday, October 14, 2013

More

Thursday, December 06, 2012

Using the parallel package

Hitting limits

makePSOCKcluster hangs

Load unbalancing

Slothful automation

Features I'd like to see

The right tool for 10,000 jobs

Sunday, August 05, 2012

Simplicity vs complexity

Abstraction for simplicity

Data representation

Choosing simplicity

Saturday, June 23, 2012

Tuesday, May 08, 2012

Principles

Sunday, April 29, 2012

Friday, March 23, 2012

More on the Semantic Web

Tuesday, March 20, 2012

Effective data integration

Citation

Thursday, April 21, 2011

Tuesday, February 01, 2011

Vaguely related

Thursday, September 23, 2010

Links

Tuesday, July 20, 2010

Links

Monday, May 31, 2010

Tuesday, May 11, 2010

Strategies

Links

Sunday, December 20, 2009

Wednesday, December 16, 2009

Monday, November 30, 2009

Tuesday, November 03, 2009

Thursday, July 30, 2009

About

About Me

Blog Archive

Labels

Cheat Sheets

Feedz

Featured on