Wednesday, October 30, 2013

Building distributed systems out of crap

Pat Helland gave the opening keynote at Basho's conference Ricon West, yesterday. The general topic was building distributed systems with enterprise guarantees and web scalability on crap. His argument is that enterprise-grade SLAs with lots of nines can be supported on cheap hardware using a strategy of expect failure and recover quickly.

Helland, formerly having done time on the Bing team and at Amazon, is building a distributed data storage system for Salesforce.com. It's design involves a catalog stored in a relational DB and files stored on clusters of storage servers, a technique Helland calls blobs-by-reference.

The files are stored in fragments distributed across a cluster. There was another concept called an “extent”. I wasn't sure if that meant an aggregation of related fragments or just a bucket to dump them in.

SSDs are used as a new layer of the memory hierarchy. Helland argues for using the cheapest and crappiest available. This entails a couple engineering tweaks. Because SSDs degrade with every operation, the software has to manage read write cycles. To detect data corruption, each fragment is packaged with a CRC error-detecting code.

“By surrounding the data with aggressive error checking, we can be extremely confident of detecting an error and fetching the desired data from one of the other places it has been stored.”

Helland emphasized the importance of immutable data, which goes a long way towards mitigating the inconsistency and race conditions that come with distributed computing. In the proposed storage system, fragments are immutable, which greatly reduces opportunity for the storage nodes to get out of sync with the catalog.

Aside from this talk, Ricon is loaded with good content including a talk by Jeff Dean coming up this afternoon. Send me next year!

More

Monday, October 14, 2013

Concurrency and Parallelism - What's the difference?

For a while, I've been coming across references to the difference between concurrency and parallelism. The definitions go something like this: Concurrency concerns "interleaved threads of execution with access to shared state" which is distinct from parallelism because "parallel operations run simultaneously".

I'm quoting from - "Clojure Programming" by Chas Emerick, Brian Carper and Christophe Grand - which is a perfectly good book. I've seen similar definitions elsewhere, so I don't want to pick on these guys in particular. I'm going to disagree a bit, but overall, the book is really well done and I'm enjoying it.

My beef is this: I couldn't see the utility of the distinction they're drawing. I couldn't see why you'd want to design a program differently to run as threads scheduled on a single core versus threads scheduled on several cores. In fact, treating those cases the same seems like a plus.

In contrast, there are some distinctions between types of concurrency that are useful. Knowing your code will be distributed across machines tells you to bring network latency into the picture. Likewise, only certain problems are amenable to the single-instruction-multiple-data (SIMD) model of vector processors such as GPUs. These considerations have a real impact. But, why the pedantry over concurrency versus parallelism?

I was about to write a little rant about why this distinction is useless. But, keeping an open mind, I googled around a bit and up popped a talk by Rob Pike called "Concurrency Is Not Parallelism". Change of plan. Rob Pike is a bad-ass, well known as a Unix pioneer, Bell Labs veteran and Google Distinguished Engineer. New plan: go back to school and find out why I'm wrong.

Pike's talk explains things beautifully, and not just because he's wearing an orange suit jacket and a gopher t-shirt. Here's my understanding of the take-away:

Concurrency is a more general and abstract idea than parallelism. Concurrency is about the decomposition of a problem into subtasks at the design level. If you're creating a concurrent design, you haven't said yet whether your design will be executed in parallel. Parallelism is a detail to be decided at run-time. Which brings us to the good part.

Whenever you can take two things that were previously conjoined and let them vary independently, you're making progress. The two things in question here are the design - the decomposition of a problem into concurrent parts - and the execution of those parts, perhaps in parallel. Making this separation allows programs to be expressed correctly and structured clearly while making good use of available resources whether that's one core or many.

This important point is what's missing from the definitions above. That and they're comparing things at different levels of generality.

Next, Pike relates these ideas to Go. The language provides three concurrency primitives: Go routines, channels and select. Go routines are like threads but much cheaper. During execution, they're mapped onto OS threads by a scheduler. Channels and select statements enable communication. Go is an implementation of concepts have their origin in the classic paper Communicating Sequential Processes by Tony Hoare.

The moral of the story? Learn from the masters ...and from the gophers.

More

Tony Hoare's paper is on several "greats" lists including these: