Pat Helland gave the opening keynote at Basho's conference Ricon West, yesterday. The general topic was building distributed systems with enterprise guarantees and web scalability on crap. His argument is that enterprise-grade SLAs with lots of nines can be supported on cheap hardware using a strategy of expect failure and recover quickly.
Helland, formerly having done time on the Bing team and at Amazon, is building a distributed data storage system for Salesforce.com. It's design involves a catalog stored in a relational DB and files stored on clusters of storage servers, a technique Helland calls blobs-by-reference.
The files are stored in fragments distributed across a cluster. There was another concept called an “extent”. I wasn't sure if that meant an aggregation of related fragments or just a bucket to dump them in.
SSDs are used as a new layer of the memory hierarchy. Helland argues for using the cheapest and crappiest available. This entails a couple engineering tweaks. Because SSDs degrade with every operation, the software has to manage read write cycles. To detect data corruption, each fragment is packaged with a CRC error-detecting code.
“By surrounding the data with aggressive error checking, we can be extremely confident of detecting an error and fetching the desired data from one of the other places it has been stored.”
Helland emphasized the importance of immutable data, which goes a long way towards mitigating the inconsistency and race conditions that come with distributed computing. In the proposed storage system, fragments are immutable, which greatly reduces opportunity for the storage nodes to get out of sync with the catalog.
Aside from this talk, Ricon is loaded with good content including a talk by Jeff Dean coming up this afternoon. Send me next year!
More
- The event was live streamed, so I'm guessing a video is forthcoming. (Ricon West Live)
- For now, there's a video of Helland's talk at the same event in 2012 Immutability Changes Everything.
- Helland spoke on Condos and Clouds at LinkedIn.
- See the Register's writeup of the talk. “Salesforce's data-center design: Oracle sits at the core, but after that things get ugly - and cheap!”
- Life beyond Distributed Transactions: an Apostate’s Opinion a 2012 position paper by Helland
- There's Just No Getting around It: You're Building a Distributed System by Mark Cavage writing for ACM Queue.
- A Distributed Systems Reading List from Dan Creswell and Distributed systems reading group at MIT