Wednesday, October 30, 2013

Building distributed systems out of crap

Pat Helland gave the opening keynote at Basho's conference Ricon West, yesterday. The general topic was building distributed systems with enterprise guarantees and web scalability on crap. His argument is that enterprise-grade SLAs with lots of nines can be supported on cheap hardware using a strategy of expect failure and recover quickly.

Helland, formerly having done time on the Bing team and at Amazon, is building a distributed data storage system for Salesforce.com. It's design involves a catalog stored in a relational DB and files stored on clusters of storage servers, a technique Helland calls blobs-by-reference.

The files are stored in fragments distributed across a cluster. There was another concept called an “extent”. I wasn't sure if that meant an aggregation of related fragments or just a bucket to dump them in.

SSDs are used as a new layer of the memory hierarchy. Helland argues for using the cheapest and crappiest available. This entails a couple engineering tweaks. Because SSDs degrade with every operation, the software has to manage read write cycles. To detect data corruption, each fragment is packaged with a CRC error-detecting code.

“By surrounding the data with aggressive error checking, we can be extremely confident of detecting an error and fetching the desired data from one of the other places it has been stored.”

Helland emphasized the importance of immutable data, which goes a long way towards mitigating the inconsistency and race conditions that come with distributed computing. In the proposed storage system, fragments are immutable, which greatly reduces opportunity for the storage nodes to get out of sync with the catalog.

Aside from this talk, Ricon is loaded with good content including a talk by Jeff Dean coming up this afternoon. Send me next year!

More

Monday, October 14, 2013

Concurrency and Parallelism - What's the difference?

For a while, I've been coming across references to the difference between concurrency and parallelism. The definitions go something like this: Concurrency concerns "interleaved threads of execution with access to shared state" which is distinct from parallelism because "parallel operations run simultaneously".

I'm quoting from - "Clojure Programming" by Chas Emerick, Brian Carper and Christophe Grand - which is a perfectly good book. I've seen similar definitions elsewhere, so I don't want to pick on these guys in particular. I'm going to disagree a bit, but overall, the book is really well done and I'm enjoying it.

My beef is this: I couldn't see the utility of the distinction they're drawing. I couldn't see why you'd want to design a program differently to run as threads scheduled on a single core versus threads scheduled on several cores. In fact, treating those cases the same seems like a plus.

In contrast, there are some distinctions between types of concurrency that are useful. Knowing your code will be distributed across machines tells you to bring network latency into the picture. Likewise, only certain problems are amenable to the single-instruction-multiple-data (SIMD) model of vector processors such as GPUs. These considerations have a real impact. But, why the pedantry over concurrency versus parallelism?

I was about to write a little rant about why this distinction is useless. But, keeping an open mind, I googled around a bit and up popped a talk by Rob Pike called "Concurrency Is Not Parallelism". Change of plan. Rob Pike is a bad-ass, well known as a Unix pioneer, Bell Labs veteran and Google Distinguished Engineer. New plan: go back to school and find out why I'm wrong.

Pike's talk explains things beautifully, and not just because he's wearing an orange suit jacket and a gopher t-shirt. Here's my understanding of the take-away:

Concurrency is a more general and abstract idea than parallelism. Concurrency is about the decomposition of a problem into subtasks at the design level. If you're creating a concurrent design, you haven't said yet whether your design will be executed in parallel. Parallelism is a detail to be decided at run-time. Which brings us to the good part.

Whenever you can take two things that were previously conjoined and let them vary independently, you're making progress. The two things in question here are the design - the decomposition of a problem into concurrent parts - and the execution of those parts, perhaps in parallel. Making this separation allows programs to be expressed correctly and structured clearly while making good use of available resources whether that's one core or many.

This important point is what's missing from the definitions above. That and they're comparing things at different levels of generality.

Next, Pike relates these ideas to Go. The language provides three concurrency primitives: Go routines, channels and select. Go routines are like threads but much cheaper. During execution, they're mapped onto OS threads by a scheduler. Channels and select statements enable communication. Go is an implementation of concepts have their origin in the classic paper Communicating Sequential Processes by Tony Hoare.

The moral of the story? Learn from the masters ...and from the gophers.

More

Tony Hoare's paper is on several "greats" lists including these:

Thursday, September 19, 2013

Special secret stuff with s3cmd

According to my coworker, "Amazon's S3 is the best thing since sliced bread." For working with S3, s3cmd is a handy little tool. It's documentation is a bit on the sparse side, but, what do you expect for free?

One gotcha with S3 is that buckets and the files in them have entirely distinct ACLs. This can lead to scenarios where the owner of a bucket can't work with the files in it. An easy way for this to come about is to log into the S3 console and create a bucket with one set of credentials, then upload files with a tool like s3cmd under another set of credentials.


cd /path/to/files/
s3cmd -v sync . s3://damnbucket.bucketowners.org

You can give permissions to the bucket owner like so:


s3cmd setacl --acl-grant=full_control:i_own_the_damn_bucket@bucketowners.org --recursive s3://damnbucket.bucketowners.org

You might also want to make the files public, so they can be served as a static website.


s3cmd setacl --acl-public --recursive s3://damnbucket.bucketowners.org

AWS Command Line Interface

I've been using s3cmd for a while, out of habit, but maybe it's time to try to Amazon's AWS Command Line Interface, which just had their version 1.0 release.

From a brief look, AWS CLI looks nice. You can do the same sync operation as above and make files public in one command:


aws s3 sync . s3://damnbucket.bucketowners.org --acl public-read

Amazon is very mysterious about how to specify the target of a grant of permissions, aka the grantee. I tried to give permission to the owner of a bucket, but kept getting an error. Some more examples in the docs would help! I also get "Invalid Id" for no apparent reason in the permissions section of the web UI for S3, so maybe I'm just clueless.


aws s3api put-object-acl --bucket damnbucket.bucketowners.org --grant-full-control i_own_the_damn_bucket@bucketowners.org --key genindex.html
#> A client error (InvalidArgument) occurred: Argument format not recognized.

As far as I could tell, the AWS CLI tool seems to be missing the --recursive option that we used with s3cmd. That seems like a fairly essential thing.

Also, I couldn't get the profile feature to work:


aws s3 ls s3://damnbucket.bucketowners.org --profile docgenerator
#>The config profile (docgenerator) could not be found

NOTE: Many thanks to Mitch Garnaat, I now know how the --profile switch works. Contrary to the documentation, you need a heading in your config file like this: [profile docgenerator] rather than like this: [docgenerator].

I'm glad Amazon is taking the lead in developing this tool, and I'm sure they'll keep making it better. And, there's a Github repo, so I'm guessing that means they take pull requests.

Saturday, September 07, 2013

Who will prosper in the age of smart machines?

What if Siri really worked? ...worked so well that those that mastered co-operation with a digital assistant had a serious competitive advantage over those relying on their own cognitive powers alone?

That question is considered by economist Tyler Cowen in a new book called Average is Over. Previously, Cowen wrote The Great Stagnation and maintains the Marginal Revolution blog and teaches online classes at Marginal Revolution University.

"Increasingly, machines are providing not only the brawn but the brains, too, and that raises the question of where humans fit into this picture — who will prosper and who won’t in this new kind of machine economy?"

Tyler Cowen's answers

Who will gain: People who can collaborate with smart machines; life long learners; people with marketing skills; motivators.

"Sheer technical skill can be done by the machines, but integrating the tech side with an attention-grabbing innovation is a lot harder."

The psychological aspect is interesting. The traditional techy nerd (ahem... hello, self) has a psychology adapted to machines. But, machines are gaining the capacity to interface on a level adapted human psychology.

Who will lose: People who compete with smart machines; people who are squeamish about being tracked, evaluated and rated; the sick; bohemians; political radicals.

On being quantified

"Computing and software will make it easier to measure performance and productivity. [...] In essence everyone will suffer the fate of professional chess players, who always know when they have lost a game, have an exact numerical rating for their overall performance, and find excuses for failure hard to come by."

On hipsters

"These urban areas [he doesn't mention Portland by name] are full of people who are bright, culturally literate, Internet-savvy and far from committed to the idea of hard work directed toward earning a good middle-class living. We’ll need a new name for the group of people who have the incomes of the lower middle class and the cultural habits of the wealthy or upper middle class. They will spread a libertarian worldview that working for other people full time is an abominable way to get by."

How many will prosper

The current trend of unequal wealth distribution will only continue as technological literacy takes on new dimensions. "Big data" makes it easier to measure and grade our skills and failings. Apex skills are the ability to grab human attention, to motivate, to manage humans and machines in collaboration.

Another, not quite contrasting view comes from Northwestern University economist Robert Gordon. His paper "Is U.S. Economic Growth Over? Faltering Innovation Confronts the Six Headwinds", suggests that "the rapid progress made over the past 250 years could well turn out to be a unique episode in human history." In particular, he raises the possibility that the internet and digital and mobile technology may contribute less to productivity than previous industrial revolutions.

Technology can create winner-take-all situations, where a few capture enormous gains leaving the also-rans with little. A lot depends on the distribution of technology's benefits.

More

Thursday, July 11, 2013

Generate UUIDs in R

Here a snippet of R to generate a Version 4 UUID. Dunno why there wouldn't be an official function for that in the standard libraries, but if there is, I couldn't find it.


## Version 4 UUIDs have the form:
##    xxxxxxxx-xxxx-4xxx-yxxx-xxxxxxxxxxxx
##    where x is any hexadecimal digit and
##    y is one of 8, 9, A, or B
##    f47ac10b-58cc-4372-a567-0e02b2c3d479
uuid <- function(uppercase=FALSE) {

  hex_digits <- c(as.character(0:9), letters[1:6])
  hex_digits <- if (uppercase) toupper(hex_digits) else hex_digits

  y_digits <- hex_digits[9:12]

  paste(
    paste0(
      sample(hex_digits, 8, replace=TRUE),
      collapse=''),
    paste0(
      sample(hex_digits, 4, replace=TRUE),
      collapse=''),
    paste0(
      '4',
      paste0(sample(hex_digits, 3, replace=TRUE),
             collapse=''),
      collapse=''),
    paste0(
      sample(y_digits,1),
      paste0(sample(hex_digits, 3, replace=TRUE),
             collapse=''),
      collapse=''),
    paste0(
      sample(hex_digits, 12, replace=TRUE),
      collapse=''),
    sep='-')
}
}

View as a gist: https://gist.github.com/cbare/5979354

Note: Thanks to Carl Witthoft for pointing out that my first version was totally broken. Turns out calling sample with __replace=TRUE__ greatly expands the possible UUIDs you might generate!

Carl also says, "In general, as I understand it, the value of UUID codes is directly dependent on the quality of the pseudo-random number generator behind them, so I’d recommend reading some R-related literature to make sure “sample” will be good enough for your purposes."

This sounds wise, but I'm not sure if I'm smart enough to follow up on it. It could be that the randomness of these UUIDs is less than ideal.

Saturday, July 06, 2013

Automate This!

The invention of the printing press by German blacksmith Johannes Gutenberg in 1439, the foundational event of the infomation age, is a common touchstone for technology stories, appearing in the opening chapter of both Nate Silver's The Signal and the Noise and Viktor Mayer-Schonberger and Kenneth Cukier's Big Data.

Automate This!, by Christopher Steiner, comes at current technology trends from a more mathy angle, tracing roots back to Leibniz and Gauss. Here, it's algorithms rather than data that take center stage. Data and algorithms are two sides of the same coin, really. But, it's nice to some of the heros of CS nerds everywhere get their due: Al Khwarizmi, Fibonacci, Pascal, the Bernoullis, Euler, George Boole, Ada Lovelace and Claude Shannon.

Automate This! is more anecdotal than Big Data, avoiding sweeping conclusions except the one announced in bold letters as the title of the last chapter: "The future belongs to the algorithms and their creators." The stories, harvested from Steiner's years as a tech journalist at Forbes, cover finance and the start-up scene, but also medicine, music and analysis of personality.

Many of the same players from Nate Silver's book or from Big Data make an appearance here as well: Moneyball baseball manager Billy Beane, game-theorist and political scientist Bruce Bueno de Mesquita, and Facebook data scientist and Cloudera founder Jeff Hammerbacher.

Finance

In the chapter on algorithmic trading we meet hungarian-born electronic-trading pioneer Thomas Peterffy, who built financial models in software in the 80's before it was cool by hacking a NASDAQ terminal.

In the same chapter, I gained new respect for financial commentator Jim Cramer. In contrast to his buffoonish on-screen persona, his real-time analysis of the May 2010 flash-crash was both "uncharacteristically calm" and uncannily accurate. As blue-chip stocks like JNJ dived to near-zero, he made the savvy assessment, "That's not a real price. Just go and buy it!" and, as prices recovered only minutes later, "You'll never know what happened here." There's little doubt that algorithmic trading was the culprit, but the unanswered question is whether it was a bot run amok or an intentional strategy that worked a little too well. Too bad, if you did buy they probably canceled it.

Music

Less scarily, algorithms can rate a pop song's chances of becoming a hit single. Serious composer and professor of music David Cope uses hand-coded (in LiSP) programs to compose music, pushing boundaries in automating the creative process.

Medicine

Having mastered Jeopardy, IBM's Watson is gearing up to take a crack at medical diagnostics, which is a field Hammerbacher thinks is ripe for hacking.

Psych and soc

Computers are beginning to understand people, which gives them a leg up on me, I'd have to say. Taibi Kahler developed a classification system for personality types based on patterns in language usage. Used by NASA psychiatrist Terry McGuire to compose well-balanced flight crews, the system divides personality into six bins: emotions-driven, thoughts-based, actions-driven, reflection-driven, opinions-driven, and reactions-based. If you know people more than superficially, they probably don't fit neatly into one of those categories, but some do (by which I mean they have me pretty well pegged).

At Cornell, Jon Kleinberg's research provides clues to the natural pecking order that emerges among working or social relationships - Malcolm Gladwell's influencers detected programmatically. One wonders if corporate hierarchies were better aligned with such psychological factors would the result be a harmonious workplace where everyone knows and occupies their right place? Or a brave new world of speciation into some technological caste system?

What next?

Perhaps surprisingly, Steiner cites the Kaufmann Foundation's Financialization and Its Entrepreneurial Consequences on "the damage wrought by wall street" - the brain drain toward finance and away from actual productive activity. The book ends with the hopeful message that the decline of finance will set quantitative minds free to work on creative entrepreneurial projects. For the next generation, there's a plea for urgently needed improvements in quantitative education, especially at the high-school level.

Automate This! is a quick and fun read. Steiner's glasses are bit rose-tinted at times and his book will make you feel like a chump if you haven't made a fortune algorithmically gaming the markets or disrupting some backwards corner of the economy. As my work-mates put it, we're living proof that tech-skills are a necessary but not sufficient condition.

Links

Kahler's personality types

  • Emotions-driven: form relationships, get to know people, tense situations -> dramatic, overreative
  • Thoughts-based: do away with pleasantries, straight to the facts. Rigid pragmatism, humorless, pedantic, controlling
  • Actions-driven: crave action, progress, always, pushing, charming. Pressure -> impulsive, irrational, vengeful
  • Reflection-driven: calm and imaginative, think about what could be rather than work with what is, can dig into a new subject for hours, applying knowledge to the real work is a weakness
  • Opinions-driven: see one side, stick to their opinions in the face of proof. Persistent workers, but can be judgmental, suspicious and sensitive
  • Reactions-based: rebels, spontaneous, creative and playful. React strongly, either, "I love it!" or "That sucks!" Under pressure, can be stubborn, negative, and blameful

RESTful APIs

At Clojure-West, this past March, former ThoughtWorker Siva Jagadeesan spoke on on how to build good web APIs using Resource Oriented Architecture (ROA) and Clojure.

The talk isn't really specific to Clojure, but is really more a primer on REST APIs.

Rants about true REST versus REST-ish, REST influenced, or partially RESTful are not especially interesting. That's not what this is. It's a nicely pragmatic guide to the architectural patterns and why you might use them.

  1. The swamp of POX (plain old XML, or more likely JSON, these days): At first glance, many, myself included, take REST to mean RPC over HTTP with XML/JSON.
  2. URI: Start thinking in terms of resources.
  3. HTTP: The HTTP verbs together with URIs define a uniform interface. CRUD operations are always handled the same way. So are error conditions.
  4. Hypermedia (HATEOAS): Possible state transitions are given in the message body, allowing the logic of the application to reside (mostly? completely?) on the server side. The key here is "discoverability of actions on a resource."

More