If you want to learn Spark - and who doesn't? - sign up.
Spark is a successor to Hadoop that comes out of the AMPLab at Berkeley. It's faster for many operations due to keeping data in memory, and the programming model feels more flexible in comparison to Hadoops rigid framework. The AMPLab provides a suite of related tools including support for machine learning, graphs, SQL and streaming. While Hadoop is most at home with batch processing, Spark is a little better suited to interactive work.
The first class was quick and easy, covering Spark and RDDs through PySpark. No brain stretching on the order of Daphne Koller's Probabilistic Graphical Models to be found here. The lectures stuck to the "applied" aspects, but that's OK. You can always hit the papers to go deeper. The labs were fun and effective at getting you up to speed:
Labs for the first class:
- Word count, the hello world of map-reduce
- Analysis of web server log files
- Entity resolution using a bag-of-words approach
- Collaborative filtering on a movie ratings database. Apparently, I should watch these: Seven Samurai, Annie Hall, Akira, Stop Making Sense, Chungking Express.
The second installment looks to very cool, delving deeper into mllib the AMPLab's machine learning library for Spark. Its labs cover:
- Musicology: predict the release year of a song given a set of audio features
- Prediction of click-through rates
- Neuroimaging Analysis on brain activity of zebrafish (which I suspect is the phase "Just keep swimming" over and over) done in collaboration with Jeremy Freeman of the Janelia Research Campus.
Echoing my own digital hoarder tendencies, the first course is liberally peppered with links, which I've dutifully culled and categorized for your clicking compulsion:
Big Data Hype
- Biology 2.0 The Economist Jun 17th 2010
- The data deluge
- Big Data: The Management Revolution
- Data Scientist: The Sexiest Job of the 21st Century
- The Unreasonable Effectiveness of Data Alon Halevy, Peter Norvig, and Fernando Pereira, Google
- A Very Short History Of Data Science, Fortune 2013
- The Fourth Paradigm
- MapReduce: Simplified Data Processing on Large Clusters, Jeffrey Dean and Sanjay Ghemawat
- Spark: Cluster Computing with Working Sets, Matei Zaharia et al, 2010
- Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia et al, 2012
- Enterprise Data Analysis and Visualization: An Interview Study. Classifies data scientists into three archetypes: the hacker, the scripter and the application user. I have to admit, I prefer Joel Grus's What Kind of Data Scientist Are You?.
- 2009 Above the Clouds Technical Report
- Epidemiological modeling of online social network dynamics
- Structured Urban Data
- City of San Franciso open data
- Not to be out-done City of Seattle open data
- Data Wrangler, the academic project that was the basis for the company Trifacta
- OpenRefine, source repo
- Slide deck by Ted Johnson on Data Quality and Data Cleaning
- Introduction to Probability and Statistics
- US National Institute of Standards and Technology primer on Exploratory Data Analysis
- Big Data Processing with Apache Spark – Part 1: Introduction
- Big Data Processing with Apache Spark - Part 2: Spark SQL
- Ben Fry's Keynote on Visualizing Data from VIZBI 2010. Alternate link: Keynote on Visualizing Data
The Data Science Process
In case you're still wondering what data scientists actually do, here it is according to...
- Identify problem
- Intrumenting data sources
- Collect data
- Prepare data (integrate, transform, clean, filter, aggregate)
- Build model
- Evaluate model
- Communicate results