Tuesday, March 01, 2011

Learning data science skills

According to Hal Varian and just about everyone these days, the hot skills to have are some combination of programming, statistics, machine learning, and visualization. Here are a pile of resources that'll help you get some mad data science skills.

Programming

There seems to be a few main platforms widely used for data intensive programming. R, is a statistical environment that is to statisticians what MatLab is to engineers. It's a weird beast, but it's open source and very powerful, plus has a great community. Python also makes a strong showing, with the help of NumPy, SciPy and matplotlib. An intriguing new entry is the combination of the Lisp dialect Clojure and Incanter. All these tools mix numerical libraries with functional and scripting programming styles in varying proportions. You'll also want to look into Hadoop, to do your big data analytics map-reduce style in the cloud.

Statistics

  • John Verzani's Using R for Introductory Statistics, which I'm working my way through.
  • Machine Learning

  • Toby Segaran's Programming Collective Intelligence
  • The Elements of Statistical Learning: Data Mining, Inference, and Prediction
  • Bishop's Pattern Recognition and Machine Learning
  • Machine Learning, Tom Mitchell
  • Visualization

  • Tufte's books, especially The Visual Display of Quantitative Information
  • Processing, along with Ben Fry's book, Visualizing Data.
  • Jeffrey Heer's papers, especially Software Design Patterns for Information Visualization. Heer is one of the creators of several toolkits: Prefuse, Flare and Protovis.
  • 7 Classic Foundational Vis Papers and Seminal information visualization papers
  • Classes

    Starting on March 5 at the Hacker Dojo in Mountain View (CA), Mike Bowles and Patricia Hoffmann will present a course on Machine Learning where R will be the "lingua franca" for looking at homework problems, discussing them and comparing different solution approaches. The class will begin at the level of elementary probability and statistics and from that background survey a broad array of machine learning techniques including: Unsupervised Learning, Clustering Techniques, and Fault Detection.

    R courses from Statistics.com

    Feb 11:  Modeling in R (Sudha Purohit -- more details after the jump)
    Mar 4:  Introduction to R - Data Handling (Paul Murrell)
    Apr 15:  Programming in R (Hadley Wickham)
    Apr 29:  Graphics in R (Paul Murrell)
    May 20:  Introduction to R – Statistical Analysis (John Verzani)

    Data bootcamp (slides and code) from the Strata Conference. Tutorials covering a handful of example problems using R and python.

    • Plotting data on maps
    • Classifying emails
    • A classification problem in image analysis

    Cosma Shalizi at CMU teaches a class: Undergraduate Advanced Data Analysis.

    More resources