Tuesday, March 01, 2011

Learning data science skills

According to Hal Varian and just about everyone these days, the hot skills to have are some combination of programming, statistics, machine learning, and visualization. Here are a pile of resources that'll help you get some mad data science skills.


There seems to be a few main platforms widely used for data intensive programming. R, is a statistical environment that is to statisticians what MatLab is to engineers. It's a weird beast, but it's open source and very powerful, plus has a great community. Python also makes a strong showing, with the help of NumPy, SciPy and matplotlib. An intriguing new entry is the combination of the Lisp dialect Clojure and Incanter. All these tools mix numerical libraries with functional and scripting programming styles in varying proportions. You'll also want to look into Hadoop, to do your big data analytics map-reduce style in the cloud.


  • John Verzani's Using R for Introductory Statistics, which I'm working my way through.
  • Machine Learning

  • Toby Segaran's Programming Collective Intelligence
  • The Elements of Statistical Learning: Data Mining, Inference, and Prediction
  • Bishop's Pattern Recognition and Machine Learning
  • Machine Learning, Tom Mitchell
  • Visualization

  • Tufte's books, especially The Visual Display of Quantitative Information
  • Processing, along with Ben Fry's book, Visualizing Data.
  • Jeffrey Heer's papers, especially Software Design Patterns for Information Visualization. Heer is one of the creators of several toolkits: Prefuse, Flare and Protovis.
  • 7 Classic Foundational Vis Papers and Seminal information visualization papers
  • Classes

    Starting on March 5 at the Hacker Dojo in Mountain View (CA), Mike Bowles and Patricia Hoffmann will present a course on Machine Learning where R will be the "lingua franca" for looking at homework problems, discussing them and comparing different solution approaches. The class will begin at the level of elementary probability and statistics and from that background survey a broad array of machine learning techniques including: Unsupervised Learning, Clustering Techniques, and Fault Detection.

    R courses from Statistics.com

    Feb 11:  Modeling in R (Sudha Purohit -- more details after the jump)
    Mar 4:  Introduction to R - Data Handling (Paul Murrell)
    Apr 15:  Programming in R (Hadley Wickham)
    Apr 29:  Graphics in R (Paul Murrell)
    May 20:  Introduction to R – Statistical Analysis (John Verzani)

    Data bootcamp (slides and code) from the Strata Conference. Tutorials covering a handful of example problems using R and python.

    • Plotting data on maps
    • Classifying emails
    • A classification problem in image analysis

    Cosma Shalizi at CMU teaches a class: Undergraduate Advanced Data Analysis.

    More resources


    1. Where are the jobs?

    2. Where are the jobs? It seems like this is a fairly hot area for jobs right now. Any web startup will want a data person on the team. Google and Facebook have armies of them.

      Sadly, a lot of the work in this field revolves around marketing, because that's where the easy money is. If you want to use these skills for good, rather than evil, bioinformatics is one idea. Healthcare informatics is another, for example search for "The Heritage Health Prize". That just scratches the surface.

    3. More on where the jobs are: http://www.readwriteweb.com/enterprise/2011/03/good-news-for-data-geeks-bad-n.php

    4. http://www.readwriteweb.com/hack/2010/12/data-science-program.php

    5. I agree, in some places there are jobs, but not yet enough where they are needed! Traditional slow-moving industries are only beginning to define the problem, with the assistance to deloitte and accenture, and the creation of open job reqs is step 2 in their process. In big pharma there is a huge need, and some IT informatics groups are starting to grow... Definitely check out consulting as a start.