According to Hal Varian and just about everyone these days, the hot skills to have are some combination of programming, statistics, machine learning, and visualization. Here are a pile of resources that'll help you get some mad data science skills.
Programming
There seems to be a few main platforms widely used for data intensive programming. R, is a statistical environment that is to statisticians what MatLab is to engineers. It's a weird beast, but it's open source and very powerful, plus has a great community. Python also makes a strong showing, with the help of NumPy, SciPy and matplotlib. An intriguing new entry is the combination of the Lisp dialect Clojure and Incanter. All these tools mix numerical libraries with functional and scripting programming styles in varying proportions. You'll also want to look into Hadoop, to do your big data analytics map-reduce style in the cloud.
Statistics
Machine Learning
Visualization
Classes
Starting on March 5 at the Hacker Dojo in Mountain View (CA), Mike Bowles and Patricia Hoffmann will present a course on Machine Learning where R will be the "lingua franca" for looking at homework problems, discussing them and comparing different solution approaches. The class will begin at the level of elementary probability and statistics and from that background survey a broad array of machine learning techniques including: Unsupervised Learning, Clustering Techniques, and Fault Detection.
Feb 11: Modeling in R (Sudha Purohit -- more details after the jump)
Mar 4: Introduction to R - Data Handling (Paul Murrell)
Apr 15: Programming in R (Hadley Wickham)
Apr 29: Graphics in R (Paul Murrell)
May 20: Introduction to R – Statistical Analysis (John Verzani)
Data bootcamp (slides and code) from the Strata Conference. Tutorials covering a handful of example problems using R and python.
- Plotting data on maps
- Classifying emails
- A classification problem in image analysis
Cosma Shalizi at CMU teaches a class: Undergraduate Advanced Data Analysis.
More resources
- A great list of machine learning tutorials by Andrew Moore.
- There are so many classes, books and lecture videos online these days, you're only limit is the rate at which you can absorb it.
- Hadley Wickham's A philosophy of clean data
- Abhishek Tiwari points us to a Quora thread: How do I become a data scientist?
- Drew Conway's Data Science Venn Diagram, which he expands on in Data science in the US intelligence community. I like Conway's emphasis on the scientific method and hypothesis testing. Drew is coming out with a book soon, Machine Learning for Hackers, that sounds promising.
- Good resources for learning about machine learning
- Machine Learning in Action
Where are the jobs?
ReplyDeleteWhere are the jobs? It seems like this is a fairly hot area for jobs right now. Any web startup will want a data person on the team. Google and Facebook have armies of them.
ReplyDeleteSadly, a lot of the work in this field revolves around marketing, because that's where the easy money is. If you want to use these skills for good, rather than evil, bioinformatics is one idea. Healthcare informatics is another, for example search for "The Heritage Health Prize". That just scratches the surface.
More on where the jobs are: http://www.readwriteweb.com/enterprise/2011/03/good-news-for-data-geeks-bad-n.php
ReplyDeletehttp://www.readwriteweb.com/hack/2010/12/data-science-program.php
ReplyDeletehttp://mias.illinois.edu/DSSI2011
I agree, in some places there are jobs, but not yet enough where they are needed! Traditional slow-moving industries are only beginning to define the problem, with the assistance to deloitte and accenture, and the creation of open job reqs is step 2 in their process. In big pharma there is a huge need, and some IT informatics groups are starting to grow... Definitely check out consulting as a start.
ReplyDelete