Digithead's Lab Notebook: Data analysis class

Wednesday, February 06, 2013

Data analysis class

I've been writing software to help others do data analysis for a number of years and at the same time trying to work up my nerve to try my own analysis. Why let other people have all the fun? So, when I saw that Jeffrey Leek, biostatistician at Johns Hopkins and coauthor of Simply Statistics, was teaching an online course in data analysis, I signed up.

The class starts off with an overview of the landscape of data analysis. Like the data-science venn diagram, Leek posits that data analysis is at the intersection of hacking, statistics and domain knowledge.

What follows is my crib-notes form Jeff's slides and from supplementary material. To get started in a cautious frame of mind, we get some wisdom from John Tukey:

“The data may not contain the answer. The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data...”

To that advice, Leek adds:

...no matter how big the data are.

A cautious data analyst pursues the question at hand with the appropriate type of analysis and will avoid going further than the available data allows.

Types of data analysis

Descriptive - Summarize and highlight, leaving generalization, interpretation and modeling for later.
Exploratory - Discover new relationships and define future studies, requires confirmation.
Inferential - Estimate values on a large population based on a small sample and quantifying uncertainty.
Predictive - Use data to estimate unmeasured values. If X predicts Y, X does not necessarily cause Y, which is just another way of saying correlation does not imply causation.
Causal - Find effect on one variable of changes in another. Randomized studies are usually required.
Mechanistic - Typically, deterministic equations are known, but the parameters must be inferred. Think physics.

On process, Leek outlines a series of steps similar to those articulated by Hadley Wickham (Engineering with data analysis) and Jeffrey Heer.

Steps in a data analysis

Define question
Define ideal data set
Determine what data you can access
Obtain data
Clean data
Exploratory analysis
Statistical prediction/modeling
Interpret results
Challenge results
Synthesize/write-up results
Create reproducible code

The class is taught in R. Early lectures cover basics like how R's type system represents continuous and categorical data. Next come basic data munging operations like binning with cut, subset, sort, merge and reshape.

The goal of data munging is to produce a clean data - data that is amenable to analysis. Hadley Wickham's paper on tidy data, defines a set of properties closely related to database normalization oriented towards getting data ready for further manipulation, visualization and modeling. This is part of what my colleague, Brig, calls data activation.

Properties of Tidy data

One variable per column
One observation per row
Tables hold elements of only one kind

Plus

Column names are easy to use and informative
Row names are easy to use and informative
Obvious mistakes in the data have been removed
Variable values are internally consistent
Appropriate transformed variables have been added

Luckily for us, data is the philosophy of the day. The unreasonable effectiveness of data is widely appreciated, and there is more data than analysis talent available. There are loads of resources for helping students of data analysis grow into data scientists.

Data sources

Open government data from many sources: data.gov, france UK GapMinder List of cities/states with open data, civic commons, many served by Seattle startup Socrata.
asdfree
Infochimps
Kaggle
Hilary Mason's research data
Stanford Large Newtork Data
UCI Machine Learning
KDD Nugets Datasets
CMU Statlib
Gene expression omnibus
ArXiv Data
Spambase data set from the UC Irvine Machine Learning Repository

API's

Resources

4 comments:

Unknown2/07/2013 12:23 PM
Is publication of the class notes a copyright violation? I am not sure Coursera or Johns Hopkins would appreciate it without their permission.
ReplyDelete
Replies
Joyce Faler2/07/2013 4:59 PM
I'm always amazed where my twitter feed lands me - in this case leading me to someone else taking the Data Analysis class, who happens to have an interesting blog with cool insights into various types of programming languages. I'll be checking out your other posts as I have time. Cheers, Joyce
ReplyDelete
Replies
Unknown2/07/2013 6:53 PM
This would definitely fall under fair use and is properly attributed. Leek himself is a big blogger and open science advocate. Check out Simply Statistics.

I really wanted to take this class, even if it is taught in R. I'm hoping he'll offer it again when I don't have any classes I am taking for credit.
ReplyDelete
Replies
Georgii Kalnytskyi2/10/2013 1:24 PM
Hello, nice to see a fellow student :)
ReplyDelete
Replies

Add comment

Digithead's Lab Notebook

Wednesday, February 06, 2013

Data analysis class

Types of data analysis

Steps in a data analysis

Properties of Tidy data

Data sources

API's

Resources

4 comments:

About

About Me

Blog Archive

Labels

Cheat Sheets

Featured on

Digithead's Lab Notebook

Wednesday, February 06, 2013

Data analysis class

Types of data analysis

Steps in a data analysis

Properties of Tidy data

Data sources

API's

Resources

4 comments:

About

About Me

Blog Archive

Labels

Cheat Sheets

Feedz

Featured on