Wednesday, August 11, 2010

Using R for Introductory Statistics 3.3

...continuing our way though John Verzani's Using R for introductory statistics. Previous installments: chapt1&2, chapt3.1, chapt3.2

Relationships in numeric data

If two data series have a natural pairing (x1,y1),...,(xn,yn), then we can ask, “What (if any) is the relationship between the two variables?” Scatterplots and correlation are first-line ways of assessing a bivariate data set.

Pearson's correlation

The Pearson's correlation coefficient is calculated by dividing the covariance of the two variables by the product of their standard deviations. It ranges from 1 for perfectly correlated variables to -1 for perfectly anticorrelated variables. 0 means uncorrelated. Independent variables have a correlation coefficient close to 0, but the converse is not true because the correlation coefficient detects only linear dependencies between two variables. [see wikipedia entry on correlation]

Question 3.19 concerns a sampling of 1000 New York Marathon runners. We're asked whether we expect a correlation between age and finishing time.

attach(nym.2002)
cor(age, time)
[1] 0.1898672
cor(age, time, method="spearman")
0.1119944

We discover a low correlation - good news for us wheezing old geezers. A scatterplot might show something. And we have the gender of each runner, so let's use that, too.

First, let's set ourselves up for a two panel plot.

par(mfrow=c(2,1))
par(mar=c(2,4,4,2)+0.1)

Next let's set up colors - pink for ladies, blue for guys - and throw in some transparency because a lot of data points are on top of each other.

blue = rgb(0,0,255,64, maxColorValue=255)
pink = rgb(255,192,203,128, maxColorValue=255)

color <- rep(blue, length(gender))
color[gender=='Female'] <- pink

In the first panel, draw the scatter plot.

plot(time, age, col=color, pch=19, main="NY Marathon", ylim=c(18,80), xlab="")

And in the second panel, break it down by gender. It's a well kept secret that outcol and outpch can be used to set the color and shape of the outliers in a boxplot.

par(mar=c(5,4,1,2)+0.1)
boxplot(time ~ gender, horizontal=T, col=c(pink, blue), outpch=19, outcol=c(pink, blue), xlab="finishing time (minutes)")

Now return our settings to normal for good measure.

par(mar=c(5,4,4,2)+0.1)
par(mfrow=c(1,1))

Sure enough, there doesn't seem to be much correlation between age and finishing time. Gender has an effect, although I'm sure the elite female runners would have little trouble dusting my slow booty off the trail.

It looks like we have fewer data points for women. Let's check that. We can use table to count the number of times each level of a factor occurs, or in other words, count the number of males and females.

table(gender)
gender
Female   Male 
   292    708

I'm still a little skeptical of our previous result - the low correlation between age and finishing time. Let's look at the data binned by decade.

bins <- cut(age, include.lowest=T, breaks=c(20,30,40,50,60,70,100), right=F, labels=c('20s','30s','40s','50s','60s','70+'))
boxplot(time ~ bins, col=colorRampPalette(c('green','yellow','brown'))(6), ylim=c(570,140))

It looks like you're not washed up as a runner until your 50's. Things go down hill from there, but, it doesn't look very linear, so we shouldn't be too surprised about our low r.

Coarser bins, old and young using 50 as our cut-off, reveal that there's really no correlation in the younger group. In the older group, we're starting to see some correlation. I suppose you could play with the numbers to find an optimum cut-off that maximized the difference in correlation. Not sure what the point of that would be.

y <- nym.2002[age<50,]
cor(y$age,y$time)
[1] -0.01148919
cor(y$age,y$time, method='spearman')
[1] -0.01512368
o <- nym.2002[age>=50,]
cor(o$age, o$time)
[1] 0.3813543
cor(o$age, o$time, method='spearman')
[1] 0.1980635

I ran a marathon once in my life. I think I was 30 and my time was a pokey 270 or so. My knees hurt for days afterwards, so I'm not sure I'd try it again. I do want to do a half, though. Gotta get back in shape for that...

More on Using R for Introductory Statistics