Wednesday, August 11, 2010

Using R for Introductory Statistics 3.3

...continuing our way though John Verzani's Using R for introductory statistics. Previous installments: chapt1&2, chapt3.1, chapt3.2

Relationships in numeric data

If two data series have a natural pairing (x1,y1),...,(xn,yn), then we can ask, “What (if any) is the relationship between the two variables?” Scatterplots and correlation are first-line ways of assessing a bivariate data set.

Pearson's correlation

The Pearson's correlation coefficient is calculated by dividing the covariance of the two variables by the product of their standard deviations. It ranges from 1 for perfectly correlated variables to -1 for perfectly anticorrelated variables. 0 means uncorrelated. Independent variables have a correlation coefficient close to 0, but the converse is not true because the correlation coefficient detects only linear dependencies between two variables. [see wikipedia entry on correlation]

Question 3.19 concerns a sampling of 1000 New York Marathon runners. We're asked whether we expect a correlation between age and finishing time.

attach(nym.2002)
cor(age, time)
[1] 0.1898672
cor(age, time, method="spearman")
0.1119944

We discover a low correlation - good news for us wheezing old geezers. A scatterplot might show something. And we have the gender of each runner, so let's use that, too.

First, let's set ourselves up for a two panel plot.

par(mfrow=c(2,1))
par(mar=c(2,4,4,2)+0.1)

Next let's set up colors - pink for ladies, blue for guys - and throw in some transparency because a lot of data points are on top of each other.

blue = rgb(0,0,255,64, maxColorValue=255)
pink = rgb(255,192,203,128, maxColorValue=255)

color <- rep(blue, length(gender))
color[gender=='Female'] <- pink

In the first panel, draw the scatter plot.

plot(time, age, col=color, pch=19, main="NY Marathon", ylim=c(18,80), xlab="")

And in the second panel, break it down by gender. It's a well kept secret that outcol and outpch can be used to set the color and shape of the outliers in a boxplot.

par(mar=c(5,4,1,2)+0.1)
boxplot(time ~ gender, horizontal=T, col=c(pink, blue), outpch=19, outcol=c(pink, blue), xlab="finishing time (minutes)")

Now return our settings to normal for good measure.

par(mar=c(5,4,4,2)+0.1)
par(mfrow=c(1,1))

Sure enough, there doesn't seem to be much correlation between age and finishing time. Gender has an effect, although I'm sure the elite female runners would have little trouble dusting my slow booty off the trail.

It looks like we have fewer data points for women. Let's check that. We can use table to count the number of times each level of a factor occurs, or in other words, count the number of males and females.

table(gender)
gender
Female   Male 
   292    708

I'm still a little skeptical of our previous result - the low correlation between age and finishing time. Let's look at the data binned by decade.

bins <- cut(age, include.lowest=T, breaks=c(20,30,40,50,60,70,100), right=F, labels=c('20s','30s','40s','50s','60s','70+'))
boxplot(time ~ bins, col=colorRampPalette(c('green','yellow','brown'))(6), ylim=c(570,140))

It looks like you're not washed up as a runner until your 50's. Things go down hill from there, but, it doesn't look very linear, so we shouldn't be too surprised about our low r.

Coarser bins, old and young using 50 as our cut-off, reveal that there's really no correlation in the younger group. In the older group, we're starting to see some correlation. I suppose you could play with the numbers to find an optimum cut-off that maximized the difference in correlation. Not sure what the point of that would be.

y <- nym.2002[age<50,]
cor(y$age,y$time)
[1] -0.01148919
cor(y$age,y$time, method='spearman')
[1] -0.01512368
o <- nym.2002[age>=50,]
cor(o$age, o$time)
[1] 0.3813543
cor(o$age, o$time, method='spearman')
[1] 0.1980635

I ran a marathon once in my life. I think I was 30 and my time was a pokey 270 or so. My knees hurt for days afterwards, so I'm not sure I'd try it again. I do want to do a half, though. Gotta get back in shape for that...

More on Using R for Introductory Statistics

4 comments:

  1. This is neat. I recently had an article published analyzing the finishing times for the Cooper River Bridge Run, a large race in Charleston, SC. The article is here if you are interested: http://www.bepress.com/jqas/vol6/iss3/5

    ReplyDelete
  2. I've just got the book and am trying to learn R for the first time. Thanks for an informative blog post. I did the run in 201 once but that was many years ago....

    ReplyDelete
  3. @Matt That's quite a sophisticated analysis. Maybe I'll be able to read (and understand) your paper once I finish UsingR. That also looks like a fun dataset to play with. It's available here:

    http://www.bridgerun.com/results.php

    Sadly, they serve it as a pdf, so you'll have to munge it using pdf2txt or something like that. Is that what you did? Or is that data available in some more tractable form somewhere?

    ReplyDelete
  4. Like the analysis which you have done on this, it seems like a fun to solve the database.

    bioinformatics training india

    ReplyDelete