## Friday, May 21, 2010

### Using R for Introductory Statistics, 3.1

#### Pairs of categorical data

The grades data.frame holds two columns of letter grades, giving pairs of categorical data, like so:

```    prev grade
1    B+    B+
2    A-    A-
3    B+    A-
...
122  B     B
```

This type of data can be summarized by the table function, which counts the occurrence of each possible pair of letter grades. But first, I was never a fan of plus-minus grading, so lets do away with that.

```> grades2 <- data.frame( prev=factor(gsub("[+]|-| ", "", as.character(grades\$prev)), levels=c('A','B','C','D','F')), grade=factor(gsub("[+]|-| ", "", as.character(grades\$grade)), levels=c('A','B','C','D','F')) )
```
```> table(grades2)
prev  A  B  C  D  F
A 22  6  3  2  0
B  4 15  5  1  3
C  3  2  9  9  7
D  0  1  4  3  1
F  1  2  4  4 11
```

You might want to compute row (1) or column (2) sums, using margin.table:

```> margin.table(table(grades2), 1)
prev
A  B  C  D  F
33 28 30  9 22
```

Of the students who got an A on the first test, what proportion also got an A on the second test? Those types of questions are answered by prop.table().

```> options(digits=1)
prev    A    B    C    D    F
A 0.67 0.18 0.09 0.06 0.00
B 0.14 0.54 0.18 0.04 0.11
C 0.10 0.07 0.30 0.30 0.23
D 0.00 0.11 0.44 0.33 0.11
F 0.05 0.09 0.18 0.18 0.50
> options(digits=4)
```

Finally, this type of data can be displayed as a stacked barplot.

```m <- t(as.matrix(florida[,2:3]))
m.prop <- prop.table(m, margin=2)
colnames(m.prop) <- florida\$County

# fool around with margins and set style of axis labels
# mar=c(bottom, left, top, right)
# las=2 => always perpendicular to the axis
old = par(mar=c(6,4,6,2)+0.1, las=2)

# cex.names => "character expansion" of bar labels
# args.legend => position the legend out of the plot area
barplot(m.prop[,order(m.prop[2,])], legend.text=T, cex.names=0.40, args.legend=list(x=82,y=1.2), main="2000 Election results in Florida", sub='county')

# reset old parameters
par(old)
``` 1. Hi Chris,
as a newbie using R I have found the post very informative. The only problem is that when I tried to replicate the results for the grade2 dataframe I got slightly different results:
prev A B C D F
A 22 4 3 2 0
B 4 12 4 1 3
C 3 2 6 9 7
D 0 1 4 3 1
F 1 2 3 4 11

prev
A B C D F
31 24 27 9 21

Just to make sure that I entered the code correctly, would it be possible for you to clarify the different type of brackets surrounding the positive and negative signs? "[+]|-|"
I understand "[]" in the context of matrices and | as the "OR" operator but I'm not familiar with this construct.
Ruben

2. In a regular expression, both + and - (the plus and minus characters) have special meanings. Plus means one-or-more of the preceding item. Minus is used to create ranges inside character classes.

The pipe character, |, signals alternation, which is a fancy way of saying or.

So, gsub("a|b|c", "flapdoodle", x) would substitute any a, b, or c character in x with flapdoodle, if for some nutty reason we wanted to do that.

But gsub("+|-| ", "", x) doesn't do likewise. It interprets the plus as a special character rather than a literal +, although I'm not sure what a + means with nothing preceding it.

So, I put the + inside square brackets to make it a character class. In that context, a + is just a +.

I'm not sure that accounts for our differing numbers. In your original grades data.frame, do you start with 122 students (rows)? You can check that we haven't lost any students like this:

 122 2

 122 2

 122

I hope that helps. Thanks for the comment!

3. Chris,
thanks a lot for the explanation. Now I can completely understand the code.
The reason why I got different numbers from you is because I mistyped the command. Copying directly from the post resulted in the same output as yours.
Please keep on posting, it's really useful!
Regards,
Ruben