Friday, March 18, 2011

Indexes are good

I have a MySQL database, on top of which we're trying to do a little data mining. I'm going to transpose domains, to protect the guilty. So, here's the (transposed) schema:

mysql> describe shoppingbaskets;
+-------------------+---------+
| Field             | Type    |
+-------------------+---------+
| id                | int(11) |
  ...etc...                    
| count_items       | int(11) |
+-------------------+---------+

mysql> describe shoppingbasket_items;
+-------------------+---------+
| Field             | Type    |
+-------------------+---------+
| item_id           | int(11) |
| shoppingbasket_id | int(11) |
+-------------------+---------+

mysql> describe items;
+-------------------+---------+
| Field             | Type    |
+-------------------+---------+
| id                | int(11) |
  ...etc...
+-------------------+---------+

There are 155,588 shopping baskets and 3,153,517 items. Baskets typically hold about 20 items.

It was kinda pokey to count the number of items in a basket on the fly, so I added the count_items column to shoppingbaskets to cache that information. To populate that column, I cooked up a little query like so:

mysql> update shoppingbaskets b set count_items = (select count(*) from shoppingbasket_items bi where bi.shoppingbasket_id=b.id);

So, I popped that in and waited... Looking at the query, it seems to me that it would be linear in the number of shopping baskets. But, I guess it's scanning all of the items for each basket.

Hmmm, how long was this going to take? I tried a few small test cases and came up with this bit of R code:

> qt = read.table('query_timing.txt', sep="\t", quote="", header=T)
> qt
    n seconds
1   1    0.57
2   2    6.71
3  10   55.20
4  11   61.76
5  20  116.72
6  50  297.83
7  80  481.17
8 100  606.04
9 118  709.70
> plot(seconds ~ n, data=qt, main="Time to count items in n baskets")
> model <- lm(seconds ~ n, data=qt)
> model

Call:
lm(formula = seconds ~ n, data = qt)

Coefficients:
(Intercept)            n  
     -5.331        6.081  

> abline(model, col='blue', lty=2)

Nicely linear, OK. But at 6 seconds per basket and 155,588 baskets, we're looking at 10 days. mysqladmin shutdown! Ok, now let's start up a brain cell or two.

Duh... index

create index idx_by_shoppingbasket_id on shoppingbasket_items (shoppingbasket_id);

Now filling the entire count_items column for all 155,588 rows takes 11.06 seconds. But, now that we can quickly count items on the fly, why bother caching the counts? Column dropped. Problem solved.

Conclusion: Indexes are good. Note to self: Don't forget to think now and then, you knucklehead. This web page describes exactly my situation.