I have a MySQL database, on top of which we're trying to do a little data mining. I'm going to transpose domains, to protect the guilty. So, here's the (transposed) schema:
mysql> describe shoppingbaskets; +-------------------+---------+ | Field | Type | +-------------------+---------+ | id | int(11) | ...etc... | count_items | int(11) | +-------------------+---------+ mysql> describe shoppingbasket_items; +-------------------+---------+ | Field | Type | +-------------------+---------+ | item_id | int(11) | | shoppingbasket_id | int(11) | +-------------------+---------+ mysql> describe items; +-------------------+---------+ | Field | Type | +-------------------+---------+ | id | int(11) | ...etc... +-------------------+---------+
There are 155,588 shopping baskets and 3,153,517 items. Baskets typically hold about 20 items.
It was kinda pokey to count the number of items in a basket on the fly, so I added the count_items column to shoppingbaskets to cache that information. To populate that column, I cooked up a little query like so:
mysql> update shoppingbaskets b set count_items = (select count(*) from shoppingbasket_items bi where bi.shoppingbasket_id=b.id);
So, I popped that in and waited... Looking at the query, it seems to me that it would be linear in the number of shopping baskets. But, I guess it's scanning all of the items for each basket.
Hmmm, how long was this going to take? I tried a few small test cases and came up with this bit of R code:
> qt = read.table('query_timing.txt', sep="\t", quote="", header=T)
> qt
n seconds
1 1 0.57
2 2 6.71
3 10 55.20
4 11 61.76
5 20 116.72
6 50 297.83
7 80 481.17
8 100 606.04
9 118 709.70
> plot(seconds ~ n, data=qt, main="Time to count items in n baskets")
> model <- lm(seconds ~ n, data=qt)
> model
Call:
lm(formula = seconds ~ n, data = qt)
Coefficients:
(Intercept) n
-5.331 6.081
> abline(model, col='blue', lty=2)
Nicely linear, OK. But at 6 seconds per basket and 155,588 baskets, we're looking at 10 days. mysqladmin shutdown! Ok, now let's start up a brain cell or two.
Duh... index
create index idx_by_shoppingbasket_id on shoppingbasket_items (shoppingbasket_id);
Now filling the entire count_items column for all 155,588 rows takes 11.06 seconds. But, now that we can quickly count items on the fly, why bother caching the counts? Column dropped. Problem solved.
Conclusion: Indexes are good. Note to self: Don't forget to think now and then, you knucklehead. This web page describes exactly my situation.




R Bloggers
No comments:
Post a Comment