I have a MySQL database, on top of which we're trying to do a little data mining. I'm going to transpose domains, to protect the guilty. So, here's the (transposed) schema:
mysql> describe shoppingbaskets; +-------------------+---------+ | Field | Type | +-------------------+---------+ | id | int(11) | ...etc... | count_items | int(11) | +-------------------+---------+ mysql> describe shoppingbasket_items; +-------------------+---------+ | Field | Type | +-------------------+---------+ | item_id | int(11) | | shoppingbasket_id | int(11) | +-------------------+---------+ mysql> describe items; +-------------------+---------+ | Field | Type | +-------------------+---------+ | id | int(11) | ...etc... +-------------------+---------+
There are 155,588 shopping baskets and 3,153,517 items. Baskets typically hold about 20 items.
It was kinda pokey to count the number of items in a basket on the fly, so I added the count_items column to shoppingbaskets to cache that information. To populate that column, I cooked up a little query like so:
mysql> update shoppingbaskets b set count_items = (select count(*) from shoppingbasket_items bi where bi.shoppingbasket_id=b.id);
So, I popped that in and waited... Looking at the query, it seems to me that it would be linear in the number of shopping baskets. But, I guess it's scanning all of the items for each basket.
Hmmm, how long was this going to take? I tried a few small test cases and came up with this bit of R code:
> qt = read.table('query_timing.txt', sep="\t", quote="", header=T) > qt n seconds 1 1 0.57 2 2 6.71 3 10 55.20 4 11 61.76 5 20 116.72 6 50 297.83 7 80 481.17 8 100 606.04 9 118 709.70 > plot(seconds ~ n, data=qt, main="Time to count items in n baskets") > model <- lm(seconds ~ n, data=qt) > model Call: lm(formula = seconds ~ n, data = qt) Coefficients: (Intercept) n -5.331 6.081 > abline(model, col='blue', lty=2)
Nicely linear, OK. But at 6 seconds per basket and 155,588 baskets, we're looking at 10 days. mysqladmin shutdown! Ok, now let's start up a brain cell or two.
Duh... index
create index idx_by_shoppingbasket_id on shoppingbasket_items (shoppingbasket_id);
Now filling the entire count_items column for all 155,588 rows takes 11.06 seconds. But, now that we can quickly count items on the fly, why bother caching the counts? Column dropped. Problem solved.
Conclusion: Indexes are good. Note to self: Don't forget to think now and then, you knucklehead. This web page describes exactly my situation.
No comments:
Post a Comment