Monday, October 04, 2010

Worse than linear performance in Rails ActiveRecords with associations

The other day I was doing a task that required iterating through 142 thousand active record objects and computing simple aggregate statistics. Taking convenience over efficiency, I thought I'd just whip up a script to run in a the Ruby on Rails script runner. I usually put little print-line debugging statements in programs like that as a poor man's progress bar. I noticed that as the script ran, it seemed to get slower and slower.

My ActiveRecord abjects are linked up as shown below. The names are changed to protect the guilty, so I might be neglecting something important.

class Basket < ActiveRecord::Base
  has_many :basket_items
  has_many :items, :through => :basket_items
end

class Item < ActiveRecord::Base
  belongs_to :category
  has_many :basket_items
  has_many :baskets, :through => :basket_items
end
Basket.find(:all).each do |basket|
  puts "basket.id = #{basket.id-20000}\t#{Time.now - start_time}" if basket.id % 1000 == 0
end

This part is super fast, 'cause nothing much is done inside the loop other than printing. I don't know exactly how Rails does the conversion from DB fields to objects. That bit seems to be taking place outside the loop. Anyway, what happens if we access the associated items?

counter = 0
Basket.find(:all).each do |basket|
  puts "basket.id = #{basket.id-20000}\t#{Time.now - start_time}" if basket.id % 1000 == 0
  counter += basket.items.length
end

In the second case, we trigger the lazy-load of the list of items. Baskets average 18.7 items. The distribution of item count within the data is more or less random and, on average, flat. Now, we see the timings below.

In other words, this is an nxm operation, but m (the number of items) is more or less constant. I can't guess why this wouldn't be linear. Garbage collection? Should that level off? Maybe, we're just seeing the cost of maintaining the heap? Items are also associated with baskets. Maybe Rails is spending time fixing up that association?

The real script, only a little more complex that the one above, ran in about 30 hours. I realize this is all a little half-baked and I don't intend to chase it down further, but I'm hoping to nerd snipe someone into figuring it out. I'm probably missing a lot and leaving out too much.