Tuesday, December 06, 2011

K-means

The topics in this week's programming exercise in the machine learning class are K-means and PCA.

K-means is a fairly easily understood clustering algorithm. Once you specify K, the number of clusters, and pick some random initial centroids, it's just two steps. First, assign each data point to a cluster according to it's nearest centroid. Next, recompute the centroids based on the average of the data points in each cluster. Repeat until convergence. It's that simple.

Vectorizing the two steps is a little tricky. Let's take a look at what we're trying to accomplish in step 1. If we had an m by k matrix with the distances where each element i,j was the distance from data point i to centroid j, we could take the min of each row. Actually, we want the index of the min of each row. That would give the assignments for all m data points in one shot.

For example, say we have just 3 data points (m=3) with 2 features each and 2 centroids, (a,b) and (c,d). How do you get to a distance matrix given X and the centroids?

What I came up with can surely be improved upon, but here it is anyway. I loop over each cluster k, finding the distance between that cluster's centroid and all data points.

The cryptic bsxfun may sound like something republicans wouldn't approve of, but really, it's a bit like the apply functions in R. It can work several ways, but in this case takes a function, a matrix X, m by n, and the n by 1 vector of the kth centroid. It applies the function to each row in X along with the centroid vector. The result is an m by n matrix whose ith row is the difference between the ith example and the kth centroid. We square that matrix, element-wise. Then, sum all the rows to compute the vector of m squared distances. After we've filled in our distances for all the centroids, we take the min, row-wise, returning the index of the nearest centroid.

function idx = findClosestCentroids(X, centroids)
  K = size(centroids, 1);
  m = size(X,1)

  idx = zeros(m, 1);
  dist = zeros(size(X,1), K);

  for k = 1:K
    dist(:,k) = sum(bsxfun(@(x,mu_k) x-mu_k, X, centroids(k,:)) .^ 2, 2);
  end

  [min_dist, idx] = min(dist, [], 2);
end

There must be a cleaner way to do that. If we looped over m rather than k, we could compute mins one row at a time and never need the whole dist matrix in memory. Maybe there's some magic linear algebra that could efficiently do the whole thing. Anyone wanna clue me in on that?

Luckily, the next step is easier. To recompute the centroids, we're finding the average of each cluster. Again, I used a loop over the k clusters. We grab the subset of data points belonging to each cluster using find.

for k = 1:K
  example_idx = find(idx==k);
  centroids(k,:) = sum(X(example_idx,:),1) / size(example_idx,1);
end

With those two steps in place, we're clustering away. One puzzle that comes with this particular algorithm is how to choose K. According to Andrew Ng, The elbow method can be problematic because a clear elbow may not present itself. Letting downstream requirements dictate the choice of K seems to be better.

It's getting pretty clear that the trick to most of these programming exercises is vectorization. The combination of functional programming and vectorized operations is very powerful, but definitely comes at the cost of some brain-strain, at least if you have as few working brain cells as I've got left.