The first section of Andrew Ng's Machine Learning class is about applying gradient descent to linear regression problems.

Our input data is an *m-by-n* matrix *X*, where we have *m* training examples with *n* features each. For these training examples, we know the expected outputs *y* where *y* is the variable we're trying to predict. We want to find a line defined by the parameter vector ϴ that minimizes the squared error between the line and our data points.

Gradient descent takes a cost function, which is the squared error of the prediction vs. the training data. I think the 2 in the denominator is there so that it cancels out when we take the derivative, leaving us with a simpler gradient function.

The update rule for each ϴ_{j} is the partial derivative of the cost function with respect to ϴ_{j}.

Part of the challenge is converting this to matrix notation, to take advantage of fast matrix arithmetic algorithms.

Next we vectorize the update rule and show how to compute least squared error directly with the normal equation.

In action, gradient descent gradually approaches optimal values for ϴ. How gradual depends on the learning rate, α.

While the classwork was done in Octave, I also did a simple gradient descent implementation in R.