Exercise 4 in Andrew Ng's Machine Learning class is on neural networks. Back in the cretaceous period, in 1994 or so, in on of the periodic phases of popularity for neural networks, I hacked up some neural network code in Borland Turbo C++ on a screaming 90MHz Pentium. That code is probably on a 3 1/2 inch floppy in the bottom of a box somewhere in my basement. Back then, I remember being fascinated by the idea that we could get a computer to learn by example.
Neural networks go in and out of style like miniskirts. [cue lecherous graphic.] Some people scoff. Neural networks are just a special cases of Bayesian networks, they say, and SVMs have more rigorous theory. Fine. But, it's a bit like seeing an old friend to find that neural networks are respectable again.
To see how neural networks work, let's look at how information flows through them, first in the forward direction.
An input example Xt becomes the activation at the first layer, the n x 1 vector a1. Prepending a bias node whose value is always 1, we then multiply by the weight matrix, Theta1. This returns z2, whose rows are the sum of the products of the input neurons with their respective weights. We pass those sums through the sigmoid function to get the activations of the next layer. Repeating the same procedure for the output layer gives the outputs of the neural network.
Implementing this in Octave, I think, could be fully vectorized to compute all training examples at once, but I did this in a loop over t which indexes a single training example, like this:
a1 = X(t,:)'; z2 = Theta1 * [1; a1]; a2 = sigmoid(z2); z3 = Theta2 * [1; a2]; a3 = sigmoid(z3);
As in the previous cases of linear and logistic regression, neural networks have a cost function to be minimized by moving in short steps along a gradient.
The gradients are computed by back propagation, which pushes the error backwards through the hidden layers. It was the publication of this algorithm in Nature in 1986, that led to the resurgence that I caught the tail end of in the 90's.
Cross-validation, regularization and gradient checking
When using neural networks, choices need to be made about architecture, the number of layers and number of units in each layer. Considerations include over-fitting, bias and computational costs. Trying a range of architectures and cross-validating is a good way to make this choice.
The layered approach gives neural networks the ability to fit highly non-linear boundaries, but also makes them prone to over-fitting, so it's helpful to add a regularization term to the cost function that penalizes large weights. Selecting the regularization parameter can be done by cross-validation.
Translating the math into efficient code is tricky and it's not hard to get incorrect implementations that still seem to work. It's a good idea to confirm the correctness of your computations with a technique called gradient checking. You compute the partial derivatives numerically and compare with your implementation.
Back in the day, I implemented back-prop twice. Once in in my C++ code and again in Excel to check the results.
A place in the toolbox
The progression of ideas leading up to this point in the course is very cleverly arranged. Linear regression starts you on familiar ground and helps introduce gradient descent. Logistic regression adds the simple step of transforming your inputs through the sigmoidal function. Neural networks then follow naturally. It's just logistic regression in multiple layers.
In spite of their tumultuous history, neural networks can be looked at as just another tool in the machine learning toolbox, with pluses and minus like other tools. The history of the idea is interesting, in terms of seeing inside the sausage factory of science.