|
|
Supervised Learning in Neural NetworksBack-propagationHaving understood the delta rule we are now in a position to extend it to include hidden layers. Consider a feed-forward network consisting of several inputs feeding into hidden units, which are in turn connected to output units. As with the delta rule, the goal is to perform a gradient descent. The global error E and pattern-specific error ep are defined as before, with e being used as an approximation for E. Application of gradient descent to the output units is fairly straightforward: The difficulty is extending this to the hidden units. While the sigmoid function and inputs are the same, the remaining term (tp - yp) is not. I have designated this term delta.
How do we know which weights are to blame when they don't directly produce an output? Put simply, we first determine how much a given weight effects different output nodes, and then how much these nodes contribute to the error. For the kth hidden unit its effect on the jth output node is a function of wjk. The jth output nodes effects on ep can then be calculated as before and may be denoted delta j. Combined we have delta j x wjk. This calculation is done for all of k's outputs nodes:
Substituting this into our initial equation we have (note that I have included the sigmoid function within the delta term):
The above learning rule may be implemented in the following algorithm: Forward pass The algorithm is repeated until stopping criteria are met. This may be when a sufficiently low error is reached or, alternatively, when a low rate of change of error is reached. Another possibility is to stop when the network starts to become "over-trained" which will be discussed later. Although the algorithm successfully allowed multiple layers it does suffer from limitations. The first problem is one of encountering local error minima. Put simply, as each error weight is being adjusted by itself the network may converge towards a solution that does not represent the best "global minima" of errors. To avoid this happening weights are adjusted "noisily". This is one advantage of training with individual patterns rather than an entire training set. Local minima can also be avoided by using a random permutation of patterns within a training set. A second problem with the back-propagation algorithm in multilayer nets is the possibility of overtraining. In real life data tends to be imprecise and one of the strengths of neural networks is their ability to use this data and look for general patterns. However, in networks with many hidden nodes the network actually has the capacity to "memorise" the desired output for specific patterns. This has the effect of reducing the errors made on the training sample, but increasing those made in real life. The problem of how many hidden neurones to have in a network is still an active research issue. A final point is the use of "momentum". This concept was introduced to help manage the learning rate. Too low and learning takes a long time to train but too high and the network becomes unstable and weights often oscillate. Instead of pure gradient descent, momentum includes a portion of the last weight change. So, if a previous weight change was large, so will the new one i.e. changes in weight carry momentum from one moment to the next. This has the net effect of smoothening out small fluctuations in "weight space". Index | Supervised Learning | Back-propagation © 2008 Marcus bros |
|