|
|
Unsupervised Learning in Neural NetworksReinforcement LearningThe third major category of learning that will be discussed is termed "reinforcement learning". As mentioned earlier, reinforcement learning lies between supervised and unsupervised learning. It is often called learning with a critic rather than learning with a teacher as the feedback is evaluative (right or wrong) rather than instructive (desired output). Reinforcement learning in neural networks has its roots in behavioural psychology. A psychologist, Burrhus Skinner, trained rats to press a lever in order to obtain a food reward. Initially he "rewarded" rats pressing the lever with a food reward. After a training period, the rats pressed the trigger by themselves, even in absence of the food. This kind of learning has been termed operant conditioning may involve a response being rewarded each time it occurs, so that it comes to occur more frequently. Another sort of operant conditioning involves a response being punished rather than rewarded each time it occurs. The result of this training is that the behaviour comes to occur less frequently. Unfortunately, the precise neurobiological correlate of operant conditioning is still unclear and this has meant that much of the work on artificial neural networks has been speculative and very much an area of active research. Among the issues that must be dealt with is the temporal credit assignment problem. For example, a network designed to play chess would receive a reinforcement signal at the end of the game after completing a long sequence of moves. How do we assign blame individually to each move in the sequence that leads to an eventual victory or loss? This is a sort of temporal equivalent of the spatial or structural credit problem that we encountered with the back-propagation algorithm. Another issue facing such networks is determining the trade-off between exploration and the possibility of a delayed reward versus exploitation of current knowledge to obtain an immediate reward. For example, consider the problem of the n-Armed bandit. In this case the bandit is the "agent". He must interact with his "environment" and perform the "action" of pulling one of n levers. Each lever has some average reward, called its "value". The goal is to obtain as much reward as possible. If the values of the levers are known then the problem is easy - choose the lever with the greatest value. If, however, the values are unknown then a decision must be made. The greedy approach is to exploit current knowledge and always pick the lever you think has the greatest value. Alternatively, the agent may explore and pick another lever, which over many turns will lead to a more accurate estimation of a levers value and improve long-term reward.
The balance between exploitation and exploration in order to obtain maximal long-term rewards is difficult and in part will depend on how many turns are allowed. One "policy" might be to exploit most of the time but every now and (with probability e) explore new options. How does the agent arrive at a policy? Consider an agent with a finite number of possible states S. The term "state" is used here to describe whatever information is available to the agent. For example, in poker the "state" is the value of the cards in a player's hand (not other players hands or cards still in the deck). The agent can also perform a limited number of actions A. The probability of taking a particular action "a" depends on the particular state "s" of the agent:
At each time-step t the agent finds itself in a state st and on that basis chooses an action at. One time-step later, the agent receives a reward rt+1 and finds itself in a new state st+1. The return R is the total reward received starting at time t+1. A refinement of return is known as "discounted return". This can be used to limit the infinite sum, and also to give more weight to earlier rewards. It utilises a discount factor lambda:
Now we have reviewed some of the language used in reinforcement it is time to describe the decision making process. Most Reinforcement Learning Rules use a stochastic (probability based) decision process. In most cases these processes rely on the system being modelled having the "Markov property". This means that the current state allows the model to predict the probable consequences of the actions it takes i.e. the next state and reward or transition function P. For example, in tic-tac-toe (noughts and crosses) the "state" of the board contains all the information necessary to decide on the next move. The steps that were taken to reach that state have no effect on the decision of the best move to make. If the agent can distinguish all discrete states there is a "Markov Decision Process (MDP)" otherwise if there is continuous distribution of states there is a "Partially Observable MDP (POMDP)". Although many real-world problems are non-Markov we generally assume they are. There are many possible policy adaptation methods the details of which are beyond the scope of this review. However, the methods broadly include "direct policy search" and "value function-based" methods. Direct policy search generates and evaluates different policies. Examples of direct policy search include evolutionary search and policy gradient methods. In contrast to these models, value function-based methods attempt to learn a value function for the policy and generate a new policy using this value function. Examples of value-function based methods include Q-learning, and Dynamic Programming. Previous | Next | Page 1 2 3 4 Index | Unsupervised Learning | Reinforcement Learning References: Reinforcement Learning: An Introduction. © 2008 Marcus bros |
|