Results 1  10
of
26
Practical Issues in Temporal Difference Learning
 Machine Learning
, 1992
"... This paper examines whether temporal difference methods for training connectionist networks, such as Suttons's TD(lambda) algorithm can be successfully applied to complex realworld problems. A number of important practical issues are identified and discussed from a general theoretical perspective. ..."
Abstract

Cited by 363 (2 self)
 Add to MetaCart
This paper examines whether temporal difference methods for training connectionist networks, such as Suttons's TD(lambda) algorithm can be successfully applied to complex realworld problems. A number of important practical issues are identified and discussed from a general theoretical perspective. These practical issues are then examined in the context of a case study in which TD(lambda) is applied to learning the game of backgammon from the outcome of selfplay. This is apparently the first application of this algorithm to a complex nontrivial task. It is found that, with zero knowledge built in, the network is able to learn from scratch to play the entire game at a fairly strong intermediate level of performance which is clearly better than conventional commercial programs and which in fact surpasses comparable networks trained on a massive human expert data set. This indicates that TD learning may work better in practice than one would expect based on current theory, and it suggests that further analysis of TD methods, as well as applications in other complex domains may be worth investigating.
Residual Algorithms: Reinforcement Learning with Function Approximation
 In Proceedings of the Twelfth International Conference on Machine Learning
, 1995
"... A number of reinforcement learning algorithms have been developed that are guaranteed to converge to the optimal solution when used with lookup tables. It is shown, however, that these algorithms can easily become unstable when implemented directly with a general functionapproximation system, such ..."
Abstract

Cited by 237 (5 self)
 Add to MetaCart
A number of reinforcement learning algorithms have been developed that are guaranteed to converge to the optimal solution when used with lookup tables. It is shown, however, that these algorithms can easily become unstable when implemented directly with a general functionapproximation system, such as a sigmoidal multilayer perceptron, a radialbasisfunction system, a memorybased learning system, or even a linear functionapproximation system. A new class of algorithms, residual gradient algorithms, is proposed, which perform gradient descent on the mean squared Bellman residual, guaranteeing convergence. It is shown, however, that they may learn very slowly in some cases. A larger class of algorithms, residual algorithms, is proposed that has the guaranteed convergence of the residual gradient algorithms, yet can retain the fast learning speed of direct algorithms. In fact, both direct and residual gradient algorithms are shown to be special cases of residual algorithms, and it is s...
Stable Function Approximation in Dynamic Programming
 IN MACHINE LEARNING: PROCEEDINGS OF THE TWELFTH INTERNATIONAL CONFERENCE
, 1995
"... The success of reinforcement learning in practical problems depends on the ability tocombine function approximation with temporal difference methods such as value iteration. Experiments in this area have produced mixed results; there have been both notable successes and notable disappointments. Theo ..."
Abstract

Cited by 208 (5 self)
 Add to MetaCart
The success of reinforcement learning in practical problems depends on the ability tocombine function approximation with temporal difference methods such as value iteration. Experiments in this area have produced mixed results; there have been both notable successes and notable disappointments. Theory has been scarce, mostly due to the difficulty of reasoning about function approximators that generalize beyond the observed data. We provide a proof of convergence for a wide class of temporal difference methods involving function approximators such as knearestneighbor, and show experimentally that these methods can be useful. The proof is based on a view of function approximators as expansion or contraction mappings. In addition, we present a novel view of approximate value iteration: an approximate algorithm for one environment turns out to be an exact algorithm for a different environment.
Approximate Solutions to Markov Decision Processes
, 1999
"... One of the basic problems of machine learning is deciding how to act in an uncertain world. For example, if I want my robot to bring me a cup of coffee, it must be able to compute the correct sequence of electrical impulses to send to its motors to navigate from the coffee pot to my office. In fact, ..."
Abstract

Cited by 66 (9 self)
 Add to MetaCart
One of the basic problems of machine learning is deciding how to act in an uncertain world. For example, if I want my robot to bring me a cup of coffee, it must be able to compute the correct sequence of electrical impulses to send to its motors to navigate from the coffee pot to my office. In fact, since the results of its actions are not completely predictable, it is not enough just to compute the correct sequence; instead the robot must sense and correct for deviations from its intended path. In order for any machine learner to act reasonably in an uncertain environment, it must solve problems like the above one quickly and reliably. Unfortunately, the world is often so complicated that it is difficult or impossible to find the optimal sequence of actions to achieve a given goal. So, in order to scale our learners up to realworld problems, we usually must settle for approximate solutions. One representation for a learner's environment and goals is a Markov decision process or MDP. ...
Modular Neural Networks for Learning ContextDependent Game Strategies
 Master’s thesis, Computer Speech and Language Processing
, 1992
"... The method of temporal differences (TD) is a learning technique which specialises in predicting the likely outcome of a sequence over time. Examples of such sequences include speech frame vectors, whose outcome is a phoneme or word decision, and positions in a board game, whose outcome is a win/loss ..."
Abstract

Cited by 32 (3 self)
 Add to MetaCart
The method of temporal differences (TD) is a learning technique which specialises in predicting the likely outcome of a sequence over time. Examples of such sequences include speech frame vectors, whose outcome is a phoneme or word decision, and positions in a board game, whose outcome is a win/loss decision. Recent results by Tesauro in the domain of backgammon indicate that a neural network, trained by TD methods to evaluate positions generated by selfplay, can reach an advanced level of backgammon skill. For my summer thesis project, I first implemented the TD/neural network learning algorithms and confirmed Tesauro's results, using the domains of tictactoe and backgammon. Then, motivated by Waibel's success with modular neural networks for phoneme recognition, I experimented with using two modular architectures (DDD and MetaPi) in place of the monolithic networks. I found that using the modular networks significantly enhanced the ability of the backgammon evaluator to change it...
Reinforcement Learning Through Gradient Descent
, 1999
"... Reinforcement learning is often done using parameterized function approximators to store value functions. Algorithms are typically developed for lookup tables, and then applied to function approximators by using backpropagation. This can lead to algorithms diverging on very small, simple MDPs and Ma ..."
Abstract

Cited by 22 (0 self)
 Add to MetaCart
Reinforcement learning is often done using parameterized function approximators to store value functions. Algorithms are typically developed for lookup tables, and then applied to function approximators by using backpropagation. This can lead to algorithms diverging on very small, simple MDPs and Markov chains, even with linear function approximators and epochwise training. These algorithms are also very difficult to analyze, and difficult to combine with other algorithms. A series of new families of algorithms are derived based on stochastic gradient descent. Since they are derived from first principles with function approximators in mind, they have guaranteed convergence to local minima, even on general nonlinear function approximators. For both residual algorithms and VAPS algorithms, it is possible to take any of the standard algorithms in the field, such as Qlearning or SARSA or value iteration, and rederive a new form of it with provable convergence. In addition to better conve...
MultiPlayer Residual Advantage Learning with General Function Approximation
 Wright Laboratory
, 1996
"... A new algorithm, advantage learning, is presented that improves on advantage updating by requiring that a single function be learned rather than two. Furthermore, advantage learning requires only a single type of update, the learning update, while advantage updating requires two different types of u ..."
Abstract

Cited by 21 (1 self)
 Add to MetaCart
A new algorithm, advantage learning, is presented that improves on advantage updating by requiring that a single function be learned rather than two. Furthermore, advantage learning requires only a single type of update, the learning update, while advantage updating requires two different types of updates, a learning update and a normilization update. The reinforcement learning system uses the residual form of advantage learning. An application of reinforcement learning to a Markov game is presented. The testbed has continuous states and nonlinear dynamics. The game consists of two players, a missile and a plane; the missile pursues the plane and the plane evades the missile. On each time step, each player chooses one of two possible actions; turn left or turn right, resulting in a 90 degree instantaneous change in the aircraft's heading. Reinforcement is given only when the missile hits the plane or the plane reaches an escape distance from the missile. The advantage function is stor...
Advantage Updating Applied to a Differential Game
 Advances in Neural Information Processing Systems 7
, 1995
"... An application of reinforcement learning to a linearquadratic, differential game is presented. The reinforcement learning system uses a recently developed algorithm, the residual gradient form of advantage updating. The game is a Markov Decision Process (MDP) with continuous time, states, and actio ..."
Abstract

Cited by 19 (5 self)
 Add to MetaCart
An application of reinforcement learning to a linearquadratic, differential game is presented. The reinforcement learning system uses a recently developed algorithm, the residual gradient form of advantage updating. The game is a Markov Decision Process (MDP) with continuous time, states, and actions, linear dynamics, and a quadratic cost function. The game consists of two players, a missile and a plane; the missile pursues the plane and the plane evades the missile. The reinforcement learning algorithm for optimal control is modified for differential games in order to find the minimax Presented at the Neural Information Processing Systems Conference, Denver, Colorado, November 28  December 3, 1994. point, rather than the maximum. Simulation results are compared to the optimal solution, demonstrating that the simulated reinforcement learning system converges to the optimal answer. The performance of both the residual gradient and nonresidual gradient forms of advantage updating an...
GPgammon: Genetically programming backgammon players. Genetic Programming and Evolvable Machines, 6(3):283–300, sep 2005. Published online: 12
, 2005
"... Abstract. We apply genetic programming to the evolution of strategies for playing the game of backgammon. We explore two different strategies of learning: using a fixed external opponent as teacher, and letting the individuals play against each other. We conclude that the second approach is better a ..."
Abstract

Cited by 14 (0 self)
 Add to MetaCart
Abstract. We apply genetic programming to the evolution of strategies for playing the game of backgammon. We explore two different strategies of learning: using a fixed external opponent as teacher, and letting the individuals play against each other. We conclude that the second approach is better and leads to excellent results: Pitted in a 1000game tournament against a standard benchmark player—Pubeval— our best evolved program wins 62.4 % of the games, the highest result to date. Moreover, several other evolved programs attain win percentages not far behind the champion, evidencing the repeatability of our approach.
TD learning of game evaluation functions with hierarchical neural architectures
 Wiering,M.A.andDorigo,M.(1998).Learningtocontrolforestres.InHaasis,H.D.and Ranze,K.C.,editors, Proceedings of the 12th international Symposium on \Computer Science for Environmental Protection",volume18of Umweltinformatik Aktuell,pages 378{388,Marbu
, 1995
"... This Master's thesis describes the e ciency of temporal di erence (TD) learning and the advantages of using modular neural network architectures for learning game evaluation functions. These modular architectures use a hierarchy of gating networks to divide the input space in subspaces for which exp ..."
Abstract

Cited by 11 (1 self)
 Add to MetaCart
This Master's thesis describes the e ciency of temporal di erence (TD) learning and the advantages of using modular neural network architectures for learning game evaluation functions. These modular architectures use a hierarchy of gating networks to divide the input space in subspaces for which expert networks are trained. This divideandconquer principle might beadvantageous when learning game evaluation functions which contain discontinuities, and can also lead to more understandable solutions in which strategies can be identi ed and explored. We compare the following three modular architectures: the hierarchical mixtures of experts, the MetaPi network and the use of xed symbolic rules. In order to generate learning samples, we combine reinforcement learning with the temporal di erence method. When training neural networks with these examples, it is possible to learn to play any desired game. An extension of normal backpropagation has been used, in which the sensitivities of neurons are adapted by a learning rule. We discuss how these neuron sensitivities can be used to learn discontinuous and smooth