Results 1 - 10
of
79
Reinforcement learning: a survey
- Journal of Artificial Intelligence Research
, 1996
"... This paper surveys the field of reinforcement learning from a computer-science perspective. It is written to be accessible to researchers familiar with machine learning. Both the historical basis of the field and a broad selection of current work are summarized. Reinforcement learning is the problem ..."
Abstract
-
Cited by 1134 (21 self)
- Add to MetaCart
This paper surveys the field of reinforcement learning from a computer-science perspective. It is written to be accessible to researchers familiar with machine learning. Both the historical basis of the field and a broad selection of current work are summarized. Reinforcement learning is the problem faced by an agent that learns behavior through trial-and-error interactions with a dynamic environment. The work described here has a resemblance to work in psychology, but differs considerably in the details and in the use of the word "reinforcement." The paper discusses central issues of reinforcement learning, including trading off exploration and exploitation, establishing the foundations of the field via Markov decision theory, learning from delayed reinforcement, constructing empirical models to accelerate learning, making use of generalization and hierarchy, and coping with hidden state. It concludes with a survey of some implemented systems and an assessment of the practical utility of current methods for reinforcement learning.
Infinite-horizon policy-gradient estimation
- Journal of Artificial Intelligence Research
, 2001
"... Gradient-based approaches to direct policy search in reinforcement learning have received much recent attention as a means to solve problems of partial observability and to avoid some of the problems associated with policy degradation in value-function methods. In this paper we introduce � � , a si ..."
Abstract
-
Cited by 119 (5 self)
- Add to MetaCart
Gradient-based approaches to direct policy search in reinforcement learning have received much recent attention as a means to solve problems of partial observability and to avoid some of the problems associated with policy degradation in value-function methods. In this paper we introduce � � , a simulation-based algorithm for generating a biased estimate of the gradient of the average reward in Partially Observable Markov Decision Processes ( � s) controlled by parameterized stochastic policies. A similar algorithm was proposed by Kimura, Yamamura, and Kobayashi (1995). The algorithm’s chief advantages are that it requires storage of only twice the number of policy parameters, uses one free parameter � � (which has a natural interpretation in terms of bias-variance trade-off), and requires no knowledge of the underlying state. We prove convergence of � � , and show how the correct choice of the parameter is related to the mixing time of the controlled �. We briefly describe extensions of � � to controlled Markov chains, continuous state, observation and control spaces, multiple-agents, higher-order derivatives, and a version for training stochastic policies with internal states. In a companion paper (Baxter, Bartlett, & Weaver, 2001) we show how the gradient estimates generated by � � can be used in both a traditional stochastic gradient algorithm and a conjugate-gradient procedure to find local optima of the average reward. 1.
Machine-Learning Research -- Four Current Directions
"... Machine Learning research has been making great progress in many directions. This article summarizes four of these directions and discusses some current open problems. The four directions are (a) improving classification accuracy by learning ensembles of classifiers, (b) methods for scaling up super ..."
Abstract
-
Cited by 102 (1 self)
- Add to MetaCart
Machine Learning research has been making great progress in many directions. This article summarizes four of these directions and discusses some current open problems. The four directions are (a) improving classification accuracy by learning ensembles of classifiers, (b) methods for scaling up supervised learning algorithms, (c) reinforcement learning, and (d) learning complex stochastic models.
Reinforcement Learning in POMDP's via Direct Gradient Ascent
- In Proc. 17th International Conf. on Machine Learning
, 2000
"... This paper discusses theoretical and experimental aspects of gradient-based approaches to the direct optimization of policy performance in controlled POMDPs. We introduce GPOMDP, a REINFORCE-like algorithm for estimating an approximation to the gradient of the average reward as a function of ..."
Abstract
-
Cited by 61 (2 self)
- Add to MetaCart
This paper discusses theoretical and experimental aspects of gradient-based approaches to the direct optimization of policy performance in controlled POMDPs. We introduce GPOMDP, a REINFORCE-like algorithm for estimating an approximation to the gradient of the average reward as a function of the parameters of a stochastic policy. The algorithm's chief advantages are that it requires only a single sample path of the underlying Markov chain, it uses only one free parameter 2 [0; 1), which has a natural interpretation in terms of bias-variance trade-off, and it requires no knowledge of the underlying state. We prove convergence of GPOMDP and show how the gradient estimates produced by GPOMDP can be used in a conjugate-gradient procedure to find local optima of the average reward. 1. Introduction "Reinforcement learning" is used to describe the general problem of training an agent to choose its actions so as to increase its long-term average reward. The structure of th...
Evolutionary function approximation for reinforcement learning
- Journal of Machine Learning Research
, 2006
"... Ø�ÓÒ�ÔÔÖÓÜ�Ñ�Ø�ÓÒ�ÒÓÚ�Ð�ÔÔÖÓ��ØÓ�ÙØÓÑ�Ø��ÐÐÝ× � Ø�ÓÒ�Ð���×�ÓÒ×Ì��ר��×�×�ÒÚ�ר���Ø�×�ÚÓÐÙØ�ÓÒ�ÖÝ�ÙÒ �Ò�ÓÖ�Ñ�ÒØÐ��ÖÒ�Ò�ÔÖÓ�Ð�Ñ×�Ö�Ø��×Ù�×�ØÓ�Ø��×�Ø�×� × ÁÒÑ�ÒÝÑ���Ò�Ð��ÖÒ�Ò�ÔÖÓ�Ð�Ñ×�Ò���ÒØÑÙרÐ��ÖÒ Ñ�ÒØ���Òר�ÒØ��Ø�ÓÒÓ��ÚÓÐÙØ�ÓÒ�ÖÝ�ÙÒØ�ÓÒ�ÔÔÖÓÜ�Ñ � Ù�Ðר��Ø�Ö���ØØ�Ö��Ð�ØÓÐ��ÖÒÁÔÖ�×�ÒØ��ÙÐÐÝ�ÑÔÐ � Ø�Ó ..."
Abstract
-
Cited by 60 (15 self)
- Add to MetaCart
Ø�ÓÒ�ÔÔÖÓÜ�Ñ�Ø�ÓÒ�ÒÓÚ�Ð�ÔÔÖÓ��ØÓ�ÙØÓÑ�Ø��ÐÐÝ× � Ø�ÓÒ�Ð���×�ÓÒ×Ì��ר��×�×�ÒÚ�ר���Ø�×�ÚÓÐÙØ�ÓÒ�ÖÝ�ÙÒ �Ò�ÓÖ�Ñ�ÒØÐ��ÖÒ�Ò�ÔÖÓ�Ð�Ñ×�Ö�Ø��×Ù�×�ØÓ�Ø��×�Ø�×� × ÁÒÑ�ÒÝÑ���Ò�Ð��ÖÒ�Ò�ÔÖÓ�Ð�Ñ×�Ò���ÒØÑÙרÐ��ÖÒ Ñ�ÒØ���Òר�ÒØ��Ø�ÓÒÓ��ÚÓÐÙØ�ÓÒ�ÖÝ�ÙÒØ�ÓÒ�ÔÔÖÓÜ�Ñ � Ù�Ðר��Ø�Ö���ØØ�Ö��Ð�ØÓÐ��ÖÒÁÔÖ�×�ÒØ��ÙÐÐÝ�ÑÔÐ � Ø�ÓÒÛ���ÓÑ��Ò�ׯ��Ì�Ò�ÙÖÓ�ÚÓÐÙØ�ÓÒ�ÖÝÓÔØ�Ñ�Þ � Ð�Ø�Ò��ÙÒØ�ÓÒ�ÔÔÖÓÜ�Ñ�ØÓÖÖ�ÔÖ�×�ÒØ�Ø�ÓÒר��Ø�Ò��Ð� Ø�ÓÒØ��Ò�ÕÙ�Û�Ø�ÉÐ��ÖÒ�Ò��ÔÓÔÙÐ�ÖÌ�Ñ�Ø�Ó�Ì� � �Æ��ÒØ�Ò��Ú��Ù�ÐÐ��ÖÒ�Ò�Ì��×Ñ�Ø�Ó��ÚÓÐÚ�×�Ò��Ú� � ÓÔØ�Ñ�Þ�Ø�ÓÒ��ÐÐ�ÒØ��×�Ø��ÓÖÝ��Ú�ÐÓÔ�Ò��«�Ø�Ú�Ö��Ò �ÓÖÁÒר����ØÖ���Ú�×ÓÒÐÝÔÓ×�Ø�Ú��Ò�Ò���Ø�Ú�Ö�Û�Ö� × ÔÖÓ�Ð�Ñ××Ù��×ÖÓ�ÓØÓÒØÖÓÐ��Ñ�ÔÐ�Ý�Ò��Ò�×Ýר�Ñ �ÒÛ���Ø�����ÒØÒ�Ú�Ö×��×�Ü�ÑÔÐ�×Ó�ÓÖÖ�Ø����Ú 1.
Direct gradient-based reinforcement learning: I. gradient estimation algorithms
- National University
, 1999
"... In [2] we introduced ¢¡¤£¦¥¨§¦¡, an algorithm for computing arbitrarily accurate approximations to the performance gradient of parameterized partially observable Markov decision processes ( ¡©£¦¥¨§¦ ¡ s). The algorithm’s chief advantages are that it requires only a single sample path of the underly ..."
Abstract
-
Cited by 51 (3 self)
- Add to MetaCart
In [2] we introduced ¢¡¤£¦¥¨§¦¡, an algorithm for computing arbitrarily accurate approximations to the performance gradient of parameterized partially observable Markov decision processes ( ¡©£¦¥¨§¦ ¡ s). The algorithm’s chief advantages are that it requires only a single sample path of the underlying Markov chain, it uses only one ���� � ������ � free parameter which has a natural interpretation in terms of bias-variance trade-off, and it requires no knowledge of the underlying state. In addition, the algorithm can be applied to infinite state, control and observation spaces.
High-Performance Job-Shop Scheduling With A Time-Delay TD(lambda) Network
- Advances in Neural Information Processing Systems 8
, 1995
"... Job-shop scheduling is an important task for manufacturing industries. We are interested in the particular task of scheduling payload processing for NASA's space shuttle program. This paper summarizes our previous work on formulating this task for solution by the reinforcement learning algorithm TD( ..."
Abstract
-
Cited by 44 (3 self)
- Add to MetaCart
Job-shop scheduling is an important task for manufacturing industries. We are interested in the particular task of scheduling payload processing for NASA's space shuttle program. This paper summarizes our previous work on formulating this task for solution by the reinforcement learning algorithm TD(). A shortcoming of this previous work was its reliance on hand-engineered input features. This paper shows how to extend the time-delay neural network (TDNN) architecture to apply it to irregular-length schedules. Experimental tests show that this TDNN-TD() network can match the performance of our previous hand-engineered system. The tests also show that both neural network approaches significantly outperform the best previous (non-learning) solution to this problem in terms of the quality of the resulting schedules and the number of search steps required to construct them. Category: Control, Navigation, and Planning: Reinforcement Learning Presentation Preference: Poster. 1 Introduction...
Learning and Value Function Approximation in Complex Decision Processes
, 1998
"... In principle, a wide variety of sequential decision problems -- ranging from dynamic resource allocation in telecommunication networks to financial risk management -- can be formulated in terms of stochastic control and solved by the algorithms of dynamic programming. Such algorithms compute and sto ..."
Abstract
-
Cited by 34 (4 self)
- Add to MetaCart
In principle, a wide variety of sequential decision problems -- ranging from dynamic resource allocation in telecommunication networks to financial risk management -- can be formulated in terms of stochastic control and solved by the algorithms of dynamic programming. Such algorithms compute and store a value function, which evaluates expected future reward as a function of current state. Unfortunately, exact computation of the value function typically requires time and storage that grow proportionately with the number of states, and consequently, the enormous state spaces that arise in practical applications render the algorithms intractable. In this thesis, we study tractable methods that approximate the value function. Our work builds on research in an area of artificial intelligence known as reinforcement learning. A point of focus of this thesis is temporal-difference learning -- a stochastic algorithm inspired to some extent by phenomena observed in animal behavior. Given a selection of...
Adore: Adaptive object recognition
- Videre
, 1999
"... Abstract. Many modern computer vision systems are built by chaining together standard vision procedures, often in graphical programming environments such as Khoros, CVIPtools or IUE. Typically, these procedures are selected and sequenced by an ad-hoc combination of programmer’s intuition and trial-a ..."
Abstract
-
Cited by 29 (1 self)
- Add to MetaCart
Abstract. Many modern computer vision systems are built by chaining together standard vision procedures, often in graphical programming environments such as Khoros, CVIPtools or IUE. Typically, these procedures are selected and sequenced by an ad-hoc combination of programmer’s intuition and trial-and-error. This paper presents a theoretically sound method for constructing object recognition strategies by casting object recognition as a Markov Decision Problem (MDP). The result is a system called ADORE (Adaptive Object Recognition) that automatically learns object recognition control policies from training data. Experimental results are presented in which ADORE is trained to recognize five types of houses in aerial images, and where its performance can be (and is) compared to optimal. 1
Auto-exploratory Average Reward Reinforcement Learning
- Artificial Intelligence
, 1996
"... We introduce a model-based average reward Reinforcement Learning method called H-learning and compare it with its discounted counterpart, Adaptive Real-Time Dynamic Programming, in a simulated robot scheduling task. We also introduce an extension to H-learning, which automatically explores the unexp ..."
Abstract
-
Cited by 28 (8 self)
- Add to MetaCart
We introduce a model-based average reward Reinforcement Learning method called H-learning and compare it with its discounted counterpart, Adaptive Real-Time Dynamic Programming, in a simulated robot scheduling task. We also introduce an extension to H-learning, which automatically explores the unexplored parts of the state space, while always choosing greedy actions with respect to the current value function. We show that this "Auto-exploratory H-learning" performs better than the original H-learning under previously studied exploration methods such as random, recency-based, or counter-based exploration. Introduction Reinforcement Learning (RL) is the study of learning agents that improve their performance at some task by receiving rewards and punishments from the environment. Most approaches to reinforcement learning, including Q-learning (Watkins and Dayan 92) and Adaptive Real-Time Dynamic Programming (ARTDP) (Barto, Bradtke, & Singh 95), optimize the total discounted reward the ...

