Results 1  10
of
228
Reinforcement learning: a survey
 Journal of Artificial Intelligence Research
, 1996
"... This paper surveys the field of reinforcement learning from a computerscience perspective. It is written to be accessible to researchers familiar with machine learning. Both the historical basis of the field and a broad selection of current work are summarized. Reinforcement learning is the problem ..."
Abstract

Cited by 1298 (23 self)
 Add to MetaCart
This paper surveys the field of reinforcement learning from a computerscience perspective. It is written to be accessible to researchers familiar with machine learning. Both the historical basis of the field and a broad selection of current work are summarized. Reinforcement learning is the problem faced by an agent that learns behavior through trialanderror interactions with a dynamic environment. The work described here has a resemblance to work in psychology, but differs considerably in the details and in the use of the word "reinforcement." The paper discusses central issues of reinforcement learning, including trading off exploration and exploitation, establishing the foundations of the field via Markov decision theory, learning from delayed reinforcement, constructing empirical models to accelerate learning, making use of generalization and hierarchy, and coping with hidden state. It concludes with a survey of some implemented systems and an assessment of the practical utility of current methods for reinforcement learning.
Between MDPs and SemiMDPs: A Framework for Temporal Abstraction in Reinforcement Learning
 Artificial Intelligence
, 1999
"... Learning, planning, and representing knowledge at multiple levels of temporal abstraction are key, longstanding challenges for AI. In this paper we consider how these challenges can be addressed within the mathematical framework of reinforcement learning and Markov decision processes (MDPs). We ..."
Abstract

Cited by 426 (29 self)
 Add to MetaCart
Learning, planning, and representing knowledge at multiple levels of temporal abstraction are key, longstanding challenges for AI. In this paper we consider how these challenges can be addressed within the mathematical framework of reinforcement learning and Markov decision processes (MDPs). We extend the usual notion of action in this framework to include optionsclosedloop policies for taking action over a period of time. Examples of options include picking up an object, going to lunch, and traveling to a distant city, as well as primitive actions such as muscle twitches and joint torques. Overall, we show that options enable temporally abstract knowledge and action to be included in the reinforcement learning framework in a natural and general way. In particular, we show that options may be used interchangeably with primitive actions in planning methods such as dynamic programming and in learning methods such as Qlearning.
Generalization in Reinforcement Learning: Successful Examples Using Sparse Coarse Coding
 Advances in Neural Information Processing Systems 8
, 1996
"... On large problems, reinforcement learning systems must use parameterized function approximators such as neural networks in order to generalize between similar situations and actions. In these cases there are no strong theoretical results on the accuracy of convergence, and computational results have ..."
Abstract

Cited by 355 (18 self)
 Add to MetaCart
On large problems, reinforcement learning systems must use parameterized function approximators such as neural networks in order to generalize between similar situations and actions. In these cases there are no strong theoretical results on the accuracy of convergence, and computational results have been mixed. In particular, Boyan and Moore reported at last year's meeting a series of negative results in attempting to apply dynamic programming together with function approximation to simple control problems with continuous state spaces. In this paper, we present positive results for all the control tasks they attempted, and for one that is significantly larger. The most important differences are that we used sparsecoarsecoded function approximators (CMACs) whereas they used mostly global function approximators, and that we learned online whereas they learned offline. Boyan and Moore and others have suggested that the problems they encountered could be solved by using actual outcomes (...
A Stochastic Model of HumanMachine Interaction for learning dialog Strategies
 IEEE Transactions on Speech and Audio Processing
, 2000
"... Abstract—In this paper, we propose a quantitative model for dialog systems that can be used for learning the dialog strategy. We claim that the problem of dialog design can be formalized as an optimization problem with an objective function reflecting different dialog dimensions relevant for a given ..."
Abstract

Cited by 167 (4 self)
 Add to MetaCart
Abstract—In this paper, we propose a quantitative model for dialog systems that can be used for learning the dialog strategy. We claim that the problem of dialog design can be formalized as an optimization problem with an objective function reflecting different dialog dimensions relevant for a given application. We also show that any dialog system can be formally described as a sequential decision process in terms of its state space, action set, and strategy. With additional assumptions about the state transition probabilities and cost assignment, a dialog system can be mapped to a stochastic model known as Markov decision process (MDP). A variety of data driven algorithms for finding the optimal strategy (i.e., the one that optimizes the criterion) is available within the MDP framework, based on reinforcement learning. For an effective use of the available training data we propose a combination of supervised and reinforcement learning: the supervised learning is used to estimate a model of the user, i.e., the MDP parameters that quantify the user’s behavior. Then a reinforcement learning algorithm is used to estimate the optimal strategy while the system interacts with the simulated user. This approach is tested for learning the strategy in an air travel information system (ATIS) task. The experimental results we present in this paper show that it is indeed possible to find a simple criterion, a state space representation, and a simulated user parameterization in order to automatically learn a relatively complex dialog behavior, similar to one that was heuristically designed by several research groups. Index Terms—Dialog systems, Markov decision process, reinforcement learning, sequential decision process, speech, spoken
The linear programming approach to approximate dynamic programming
 Operations Research
, 2001
"... The curse of dimensionality gives rise to prohibitive computational requirements that render infeasible the exact solution of largescale stochastic control problems. We study an efficient method based on linear programming for approximating solutions to such problems. The approach “fits ” a linear ..."
Abstract

Cited by 140 (16 self)
 Add to MetaCart
The curse of dimensionality gives rise to prohibitive computational requirements that render infeasible the exact solution of largescale stochastic control problems. We study an efficient method based on linear programming for approximating solutions to such problems. The approach “fits ” a linear combination of preselected basis functions to the dynamic programming costtogo function. We develop error bounds that offer performance guarantees and also guide the selection of both basis functions and “staterelevance weights ” that influence quality of the approximation. Experimental results in the domain of queueing network control provide empirical support for the methodology. (Dynamic programming/optimal control: approximations/largescale problems. Queues, algorithms: control of queueing networks.)
Optimizing Dialogue Management with Reinforcement Learning: Experiments with the NJFun System
 Journal of Artificial Intelligence Research
, 2002
"... Designing the dialogue policy of a spoken dialogue system involves many nontrivial choices. This paper presents a reinforcement learning approach for automatically optimizing a dialogue policy, which addresses the technical challenges in applying reinforcement learning to a working dialogue system w ..."
Abstract

Cited by 131 (10 self)
 Add to MetaCart
Designing the dialogue policy of a spoken dialogue system involves many nontrivial choices. This paper presents a reinforcement learning approach for automatically optimizing a dialogue policy, which addresses the technical challenges in applying reinforcement learning to a working dialogue system with human users. We report on the design, construction and empirical evaluation of NJFun, an experimental spoken dialogue system that provides users with access to information about fun things to do in New Jersey. Our results show that by optimizing its performance via reinforcement learning, NJFun measurably improves system performance.
Reinforcement Learning for Dynamic Channel Allocation in Cellular Telephone Systems
"... In cellular telephone systems, an important problem is to dynamically allocate the communication resource (channels) so as to maximize service in a stochastic caller environment. This problem is naturally formulated as a dynamic programming problem and we use a reinforcement learning (RL) method to ..."
Abstract

Cited by 124 (5 self)
 Add to MetaCart
In cellular telephone systems, an important problem is to dynamically allocate the communication resource (channels) so as to maximize service in a stochastic caller environment. This problem is naturally formulated as a dynamic programming problem and we use a reinforcement learning (RL) method to find dynamic channel allocation policies that are better than previous heuristic solutions. The policies obtained perform well for a broad variety of call traffic patterns. We present results on a large cellular system with approximately 49^49 states.
MachineLearning Research  Four Current Directions
"... Machine Learning research has been making great progress in many directions. This article summarizes four of these directions and discusses some current open problems. The four directions are (a) improving classification accuracy by learning ensembles of classifiers, (b) methods for scaling up super ..."
Abstract

Cited by 114 (1 self)
 Add to MetaCart
Machine Learning research has been making great progress in many directions. This article summarizes four of these directions and discusses some current open problems. The four directions are (a) improving classification accuracy by learning ensembles of classifiers, (b) methods for scaling up supervised learning algorithms, (c) reinforcement learning, and (d) learning complex stochastic models.
Reinforcement Learning In Continuous Time and Space
 Neural Computation
, 2000
"... This paper presents a reinforcement learning framework for continuoustime dynamical systems without a priori discretization of time, state, and action. Based on the HamiltonJacobiBellman (HJB) equation for infinitehorizon, discounted reward problems, we derive algorithms for estimating value f ..."
Abstract

Cited by 112 (5 self)
 Add to MetaCart
This paper presents a reinforcement learning framework for continuoustime dynamical systems without a priori discretization of time, state, and action. Based on the HamiltonJacobiBellman (HJB) equation for infinitehorizon, discounted reward problems, we derive algorithms for estimating value functions and for improving policies with the use of function approximators. The process of value function estimation is formulated as the minimization of a continuoustime form of the temporal difference (TD) error. Update methods based on backward Euler approximation and exponential eligibility traces are derived and their correspondences with the conventional residual gradient, TD(0), and TD() algorithms are shown. For policy improvement, two methods, namely, a continuous actorcritic method and a valuegradient based greedy policy, are formulated. As a special case of the latter, a nonlinear feedback control law using the value gradient and the model of the input gain is derived....
Approximate Planning in Large POMDPs via Reusable Trajectories
, 1999
"... We consider the problem of choosing a nearbest strategy from a restricted class of strategies in a partially observable Markov decision process (POMDP). We assume we are given the ability to simulate the behavior of the POMDP, and we provide methods for generating simulated experience su cient to a ..."
Abstract

Cited by 111 (10 self)
 Add to MetaCart
We consider the problem of choosing a nearbest strategy from a restricted class of strategies in a partially observable Markov decision process (POMDP). We assume we are given the ability to simulate the behavior of the POMDP, and we provide methods for generating simulated experience su cient to accurately approximate the expected return of any strategy in the class. We prove upper bounds on the amount of simulated experience our methods must generate in order to achieve such uniform approximation. These bounds have no dependence on the size or complexity of the underlying POMDP, but depend only on the complexity of the restricted strategy class. The main challenge is in generating trajectories in the POMDP that can be reused, in the sense that they simultaneously provide estimates of the return of many strategies in the class. Our measure of strategy class complexity generalizes the classical notion of VC dimension, and our methods develop connections between problems of current interest in reinforcement learning and wellstudied issues in the theory of supervised learning. We also discuss a number of practical planning algorithms for POMDPs that arise from our reusable trajectories.