Results 1  10
of
457
Treebased batch mode reinforcement learning
 JOURNAL OF MACHINE LEARNING RESEARCH
, 2005
"... Reinforcement learning aims to determine an optimal control policy from interaction with a system or from observations gathered from a system. In batch mode, it can be achieved by approximating the socalled Qfunction based on a set of fourtuples (xt,ut,rt,xt+1) where xt denotes the system state a ..."
Abstract

Cited by 224 (42 self)
 Add to MetaCart
Reinforcement learning aims to determine an optimal control policy from interaction with a system or from observations gathered from a system. In batch mode, it can be achieved by approximating the socalled Qfunction based on a set of fourtuples (xt,ut,rt,xt+1) where xt denotes the system state at time t, ut the control action taken, rt the instantaneous reward obtained and xt+1 the successor state of the system, and by determining the control policy from this Qfunction. The Qfunction approximation may be obtained from the limit of a sequence of (batch mode) supervised learning problems. Within this framework we describe the use of several classical treebased supervised learning methods (CART, Kdtree, tree bagging) and two newly proposed ensemble algorithms, namely extremely and totally randomized trees. We study their performances on several examples and find that the ensemble methods based on regression trees perform well in extracting relevant information about the optimal control policy from sets of fourtuples. In particular, the totally randomized trees give good results while ensuring the convergence of the sequence, whereas by relaxing the convergence constraint even better accuracy results are provided by the extremely randomized trees.
Perseus: Randomized pointbased value iteration for POMDPs
 Journal of Artificial Intelligence Research
, 2005
"... Partially observable Markov decision processes (POMDPs) form an attractive and principled framework for agent planning under uncertainty. Pointbased approximate techniques for POMDPs compute a policy based on a finite set of points collected in advance from the agent’s belief space. We present a ra ..."
Abstract

Cited by 204 (17 self)
 Add to MetaCart
(Show Context)
Partially observable Markov decision processes (POMDPs) form an attractive and principled framework for agent planning under uncertainty. Pointbased approximate techniques for POMDPs compute a policy based on a finite set of points collected in advance from the agent’s belief space. We present a randomized pointbased value iteration algorithm called Perseus. The algorithm performs approximate value backup stages, ensuring that in each backup stage the value of each point in the belief set is improved; the key observation is that a single backup may improve the value of many belief points. Contrary to other pointbased methods, Perseus backs up only a (randomly selected) subset of points in the belief set, sufficient for improving the value of each belief point in the set. We show how the same idea can be extended to dealing with continuous action spaces. Experimental results show the potential of Perseus in large scale POMDP problems. 1.
Learning nearoptimal policies with Bellmanresidual minimization based fitted policy iteration and a single sample path
 MACHINE LEARNING JOURNAL (2008) 71:89129
, 2008
"... ..."
(Show Context)
Evolutionary function approximation for reinforcement learning
 Journal of Machine Learning Research
, 2006
"... Ø�ÓÒ�ÔÔÖÓÜ�Ñ�Ø�ÓÒ�ÒÓÚ�Ð�ÔÔÖÓ��ØÓ�ÙØÓÑ�Ø��ÐÐÝ× � Ø�ÓÒ�Ð���×�ÓÒ×Ì��×Ø��×�×�ÒÚ�×Ø���Ø�×�ÚÓÐÙØ�ÓÒ�ÖÝ�ÙÒ �Ò�ÓÖ�Ñ�ÒØÐ��ÖÒ�Ò�ÔÖÓ�Ð�Ñ×�Ö�Ø��×Ù�×�ØÓ�Ø��×�Ø�×� × ÁÒÑ�ÒÝÑ���Ò�Ð��ÖÒ�Ò�ÔÖÓ�Ð�Ñ×�Ò���ÒØÑÙ×ØÐ��ÖÒ Ñ�ÒØ���Ò×Ø�ÒØ��Ø�ÓÒÓ��ÚÓÐÙØ�ÓÒ�ÖÝ�ÙÒØ�ÓÒ�ÔÔÖÓÜ�Ñ � Ù�Ð×Ø��Ø�Ö���ØØ�Ö��Ð�ØÓÐ��ÖÒÁÔÖ�×�ÒØ��ÙÐÐÝ�ÑÔÐ � Ø�Ó ..."
Abstract

Cited by 110 (17 self)
 Add to MetaCart
Ø�ÓÒ�ÔÔÖÓÜ�Ñ�Ø�ÓÒ�ÒÓÚ�Ð�ÔÔÖÓ��ØÓ�ÙØÓÑ�Ø��ÐÐÝ× � Ø�ÓÒ�Ð���×�ÓÒ×Ì��×Ø��×�×�ÒÚ�×Ø���Ø�×�ÚÓÐÙØ�ÓÒ�ÖÝ�ÙÒ �Ò�ÓÖ�Ñ�ÒØÐ��ÖÒ�Ò�ÔÖÓ�Ð�Ñ×�Ö�Ø��×Ù�×�ØÓ�Ø��×�Ø�×� × ÁÒÑ�ÒÝÑ���Ò�Ð��ÖÒ�Ò�ÔÖÓ�Ð�Ñ×�Ò���ÒØÑÙ×ØÐ��ÖÒ Ñ�ÒØ���Ò×Ø�ÒØ��Ø�ÓÒÓ��ÚÓÐÙØ�ÓÒ�ÖÝ�ÙÒØ�ÓÒ�ÔÔÖÓÜ�Ñ � Ù�Ð×Ø��Ø�Ö���ØØ�Ö��Ð�ØÓÐ��ÖÒÁÔÖ�×�ÒØ��ÙÐÐÝ�ÑÔÐ � Ø�ÓÒÛ���ÓÑ��Ò�×Æ��Ì�Ò�ÙÖÓ�ÚÓÐÙØ�ÓÒ�ÖÝÓÔØ�Ñ�Þ � Ð�Ø�Ò��ÙÒØ�ÓÒ�ÔÔÖÓÜ�Ñ�ØÓÖÖ�ÔÖ�×�ÒØ�Ø�ÓÒ×Ø��Ø�Ò��Ð� Ø�ÓÒØ��Ò�ÕÙ�Û�Ø�ÉÐ��ÖÒ�Ò��ÔÓÔÙÐ�ÖÌ�Ñ�Ø�Ó�Ì� � �Æ��ÒØ�Ò��Ú��Ù�ÐÐ��ÖÒ�Ò�Ì��×Ñ�Ø�Ó��ÚÓÐÚ�×�Ò��Ú� � ÓÔØ�Ñ�Þ�Ø�ÓÒ��ÐÐ�ÒØ��×�Ø��ÓÖÝ��Ú�ÐÓÔ�Ò��«�Ø�Ú�Ö��Ò �ÓÖÁÒ×Ø����ØÖ���Ú�×ÓÒÐÝÔÓ×�Ø�Ú��Ò�Ò���Ø�Ú�Ö�Û�Ö� × ÔÖÓ�Ð�Ñ××Ù��×ÖÓ�ÓØÓÒØÖÓÐ��Ñ�ÔÐ�Ý�Ò��Ò�×Ý×Ø�Ñ �ÒÛ���Ø�����ÒØÒ�Ú�Ö×��×�Ü�ÑÔÐ�×Ó�ÓÖÖ�Ø����Ú 1.
The jackknife—a review.
 Biometrika
, 1974
"... The Light Beyond, By Raymond A. Moody, Jr. with Paul Perry. New York, NY: Bantam Books, 1988, 161 pp., $18.95 In his foreword to this book, Andrew Greeley, a prominent priest and sociologist, introduces his comments with the following statement: "Raymond Moody has achieved a rare feat in th ..."
Abstract

Cited by 105 (0 self)
 Add to MetaCart
(Show Context)
The Light Beyond, By Raymond A. Moody, Jr. with Paul Perry. New York, NY: Bantam Books, 1988, 161 pp., $18.95 In his foreword to this book, Andrew Greeley, a prominent priest and sociologist, introduces his comments with the following statement: "Raymond Moody has achieved a rare feat in the quest for human knowledge; he has created a paradigm." He then refers to Thomas Kuhn, who pointed out in The Structure of Scientific Revolutions that scientific revolutions occur when someone creates a new perspective, a new model, a new approach to reality. Although Greeley acknowledges that Moody did not discover the neardeath experience (NDE), he contends that because Moody put a name to it in his previous bestseller Life After Life (1975), he therefore deserves credit for the new para digm that has evolved. Greeley then refers to The Light Beyond as characterized by Moody's "openness, sensitivity and modesty." This he attributes to Moody's acknowledgement that the NDE does not repre sent proof of life after death; rather, it indicates only the existence and widespread prevalence of the NDE. I must question why Greeley does not comment more on the content of the book, and why Moody felt it was appropriate to be credited with creating a new paradigm. During the last fourteen years since Life
Protovalue functions: A laplacian framework for learning representation and control in markov decision processes
 Journal of Machine Learning Research
, 2006
"... This paper introduces a novel spectral framework for solving Markov decision processes (MDPs) by jointly learning representations and optimal policies. The major components of the framework described in this paper include: (i) A general scheme for constructing representations or basis functions by d ..."
Abstract

Cited by 92 (10 self)
 Add to MetaCart
This paper introduces a novel spectral framework for solving Markov decision processes (MDPs) by jointly learning representations and optimal policies. The major components of the framework described in this paper include: (i) A general scheme for constructing representations or basis functions by diagonalizing symmetric diffusion operators (ii) A specific instantiation of this approach where global basis functions called protovalue functions (PVFs) are formed using the eigenvectors of the graph Laplacian on an undirected graph formed from state transitions induced by the MDP (iii) A threephased procedure called representation policy iteration comprising of a sample collection phase, a representation learning phase that constructs basis functions from samples, and a final parameter estimation phase that determines an (approximately) optimal policy within the (linear) subspace spanned by the (current) basis functions. (iv) A specific instantiation of the RPI framework using leastsquares policy iteration (LSPI) as the parameter estimation method (v) Several strategies for scaling the proposed approach to large discrete and continuous state spaces, including the Nyström extension for outofsample interpolation of eigenfunctions, and the use of Kronecker sum factorization to construct compact eigenfunctions in product spaces such as factored MDPs (vi) Finally, a series of illustrative discrete and continuous control tasks, which both illustrate the concepts and provide a benchmark for evaluating the proposed approach. Many challenges remain to be addressed in scaling the proposed framework to large MDPs, and several elaboration of the proposed framework are briefly summarized at the end.
Neural fitted Q iteration – first experiences with a data efficient neural reinforcement learning method
 In 16th European Conference on Machine Learning
, 2005
"... Abstract. This paper introduces NFQ, an algorithm for efficient and effective training of a Qvalue function represented by a multilayer perceptron. Based on the principle of storing and reusing transition experiences, a modelfree, neural network based Reinforcement Learning algorithm is proposed. ..."
Abstract

Cited by 87 (18 self)
 Add to MetaCart
(Show Context)
Abstract. This paper introduces NFQ, an algorithm for efficient and effective training of a Qvalue function represented by a multilayer perceptron. Based on the principle of storing and reusing transition experiences, a modelfree, neural network based Reinforcement Learning algorithm is proposed. The method is evaluated on three benchmark problems. It is shown empirically, that reasonably few interactions with the plant are needed to generate control policies of high quality. 1
Regularization and feature selection in leastsquares temporal difference learning
, 2009
"... We consider the task of reinforcement learning with linear value function approximation. Temporal difference algorithms, and in particular the LeastSquares Temporal Difference (LSTD) algorithm, provide a method for learning the parameters of the value function, but when the number of features is la ..."
Abstract

Cited by 80 (1 self)
 Add to MetaCart
(Show Context)
We consider the task of reinforcement learning with linear value function approximation. Temporal difference algorithms, and in particular the LeastSquares Temporal Difference (LSTD) algorithm, provide a method for learning the parameters of the value function, but when the number of features is large this algorithm can overfit to the data and is computationally expensive. In this paper, we propose a regularization framework for the LSTD algorithm that overcomes these difficulties. In particular, we focus on the case of l1 regularization, which is robust to irrelevant features and also serves as a method for feature selection. Although the l1 regularized LSTD solution cannot be expressed as a convex optimization problem, we present an algorithm similar to the Least Angle Regression (LARS) algorithm that can efficiently compute the optimal solution. Finally, we demonstrate the performance of the algorithm experimentally.
Analyzing feature generation for valuefunction approximation
 In Proceedings of the 24th International Conference on Machine Learning
, 2007
"... We analyze a simple, Bellmanerrorbased approach to generating basis functions for valuefunction approximation. We show that it generates orthogonal basis functions that provably tighten approximation error bounds. We also illustrate the use of this approach in the presence of noise on some sample ..."
Abstract

Cited by 57 (6 self)
 Add to MetaCart
(Show Context)
We analyze a simple, Bellmanerrorbased approach to generating basis functions for valuefunction approximation. We show that it generates orthogonal basis functions that provably tighten approximation error bounds. We also illustrate the use of this approach in the presence of noise on some sample problems. 1.