Results 1  10
of
55
A tutorial on the crossentropy method
 Annals of Operations Research
, 2005
"... Abstract: The crossentropy method is a recent versatile Monte Carlo technique. This article provides a brief introduction to the crossentropy method and discusses how it can be used for rareevent probability estimation and for solving combinatorial, continuous, constrained and noisy optimization ..."
Abstract

Cited by 104 (15 self)
 Add to MetaCart
Abstract: The crossentropy method is a recent versatile Monte Carlo technique. This article provides a brief introduction to the crossentropy method and discusses how it can be used for rareevent probability estimation and for solving combinatorial, continuous, constrained and noisy optimization problems. A comprehensive list of references on crossentropy methods and applications is included.
Protovalue functions: A laplacian framework for learning representation and control in markov decision processes
 Journal of Machine Learning Research
, 2006
"... This paper introduces a novel spectral framework for solving Markov decision processes (MDPs) by jointly learning representations and optimal policies. The major components of the framework described in this paper include: (i) A general scheme for constructing representations or basis functions by d ..."
Abstract

Cited by 66 (10 self)
 Add to MetaCart
This paper introduces a novel spectral framework for solving Markov decision processes (MDPs) by jointly learning representations and optimal policies. The major components of the framework described in this paper include: (i) A general scheme for constructing representations or basis functions by diagonalizing symmetric diffusion operators (ii) A specific instantiation of this approach where global basis functions called protovalue functions (PVFs) are formed using the eigenvectors of the graph Laplacian on an undirected graph formed from state transitions induced by the MDP (iii) A threephased procedure called representation policy iteration comprising of a sample collection phase, a representation learning phase that constructs basis functions from samples, and a final parameter estimation phase that determines an (approximately) optimal policy within the (linear) subspace spanned by the (current) basis functions. (iv) A specific instantiation of the RPI framework using leastsquares policy iteration (LSPI) as the parameter estimation method (v) Several strategies for scaling the proposed approach to large discrete and continuous state spaces, including the Nyström extension for outofsample interpolation of eigenfunctions, and the use of Kronecker sum factorization to construct compact eigenfunctions in product spaces such as factored MDPs (vi) Finally, a series of illustrative discrete and continuous control tasks, which both illustrate the concepts and provide a benchmark for evaluating the proposed approach. Many challenges remain to be addressed in scaling the proposed framework to large MDPs, and several elaboration of the proposed framework are briefly summarized at the end.
Automatic basis function construction for approximate dynamic programming and reinforcement learning
 In Cohen and Moore (2006
, 2006
"... We address the problem of automatically constructing basis functions for linear approximation of the value function of a Markov Decision Process (MDP). Our work builds on results by Bertsekas and Castañon (1989) who proposed a method for automatically aggregating states to speed up value iteration. ..."
Abstract

Cited by 62 (2 self)
 Add to MetaCart
We address the problem of automatically constructing basis functions for linear approximation of the value function of a Markov Decision Process (MDP). Our work builds on results by Bertsekas and Castañon (1989) who proposed a method for automatically aggregating states to speed up value iteration. We propose to use neighborhood component analysis (Goldberger et al., 2005), a dimensionality reduction technique created for supervised learning, in order to map a highdimensional state space to a lowdimensional space, based on the Bellman error, or on the temporal difference (TD) error. We then place basis function in the lowerdimensional space. These are added as new features for the linear function approximator. This approach is applied to a highdimensional inventory control problem. 1.
Regularization and feature selection in leastsquares temporal difference learning (full version). Available at http://ai.stanford.edu/˜kolter
, 2009
"... We consider the task of reinforcement learning with linear value function approximation. Temporal difference algorithms, and in particular the LeastSquares Temporal Difference (LSTD) algorithm, provide a method for learning the parameters of the value function, but when the number of features is la ..."
Abstract

Cited by 48 (1 self)
 Add to MetaCart
We consider the task of reinforcement learning with linear value function approximation. Temporal difference algorithms, and in particular the LeastSquares Temporal Difference (LSTD) algorithm, provide a method for learning the parameters of the value function, but when the number of features is large this algorithm can overfit to the data and is computationally expensive. In this paper, we propose a regularization framework for the LSTD algorithm that overcomes these difficulties. In particular, we focus on the case of l1 regularization, which is robust to irrelevant features and also serves as a method for feature selection. Although the l1 regularized LSTD solution cannot be expressed as a convex optimization problem, we present an algorithm similar to the Least Angle Regression (LARS) algorithm that can efficiently compute the optimal solution. Finally, we demonstrate the performance of the algorithm experimentally. 1.
Analyzing feature generation for valuefunction approximation
 In Proceedings of the 24th International Conference on Machine Learning
, 2007
"... We analyze a simple, Bellmanerrorbased approach to generating basis functions for valuefunction approximation. We show that it generates orthogonal basis functions that provably tighten approximation error bounds. We also illustrate the use of this approach in the presence of noise on some sample ..."
Abstract

Cited by 43 (5 self)
 Add to MetaCart
We analyze a simple, Bellmanerrorbased approach to generating basis functions for valuefunction approximation. We show that it generates orthogonal basis functions that provably tighten approximation error bounds. We also illustrate the use of this approach in the presence of noise on some sample problems. 1.
Learning Tetris Using the Noisy CrossEntropy Method
 Neural Computation
"... The crossentropy method is an efficient and general optimization algorithm. However, its applicability in reinforcement learning seems to be limited although it is fast, because it often converges to suboptimal policies. A standard technique for preventing early convergence is to introduce noise. W ..."
Abstract

Cited by 38 (2 self)
 Add to MetaCart
The crossentropy method is an efficient and general optimization algorithm. However, its applicability in reinforcement learning seems to be limited although it is fast, because it often converges to suboptimal policies. A standard technique for preventing early convergence is to introduce noise. We apply the noisy crossentropy method to the game of Tetris to demonstrate its efficiency. The resulting policy outperforms previous RL algorithms by almost two orders of magnitude, and reaches over 300,000 points on average. Key words: tetris, crossentropy method, reinforcement learning 1
Representation Policy Iteration
, 2005
"... This paper addresses a fundamental issue central to approximation methods for solving large Markov decision processes (MDPs): how to automatically learn the underlying representation for value function approximation? A novel theoretically rigorous framework is proposed that automatically generates g ..."
Abstract

Cited by 21 (8 self)
 Add to MetaCart
This paper addresses a fundamental issue central to approximation methods for solving large Markov decision processes (MDPs): how to automatically learn the underlying representation for value function approximation? A novel theoretically rigorous framework is proposed that automatically generates geometrically customized orthonormal sets of basis functions, which can be used with any approximate MDP solver like leastsquares policy iteration (LSPI). The key innovation is a coordinatefree representation of value functions, using the theory of smooth functions on a Riemannian manifold. Hodge theory yields a constructive method for generating basis functions for approximating value functions based on the eigenfunctions of the selfadjoint (LaplaceBeltrami) operator on manifolds. In effect, this approach performs a global Fourier analysis on the state space graph to approximate value functions, where the basis functions reflect the largescale topology of the underlying state space. A new class of algorithms called Representation Policy Iteration (RPI) are presented that automatically learn both basis functions and approximately optimal policies. Illustrative experiments compare the performance of RPI with that of LSPI using two handcoded basis functions (RBF and polynomial state encodings).
Samuel meets Amarel: Automating Value Function Approximation using Global State Space Analysis
, 2005
"... Most work on value function approximation adheres to Samuel’s original design: agents learn a taskspecific value function using parameter estimation, where the approximation architecture (e.g, polynomials) is specified by a human designer. This paper proposes a novel framework generalizing Samuel’s ..."
Abstract

Cited by 20 (0 self)
 Add to MetaCart
Most work on value function approximation adheres to Samuel’s original design: agents learn a taskspecific value function using parameter estimation, where the approximation architecture (e.g, polynomials) is specified by a human designer. This paper proposes a novel framework generalizing Samuel’s paradigm using a coordinatefree approach to value function approximation. Agents learn both representations and value functions by constructing geometrically customized taskindependent basis functions that form an orthonormal set for the Hilbert space of smooth functions on the underlying state space manifold. The approach rests on a technical result showing that the space of smooth functions on a (compact) Riemannian manifold has a discrete spectrum associated with the LaplaceBeltrami operator. In the discrete setting, spectral analysis of the graph Laplacian yields a set of geometrically customized basis functions for approximating and decomposing value functions. The proposed framework generalizes Samuel’s value function approximation paradigm by combining it with a formalization of Saul Amarel’s paradigm of representation learning through global state space analysis.
A unifying framework for computational reinforcement learning theory
, 2009
"... Computational learning theory studies mathematical models that allow one to formally analyze and compare the performance of supervisedlearning algorithms such as their sample complexity. While existing models such as PAC (Probably Approximately Correct) have played an influential role in understand ..."
Abstract

Cited by 18 (6 self)
 Add to MetaCart
Computational learning theory studies mathematical models that allow one to formally analyze and compare the performance of supervisedlearning algorithms such as their sample complexity. While existing models such as PAC (Probably Approximately Correct) have played an influential role in understanding the nature of supervised learning, they have not been as successful in reinforcement learning (RL). Here, the fundamental barrier is the need for active exploration in sequential decision problems. An RL agent tries to maximize longterm utility by exploiting its knowledge about the problem, but this knowledge has to be acquired by the agent itself through exploring the problem that may reduce shortterm utility. The need for active exploration is common in many problems in daily life, engineering, and sciences. For example, a Backgammon program strives to take good moves to maximize the probability of winning a game, but sometimes it may try novel and possibly harmful moves to discover how the opponent reacts in the hope of discovering a better gameplaying strategy. It has been known since the early days of RL that a good tradeoff between exploration and exploitation is critical for the agent to learn fast (i.e., to reach nearoptimal strategies