## Proto-value functions: A laplacian framework for learning representation and control in markov decision processes (2006)

### Cached

### Download Links

Venue: | Journal of Machine Learning Research |

Citations: | 68 - 10 self |

### BibTeX

@ARTICLE{Mahadevan06proto-valuefunctions:,

author = {Sridhar Mahadevan and Mauro Maggioni and Carlos Guestrin},

title = {Proto-value functions: A laplacian framework for learning representation and control in markov decision processes},

journal = {Journal of Machine Learning Research},

year = {2006},

volume = {8},

pages = {2007}

}

### OpenURL

### Abstract

This paper introduces a novel spectral framework for solving Markov decision processes (MDPs) by jointly learning representations and optimal policies. The major components of the framework described in this paper include: (i) A general scheme for constructing representations or basis functions by diagonalizing symmetric diffusion operators (ii) A specific instantiation of this approach where global basis functions called proto-value functions (PVFs) are formed using the eigenvectors of the graph Laplacian on an undirected graph formed from state transitions induced by the MDP (iii) A three-phased procedure called representation policy iteration comprising of a sample collection phase, a representation learning phase that constructs basis functions from samples, and a final parameter estimation phase that determines an (approximately) optimal policy within the (linear) subspace spanned by the (current) basis functions. (iv) A specific instantiation of the RPI framework using least-squares policy iteration (LSPI) as the parameter estimation method (v) Several strategies for scaling the proposed approach to large discrete and continuous state spaces, including the Nyström extension for out-of-sample interpolation of eigenfunctions, and the use of Kronecker sum factorization to construct compact eigenfunctions in product spaces such as factored MDPs (vi) Finally, a series of illustrative discrete and continuous control tasks, which both illustrate the concepts and provide a benchmark for evaluating the proposed approach. Many challenges remain to be addressed in scaling the proposed framework to large MDPs, and several elaboration of the proposed framework are briefly summarized at the end.

### Citations

2712 | Normalized cuts and image segmentation - Shi, Malik - 2000 |

2480 | A theory for multiresolution signal decomposition: the wavelet representation
- Mallat
- 1989
(Show Context)
Citation Context ...lue function on the space spanned by the eigenvectors of the graph Laplacian, the “spatial” content of a value function is mapped into a “frequency” basis, a hallmark of classical “Fourier” analysis (=-=Mallat, 1989-=-). It has long been recognized that traditional parametric function approximators may have difficulty accurately modeling value functions due to nonlinearities in an MDP’s state space (Dayan, 1993). F... |

2170 |
The Elements of Statistical Learning
- Hastie, Tibshirani, et al.
- 2001
(Show Context)
Citation Context ...d his early ideas, principally by combining the mathematical model of Markov decision processes (MDPs) (Puterman, 1994) with theoretical and algorithmic insights from statistics and machine learning (=-=Hastie et al., 2001-=-). This body of research has culminated in the modern fields of approximate dynamic programming (ADP) and reinforcement learning (RL) (Bertsekas and Tsitsiklis, 1996; Sutton and Barto, 1998). Classica... |

2155 |
A Wavelet Tour of Signal Processing
- Mallat
- 1999
(Show Context)
Citation Context ... In the linear least-squares approach, the approximated function ˆ V π is specified as ˆV π k� = 〈V π , ˜ φi〉φi (5) i=1 4. Nonlinear Least-Squares Projection: In the nonlinear least-squares approach (=-=Mallat, 1998-=-), the approximated function ˆ V π is specified as ˆV π = � 〈V π , ˜ φi〉φi (6) i∈Ik(V π ) where Ik(V π ) is the set of indices of the k basis functions with the largest inner product (in absolute valu... |

2114 |
Robot Motion Planning
- Latombe
- 1991
(Show Context)
Citation Context ...lysis to manifolds. Historically, manifolds have been applied to many problems in AI, for example configuration space planning in robotics, but these problems assume a model of the manifold is known (=-=Latombe, 1991-=-; Lavalle, 2005), unlike here where only samples of a manifold are given. Recently, there has been rapidly growing interest in manifold learning methods, including ISOMAP (Tenenbaum et al., 2000), LLE... |

1752 | A global geometric framework for nonlinear dimensionality reduction - Tenenbaum, Silva, et al. - 2000 |

1686 | Nonlinear dimensionality reduction by locally linear embedding - Roweis, Saul |

1358 |
Learning from delayed rewards
- Watkins
- 1989
(Show Context)
Citation Context ...whether representation learning by spectral analysis of the graph Laplacian can be done incrementally, making it easier to combine with incremental parameter estimation techniques such as Q-learning (=-=Watkins, 1989-=-). Decentralized spectral algorithms for computing the top k eigenvectors of a symmetric matrix have been developed recently (Kempe and McSherry, 2004), which rely on gossip algorithms for computing a... |

1267 | Learning to predict by the method of temporal difference - Sutton - 1998 |

1157 | On spectral clustering: Analysis and an algorithm
- Ng, Jordan, et al.
- 2002
(Show Context)
Citation Context ...edler eigenvector (the eigenvector associated with the smallest non-zero eigenvalue), such as its sensitivity to bottlenecks, in order to find clusters in data or segment images (Shi and Malik, 2000; =-=Ng et al., 2002-=-). To formally explain this, we briefly review spectral geometry. The Cheeger constant hG of a graph G is defined as 2188sLEARNING REPRESENTATION AND CONTROL IN MARKOV DECISION PROCESSES hG(S) = min S... |

887 | Reinforcement learning
- Sutton, Barto
- 1998
(Show Context)
Citation Context ...to novel states. These sections also contain a detailed experimental analysis of RPI on large MDPs, including the blockers task (Sallans and Hinton, 2004), the inverted pendulum and the mountain car (=-=Sutton and Barto, 1998-=-) continuous control tasks. Finally, Section 10 discusses several extensions of the proposed framework to new areas. 2. History and Related Work 2.1 Value Function Approximation In the 1950s, Samuel (... |

824 | A fast and high quality multilevel scheme for partitioning irregular graphs
- Karypis, Kumar
- 1998
(Show Context)
Citation Context ...method, Laplacian bases can be constructed quite efficiently (Drineas and Mahoney, 2005). Finally, a variety of other techniques can be used to sparsify Laplacian bases, including graph partitioning (=-=Karypis and Kumar, 1999-=-), matrix sparsification (Achlioptas et al., 2002), and automatic Kronecker matrix factorization (Van Loan and Pitsianis, 1993). Other sources of information can be additionally exploited to facilitat... |

775 |
Nonlinear Programming. Athena Scientific, 2nd edition edition
- Bertsekas
- 1999
(Show Context)
Citation Context ... to represent value functions exactly on large discrete state spaces, or in continuous spaces. Consequently, there has been much study of approximation architectures for representing value functions (=-=Bertsekas and Tsitsiklis, 1996-=-). Value functions generally exhibit two key properties: they are typically smooth, and they reflect the underlying state space geometry. A fundamental contribution of this paper is the use of an appr... |

677 | Planning Algorithms
- LaValle
- 2006
(Show Context)
Citation Context ...lds. Historically, manifolds have been applied to many problems in AI, for example configuration space planning in robotics, but these problems assume a model of the manifold is known (Latombe, 1991; =-=Lavalle, 2006-=-), unlike here where only samples of a manifold are given. 6.1 Nyström Extension To learn policies on continuous MDPs, it is necessary to be able to extend eigenfunctions computed on a set of points ∈... |

633 | Some studies in machine learning using the game of checkers.II - recent progress - Samuel - 1967 |

631 | Matrix Perturbation Theory
- Stewart, Sun
- 1990
(Show Context)
Citation Context ...d using 100 basis functions. Why does the pure tensor approach perform so well even in irregular structured MDPS, as shown above? A detailed analysis requires the use of matrix perturbation analysis (=-=Stewart and Sun, 1990-=-), which is beyond the scope of this paper. However, Figure 19 provides some insight. In the blockers task, the state space topologically can be modeled as the product of cylinders (when the grid has ... |

597 |
Finite Markov decision processes
- Puterman
- 1994
(Show Context)
Citation Context ..., reinforcement learning, value function approximation, manifold learning, spectral graph theory. 1. Introduction This paper introduces a novel framework for solving Markov decision processes (MDPs) (=-=Puterman, 1994-=-), by simultaneously learning both the underlying representation or basis functions and (approximate) optimal policies. The framework is based on a new type of basis representation for approximating v... |

435 | Decisiontheoretic planning: Structural assumptions and computational leverage. JAIR
- Boutilier, Dean, et al.
- 1999
(Show Context)
Citation Context ...y computing and compactly storing proto-value functions. Many RL domains lead to factored representations where the state space is generated as the Cartesian product of the values of state variables (=-=Boutilier et al., 1999-=-). Consider a hypercube Markov decision process with d dimensions, where each dimension can take on k values. The size of the resulting state space is O(k d ), and the size of each proto-value functio... |

430 |
Algebraic connectivity of graphs
- Fiedler
- 1973
(Show Context)
Citation Context ...bal properties of the graph, such as volume, “dimension”, bottlenecks and mixing times of random walks. The latter are connected with the first non-zero eigenvalue λ1, often called the Fiedler value (=-=Fiedler, 1973-=-). The lower the Fiedler value, the easier it is to partition the graph into components without breaking too many edges. 24sProto-value functions 5.3 Function Approximation on Graphs and Manifolds We ... |

426 |
Learning with Kernels: Support Vector
- Schölkopf, Smola
- 2002
(Show Context)
Citation Context ...ral the whole graph. This 4. The graph Laplacian induces a smoothing prior on the space of functions of a graph that can formally be shown to define a data-dependent reproducing kernel Hilbert space (=-=Scholkopf and Smola, 2001-=-). 2172sLEARNING REPRESENTATION AND CONTROL IN MARKOV DECISION PROCESSES 0 1 2 3 4 5 6 7 8 9 10 11 0 2 4 6 8 10 Total states = 100,"Wall" States = 47 G 1 2 3 4 5 6 7 8 9 10 Two−Room MDP using Unit Vec... |

375 | Practical issues in temporal difference learning
- Tesauro
- 1992
(Show Context)
Citation Context ...6) provide an authoritative review. Parametric approaches using linear architectures, such as radial basis functions (Lagoudakis and Parr, 2003), and nonlinear architectures, such as neural networks (=-=Tesauro, 1992-=-), have been extensively explored. However, most approaches (with notable exceptions discussed below) are based on a fixed parametric architecture, and a parameter estimation method is used to approxi... |

345 | Manifold regularization: A geometric framework for learning from labeled and unlabeled examples - Belkin, Niyogi, et al. |

311 | Least-squares policy iteration - Lagoudakis, Parr |

297 | Using the Nystrom method to speed up kernel machines
- Williams, Seeger
- 2001
(Show Context)
Citation Context ...and stored on sampled real-valued states, and hence must be interpolated to novel states. We apply the Nyström interpolation method. While this approach has been studied previously in kernel methods (=-=Williams and Seeger, 2000-=-) and spectral clustering (Belongie et al., 2002), our work represents the first detailed study of 42sProto-value functions the Nyström method for learning control, as well as a detailed comparison of... |

252 |
Temporal credit assignment in reinforcement learning
- Sutton
- 1984
(Show Context)
Citation Context ...lynomial approximator so that values of states earlier in a game reflected outcomes experienced later during actual play (a heuristic form of what is now formally called temporal difference learning (=-=Sutton, 1984-=-, 1988)). Board states s were translated 4sProto-value functions into k-dimensional feature vectors φ(s), where the features or basis functions φ : S → R k were hand engineered (for example, a feature... |

230 | An analysis of temporal-difference learning with function approximation - Tsitsiklis, Roy - 1996 |

218 | Stable function approximation in dynamic programming
- Gordon
- 1995
(Show Context)
Citation Context ...yan, 1999; Nedic and Bertsekas, 2003; Lagoudakis and Parr, 2003). There has also been significant work on non-parametric methods for approximating value functions, including nearest neighbor methods (=-=Gordon, 1995-=-) and kernel density estimation (Ormoneit and Sen, 2002). Although our approach is also nonparametric, it differs from kernel density estimation and nearest neighbor techniques by extracting a distanc... |

209 |
A lower bound for the smallest eigenvalue of the Laplacian
- Cheeger
- 1970
(Show Context)
Citation Context ...aces. From the geometric point of 2sProto-value functions view, it is well-known from the study of the Laplacian on a manifold that its eigenfunctions capture intrinsic properties like “bottlenecks” (=-=Cheeger, 1970-=-). In the discrete setting, these ideas find a natural counterpart: the Cheeger constant (Chung, 1997) quantifies the decomposability of graphs and can be shown to be intimately connected to the spect... |

207 | Self-tuning spectral clustering
- Zelnik-Manor, Perona
(Show Context)
Citation Context ...ecified; (W3) W(i, j) = α(i)e − ||x i −x j || 2 R d ||x i −x k(i) || ||x j −x k(j) || where k > 0 is a parameter and xk(i) is the k-th nearest neighbor of xi (this is called self-tuning Laplacian (L. =-=Zelnik-Manor, 2004-=-)). Again α a weight function to be specified. Observe that in case (E2) the graph is in general not undirected, since xj can be among the K nearest neighbors of xi but xi may not be among the K neare... |

190 | Diffusion kernels on graphs and other discrete input spaces
- Kondor, Lafferty
- 2002
(Show Context)
Citation Context ...An agent explores the underlying state space by carrying out actions using some policy, say a random walk. Central to the proposed framework is the notion of a diffusion model (Coifman et al., 2005a; =-=Kondor and Lafferty, 2002-=-): the agent constructs a (directed or undirected) graph connecting states that are “nearby”. In the simplest setting, the diffusion model is defined by the combinatorial graph Laplacian matrix L = D ... |

186 | Linear Least-Squares Algorithm for Temporal Difference Learning
- Bradtke, Barto
- 1996
(Show Context)
Citation Context ...tribution ρπ . We know the Bellman “backup” operator T defined above has a fixed point V π = T (V π ). Many standard parameter estimation methods, including LSPI (Lagoudakis and Parr, 2003) and LSTD (=-=Bradtke and Barto, 1996-=-), can be viewed as finding an approximate fixed point of the operator T ˆV π = Φw π = M π φ (T (Φw π )). It can be shown that the operator T is a contraction mapping, where �TV1 − TV2�π ≤ γ�V1 −V2�π ... |

178 | Recent advances in hierarchical reinforcement learning. Discrete Event Dynamic Systems 13:341–379
- Barto, Mahadevan
- 2003
(Show Context)
Citation Context ...KOV DECISION PROCESSES 9.9 Proto-Value Functions for Semi-Markov Decision Processes Proto-value functions provide a way of constructing function approximators for hierarchical reinforcement learning (=-=Barto and Mahadevan, 2003-=-), as well as form a theoretical foundation for some recent attempts to automate the learning of task structure in hierarchical reinforcement learning, by discovering “symmetries” or “bottlenecks” (Mc... |

160 | Semi-supervised learning on Riemannian manifolds - Belkin, Niyogi - 2004 |

158 | Geometric diffusions as a tool for harmonic analysis and structure definition of data: Diffusion maps
- Coifman, Lafon, et al.
- 2005
(Show Context)
Citation Context ...ilds on recent work on manifold and spectral learning, including ISOMAP (Tenenbaum et al., 2000), LLE (Roweis and Saul, 2000), Laplacian eigenmaps (Belkin and Niyogi, 2004), and diffusion geometries (=-=Coifman et al., 2005-=-a,b,c). One major 6sProto-value functions difference is that these methods have largely (but not exclusively) been applied to nonlinear dimensionality reduction and semi-supervised learning on graphs ... |

149 | The linear programming approach to approximate dynamic programming - Farias, Roy |

144 | Fast Monte Carlo algorithms for matrices II: Computing a low rank approximation to a matrix
- Drineas, Mahoney, et al.
(Show Context)
Citation Context ...w or column). There are guarantees that these algorithms will select with high probability a set of rows whose span is close to that of the top singular vectors: see for example (Frieze et al., 1998; =-=Drineas et al., 2004-=-; Drineas and Mahoney, 2005). The Nyström method is applied to the approximation of the eigenfunctions of the graph Laplacian Lφi = λiφi by letting F1 be the matrix with k eigenfunctions as columns: e... |

134 | Efficient solution algorithms for factored MDPs - Guestrin, Koller, et al. - 2003 |

119 | On representations of problems of reasoning about actions - Amarel - 1968 |

118 |
Spectyra of Graphs – Theory and Application
- Cvetković, Doob, et al.
- 1980
(Show Context)
Citation Context ...ncreasingly finding more applications in AI, from image segmentation (Shi and Malik, 2000) to clustering (Ng et al., 2002). The Laplace-Beltrami operator now becomes the graph Laplacian (Chung, 1997; =-=Cvetkovic et al., 1980-=-), from which an orthonormal set of basis functions φ G 1 (s), . . . , φG k (s) can be extracted. The graph Laplacian can be defined in several ways, such as the combinatorial Laplacian and the normal... |

118 |
Introduction to Smooth Manifolds
- Lee
- 2003
(Show Context)
Citation Context ..., we provide a brief overview of the mathematics underlying proto-value functions, in particular, spectral graph theory (Chung, 1997) and its continuous counterpart, analysis on Riemannian manifolds (=-=Lee, 2003-=-). 5.1 Riemannian Manifolds This section introduces the Laplace-Beltrami operator in the general setting of Riemannian manifolds (Rosenberg, 1997), as a prelude to describing the Laplace-Beltrami oper... |

118 |
The Laplacian on a Riemannian Manifold
- Rosenberg
- 1997
(Show Context)
Citation Context ...y in the “Fourier” tradition, by diagonalizing and using the eigenfunctions of a symmetric operator called the Laplacian (associated with a heat diffusion equation) on a graph or Riemannian manifold (=-=Rosenberg, 1997-=-). In the second paper (Maggioni and Mahadevan, 2006), this manifold framework is extended to a multi-resolution approach, where basis functions are constructed using the newly developed framework of ... |

113 | On the nyström method for approximating a gram matrix for improved kernel-based learning - Drineas, Mahoney |

108 | Kernel-based reinforcement learning
- Ormoneit, Sen
- 2002
(Show Context)
Citation Context ...kis and Parr, 2003). There has also been significant work on non-parametric methods for approximating value functions, including nearest neighbor methods (Gordon, 1995) and kernel density estimation (=-=Ormoneit and Sen, 2002-=-). Although our approach is also nonparametric, it differs from kernel density estimation and nearest neighbor techniques by extracting a distance measure through modeling the underlying graph or mani... |

107 | Towards a Theoretical foundation for Laplacianbased manifold methods
- Belkin, Niyogi
(Show Context)
Citation Context ...a Riemannian manifold (Rosenberg, 1997). The convergence of the discrete Laplacian to the continuous Laplacian on the underlying manifold under uniform sampling conditions has been shown recently in (=-=Belkin and Niyogi, 2005-=-). The combinatorial Laplacian L = D − A acts on a function f as Lf(u) = � (f(u) − f(v)) wuv v∼u Unlike the adjacency matrix operator, the combinatorial Laplacian acts as a difference operator. More i... |

105 | A natural policy gradient
- Kakade
(Show Context)
Citation Context ...to design new types of kernels for supervised machine learning (Lafferty and Lebanon, 2005) and faster policy gradient methods using the natural Riemannian gradient on a space of parametric policies (=-=Kakade, 2002-=-; Bagnell and Schneider, 2003; Peters et al., 2003). In recent work on manifold learning, Belkin and Niyogi (2004) have studied semi-supervised learning in Riemannian manifolds, where a large set of u... |

101 | Engineering Applications of Noncommutative Harmonic Analysis: With Emphasis on Rotation and Motion Groups - Chirikjian, Kyatkin - 2000 |

99 | Least-squares temporal difference learning
- Boyan
- 1999
(Show Context)
Citation Context ...I) (Lagoudakis and Parr, 2003), approximate dynamic programming using linear programming (Farias, 2003; Guestrin et al., 2003), and leastsquares temporal-difference learning (Bradtke and Barto, 1996; =-=Boyan, 1999-=-; Nedic and Bertsekas, 2003). These approximate methods can be viewed as projecting the exact value function onto a subspace spanned by a set of basis functions. The majority of this research has assu... |

96 |
The Numerical Treatment of Integral Equations
- Baker
- 1977
(Show Context)
Citation Context ... The Nyström method interpolates the value of eigenvectors computed on sample states to novel states, and is an application of a classical method used in the numerical solution of integral equations (=-=Baker, 1977-=-). It can be viewed as a technique for approximating a semi-positive definite matrix from a low-rank approximation. In this context it can be related to randomized algorithms for low-rank approximatio... |

94 | reinforcement learning for humanoid robotics
- Peters, Vijayakumar, et al.
- 2003
(Show Context)
Citation Context ...ed machine learning (Lafferty and Lebanon, 2005) and faster policy gradient methods using the natural Riemannian gradient on a space of parametric policies (Kakade, 2002; Bagnell and Schneider, 2003; =-=Peters et al., 2003-=-). In recent work on manifold learning, Belkin and Niyogi (2004) have studied semi-supervised learning in Riemannian manifolds, where a large set of unlabeled points are used to extract a representati... |

90 |
Spectral Graph Theory. Number 92
- Chung
- 1997
(Show Context)
Citation Context ...crete setting of graphs, spectral analysis of the graph Laplacian operator provides an orthonormal set of basis functions on the graph for approximating any (square-integrable) function on the graph (=-=Chung, 1997-=-). This research builds on recent work on manifold and spectral learning (Tenenbaum et al., 2000; Roweis and Saul, 2000; Belkin and Niyogi, 2004). In particular, Belkin and Niyogi (2004) pioneered the... |