## Least-Squares Policy Iteration (2003)

### Cached

### Download Links

Venue: | Journal of Machine Learning Research |

Citations: | 298 - 9 self |

### BibTeX

@ARTICLE{Lagoudakis03least-squarespolicy,

author = {Michail G. Lagoudakis and Ronald Parr and L. Bartlett},

title = {Least-Squares Policy Iteration},

journal = {Journal of Machine Learning Research},

year = {2003},

volume = {4},

pages = {2003}

}

### Years of Citing Articles

### OpenURL

### Abstract

We propose a new approach to reinforcement learning for control problems which combines value-function approximation with linear architectures and approximate policy iteration.

### Citations

3763 | Reinforcement Learning: An Introduction
- Sutton, Barto
- 1998
(Show Context)
Citation Context ...ces of evidence that the least-squares fixed-point method is preferable to the Bellman residual minimizing method: . Learning the Bellman residual minimizing approximation requires "doubled"=-= samples (Sutton and Barto, 1998-=-) that can only be collected from a generative model of the MDP. . Experimentally, the least-squares fixed-point approximation often delivers policies that are superior to the ones found using the Bel... |

1319 |
Learning from delayed rewards
- Watkins
- 1989
(Show Context)
Citation Context ...ion. In both cases, LSPI learns to control the pendulum or the bicycle by merely observing a relatively small number of trials where actions are selected randomly. We also compare LSPI to Q-learning (=-=Watkins, 1989-=-), both with and without experience replay (Lin, 1993), using the same value function architecture. While LSPI achieves good performance fairly consistently on the di#cult bicycle task, Q-learning var... |

516 |
Dynamic Programming and and Markov Processes
- Howard
- 1960
(Show Context)
Citation Context ...state. It is, therefore, su#cient to restrict the search for the optimal policy only within the space of deterministic policies. 3. Policy Iteration and Approximate Policy Iteration Policy iteration (=-=Howard, 1960-=-) is a method of discovering the optimal policy for any given MDP. Policy iteration is an iterative procedure in the space of deterministic policies; it discovers the optimal policy by generating a se... |

472 |
Neuronlike adaptive elements that can solve difficult learning control problems
- BARTO, SUTTON, et al.
- 1983
(Show Context)
Citation Context ...r is responsible for the way the agent acts and the critic is responsible for criticizing the way the agent acts. Hence, policy-iteration algorithms are also refered to as actor-critic architectures (=-=Barto et al., 1983-=-; Sutton, 1984). Figure 1 shows a block diagram of policy iteration (or an actor-critic architecture) and the dependencies among the various components. The guaranteed convergence of policy iteration ... |

354 | Generalization in reinforcement learning: successful examples using sparse coding
- Sutton
- 1995
(Show Context)
Citation Context ...t needs no access to a model to perform policy iteration, nor does it need to learn one. Traditional reinforcement-learning algorithms for control, such as SARSA learning (Rummery and Niranjan, 1994; =-=Sutton, 1996-=-) and Q-learning (Watkins, 1989), lack any stability or convergence guarantees when combined with most forms of value-function approximation. In many cases, their learned approximations may even diver... |

319 | Policy Gradient Methods for Reinforcement Learning with Function Approximation
- Sutton, Mcallester, et al.
- 2000
(Show Context)
Citation Context ...ate this matrix to standard notions of a model. 1129 Lagoudakis and Parr In contrast to the variety of direct policy learning methods (Ng et al., 2000; Ng and Jordan, 2000; Baxter and Bartlett, 2001; =-=Sutton et al., 2000-=-; Konda and Tsitsiklis, 2000), LSPI o#ers the strength of policy iteration. Policy search methods typically make a large number of relatively small steps of gradient-based policy updates to a paramete... |

243 |
Temporal credit assignment in reinforcement learning, Doctoral dissertation
- Sutton
- 1984
(Show Context)
Citation Context ... the way the agent acts and the critic is responsible for criticizing the way the agent acts. Hence, policy-iteration algorithms are also refered to as actor-critic architectures (Barto et al., 1983; =-=Sutton, 1984-=-). Figure 1 shows a block diagram of policy iteration (or an actor-critic architecture) and the dependencies among the various components. The guaranteed convergence of policy iteration to the optimal... |

207 | Pegasus: A policy search method for large mdps and pomdps
- Ng, Jordan
- 2000
(Show Context)
Citation Context ...x, further investigation will be required to relate this matrix to standard notions of a model. 1129 Lagoudakis and Parr In contrast to the variety of direct policy learning methods (Ng et al., 2000; =-=Ng and Jordan, 2000-=-; Baxter and Bartlett, 2001; Sutton et al., 2000; Konda and Tsitsiklis, 2000), LSPI o#ers the strength of policy iteration. Policy search methods typically make a large number of relatively small step... |

188 |
Reinforcement Learning for Robots using neural networks
- Lin
- 1993
(Show Context)
Citation Context ...r the bicycle by merely observing a relatively small number of trials where actions are selected randomly. We also compare LSPI to Q-learning (Watkins, 1989), both with and without experience replay (=-=Lin, 1993-=-), using the same value function architecture. While LSPI achieves good performance fairly consistently on the di#cult bicycle task, Q-learning variants were rarely able to balance the bicycle for mor... |

179 | Linear Least-Squares algorithms for temporal difference learning
- Bradtke, Barto
- 1996
(Show Context)
Citation Context ...vely easy to get some insight into why the failure has occurred. Our enthusiasm for the approach presented in this paper is inspired by the least-squares temporal-di#erence learning algorithm (LSTD) (=-=Bradtke and Barto, 1996-=-). The LSTD algorithm is ideal for prediction problems, that is, problems where we are interested in learning the value function of a fixed policy. LSTD makes e#cient use of data and converges faster ... |

175 | Actor-critic algorithms
- Konda, Tsitsiklis
- 1999
(Show Context)
Citation Context ...andard notions of a model. 1129 Lagoudakis and Parr In contrast to the variety of direct policy learning methods (Ng et al., 2000; Ng and Jordan, 2000; Baxter and Bartlett, 2001; Sutton et al., 2000; =-=Konda and Tsitsiklis, 2000-=-), LSPI o#ers the strength of policy iteration. Policy search methods typically make a large number of relatively small steps of gradient-based policy updates to a parameterized policy function. Our u... |

175 | Policy invariance under reward transformations: Theory and application to reward shaping
- Ng, Harada, et al.
- 1999
(Show Context)
Citation Context ...cies because of the noise in the input. A number of design decisions influenced the performance of LSPI on the bicycle balancing and riding problem. As is typical with this problem, a shaping reward (=-=Ng et al., 1999-=-) for the distance to the goal was used. In particular, the shaping reward was 1% of the net change (in meters) in the distance to the goal. It was also observed that, in full random trajectories, mos... |

102 | Kernel-based reinforcement learning - Ormoneit, Sen |

97 |
Generalized polynomial approximations in Markovian decision processes
- Schweitzer, Seidmann
- 1985
(Show Context)
Citation Context ...ries (s, a), the weighted least-squares solution of the system would be: w # = (# - #P# # #) # #s(# - #P# # #) -1 (# - #P# # #) # #sR , The Bellman residual minimizing approach has been proposed (Sc=-=hweitzer and Seidmann, 1985-=-) as a means of computing approximate state value functions from the model of the process. 5.2 Least-Squares Fixed-Point Approximation Recall that the state-action value function Q # is also the fixed... |

88 | Technical Update: Least-Squares Temporal Difference Learning - Boyan - 1999 |

77 |
An approach to fuzzy control of nonlinearsystems: Stability and design issues
- Wang, Tanaka, et al.
- 1996
(Show Context)
Citation Context ...on. The state space of the problem is continuous and consists of the vertical angle # and the angular velocity # of the pendulum. The transitions are governed by the nonlinear dynamics of the system (=-=Wang et al., 1996) and depe-=-nd on the current state and the current (noisy) control u: # = g sin(#) - #ml( #) 2 sin(2#)/2 - # cos(#)u 4l/3 - #ml cos 2 (#) , 11. To generate the necessary "doubled" samples, for each sam... |

72 | Policy Iteration for Factored MDPs
- Koller, Parr
- 2000
(Show Context)
Citation Context ...lution of this system w # = # # (# - #P# # #) -1 # # R 1117 Lagoudakis and Parr Figure 4: Policy evaluation and projection methods. is guaranteed to exist for all, but finitely many, values of # (Ko=-=ller and Parr, 2000-=-). Since the orthogonal projection minimizes the L 2 norm, the solution w # yields a value function b Q # which can be called the least-squares fixed-point approximation to the true value function. In... |

63 | Least Squares Policy Evaluation Algorithms with Linear Function Approximation. Discrete Event Dynamic Systems: Theory and Applications
- Nedić, Bertsekas
- 2003
(Show Context)
Citation Context ... to a multiple of the identity matrix #I for some small positive #, instead of 0 (ridge regression, Dempster et al., 1977). The convergence properties of the algorithm are not a#ected by this change (=-=Nedic and Bertsekas, 2003-=-). Another possibility is to use singular value decomposition (SVD) for robust inversion of e A which also eliminates singularities due to linearly dependent basis functions. However, if the linear sy... |

62 | Learning to drive a bicycle using reinforcement learning and shaping
- Randlov, Alstrom
- 1998
(Show Context)
Citation Context ...lized nature in conjunction with their magnitude are indicative of the appropriate adjustment to each parameter. 9.3 Bicycle Balancing and Riding The goal in the bicycle balancing and riding problem (=-=Randløv and Alstrøm, 1998) is-=- to learn to balance and ride a bicycle to a target position located 1 km away from the starting location. Initially, the bicycle’s orientation is at an angle of 90 ◦ to the goal. The state descri... |

55 | Reinforcement learning Applied to Linear Quadratic Regulation
- Bradtke
- 1993
(Show Context)
Citation Context ...t action. Depending on role of the action variables in the approximate state-action value function, a closed form solution may be possible as, for example, in the adaptive policy-iteration algorithm (=-=Bradtke, 1993-=-) for linear quadratic regulation. Finally, any policy # (represented by the basis functions # and a set of parameters w) is fed to LSTDQ along with a set of samples for evaluation. LSTDQ performs the... |

53 | Error bounds for approximate policy iteration - Munos - 2003 |

35 | Policy search via density estimation
- Ng, Parr, et al.
- 2000
(Show Context)
Citation Context ... stochastic matrix, further investigation will be required to relate this matrix to standard notions of a model. 1129 Lagoudakis and Parr In contrast to the variety of direct policy learning methods (=-=Ng et al., 2000-=-; Ng and Jordan, 2000; Baxter and Bartlett, 2001; Sutton et al., 2000; Konda and Tsitsiklis, 2000), LSPI o#ers the strength of policy iteration. Policy search methods typically make a large number of ... |

26 | Infinite-Horizon Gradient-Based Policy Search: II. Gradient Ascent Algorithms and Experiments - Baxter, Bartlett, et al. |

9 |
A simulation study of alternatives to ordinary least-squares
- Dempster, Schatzoff, et al.
- 1977
(Show Context)
Citation Context ...cient number of samples has been processed. One way to avoid such singularities is to initialize e A to a multiple of the identity matrix #I for some small positive #, instead of 0 (ridge regression, =-=Dempster et al., 1977-=-). The convergence properties of the algorithm are not a#ected by this change (Nedic and Bertsekas, 2003). Another possibility is to use singular value decomposition (SVD) for robust inversion of e A ... |

9 |
Learning to drive a bicycle using reinforcement learning and shaping
- Randlv, Alstrm
- 1998
(Show Context)
Citation Context ...lized nature in conjunction with their magnitude are indicative of the appropriate adjustment to each parameter. 9.3 Bicycle Balancing and Riding The goal in the bicycle balancing and riding problem (=-=Randlv and Alstrm, 1998-=-) is to learn to balance and ride a bicycle to a target position located 1 km away from the starting location. Initially, the bicycle's orientation is at an angle of 90 # to the goal. The state descri... |