## Active policy learning for robot planning and exploration under uncertainty (2007)

### Cached

### Download Links

- [www.cs.ubc.ca]
- [users.isr.ist.utl.pt]
- [webdiis.unizar.es]
- [www.roboticsproceedings.org]
- [robots.unizar.es]
- [roboticsproceedings.org]
- DBLP

### Other Repositories/Bibliography

Venue: | IN PROCEEDINGS OF ROBOTICS: SCIENCE AND SYSTEMS |

Citations: | 27 - 2 self |

### BibTeX

@INPROCEEDINGS{Martinez-cantin07activepolicy,

author = {Ruben Martinez-cantin and Nando de Freitas and Arnaud Doucet and José A. Castellanos},

title = {Active policy learning for robot planning and exploration under uncertainty},

booktitle = {IN PROCEEDINGS OF ROBOTICS: SCIENCE AND SYSTEMS},

year = {2007},

publisher = {}

}

### OpenURL

### Abstract

This paper proposes a simulation-based active policy learning algorithm for finite-horizon, partially-observed sequential decision processes. The algorithm is tested in the domain of robot navigation and exploration under uncertainty. In such a setting, the expected cost, that must be minimized, is a function of the belief state (filtering distribution). This filtering distribution is in turn nonlinear and subject to discontinuities, which arise because constraints in the robot motion and control models. As a result, the expected cost is non-differentiable and very expensive to simulate. The new algorithm overcomes the first difficulty and reduces the number of required simulations as follows. First, it assumes that we have carried out previous simulations which returned values of the expected cost for different corresponding policy parameters. Second, it fits a Gaussian process (GP) regression model to these values, so as to approximate the expected cost as a function of the policy parameters. Third, it uses the GP predicted mean and variance to construct a statistical measure that determines which policy parameters should be used in the next simulation. The process is then repeated using the new parameters and the newly gathered expected cost observation. Since the objective is to find the policy parameters that minimize the expected cost, this iterative active learning approach effectively trades-off between exploration (in regions where the GP variance is large) and exploitation (where the GP mean is low). In our experiments, a robot uses the proposed algorithm to plan an optimal path for accomplishing a series of tasks, while maximizing the information about its pose and map estimates. These estimates are obtained with a standard filter for simultaneous localization and mapping. Upon gathering new observations, the robot updates the state estimates and is able to replan a new path in the spirit of open-loop feedback control.

### Citations

499 |
Dynamic programming and optimal control. Athena Scientific
- Bertsekas
- 2007
(Show Context)
Citation Context .... Moreover, in our domain, the robot only sees the landmarks within and observation gate. Since the models are not linear-Gaussian, one cannot use standard linear-quadratic-Gaussian (LQG) controllers =-=[20]-=- to solve our problem. Moreover, since the action and state spaces are large-dimensional and continuous, one cannot discretize the problem and use closed-loop control as suggested in [21]. That is, th... |

435 |
Predictive Control with Constraints
- Maciejowski
- 2002
(Show Context)
Citation Context ...the planning horizon to recede. That is, as the robot moves, it keeps planning T steps ahead of its current position. This control framework is also known as receding-horizon model-predictive control =-=[25]-=-. In the following two subsections, we will describe a way of conducting the simulations to estimate the AMSE. The active policy update algorithm will be described in Section III. A. Simulation of the... |

349 | Simple statistical gradient-following algorithms learning by stochastic hill climbing on discounted reward
- Williams
- 1992
(Show Context)
Citation Context ...That is, the discretized partially observed Markov decision process is too large for stochastic dynamic programming [22]. As a result of these considerations, we adopt the direct policy search method =-=[23, 24]-=-. In particular, the initial policy is set either randomly or using prior knowledge. Given this policy, we conduct simulations to estimate the AMSE. These simulations involve sampling states and obser... |

316 |
Learning in Embedded Systems
- Kaelbling
- 1993
(Show Context)
Citation Context ...goals of exploration and exploitation in policy search. It is motivated by work on experimental design [13, 14, 15]. Simpler variations of our ideas appeared early in the reinforcement literature. In =-=[16]-=-, the problem is treated in the framework of exploration/exploitation with bandits. An extension to continuous spaces (infinite number of bandits) using locally weighted regression was proposed in [17... |

311 |
The optimal control of partially observable Markov processes over a finite horizon
- Smallwood, Sondik
- 1973
(Show Context)
Citation Context ...one cannot discretize the problem and use closed-loop control as suggested in [21]. That is, the discretized partially observed Markov decision process is too large for stochastic dynamic programming =-=[22]-=-. As a result of these considerations, we adopt the direct policy search method [23, 24]. In particular, the initial policy is set either randomly or using prior knowledge. Given this policy, we condu... |

267 |
Efficient Global Optimization of Expensive Black-Box Functions
- Jones, Schonlau, et al.
- 1998
(Show Context)
Citation Context ...at policies are likely to result in higher expected returns. The method effectively balances the goals of exploration and exploitation in policy search. It is motivated by work on experimental design =-=[13, 14, 15]-=-. Simpler variations of our ideas appeared early in the reinforcement literature. In [16], the problem is treated in the framework of exploration/exploitation with bandits. An extension to continuous ... |

265 |
The Design and Analysis of Computer Experiments
- Santner, Williams, et al.
- 2003
(Show Context)
Citation Context ...at policies are likely to result in higher expected returns. The method effectively balances the goals of exploration and exploitation in policy search. It is motivated by work on experimental design =-=[13, 14, 15]-=-. Simpler variations of our ideas appeared early in the reinforcement literature. In [16], the problem is treated in the framework of exploration/exploitation with bandits. An extension to continuous ... |

231 |
Lipschitzian optimization without the Lipschitz constant
- Jones, Perttunen, et al.
- 1993
(Show Context)
Citation Context ... merely that we can quickly locate a point that is likely to be as good as possible. To deal with this nonlinear constrained optimization problem, we adopted the DIvided RECTangles (DIRECT) algorithm =-=[30, 31]-=-. DIRECT is a deterministic, derivative-free sampling algorithm. It uses the existing samples of the objective function to decide how to proceed to divide the feasible space into finer rectangles. For... |

217 | PEGASUS:A policy search method for large MDPs and POMDPs
- Ng, Jordan
- 2000
(Show Context)
Citation Context ...That is, the discretized partially observed Markov decision process is too large for stochastic dynamic programming [22]. As a result of these considerations, we adopt the direct policy search method =-=[23, 24]-=-. In particular, the initial policy is set either randomly or using prior knowledge. Given this policy, we conduct simulations to estimate the AMSE. These simulations involve sampling states and obser... |

169 | Infinite-horizon policy-gradient estimation
- Baxter, Bartlett
- 2001
(Show Context)
Citation Context ...ed to significant achievements in control and robotics [1, 2, 3, 4]. The success of the method does often, however, hinge on our ability to formulate expressions for the gradient of the expected cost =-=[5, 4, 6]-=-. In some important applications in robotics, such as exploration, constraints in the robot motion and control models make it hard, and often impossible, to compute derivatives of the cost function wi... |

140 | A Taxonomy of Global Optimization Methods Based on Response
- Jones
- 2001
(Show Context)
Citation Context ...pply. We present an alternative approach to gradient-based optimization for continuous policy spaces. This approach, which we refer to as active policy learning, is based on experimental design ideas =-=[27, 13, 28, 29]-=-. Active policy learning is an any-time, “black-box” statistical optimizations3.5 3 2.5 2 1.5 1 0.5 GP mean cost GP variance True cost Infill Data point 0 −1.5 −1 −0.5 0 Policy parameter 0.5 1 1.5 3.5... |

136 | Inverted autonomous helicopter flight via reinforcement learning
- Ng, Coates, et al.
- 2004
(Show Context)
Citation Context ...o replan a new path in the spirit of open-loop feedback control. I. INTRODUCTION The direct policy search method for reinforcement learning has led to significant achievements in control and robotics =-=[1, 2, 3, 4]-=-. The success of the method does often, however, hinge on our ability to formulate expressions for the gradient of the expected cost [5, 4, 6]. In some important applications in robotics, such as expl... |

128 |
Tardós, Mobile Robot Localization and Map Building: A Multisensor Fusion Approach
- Castellanos, D
- 1999
(Show Context)
Citation Context ...KF or particle filter) to compute the posterior mean state �x (i) 1:T . (In this paper, we adopt the EKFSLAM algorithm to estimate the mean and covariance of this distribution. We refer the reader to =-=[26]-=- for implementation details.) The evaluation of the cost function is therefore extremely expensive. Moreover, since the model is nonlinear, it is hard to quantify the uncertainty introduced by the sub... |

111 | Posterior Cramér-Rao bounds for discretetime nonlinear filtering
- Tichavsky, Muravchik, et al.
- 1998
(Show Context)
Citation Context ... measurements and states are assumed random. It is defined as the inverse of the Fisher information matrix J and provides the following lower bound on the AMSE: C π AMSE ≥ C π P CRB = J −1 Tichavsk´y =-=[35]-=-, derived the following Riccati-like recursion to compute the PCRB for any unbiased estimator: where Jt+1 = Dt − C ′ t(Jt + Bt) −1 Ct + At+1, (3) At+1 = E[−∆xt+1,xt+1 log p(yt+1|xt+1)] Bt = E[−∆xt,xt ... |

110 | Policy Gradient Reinforcement Learning for Fast Quadrupedal Locomotion
- Kohl, Stone
- 2004
(Show Context)
Citation Context ...o replan a new path in the spirit of open-loop feedback control. I. INTRODUCTION The direct policy search method for reinforcement learning has led to significant achievements in control and robotics =-=[1, 2, 3, 4]-=-. The success of the method does often, however, hinge on our ability to formulate expressions for the gradient of the expected cost [5, 4, 6]. In some important applications in robotics, such as expl... |

84 | Policy gradient methods for robotics
- Peters, Schaal
- 2006
(Show Context)
Citation Context ...o replan a new path in the spirit of open-loop feedback control. I. INTRODUCTION The direct policy search method for reinforcement learning has led to significant achievements in control and robotics =-=[1, 2, 3, 4]-=-. The success of the method does often, however, hinge on our ability to formulate expressions for the gradient of the expected cost [5, 4, 6]. In some important applications in robotics, such as expl... |

70 | Information gain-based exploration using rao-blackwellized particle filters
- Stachniss, Grisetti, et al.
- 2005
(Show Context)
Citation Context ... We demonstrate the new approach on a hard robotics problem: planning and exploration under uncertainty. This problem plays a key role in simultaneous localization and mapping (SLAM), see for example =-=[7, 8]-=-. Mobile robots must maximize the size of the explored terrain, but, at the same time, they must ensure that localization errors are minimized. While exploration is needed to find new features, the ro... |

56 | Kernel methods for missing variables
- Smola, Vishwanathan, et al.
- 2005
(Show Context)
Citation Context ...er motivating factor is that DIRECT’s implementation is easily available [32]. However, we conjecture that for large dimensional spaces, sequential quadratic programming or concave-convex programming =-=[33]-=- might be better algorithm choices for infill optimization. A. Gaussian processes A Gaussian process, z(·) ∼ GP (m(·), K(·, ·)), is an infinite random process indexed by the vector θ, such that any re... |

45 | Memory-based stochastic optimization
- Moore, Schneider
- 1996
(Show Context)
Citation Context ...16], the problem is treated in the framework of exploration/exploitation with bandits. An extension to continuous spaces (infinite number of bandits) using locally weighted regression was proposed in =-=[17]-=-. Our paper presents richer criteria for active learning as well suitable optimization objectives. This paper also presents posterior Cramér-Rao bounds to approximate the cost function in robot explor... |

41 | Flexibility and Efficiency Enhancements for Constrained Global Design Optimization with Kriging Approximations
- Sasena
- 2002
(Show Context)
Citation Context ...pply. We present an alternative approach to gradient-based optimization for continuous policy spaces. This approach, which we refer to as active policy learning, is based on experimental design ideas =-=[27, 13, 28, 29]-=-. Active policy learning is an any-time, “black-box” statistical optimizations3.5 3 2.5 2 1.5 1 0.5 GP mean cost GP variance True cost Infill Data point 0 −1.5 −1 −0.5 0 Policy parameter 0.5 1 1.5 3.5... |

35 | Modifications of the DIRECT algorithm
- Gablonsky
(Show Context)
Citation Context ... merely that we can quickly locate a point that is likely to be as good as possible. To deal with this nonlinear constrained optimization problem, we adopted the DIvided RECTangles (DIRECT) algorithm =-=[30, 31]-=-. DIRECT is a deterministic, derivative-free sampling algorithm. It uses the existing samples of the objective function to decide how to proceed to divide the feasible space into finer rectangles. For... |

30 |
Global A-Optimal Robot Exploration
- Sim, Roy
- 2005
(Show Context)
Citation Context ... We demonstrate the new approach on a hard robotics problem: planning and exploration under uncertainty. This problem plays a key role in simultaneous localization and mapping (SLAM), see for example =-=[7, 8]-=-. Mobile robots must maximize the size of the explored terrain, but, at the same time, they must ensure that localization errors are minimized. While exploration is needed to find new features, the ro... |

25 | DIRECT optimization algorithm user guide
- Finkel
- 2003
(Show Context)
Citation Context ..., DIRECT provides a better solution than gradient approaches because the infill function tends to have many local optima. Another motivating factor is that DIRECT’s implementation is easily available =-=[32]-=-. However, we conjecture that for large dimensional spaces, sequential quadratic programming or concave-convex programming [33] might be better algorithm choices for infill optimization. A. Gaussian p... |

24 |
Multisensor resource deployment using posterior Cramer-Rao bounds,” Aerospace and Electronic Systems
- Hernandez, Kirubarajan, et al.
- 2004
(Show Context)
Citation Context ...ocused on robot exploration and planning, our policy search framework extends naturally to other domains. Related problems appear the fields of terrainaided navigation [18, 9] and dynamic sensor nets =-=[19, 6]-=-. II. APPLICATION TO ROBOT EXPLORATION AND PLANNING Although the algorithm proposed in this paper applies to many sequential decision making settings, we will restrict attention to the robot explorati... |

23 |
A new method of locating the maximum of an arbitrary multipeak curve in the presence of noise
- Kushner
- 1964
(Show Context)
Citation Context ...pply. We present an alternative approach to gradient-based optimization for continuous policy spaces. This approach, which we refer to as active policy learning, is based on experimental design ideas =-=[27, 13, 28, 29]-=-. Active policy learning is an any-time, “black-box” statistical optimizations3.5 3 2.5 2 1.5 1 0.5 GP mean cost GP variance True cost Infill Data point 0 −1.5 −1 −0.5 0 Policy parameter 0.5 1 1.5 3.5... |

17 | Efficient gradient estimation for motor control learning
- Lawrence
- 2003
(Show Context)
Citation Context |

12 | Using reinforcement learning to improve exploration trajectories for error minimization
- Kollar, Roy
- 2006
(Show Context)
Citation Context ...ns. For instance, full observability is assumed in [9, 7], known robot location is assumed in [10], myopic planning is adopted in [8], and discretization of the state and/or actions spaces appears in =-=[11, 12, 7]-=-. The method proposed in this paper does not rely on any of these assumptions. Our direct policy solution uses an any-time probabilistic active learning algorithm to predict what policies are likely t... |

11 | Optimal sensor trajectories in bearings-only tracking
- Hernandez
(Show Context)
Citation Context ...ns. For instance, full observability is assumed in [9, 7], known robot location is assumed in [10], myopic planning is adopted in [8], and discretization of the state and/or actions spaces appears in =-=[11, 12, 7]-=-. The method proposed in this paper does not rely on any of these assumptions. Our direct policy solution uses an any-time probabilistic active learning algorithm to predict what policies are likely t... |

11 | Optimal estimation and Cramér-Rao bounds for partial nonGaussian state space models
- Bergman, Doucet, et al.
- 2001
(Show Context)
Citation Context ...case in our setting and hence a potential source of error. An alternative PCRB approximation method that overcomes this shortcoming, in the context of jump Markov linear (JML) models, was proposed by =-=[36]-=-. We try both approximations in our experiments and refer to them as NL-PCRB and JMLPCRB respectively. The AMSE simulation approach of Section II-A using the EKF requires that we perform an expensive ... |

9 | Simulation-based optimal sensor scheduling with application to observer trajectory planning. Automatica 2007
- Singh, Kantas, et al.
(Show Context)
Citation Context ...ed to significant achievements in control and robotics [1, 2, 3, 4]. The success of the method does often, however, hinge on our ability to formulate expressions for the gradient of the expected cost =-=[5, 4, 6]-=-. In some important applications in robotics, such as exploration, constraints in the robot motion and control models make it hard, and often impossible, to compute derivatives of the cost function wi... |

8 |
Le Cadre, “Optimal observer trajectory in bearings-only tracking for manoeuvring sources
- Tremois, P
- 1999
(Show Context)
Citation Context ... controllers [20] to solve our problem. Moreover, since the action and state spaces are large-dimensional and continuous, one cannot discretize the problem and use closed-loop control as suggested in =-=[21]-=-. That is, the discretized partially observed Markov decision process is too large for stochastic dynamic programming [22]. As a result of these considerations, we adopt the direct policy search metho... |

7 |
Fast parameter optimization of large-scale electromagnetic objects using DIRECT with Kriging metamodeling
- Siah, Ozdemir, et al.
(Show Context)
Citation Context ...at policies are likely to result in higher expected returns. The method effectively balances the goals of exploration and exploitation in policy search. It is motivated by work on experimental design =-=[13, 14, 15]-=-. Simpler variations of our ideas appeared early in the reinforcement literature. In [16], the problem is treated in the framework of exploration/exploitation with bandits. An extension to continuous ... |

4 | On the Cramer-Rao bound for terrain-aided navigation
- Bergman
- 1997
(Show Context)
Citation Context ...on. Although the discussion is focused on robot exploration and planning, our policy search framework extends naturally to other domains. Related problems appear the fields of terrainaided navigation =-=[18, 9]-=- and dynamic sensor nets [19, 6]. II. APPLICATION TO ROBOT EXPLORATION AND PLANNING Although the algorithm proposed in this paper applies to many sequential decision making settings, we will restrict ... |

2 |
Le Cadre, “Planification for terrain-aided navigation
- Paris, P
- 2002
(Show Context)
Citation Context ...ution). Even a toy problem requires enormous computational effort. As a result, it is not surprising that most existing approaches relax the constrains. For instance, full observability is assumed in =-=[9, 7]-=-, known robot location is assumed in [10], myopic planning is adopted in [8], and discretization of the state and/or actions spaces appears in [11, 12, 7]. The method proposed in this paper does not r... |

2 | Trajectory planning for multiple robots in bearing-only target localisation
- Leung, Huang, et al.
- 2005
(Show Context)
Citation Context ...s computational effort. As a result, it is not surprising that most existing approaches relax the constrains. For instance, full observability is assumed in [9, 7], known robot location is assumed in =-=[10]-=-, myopic planning is adopted in [8], and discretization of the state and/or actions spaces appears in [11, 12, 7]. The method proposed in this paper does not rely on any of these assumptions. Our dire... |