Results 1  10
of
63
Coastal Navigation with Mobile Robots
, 2000
"... The problem that we address in this paper is how a mobile robot can plan in order to arrive at its goal with minimum uncertainty. Traditional motion planning algorithms often assume that a mobile robot can track its position reliably, however, in real world situations, reliable localization may not ..."
Abstract

Cited by 95 (20 self)
 Add to MetaCart
The problem that we address in this paper is how a mobile robot can plan in order to arrive at its goal with minimum uncertainty. Traditional motion planning algorithms often assume that a mobile robot can track its position reliably, however, in real world situations, reliable localization may not always be feasible. Partially Observable Markov Decision Processes (POMDPs) provide one way to maximize the certainty of reaching the goal state, but at the cost of computational intractability for large state spaces. The method we propose explicitly models the uncertainty of the robot’s position as a state variable, and generates trajectories through the augmented poseuncertainty space. By minimizing the positional uncertainty at the goal, the robot reduces the likelihood it becomes lost. We demonstrate experimentally that coastal navigation reduces the uncertainty at the goal, especially with degraded localization.
Finding approximate pomdp solutions through belief compression,”
 Journal of artificial intelligence research,
, 2005
"... Abstract Standard value function approaches to finding policies for Partially Observable Markov Decision Processes (POMDPs) are generally considered to be intractable for large models. The intractability of these algorithms is to a large extent a consequence of computing an exact, optimal policy ov ..."
Abstract

Cited by 85 (3 self)
 Add to MetaCart
(Show Context)
Abstract Standard value function approaches to finding policies for Partially Observable Markov Decision Processes (POMDPs) are generally considered to be intractable for large models. The intractability of these algorithms is to a large extent a consequence of computing an exact, optimal policy over the entire belief space. However, in realworld POMDP problems, computing the optimal policy for the full belief space is often unnecessary for good control even for problems with complicated policy classes. The beliefs experienced by the controller often lie near a structured, lowdimensional subspace embedded in the highdimensional belief space. Finding a good approximation to the optimal value function for only this subspace can be much easier than computing the full value function. We introduce a new method for solving largescale POMDPs by reducing the dimensionality of the belief space. We use Exponential family Principal Components Analysis We demonstrate the use of this algorithm on a synthetic problem and on mobile robot navigation tasks.
Autonomous shaping: knowledge transfer in reinforcement learning
 In Int. Conference on Machine Learning
, 2006
"... All intext references underlined in blue are linked to publications on ResearchGate, letting you access and read them immediately. ..."
Abstract

Cited by 66 (5 self)
 Add to MetaCart
(Show Context)
All intext references underlined in blue are linked to publications on ResearchGate, letting you access and read them immediately.
Agentcentered search
 Articial Intelligence Magazine
"... In this article, we describe agentcentered search (sometimes also called realtime search or local search) and illustrate this planning paradigm with examples. Agentcentered search methods interleave planning and plan execution and restrict planning to the part of the domain around the current sta ..."
Abstract

Cited by 55 (5 self)
 Add to MetaCart
(Show Context)
In this article, we describe agentcentered search (sometimes also called realtime search or local search) and illustrate this planning paradigm with examples. Agentcentered search methods interleave planning and plan execution and restrict planning to the part of the domain around the current state of the agent, for example, the current location of a mobile robot or the current board position of a game. They can execute actions in the presence of time constraints and often have a small sum of planning and execution cost, both because they tradeoff planning and execution cost and because they allow agents to gather information early in nondeterministic domains, which reduces the amount of planning they have to perform for unencountered situations. These advantages become important as more intelligent systems are interfaced
Exploration of MultiState Environments: Local Measures and BackPropagation of Uncertainty
, 1998
"... This paper presents an action selection technique for reinforcement learning in stationary Markovian environments. This technique may be used in direct algorithms such as Qlearning, or in indirect algorithms such as adaptive dynamic programming. It is based on two principles. The first is to defin ..."
Abstract

Cited by 54 (1 self)
 Add to MetaCart
This paper presents an action selection technique for reinforcement learning in stationary Markovian environments. This technique may be used in direct algorithms such as Qlearning, or in indirect algorithms such as adaptive dynamic programming. It is based on two principles. The first is to define a local measure of the uncertainty using the theory of bandit problems. We show that such a measure suffers from several drawbacks. In particular, a direct application of it leads to algorithms of low quality that can be easily misled by particular configurations of the environment. The second basic principle was introduced to eliminate this drawback. It consists of assimilating the local measures of uncertainty to rewards, and backpropagating them with the dynamic programming or temporal difference mechanisms. This allows reproducing globalscale reasoning about the uncertainty, using only local measures of it. Numerical simulations clearly show the efficiency of these propositions.
Reinforcement Learning in Finite MDPs: PAC Analysis Reinforcement Learning in Finite MDPs: PAC Analysis
"... Editor: We study the problem of learning nearoptimal behavior in finite Markov Decision Processes (MDPs) with a polynomial number of samples. These “PACMDP ” algorithms include the wellknown E 3 and RMAX algorithms as well as the more recent Delayed Qlearning algorithm. We summarize the current ..."
Abstract

Cited by 52 (6 self)
 Add to MetaCart
(Show Context)
Editor: We study the problem of learning nearoptimal behavior in finite Markov Decision Processes (MDPs) with a polynomial number of samples. These “PACMDP ” algorithms include the wellknown E 3 and RMAX algorithms as well as the more recent Delayed Qlearning algorithm. We summarize the current stateoftheart by presenting bounds for the problem in a unified theoretical framework. We also present a more refined analysis that yields insight into the differences between the modelfree Delayed Qlearning and the modelbased RMAX. Finally, we conclude with open problems.
Autoexploratory Average Reward Reinforcement Learning
 Artificial Intelligence
, 1996
"... We introduce a modelbased average reward Reinforcement Learning method called Hlearning and compare it with its discounted counterpart, Adaptive RealTime Dynamic Programming, in a simulated robot scheduling task. We also introduce an extension to Hlearning, which automatically explores the unexp ..."
Abstract

Cited by 36 (10 self)
 Add to MetaCart
We introduce a modelbased average reward Reinforcement Learning method called Hlearning and compare it with its discounted counterpart, Adaptive RealTime Dynamic Programming, in a simulated robot scheduling task. We also introduce an extension to Hlearning, which automatically explores the unexplored parts of the state space, while always choosing greedy actions with respect to the current value function. We show that this "Autoexploratory Hlearning" performs better than the original Hlearning under previously studied exploration methods such as random, recencybased, or counterbased exploration. Introduction Reinforcement Learning (RL) is the study of learning agents that improve their performance at some task by receiving rewards and punishments from the environment. Most approaches to reinforcement learning, including Qlearning (Watkins and Dayan 92) and Adaptive RealTime Dynamic Programming (ARTDP) (Barto, Bradtke, & Singh 95), optimize the total discounted reward the ...
Potentialbased shaping and Qvalue initialization are equivalent,”
 Journal of Artificial Intelligence Research,
, 2003
"... ..."
Reinforcement Learning by Policy Search
, 2000
"... One objective of artificial intelligence is to model the behavior of an intelligent agent interacting with its environment. The environment's transformations could be modeled as a Markov chain, whose state is partially observable to the agent and affected by its actions; such processes are know ..."
Abstract

Cited by 30 (2 self)
 Add to MetaCart
One objective of artificial intelligence is to model the behavior of an intelligent agent interacting with its environment. The environment's transformations could be modeled as a Markov chain, whose state is partially observable to the agent and affected by its actions; such processes are known as partially observable Markov decision processes (POMDPs). While the environment's dynamics are assumed to obey certain rules, the agent does not know them and must learn. In this dissertation we focus on the agent's adaptation as captured by the reinforcement learning framework. Reinforcement learning means learning a policya mapping of observations into actionsbased on feedback from the environment. The learning can be viewed as browsing a set of policies while evaluating them by trial through interaction with the environment. The set of policies being searched is constrained by the architecture of the agent's controller. POMDPs require a controller to have a memory. We investigate various architectures for controllers with memory, including controllers with external memory, finite state controllers and distributed controllers for multiagent system. For these various controllers we work out the details of the algorithms which learn by ascending the gradient of expected cumulative reinforcement. Building on statistical learning theory and experiment design theory, a policy evaluation algorithm is developed for the case of experience reuse. We address the question of sufficient experience for uniform convergence of policy evaluation and obtain sample complexity bounds for various estimators. Finally, we demonstrate the performance of the proposed algorithms on several domains, the most complex of which is simulated adaptive packet routing in a telecommunication network.