Results 1 - 10
of
63
Coastal Navigation with Mobile Robots
, 2000
"... The problem that we address in this paper is how a mobile robot can plan in order to arrive at its goal with minimum uncertainty. Traditional motion planning algorithms often assume that a mobile robot can track its position reliably, however, in real world situations, reliable localization may not ..."
Abstract
-
Cited by 95 (20 self)
- Add to MetaCart
The problem that we address in this paper is how a mobile robot can plan in order to arrive at its goal with minimum uncertainty. Traditional motion planning algorithms often assume that a mobile robot can track its position reliably, however, in real world situations, reliable localization may not always be feasible. Partially Observable Markov Decision Processes (POMDPs) provide one way to maximize the certainty of reaching the goal state, but at the cost of computational intractability for large state spaces. The method we propose explicitly models the uncertainty of the robot’s position as a state variable, and generates trajectories through the augmented pose-uncertainty space. By minimizing the positional uncertainty at the goal, the robot reduces the likelihood it becomes lost. We demonstrate experimentally that coastal navigation reduces the uncertainty at the goal, especially with degraded localization.
Finding approximate pomdp solutions through belief compression,”
- Journal of artificial intelligence research,
, 2005
"... Abstract Standard value function approaches to finding policies for Partially Observable Markov Decision Processes (POMDPs) are generally considered to be intractable for large models. The intractability of these algorithms is to a large extent a consequence of computing an exact, optimal policy ov ..."
Abstract
-
Cited by 85 (3 self)
- Add to MetaCart
(Show Context)
Abstract Standard value function approaches to finding policies for Partially Observable Markov Decision Processes (POMDPs) are generally considered to be intractable for large models. The intractability of these algorithms is to a large extent a consequence of computing an exact, optimal policy over the entire belief space. However, in real-world POMDP problems, computing the optimal policy for the full belief space is often unnecessary for good control even for problems with complicated policy classes. The beliefs experienced by the controller often lie near a structured, low-dimensional subspace embedded in the high-dimensional belief space. Finding a good approximation to the optimal value function for only this subspace can be much easier than computing the full value function. We introduce a new method for solving large-scale POMDPs by reducing the dimensionality of the belief space. We use Exponential family Principal Components Analysis We demonstrate the use of this algorithm on a synthetic problem and on mobile robot navigation tasks.
Autonomous shaping: knowledge transfer in reinforcement learning
- In Int. Conference on Machine Learning
, 2006
"... All in-text references underlined in blue are linked to publications on ResearchGate, letting you access and read them immediately. ..."
Abstract
-
Cited by 66 (5 self)
- Add to MetaCart
(Show Context)
All in-text references underlined in blue are linked to publications on ResearchGate, letting you access and read them immediately.
Agent-centered search
- Articial Intelligence Magazine
"... In this article, we describe agent-centered search (sometimes also called real-time search or local search) and illustrate this planning paradigm with examples. Agent-centered search methods interleave planning and plan execution and restrict planning to the part of the domain around the current sta ..."
Abstract
-
Cited by 55 (5 self)
- Add to MetaCart
(Show Context)
In this article, we describe agent-centered search (sometimes also called real-time search or local search) and illustrate this planning paradigm with examples. Agent-centered search methods interleave planning and plan execution and restrict planning to the part of the domain around the current state of the agent, for example, the current location of a mobile robot or the current board position of a game. They can execute actions in the presence of time constraints and often have a small sum of planning and execution cost, both because they trade-off planning and execution cost and because they allow agents to gather information early in nondeterministic domains, which reduces the amount of planning they have to perform for unencountered situations. These advantages become important as more intelligent systems are interfaced
Exploration of Multi-State Environments: Local Measures and Back-Propagation of Uncertainty
, 1998
"... This paper presents an action selection technique for reinforcement learning in stationary Markovian environments. This technique may be used in direct algorithms such as Q-learning, or in indirect algorithms such as adaptive dynamic programming. It is based on two principles. The first is to defin ..."
Abstract
-
Cited by 54 (1 self)
- Add to MetaCart
This paper presents an action selection technique for reinforcement learning in stationary Markovian environments. This technique may be used in direct algorithms such as Q-learning, or in indirect algorithms such as adaptive dynamic programming. It is based on two principles. The first is to define a local measure of the uncertainty using the theory of bandit problems. We show that such a measure suffers from several drawbacks. In particular, a direct application of it leads to algorithms of low quality that can be easily misled by particular configurations of the environment. The second basic principle was introduced to eliminate this drawback. It consists of assimilating the local measures of uncertainty to rewards, and back-propagating them with the dynamic programming or temporal difference mechanisms. This allows reproducing global-scale reasoning about the uncertainty, using only local measures of it. Numerical simulations clearly show the efficiency of these propositions.
Reinforcement Learning in Finite MDPs: PAC Analysis Reinforcement Learning in Finite MDPs: PAC Analysis
"... Editor: We study the problem of learning near-optimal behavior in finite Markov Decision Processes (MDPs) with a polynomial number of samples. These “PAC-MDP ” algorithms include the well-known E 3 and R-MAX algorithms as well as the more recent Delayed Q-learning algorithm. We summarize the current ..."
Abstract
-
Cited by 52 (6 self)
- Add to MetaCart
(Show Context)
Editor: We study the problem of learning near-optimal behavior in finite Markov Decision Processes (MDPs) with a polynomial number of samples. These “PAC-MDP ” algorithms include the well-known E 3 and R-MAX algorithms as well as the more recent Delayed Q-learning algorithm. We summarize the current state-of-the-art by presenting bounds for the problem in a unified theoretical framework. We also present a more refined analysis that yields insight into the differences between the model-free Delayed Q-learning and the model-based R-MAX. Finally, we conclude with open problems.
Auto-exploratory Average Reward Reinforcement Learning
- Artificial Intelligence
, 1996
"... We introduce a model-based average reward Reinforcement Learning method called H-learning and compare it with its discounted counterpart, Adaptive Real-Time Dynamic Programming, in a simulated robot scheduling task. We also introduce an extension to H-learning, which automatically explores the unexp ..."
Abstract
-
Cited by 36 (10 self)
- Add to MetaCart
We introduce a model-based average reward Reinforcement Learning method called H-learning and compare it with its discounted counterpart, Adaptive Real-Time Dynamic Programming, in a simulated robot scheduling task. We also introduce an extension to H-learning, which automatically explores the unexplored parts of the state space, while always choosing greedy actions with respect to the current value function. We show that this "Auto-exploratory H-learning" performs better than the original H-learning under previously studied exploration methods such as random, recency-based, or counter-based exploration. Introduction Reinforcement Learning (RL) is the study of learning agents that improve their performance at some task by receiving rewards and punishments from the environment. Most approaches to reinforcement learning, including Q-learning (Watkins and Dayan 92) and Adaptive Real-Time Dynamic Programming (ARTDP) (Barto, Bradtke, & Singh 95), optimize the total discounted reward the ...
Potential-based shaping and Q-value initialization are equivalent,”
- Journal of Artificial Intelligence Research,
, 2003
"... ..."
Reinforcement Learning by Policy Search
, 2000
"... One objective of artificial intelligence is to model the behavior of an intelligent agent interacting with its environment. The environment's transformations could be modeled as a Markov chain, whose state is partially observable to the agent and affected by its actions; such processes are know ..."
Abstract
-
Cited by 30 (2 self)
- Add to MetaCart
One objective of artificial intelligence is to model the behavior of an intelligent agent interacting with its environment. The environment's transformations could be modeled as a Markov chain, whose state is partially observable to the agent and affected by its actions; such processes are known as partially observable Markov decision processes (POMDPs). While the environment's dynamics are assumed to obey certain rules, the agent does not know them and must learn. In this dissertation we focus on the agent's adaptation as captured by the reinforcement learning framework. Reinforcement learning means learning a policy---a mapping of observations into actions---based on feedback from the environment. The learning can be viewed as browsing a set of policies while evaluating them by trial through interaction with the environment. The set of policies being searched is constrained by the architecture of the agent's controller. POMDPs require a controller to have a memory. We investigate various architectures for controllers with memory, including controllers with external memory, finite state controllers and distributed controllers for multi-agent system. For these various controllers we work out the details of the algorithms which learn by ascending the gradient of expected cumulative reinforcement. Building on statistical learning theory and experiment design theory, a policy evaluation algorithm is developed for the case of experience re-use. We address the question of sufficient experience for uniform convergence of policy evaluation and obtain sample complexity bounds for various estimators. Finally, we demonstrate the performance of the proposed algorithms on several domains, the most complex of which is simulated adaptive packet routing in a telecommunication network.