• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • DMCA
  • Donate

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

The Effect of Representation and Knowledge on Goal-Directed Exploration with Reinforcement-Learning Algorithms. (1996)

by S Koenig, R G Simmons
Venue:Machine Learning
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 63
Next 10 →

Relational reinforcement learning

by Saso Dzeroski, Luc De Raedt, Kurt Driessens , 2000
"... ..."
Abstract - Cited by 137 (9 self) - Add to MetaCart
Abstract not found

Coastal Navigation with Mobile Robots

by Nicholas Roy, Sebastian Thrun , 2000
"... The problem that we address in this paper is how a mobile robot can plan in order to arrive at its goal with minimum uncertainty. Traditional motion planning algorithms often assume that a mobile robot can track its position reliably, however, in real world situations, reliable localization may not ..."
Abstract - Cited by 95 (20 self) - Add to MetaCart
The problem that we address in this paper is how a mobile robot can plan in order to arrive at its goal with minimum uncertainty. Traditional motion planning algorithms often assume that a mobile robot can track its position reliably, however, in real world situations, reliable localization may not always be feasible. Partially Observable Markov Decision Processes (POMDPs) provide one way to maximize the certainty of reaching the goal state, but at the cost of computational intractability for large state spaces. The method we propose explicitly models the uncertainty of the robot’s position as a state variable, and generates trajectories through the augmented pose-uncertainty space. By minimizing the positional uncertainty at the goal, the robot reduces the likelihood it becomes lost. We demonstrate experimentally that coastal navigation reduces the uncertainty at the goal, especially with degraded localization.

Finding approximate pomdp solutions through belief compression,”

by Nicholas Roy , Geoffrey Gordon , Sebastian Thrun - Journal of artificial intelligence research, , 2005
"... Abstract Standard value function approaches to finding policies for Partially Observable Markov Decision Processes (POMDPs) are generally considered to be intractable for large models. The intractability of these algorithms is to a large extent a consequence of computing an exact, optimal policy ov ..."
Abstract - Cited by 85 (3 self) - Add to MetaCart
Abstract Standard value function approaches to finding policies for Partially Observable Markov Decision Processes (POMDPs) are generally considered to be intractable for large models. The intractability of these algorithms is to a large extent a consequence of computing an exact, optimal policy over the entire belief space. However, in real-world POMDP problems, computing the optimal policy for the full belief space is often unnecessary for good control even for problems with complicated policy classes. The beliefs experienced by the controller often lie near a structured, low-dimensional subspace embedded in the high-dimensional belief space. Finding a good approximation to the optimal value function for only this subspace can be much easier than computing the full value function. We introduce a new method for solving large-scale POMDPs by reducing the dimensionality of the belief space. We use Exponential family Principal Components Analysis We demonstrate the use of this algorithm on a synthetic problem and on mobile robot navigation tasks.
(Show Context)

Citation Context

...d as a compromise between active localisation and conventional distance-optimal planning. Similarly, POMDP-style planners have been used to recover from localisation failure (Nourbakhsh et al., 1995; =-=Koenig and Simmons, 1996-=-; Takeda et al., 1994), but the greedy nature of the heuristics these planners use prevents the global style of planning that the augmented MDP generates. In topological environments, full POMDP plann...

Autonomous shaping: knowledge transfer in reinforcement learning

by George Konidaris, George Konidaris, Andrew Barto - In Int. Conference on Machine Learning , 2006
"... All in-text references underlined in blue are linked to publications on ResearchGate, letting you access and read them immediately. ..."
Abstract - Cited by 66 (5 self) - Add to MetaCart
All in-text references underlined in blue are linked to publications on ResearchGate, letting you access and read them immediately.
(Show Context)

Citation Context

...rience across several tasks without having to have it specified in advance. 2.2. Sequences of Goal-directed Tasks In this paper we are concerned with a sequence of goal directed exploration problems (=-=Koenig & Simmons, 1996-=-). In each, the agent is in an environment (characterized by a set of states and actions and transition probability and reward functions) and must get to some goal state s ′ , where it will receive a ...

Agent-centered search

by Sven Koenig - Articial Intelligence Magazine
"... In this article, we describe agent-centered search (sometimes also called real-time search or local search) and illustrate this planning paradigm with examples. Agent-centered search methods interleave planning and plan execution and restrict planning to the part of the domain around the current sta ..."
Abstract - Cited by 55 (5 self) - Add to MetaCart
In this article, we describe agent-centered search (sometimes also called real-time search or local search) and illustrate this planning paradigm with examples. Agent-centered search methods interleave planning and plan execution and restrict planning to the part of the domain around the current state of the agent, for example, the current location of a mobile robot or the current board position of a game. They can execute actions in the presence of time constraints and often have a small sum of planning and execution cost, both because they trade-off planning and execution cost and because they allow agents to gather information early in nondeterministic domains, which reduces the amount of planning they have to perform for unencountered situations. These advantages become important as more intelligent systems are interfaced
(Show Context)

Citation Context

... the execution cost of LRTA* until it reaches a goal state and how it depends on the informedness of the initial state values and the topology of the state space is given in (Koenig and Simmons 1995; =-=Koenig and Simmons 1996-=-a). This analysis yields insights into when agent-centered search methods solve planning tasks in deterministic domains efficiently. For example, LRTA* tends to be more efficient the more informed the...

Exploration of Multi-State Environments: Local Measures and Back-Propagation of Uncertainty

by Nicolas Meuleau, Sridhar Mahadevan , 1998
"... This paper presents an action selection technique for reinforcement learning in stationary Markovian environments. This technique may be used in direct algorithms such as Q-learning, or in indirect algorithms such as adaptive dynamic programming. It is based on two principles. The first is to defin ..."
Abstract - Cited by 54 (1 self) - Add to MetaCart
This paper presents an action selection technique for reinforcement learning in stationary Markovian environments. This technique may be used in direct algorithms such as Q-learning, or in indirect algorithms such as adaptive dynamic programming. It is based on two principles. The first is to define a local measure of the uncertainty using the theory of bandit problems. We show that such a measure suffers from several drawbacks. In particular, a direct application of it leads to algorithms of low quality that can be easily misled by particular configurations of the environment. The second basic principle was introduced to eliminate this drawback. It consists of assimilating the local measures of uncertainty to rewards, and back-propagating them with the dynamic programming or temporal difference mechanisms. This allows reproducing global-scale reasoning about the uncertainty, using only local measures of it. Numerical simulations clearly show the efficiency of these propositions.

Reinforcement Learning in Finite MDPs: PAC Analysis Reinforcement Learning in Finite MDPs: PAC Analysis

by Alexander L. Strehl, Lihong Li, Michael L. Littman
"... Editor: We study the problem of learning near-optimal behavior in finite Markov Decision Processes (MDPs) with a polynomial number of samples. These “PAC-MDP ” algorithms include the well-known E 3 and R-MAX algorithms as well as the more recent Delayed Q-learning algorithm. We summarize the current ..."
Abstract - Cited by 52 (6 self) - Add to MetaCart
Editor: We study the problem of learning near-optimal behavior in finite Markov Decision Processes (MDPs) with a polynomial number of samples. These “PAC-MDP ” algorithms include the well-known E 3 and R-MAX algorithms as well as the more recent Delayed Q-learning algorithm. We summarize the current state-of-the-art by presenting bounds for the problem in a unified theoretical framework. We also present a more refined analysis that yields insight into the differences between the model-free Delayed Q-learning and the model-based R-MAX. Finally, we conclude with open problems.
(Show Context)

Citation Context

...se behavior of these algorithms. Developing lower bounds, especially matching lower bounds, tells us what can (or cannot) be achieved. Although matching lower bounds are known for deterministic MDPs (=-=Koenig and Simmons, 1996-=-; Kakade, 2003), it remains an open question for general MDPs. The previous best lower bound is due to Kakade (2003), and was developed for the slightly different notion of H-horizon value functions i...

Auto-exploratory Average Reward Reinforcement Learning

by Dokyeong Ok, Prasad Tadepalli - Artificial Intelligence , 1996
"... We introduce a model-based average reward Reinforcement Learning method called H-learning and compare it with its discounted counterpart, Adaptive Real-Time Dynamic Programming, in a simulated robot scheduling task. We also introduce an extension to H-learning, which automatically explores the unexp ..."
Abstract - Cited by 36 (10 self) - Add to MetaCart
We introduce a model-based average reward Reinforcement Learning method called H-learning and compare it with its discounted counterpart, Adaptive Real-Time Dynamic Programming, in a simulated robot scheduling task. We also introduce an extension to H-learning, which automatically explores the unexplored parts of the state space, while always choosing greedy actions with respect to the current value function. We show that this "Auto-exploratory H-learning" performs better than the original H-learning under previously studied exploration methods such as random, recency-based, or counter-based exploration. Introduction Reinforcement Learning (RL) is the study of learning agents that improve their performance at some task by receiving rewards and punishments from the environment. Most approaches to reinforcement learning, including Q-learning (Watkins and Dayan 92) and Adaptive Real-Time Dynamic Programming (ARTDP) (Barto, Bradtke, & Singh 95), optimize the total discounted reward the ...

Potential-based shaping and Q-value initialization are equivalent,”

by E Wiewiora - Journal of Artificial Intelligence Research, , 2003
"... ..."
Abstract - Cited by 31 (0 self) - Add to MetaCart
Abstract not found

Reinforcement Learning by Policy Search

by Leonid Peshkin , 2000
"... One objective of artificial intelligence is to model the behavior of an intelligent agent interacting with its environment. The environment's transformations could be modeled as a Markov chain, whose state is partially observable to the agent and affected by its actions; such processes are know ..."
Abstract - Cited by 30 (2 self) - Add to MetaCart
One objective of artificial intelligence is to model the behavior of an intelligent agent interacting with its environment. The environment's transformations could be modeled as a Markov chain, whose state is partially observable to the agent and affected by its actions; such processes are known as partially observable Markov decision processes (POMDPs). While the environment's dynamics are assumed to obey certain rules, the agent does not know them and must learn. In this dissertation we focus on the agent's adaptation as captured by the reinforcement learning framework. Reinforcement learning means learning a policy---a mapping of observations into actions---based on feedback from the environment. The learning can be viewed as browsing a set of policies while evaluating them by trial through interaction with the environment. The set of policies being searched is constrained by the architecture of the agent's controller. POMDPs require a controller to have a memory. We investigate various architectures for controllers with memory, including controllers with external memory, finite state controllers and distributed controllers for multi-agent system. For these various controllers we work out the details of the algorithms which learn by ascending the gradient of expected cumulative reinforcement. Building on statistical learning theory and experiment design theory, a policy evaluation algorithm is developed for the case of experience re-use. We address the question of sufficient experience for uniform convergence of policy evaluation and obtain sample complexity bounds for various estimators. Finally, we demonstrate the performance of the proposed algorithms on several domains, the most complex of which is simulated adaptive packet routing in a telecommunication network.
Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University