Results 1  10
of
5,027
Finitetime analysis of the multiarmed bandit problem
 Machine Learning
, 2002
"... Abstract. Reinforcement learning policies face the exploration versus exploitation dilemma, i.e. the search for a balance between exploring the environment to find profitable actions while taking the empirically best action as often as possible. A popular measure of a policy’s success in addressing ..."
Abstract

Cited by 687 (13 self)
 Add to MetaCart
Abstract. Reinforcement learning policies face the exploration versus exploitation dilemma, i.e. the search for a balance between exploring the environment to find profitable actions while taking the empirically best action as often as possible. A popular measure of a policy’s success in addressing this dilemma is the regret, that is the loss due to the fact that the globally optimal policy is not followed all the times. One of the simplest examples of the exploration/exploitation dilemma is the multiarmed bandit problem. Lai and Robbins were the first ones to show that the regret for this problem has to grow at least logarithmically in the number of plays. Since then, policies which asymptotically achieve this regret have been devised by Lai and Robbins and many others. In this work we show that the optimal logarithmic regret is also achievable uniformly over time, with simple and efficient policies, and for all reward distributions with bounded support. Keywords: bandit problems, adaptive allocation rules, finite horizon regret 1.
Between MDPs and SemiMDPs: A Framework for Temporal Abstraction in Reinforcement Learning
 Artificial Intelligence
, 1999
"... Learning, planning, and representing knowledge at multiple levels of temporal abstraction are key, longstanding challenges for AI. In this paper we consider how these challenges can be addressed within the mathematical framework of reinforcement learning and Markov decision processes (MDPs). We ..."
Abstract

Cited by 536 (38 self)
 Add to MetaCart
(Show Context)
Learning, planning, and representing knowledge at multiple levels of temporal abstraction are key, longstanding challenges for AI. In this paper we consider how these challenges can be addressed within the mathematical framework of reinforcement learning and Markov decision processes (MDPs). We extend the usual notion of action in this framework to include optionsclosedloop policies for taking action over a period of time. Examples of options include picking up an object, going to lunch, and traveling to a distant city, as well as primitive actions such as muscle twitches and joint torques. Overall, we show that options enable temporally abstract knowledge and action to be included in the reinforcement learning framework in a natural and general way. In particular, we show that options may be used interchangeably with primitive actions in planning methods such as dynamic programming and in learning methods such as Qlearning.
DecisionTheoretic Planning: Structural Assumptions and Computational Leverage
 JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH
, 1999
"... Planning under uncertainty is a central problem in the study of automated sequential decision making, and has been addressed by researchers in many different fields, including AI planning, decision analysis, operations research, control theory and economics. While the assumptions and perspectives ..."
Abstract

Cited by 490 (4 self)
 Add to MetaCart
(Show Context)
Planning under uncertainty is a central problem in the study of automated sequential decision making, and has been addressed by researchers in many different fields, including AI planning, decision analysis, operations research, control theory and economics. While the assumptions and perspectives adopted in these areas often differ in substantial ways, many planning problems of interest to researchers in these fields can be modeled as Markov decision processes (MDPs) and analyzed using the techniques of decision theory. This paper presents an overview and synthesis of MDPrelated methods, showing how they provide a unifying framework for modeling many classes of planning problems studied in AI. It also describes structural properties of MDPs that, when exhibited by particular classes of problems, can be exploited in the construction of optimal or approximately optimal policies or plans. Planning problems commonly possess structure in the reward and value functions used to de...
Ant algorithms for discrete optimization
 ARTIFICIAL LIFE
, 1999
"... This article presents an overview of recent work on ant algorithms, that is, algorithms for discrete optimization that took inspiration from the observation of ant colonies’ foraging behavior, and introduces the ant colony optimization (ACO) metaheuristic. In the first part of the article the basic ..."
Abstract

Cited by 434 (43 self)
 Add to MetaCart
(Show Context)
This article presents an overview of recent work on ant algorithms, that is, algorithms for discrete optimization that took inspiration from the observation of ant colonies’ foraging behavior, and introduces the ant colony optimization (ACO) metaheuristic. In the first part of the article the basic biological findings on real ants are reviewed and their artificial counterparts as well as the ACO metaheuristic are defined. In the second part of the article a number of applications of ACO algorithms to combinatorial optimization and routing in communications networks are described. We conclude with a discussion of related work and of some of the most important aspects of the ACO metaheuristic.
LeastSquares Policy Iteration
 JOURNAL OF MACHINE LEARNING RESEARCH
, 2003
"... We propose a new approach to reinforcement learning for control problems which combines valuefunction approximation with linear architectures and approximate policy iteration. This new approach ..."
Abstract

Cited by 427 (10 self)
 Add to MetaCart
(Show Context)
We propose a new approach to reinforcement learning for control problems which combines valuefunction approximation with linear architectures and approximate policy iteration. This new approach
Policy Gradient Methods for Reinforcement Learning with Function Approximation
, 1999
"... Function approximation is essential to reinforcement learning, but the standard approach of approximating a value function and determining a policy from it has so far proven theoretically intractable. In this paper we explore an alternative approach in which the policy is explicitly represented by i ..."
Abstract

Cited by 408 (20 self)
 Add to MetaCart
(Show Context)
Function approximation is essential to reinforcement learning, but the standard approach of approximating a value function and determining a policy from it has so far proven theoretically intractable. In this paper we explore an alternative approach in which the policy is explicitly represented by its own function approximator, independent of the value function, and is updated according to the gradient of expected reward with respect to the policy parameters. Williams’s REINFORCE method and actor–critic methods are examples of this approach. Our main new result is to show that the gradient can be written in a form suitable for estimation from experience aided by an approximate actionvalue or advantage function. Using this result, we prove for the first time that a version of policy iteration with arbitrary differentiable function approximation is convergent to a locally optimal policy.
Searching in Metric Spaces
, 1999
"... The problem of searching the elements of a set which are close to a given query element under some similarity criterion has a vast number of applications in many branches of computer science, from pattern recognition to textual and multimedia information retrieval. We are interested in the rather ge ..."
Abstract

Cited by 407 (37 self)
 Add to MetaCart
The problem of searching the elements of a set which are close to a given query element under some similarity criterion has a vast number of applications in many branches of computer science, from pattern recognition to textual and multimedia information retrieval. We are interested in the rather general case where the similarity criterion defines a metric space, instead of the more restricted case of a vector space. A large number of solutions have been proposed in different areas, in many cases without crossknowledge. Because of this, the same ideas have been reinvented several times, and very different presentations have been given for the same approaches. We
Experiences with an Interactive Museum TourGuide Robot
, 1998
"... This article describes the software architecture of an autonomous, interactive tourguide robot. It presents a modular and distributed software architecture, which integrates localization, mapping, collision avoidance, planning, and various modules concerned with user interaction and Webbased telep ..."
Abstract

Cited by 321 (76 self)
 Add to MetaCart
This article describes the software architecture of an autonomous, interactive tourguide robot. It presents a modular and distributed software architecture, which integrates localization, mapping, collision avoidance, planning, and various modules concerned with user interaction and Webbased telepresence. At its heart, the software approach relies on probabilistic computation, online learning, and anytime algorithms. It enables robots to operate safely, reliably, and at high speeds in highly dynamic environments, and does not require any modifications of the environment to aid the robot's operation. Special emphasis is placed on the design of interactive capabilities that appeal to people's intuition. The interface provides new means for humanrobot interaction with crowds of people in public places, and it also provides people all around the world with the ability to establish a "virtual telepresence" using the Web. To illustrate our approach, results are reported obtained in mid...
Multiagent Reinforcement Learning: Theoretical Framework and an Algorithm
, 1998
"... In this paper, we adopt generalsum stochastic games as a framework for multiagent reinforcement learning. Our work extends previous work by Littman on zerosum stochastic games to a broader framework. We design a multiagent Qlearning method under this framework, and prove that it converges to a Na ..."
Abstract

Cited by 315 (4 self)
 Add to MetaCart
In this paper, we adopt generalsum stochastic games as a framework for multiagent reinforcement learning. Our work extends previous work by Littman on zerosum stochastic games to a broader framework. We design a multiagent Qlearning method under this framework, and prove that it converges to a Nash equilibrium under specified conditions. This algorithm is useful for finding the optimal strategy when there exists a unique Nash equilibrium in the game. When there exist multiple Nash equilibria in the game, this algorithm should be combined with other learning techniques to find optimal strategies.
The neural basis of human error processing: Reinforcement learning, dopamine, and the errorrelated negativity
 PSYCHOLOGICAL REVIEW 109:679–709
, 2002
"... The authors present a unified account of 2 neural systems concerned with the development and expression of adaptive behaviors: a mesencephalic dopamine system for reinforcement learning and a “generic ” errorprocessing system associated with the anterior cingulate cortex. The existence of the error ..."
Abstract

Cited by 314 (15 self)
 Add to MetaCart
(Show Context)
The authors present a unified account of 2 neural systems concerned with the development and expression of adaptive behaviors: a mesencephalic dopamine system for reinforcement learning and a “generic ” errorprocessing system associated with the anterior cingulate cortex. The existence of the errorprocessing system has been inferred from the errorrelated negativity (ERN), a component of the eventrelated brain potential elicited when human participants commit errors in reactiontime tasks. The authors propose that the ERN is generated when a negative reinforcement learning signal is conveyed to the anterior cingulate cortex via the mesencephalic dopamine system and that this signal is used by the anterior cingulate cortex to modify performance on the task at hand. They provide support for this proposal using both computational modeling and psychophysiological experimentation. Human beings learn from the consequences of their actions. Thorndike (1911/1970) originally described this phenomenon with his law of effect, which made explicit the commonsense notion that actions that are followed by feelings of satisfaction are more likely to be generated again in the future, whereas actions that are followed by negative outcomes are less likely to reoccur. This