Results 1  10
of
107
Classifier fitness based on accuracy
 Evolutionary Computation
, 1995
"... In many classifier systems, the classifier strength parameter serves as a predictor of future payoff and as the classifier’s fitness for the genetic algorithm. We investigate a classifier system, XCS, in which each classifier maintains a prediction of expected payoff, but the classifier’s fitness is ..."
Abstract

Cited by 284 (16 self)
 Add to MetaCart
In many classifier systems, the classifier strength parameter serves as a predictor of future payoff and as the classifier’s fitness for the genetic algorithm. We investigate a classifier system, XCS, in which each classifier maintains a prediction of expected payoff, but the classifier’s fitness is given by a measure of the prediction’s accuracy. The system executes the genetic algorithm in niches defined by the match sets, instead of panmictically. These aspects of XCS result in its population tending to form a complete and accurate mapping X x A + P from inputs and actions to payoff predictions. Further, XCS tends to evolve classifiers that are maximally general, subject to an accuracy criterion. Besides introducing a new direction for classifier system research, these properties of XCS make it suitable for a wide range of reinforcement learning situations where generalization over states is desirable.
Ant colonies for the travelling salesman problem
, 1997
"... We describe an artificial ant colony capable of solving the travelling salesman problem (TSP). Ants of the artificial colony are able to generate successively shorter feasible tours by using information accumulated in the form of a pheromone trail deposited on the edges of the TSP graph. Computer si ..."
Abstract

Cited by 156 (5 self)
 Add to MetaCart
We describe an artificial ant colony capable of solving the travelling salesman problem (TSP). Ants of the artificial colony are able to generate successively shorter feasible tours by using information accumulated in the form of a pheromone trail deposited on the edges of the TSP graph. Computer simulations demonstrate that the artificial ant colony is capable of generating good solutions to both symmetric and asymmetric instances of the TSP. The method is an example, like simulated annealing, neural networks and evolutionary computation, of the successful use of a natural metaphor to design an optimization algorithm.
A Feedback Control Structure for Online Learning Tasks
 Robotics and Autonomous Systems
, 1997
"... This paper addresses adaptive control architectures for systems that respond autonomously to changing tasks. Such systems often have many sensory and motor alternatives and behavior drawn from these produces varying quality solutions. The objective is then to ground behavior in control laws which, c ..."
Abstract

Cited by 59 (19 self)
 Add to MetaCart
This paper addresses adaptive control architectures for systems that respond autonomously to changing tasks. Such systems often have many sensory and motor alternatives and behavior drawn from these produces varying quality solutions. The objective is then to ground behavior in control laws which, combined with resources, enumerate closedloop behavioral alternatives. Use of such controllers leads to analyzable and predictable composite systems, permitting the construction of abstract behavioral models. Here, discrete event system and reinforcement learning techniques are employed to constrain the behavioral alternatives and to synthesize behavior online. To illustrate this, a quadruped robot learning a turning gait subject to safety and kinematic constraints is presented. Keywords: Control Composition, DEDS, Reinforcement Learning, Walking. 1 Introduction Behavior generation in complex sensorimotor systems can be viewed as a scheduling problem in which a policy for engaging resour...
Evolving Optimal Populations with XCS Classifier Systems
, 1996
"... This work investigates some uses of selfmonitoring in classifier systems (CS) using Wilson's recent XCS system as a framework. XCS is a significant advance in classifier systems technology which shifts the basis of fitness evaluation for the Genetic Algorithm (GA) from the strength of payoff predic ..."
Abstract

Cited by 43 (10 self)
 Add to MetaCart
This work investigates some uses of selfmonitoring in classifier systems (CS) using Wilson's recent XCS system as a framework. XCS is a significant advance in classifier systems technology which shifts the basis of fitness evaluation for the Genetic Algorithm (GA) from the strength of payoff prediction to the accuracy of payoff prediction. Initial work consisted of implementing an XCS system in Pop11 and replicating published XCS multiplexer experiments from (Wilson 1995, 1996a). In subsequent original work, the XCS Optimality Hypothesis, which suggests that under certain conditions XCS systems can reliably evolve optimal populations (solutions), is proposed. An optimal population is one which accurately maps inputs to actions to reward predictions using the smallest possible set of classifiers. An optimal XCS population forms a complete mapping of the payoff environment in the reinforcement learning tradition, in contrast to traditional classifier systems which only seek to maximise ...
Exploration of MultiState Environments: Local Measures and BackPropagation of Uncertainty
, 1998
"... . This paper presents an action selection technique for reinforcement learning in stationary Markovian environments. This technique may be used in direct algorithms such as Qlearning, or in indirect algorithms such as adaptive dynamic programming. It is based on two principles. The rst is to dene a ..."
Abstract

Cited by 42 (1 self)
 Add to MetaCart
. This paper presents an action selection technique for reinforcement learning in stationary Markovian environments. This technique may be used in direct algorithms such as Qlearning, or in indirect algorithms such as adaptive dynamic programming. It is based on two principles. The rst is to dene a local measure of the uncertainty using the theory of bandit problems. We show that such a measure suers from several drawbacks. In particular, a direct application of it leads to algorithms of low quality that can be easily misled by particular congurations of the environment. The second basic principle was introduced to eliminate this drawback. It consists of assimilating the local measures of uncertainty to rewards, and backpropagating them with the dynamic programming or temporal dierence mechanisms. This allows reproducing globalscale reasoning about the uncertainty, using only local measures of it. Numerical simulations clearly show the eciency of these propositions. Keywords: ...
Learning to Trade via Direct Reinforcement
, 2001
"... We present methods for optimizing portfolios, asset allocations, and trading systems based on direct reinforcement (DR). In this approach, investment decision making is viewed as a stochastic control problem, and strategies are discovered directly. We present an adaptive algorithm called recurrent r ..."
Abstract

Cited by 35 (1 self)
 Add to MetaCart
We present methods for optimizing portfolios, asset allocations, and trading systems based on direct reinforcement (DR). In this approach, investment decision making is viewed as a stochastic control problem, and strategies are discovered directly. We present an adaptive algorithm called recurrent reinforcement learning (RRL) for discovering investment policies. The need to build forecasting models is eliminated, and better trading performance is obtained. The direct reinforcement approach differs from dynamic programming and reinforcement algorithms such as TDlearning and Qlearning, which attempt to estimate a value function for the control problem. We find that the RRL direct reinforcement framework enables a simpler problem representation, avoids Bellman's curse of dimensionality and offers compelling advantages in efficiency. We demonstrate how direct reinforcement can be used to optimize riskadjusted investment returns (including the differential Sharpe ratio), while accounting for the effects of transaction costs. In extensive simulation work using real financial data, we find that our approach based on RRL produces better trading strategies than systems utilizing QLearning (a value function method). Realworld applications include an intradaily currency trader and a monthly asset allocation system for the S&P 500 Stock Index and TBills.
Collaborative Multiagent Reinforcement Learning by Payoff Propagation
 JOURNAL OF MACHINE LEARNING RESEARCH
, 2006
"... In this article we describe a set of scalable techniques for learning the behavior of a group of agents in a collaborative multiagent setting. As a basis we use the framework of coordination graphs of Guestrin, Koller, and Parr (2002a) which exploits the dependencies between agents to decompose t ..."
Abstract

Cited by 32 (2 self)
 Add to MetaCart
In this article we describe a set of scalable techniques for learning the behavior of a group of agents in a collaborative multiagent setting. As a basis we use the framework of coordination graphs of Guestrin, Koller, and Parr (2002a) which exploits the dependencies between agents to decompose the global payoff function into a sum of local terms. First, we deal with the singlestate case and describe a payoff propagation algorithm that computes the individual actions that approximately maximize the global payoff function. The method can be viewed as the decisionmaking analogue of belief propagation in Bayesian networks. Second, we focus on learning the behavior of the agents in sequential decisionmaking tasks. We introduce different modelfree reinforcementlearning techniques, unitedly called Sparse Cooperative Qlearning, which approximate the global actionvalue function based on the topology of a coordination graph, and perform updates using the contribution of the individual agents to the maximal global action value. The combined use of an edgebased decomposition of the actionvalue function and the payoff propagation algorithm for efficient action selection, result in an approach that scales only linearly in the problem size. We provide experimental evidence that our method outperforms related multiagent reinforcementlearning methods based on temporal differences.
An Algebraic Approach to Abstraction in Reinforcement Learning
, 2003
"... To operate e#ectively in complex environments learning agents have to selectively ignore irrelevant details by forming useful abstractions. In this article we outline a formulation of abstraction for reinforcement learning approaches to stochastic sequential decision problems modeled as semiMarkov D ..."
Abstract

Cited by 32 (1 self)
 Add to MetaCart
To operate e#ectively in complex environments learning agents have to selectively ignore irrelevant details by forming useful abstractions. In this article we outline a formulation of abstraction for reinforcement learning approaches to stochastic sequential decision problems modeled as semiMarkov Decision Processes (SMDPs). Building on existing algebraic approaches, we propose the concept of SMDP homomorphism and argue that it provides a useful tool for a rigorous study of abstraction for SMDPs. We apply this framework to di#erent classes of abstractions that arise in hierarchical systems and discuss relativized options, a framework for compactly specifying a related family of temporallyextended actions. Additional details of this work are described in refs. [1, 2, 3].
ActionBased Sensor Space Categorization for Robot Learning
 In Proc. of IEEE/RSJ International Conference on Intelligent Robots and Systems 1996 (IROS '96
, 1996
"... Robot learning such as reinforcement learning generally needs a welldefined state space in order to converge. However, to build such a state space is one of the main issues of the robot learning because of the interdependence between state and action spaces, which resembles to the well known "chic ..."
Abstract

Cited by 27 (11 self)
 Add to MetaCart
Robot learning such as reinforcement learning generally needs a welldefined state space in order to converge. However, to build such a state space is one of the main issues of the robot learning because of the interdependence between state and action spaces, which resembles to the well known "chicken and egg" problem. This paper proposes a method of actionbased state space construction for visionbased mobile robots. Basic ideas to cope with the interdependence are that we define a state as a cluster of of input vectors from which the robot can reach the goal state or the state already obtained by a sequence of one kind action primitive regardless of its length, and that this sequence is defined as one action. To realize these ideas, we need many data (experiences) of the robot and cluster the input vectors as hyper ellipsoids so that the whole state space is segmented into a state transition map in terms of action from which the optimal action sequence is obtained. To show the validi...
Coordination Of Multiple Behaviors Acquired By A VisionBased Reinforcement Learning
 In Proc. of IEEE/RSJ/GI International Conference on Intelligent Robots and Systems 1994 (IROS '94
, 1994
"... A method is proposed which accomplishes a whole task consisting of plural subtasks by coordinating multiple behaviors acquired by a visionbased reinforcement learning. First, individual behaviors which achieve the corresponding subtasks are independently acquired by Qlearning, a widely used reinfo ..."
Abstract

Cited by 26 (4 self)
 Add to MetaCart
A method is proposed which accomplishes a whole task consisting of plural subtasks by coordinating multiple behaviors acquired by a visionbased reinforcement learning. First, individual behaviors which achieve the corresponding subtasks are independently acquired by Qlearning, a widely used reinforcement learning method. Each learned behavior can be represented by an actionvalue function in terms of state of the environment and robot action. Next, three kinds of coordinations of multiple behaviors are considered; simple summation of dierent actionvalue functions, switching actionvalue functions according to situations, and learning with previously obtained actionvalue functions as initial values of a new actionvalue function. A task of shooting a ball into the goal avoiding collisions with an enemy is examined. The task can be decomposed into a ball shooting subtask and a collision avoiding subtask. These subtasks should be accomplished simultaneously, but they are not independe...