Results 1  10
of
23
Practical Issues in Temporal Difference Learning
 Machine Learning
, 1992
"... This paper examines whether temporal difference methods for training connectionist networks, such as Suttons's TD(lambda) algorithm can be successfully applied to complex realworld problems. A number of important practical issues are identified and discussed from a general theoretical perspective. ..."
Abstract

Cited by 363 (2 self)
 Add to MetaCart
This paper examines whether temporal difference methods for training connectionist networks, such as Suttons's TD(lambda) algorithm can be successfully applied to complex realworld problems. A number of important practical issues are identified and discussed from a general theoretical perspective. These practical issues are then examined in the context of a case study in which TD(lambda) is applied to learning the game of backgammon from the outcome of selfplay. This is apparently the first application of this algorithm to a complex nontrivial task. It is found that, with zero knowledge built in, the network is able to learn from scratch to play the entire game at a fairly strong intermediate level of performance which is clearly better than conventional commercial programs and which in fact surpasses comparable networks trained on a massive human expert data set. This indicates that TD learning may work better in practice than one would expect based on current theory, and it suggests that further analysis of TD methods, as well as applications in other complex domains may be worth investigating.
CoEvolution in the Successful Learning of Backgammon Strategy
 Machine Learning
, 1998
"... Following Tesauro's work on TDGammon, we used a 4000 parameter feedforward neural network to develop a competitive backgammon evaluation function. Play proceeds by a roll of the dice, application of the network to all legal moves, and choosing the move with the highest evaluation. However, no back ..."
Abstract

Cited by 109 (24 self)
 Add to MetaCart
Following Tesauro's work on TDGammon, we used a 4000 parameter feedforward neural network to develop a competitive backgammon evaluation function. Play proceeds by a roll of the dice, application of the network to all legal moves, and choosing the move with the highest evaluation. However, no backpropagation, reinforcement or temporal difference learning methods were employed. Instead we apply simple hillclimbing in a relative fitness environment. We start with an initial champion of all zero weights and proceed simply by playing the current champion network against a slightly mutated challenger and changing weights if the challenger wins. Surprisingly, this worked rather well. We investigate how the peculiar dynamics of this domain enabled a previously discarded weak method to succeed, by preventing suboptimal equilibria in a "metagame" of selflearning. Keywords: coevolution, backgammon, reinforcement, temporal difference learning, selflearning Running Head: COEVOLUTIONARY LEA...
Round Robin Classification
, 2002
"... In this paper, we discuss round robin classification (aka pairwise classification), a technique for handling multiclass problems with binary classifiers by learning one classifier for each pair of classes. We present an empirical evaluation of the method, implemented as a wrapper around the Ripp ..."
Abstract

Cited by 88 (18 self)
 Add to MetaCart
In this paper, we discuss round robin classification (aka pairwise classification), a technique for handling multiclass problems with binary classifiers by learning one classifier for each pair of classes. We present an empirical evaluation of the method, implemented as a wrapper around the Ripper rule learning algorithm, on 20 multiclass datasets from the UCI database repository. Our results show that the technique is very likely to improve Ripper's classification accuracy without having a high risk of decreasing it. More importantly, we give a general theoretical analysis of the complexity of the approach and show that its runtime complexity is below that of the commonly used oneagainstall technique.
Two Kinds of Training Information for Evaluation Function Learning
 In Proceedings of the Ninth Annual Conference on Artificial Intelligence
, 1991
"... This paper identifies two fundamentally different kinds of training information for learning search control in terms of an evaluation function. Each kind of training information suggests its own set of methods for learning an evaluation function. The paper shows that one can integrate the methods an ..."
Abstract

Cited by 54 (3 self)
 Add to MetaCart
This paper identifies two fundamentally different kinds of training information for learning search control in terms of an evaluation function. Each kind of training information suggests its own set of methods for learning an evaluation function. The paper shows that one can integrate the methods and learn simultaneously from both kinds of information.
Label Ranking by Learning Pairwise Preferences
"... Preference learning is an emerging topic that appears in different guises in the recent literature. This work focuses on a particular learning scenario called label ranking, where the problem is to learn a mapping from instances to rankings over a finite number of labels. Our approach for learning s ..."
Abstract

Cited by 46 (16 self)
 Add to MetaCart
Preference learning is an emerging topic that appears in different guises in the recent literature. This work focuses on a particular learning scenario called label ranking, where the problem is to learn a mapping from instances to rankings over a finite number of labels. Our approach for learning such a mapping, called ranking by pairwise comparison (RPC), first induces a binary preference relation from suitable training data using a natural extension of pairwise classification. A ranking is then derived from the preference relation thus obtained by means of a ranking procedure, whereby different ranking methods can be used for minimizing different loss functions. In particular, we show that a simple (weighted) voting strategy minimizes risk with respect to the wellknown Spearman rank correlation. We compare RPC to existing label ranking methods, which are based on scoring individual labels instead of comparing pairs of labels. Both empirically and theoretically, it is shown that RPC is superior in terms of computational efficiency, and at least competitive in terms of accuracy.
Learning Subjective Functions with Large Margins
 Stanford University
, 2000
"... In many optimization and decision problems the objective function can be expressed as a linear combination of competing criteria, the weights of which specify the relative importance of the criteria for the user. We consider the problem of learning such a "subjective" function from preference ..."
Abstract

Cited by 24 (1 self)
 Add to MetaCart
In many optimization and decision problems the objective function can be expressed as a linear combination of competing criteria, the weights of which specify the relative importance of the criteria for the user. We consider the problem of learning such a "subjective" function from preference judgments collected from traces of user interactions. We propose a new algorithm for that task based on the theory of Support Vector Machines. One advantage of the algorithm is that prior knowledge about the domain can easily be included to constrain the solution. We demonstrate the algorithm in a route recommendation system that adapts to the driver's route preferences. We present experimental results on real users that show that the algorithm performs well in practice. 1.
Why did TDGammon Work
 Advances in Neural Information Processing Systems 9
"... Although TDGammon is one of the major successes in machine learning, it has not led to similar impressive breakthroughs in temporal difference learning for other applications or even other games. We were able to replicate some of the success of TDGammon, developing a competitive evaluation functio ..."
Abstract

Cited by 15 (4 self)
 Add to MetaCart
Although TDGammon is one of the major successes in machine learning, it has not led to similar impressive breakthroughs in temporal difference learning for other applications or even other games. We were able to replicate some of the success of TDGammon, developing a competitive evaluation function on a 4000 parameter feedforward neural network, without using backpropagation, reinforcement or temporal difference learning methods. Instead we apply simple hillclimbing in a relative fitness environment. These results and further analysis suggest that the surprising success of Tesauro’s program had more to do with the coevolutionary structure of the learning task and the dynamics of the backgammon game itself. 1
Separateandconquer learning
 Artificial Intelligence Review
, 1999
"... By combining practical relevance with novel types of prediction problems, the learning from/of preferences has recently received a lot of attention in the machine learning literature. Just as other types of complex learning tasks, preference learning deviates strongly from the standard problems of c ..."
Abstract

Cited by 13 (2 self)
 Add to MetaCart
By combining practical relevance with novel types of prediction problems, the learning from/of preferences has recently received a lot of attention in the machine learning literature. Just as other types of complex learning tasks, preference learning deviates strongly from the standard problems of classification and regression. It is particularly challenging because it involves the prediction of complex structures, such as weak or partial order relations, rather than single values. This article aims at conveying a first idea of typical preference learning problems. To this end, two particular learning scenarios will be sketched, namely learning from label preferences and learning from object preferences. Both scenarios can be handled in two fundamentally different ways: by evaluating individual candidates (using a utility function) or by comparing competing candidates (using a binary “is preferred to ” predicate). 1
Learning to Assess From PairWise Comparisons
, 2002
"... In this paper we present an algorithm for learning a function able to assess objects. We assume that our teachers can provide a collection of pairwise comparisons but encounter certain difficulties in assigning a number to the qualities of the objects considered. This is a typical situation when dea ..."
Abstract

Cited by 8 (4 self)
 Add to MetaCart
In this paper we present an algorithm for learning a function able to assess objects. We assume that our teachers can provide a collection of pairwise comparisons but encounter certain difficulties in assigning a number to the qualities of the objects considered. This is a typical situation when dealing with food products, where it is very interesting to have repeatable, reliable mechanisms that are as objective as possible to evaluate quality in order to provide markets with products of a uniform quality. The same problem arises when we are trying to learn user preferences in an information retrieval system or in configuring a complex device. The algorithm is implemented using a growing variant of Kohonen's SelfOrganizing Maps (growing neural gas), and is tested with a variety of data sets to demonstrate the capabilities of our approach.
Reinforcement Learning From State and Temporal Differences
, 1999
"... Abstract) with function approximation has proved empirically successful for some TD(¡ complex reinforcement learning problems. For linear TD( ¡ approximation, ) has been shown to minimise the squared error between the approximate value of each state and the true value. However, as far as policy is c ..."
Abstract

Cited by 7 (2 self)
 Add to MetaCart
Abstract) with function approximation has proved empirically successful for some TD(¡ complex reinforcement learning problems. For linear TD( ¡ approximation, ) has been shown to minimise the squared error between the approximate value of each state and the true value. However, as far as policy is concerned, it is error in the relative ordering of states that is critical, rather than error in the state values. We illustrate this point, both in simple twostate and threestate systems in which)—starting from an optimal policy—converges to a suboptimal policy, and TD(¡ also in backgammon. We then present a modified form TD( ¡ of), STD( ¡ called), in which function approximators are trained with respect to relative state values on binary decision problems. A theoretical analysis, including a proof of monotonic policy improvement STD( ¡ for) in the context of the twostate system, is presented, along with a comparison with Bertsekas ’ differential training method [1]. This is followed by successful demonstrations STD( ¡ of) on the twostate system and a variation on the well known acrobot problem. 1