Results 1  10
of
49
Selection of relevant features and examples in machine learning
 ARTIFICIAL INTELLIGENCE
, 1997
"... In this survey, we review work in machine learning on methods for handling data sets containing large amounts of irrelevant information. We focus on two key issues: the problem of selecting relevant features, and the problem of selecting relevant examples. We describe the advances that have been mad ..."
Abstract

Cited by 590 (2 self)
 Add to MetaCart
In this survey, we review work in machine learning on methods for handling data sets containing large amounts of irrelevant information. We focus on two key issues: the problem of selecting relevant features, and the problem of selecting relevant examples. We describe the advances that have been made on these topics in both empirical and theoretical work in machine learning, and we present a general framework that we use to compare different methods. We close with some challenges for future work in this area.
How to Use Expert Advice
 JOURNAL OF THE ASSOCIATION FOR COMPUTING MACHINERY
, 1997
"... We analyze algorithms that predict a binary value by combining the predictions of several prediction strategies, called experts. Our analysis is for worstcase situations, i.e., we make no assumptions about the way the sequence of bits to be predicted is generated. We measure the performance of the ..."
Abstract

Cited by 378 (74 self)
 Add to MetaCart
We analyze algorithms that predict a binary value by combining the predictions of several prediction strategies, called experts. Our analysis is for worstcase situations, i.e., we make no assumptions about the way the sequence of bits to be predicted is generated. We measure the performance of the algorithm by the difference between the expected number of mistakes it makes on the bit sequence and the expected number of mistakes made by the best expert on this sequence, where the expectation is taken with respect to the randomization in the predictions. We show that the minimum achievable difference is on the order of the square root of the number of mistakes of the best expert, and we give efficient algorithms that achieve this. Our upper and lower bounds have matching leading constants in most cases. We then show howthis leads to certain kinds of pattern recognition/learning algorithms with performance bounds that improve on the best results currently known in this context. We also compare our analysis to the case in which log loss is used instead of the expected number of mistakes.
Exponentiated Gradient Versus Gradient Descent for Linear Predictors
 Information and Computation
, 1995
"... this paper, we concentrate on linear predictors . To any vector u 2 R ..."
Abstract

Cited by 326 (14 self)
 Add to MetaCart
(Show Context)
this paper, we concentrate on linear predictors . To any vector u 2 R
Logistic Regression, AdaBoost and Bregman Distances
, 2000
"... We give a unified account of boosting and logistic regression in which each learning problem is cast in terms of optimization of Bregman distances. The striking similarity of the two problems in this framework allows us to design and analyze algorithms for both simultaneously, and to easily adapt al ..."
Abstract

Cited by 261 (44 self)
 Add to MetaCart
We give a unified account of boosting and logistic regression in which each learning problem is cast in terms of optimization of Bregman distances. The striking similarity of the two problems in this framework allows us to design and analyze algorithms for both simultaneously, and to easily adapt algorithms designed for one problem to the other. For both problems, we give new algorithms and explain their potential advantages over existing methods. These algorithms can be divided into two types based on whether the parameters are iteratively updated sequentially (one at a time) or in parallel (all at once). We also describe a parameterized family of algorithms which interpolates smoothly between these two extremes. For all of the algorithms, we give convergence proofs using a general formalization of the auxiliaryfunction proof technique. As one of our sequentialupdate algorithms is equivalent to AdaBoost, this provides the first general proof of convergence for AdaBoost. We show that all of our algorithms generalize easily to the multiclass case, and we contrast the new algorithms with iterative scaling. We conclude with a few experimental results with synthetic data that highlight the behavior of the old and newly proposed algorithms in different settings.
Universal prediction
 IEEE TRANSACTIONS ON INFORMATION THEORY
, 1998
"... This paper consists of an overview on universal prediction from an informationtheoretic perspective. Special attention is given to the notion of probability assignment under the selfinformation loss function, which is directly related to the theory of universal data compression. Both the probabili ..."
Abstract

Cited by 186 (16 self)
 Add to MetaCart
This paper consists of an overview on universal prediction from an informationtheoretic perspective. Special attention is given to the notion of probability assignment under the selfinformation loss function, which is directly related to the theory of universal data compression. Both the probabilistic setting and the deterministic setting of the universal prediction problem are described with emphasis on the analogy and the differences between results in the two settings.
Relative Loss Bounds for Online Density Estimation with the Exponential Family of Distributions
 MACHINE LEARNING
, 2000
"... We consider online density estimation with a parameterized density from the exponential family. The online algorithm receives one example at a time and maintains a parameter that is essentially an average of the past examples. After receiving an example the algorithm incurs a loss, which is the n ..."
Abstract

Cited by 153 (14 self)
 Add to MetaCart
We consider online density estimation with a parameterized density from the exponential family. The online algorithm receives one example at a time and maintains a parameter that is essentially an average of the past examples. After receiving an example the algorithm incurs a loss, which is the negative loglikelihood of the example with respect to the past parameter of the algorithm. An oline algorithm can choose the best parameter based on all the examples. We prove bounds on the additional total loss of the online algorithm over the total loss of the best oline parameter. These relative loss bounds hold for an arbitrary sequence of examples. The goal is to design algorithms with the best possible relative loss bounds. We use a Bregman divergence to derive and analyze each algorithm. These divergences are relative entropies between two exponential distributions. We also use our methods to prove relative loss bounds for linear regression.
A Game of Prediction with Expert Advice
 Journal of Computer and System Sciences
, 1997
"... We consider the following problem. At each point of discrete time the learner must make a prediction; he is given the predictions made by a pool of experts. Each prediction and the outcome, which is disclosed after the learner has made his prediction, determine the incurred loss. It is known that, u ..."
Abstract

Cited by 152 (10 self)
 Add to MetaCart
(Show Context)
We consider the following problem. At each point of discrete time the learner must make a prediction; he is given the predictions made by a pool of experts. Each prediction and the outcome, which is disclosed after the learner has made his prediction, determine the incurred loss. It is known that, under weak regularity, the learner can ensure that his cumulative loss never exceeds cL+ a ln n, where c and a are some constants, n is the size of the pool, and L is the cumulative loss incurred by the best expert in the pool. We find the set of those pairs (c; a) for which this is true.
An introduction to boosting and leveraging
 Advanced Lectures on Machine Learning, LNCS
, 2003
"... ..."
(Show Context)
Schapire R., Bounds on the Sample Complexity of Bayesian Learning Using Information Theory and the VC Dimension
"... ..."
(Show Context)
Sparse Online Learning via Truncated Gradient
"... We propose a general method called truncated gradient to induce sparsity in the weights of onlinelearning algorithms with convex loss. This method has several essential properties. First, the degree of sparsity is continuous—a parameter controls the rate of sparsification from no sparsification to ..."
Abstract

Cited by 105 (4 self)
 Add to MetaCart
We propose a general method called truncated gradient to induce sparsity in the weights of onlinelearning algorithms with convex loss. This method has several essential properties. First, the degree of sparsity is continuous—a parameter controls the rate of sparsification from no sparsification to total sparsification. Second, the approach is theoretically motivated, and an instance of it can be regarded as an online counterpart of the popular L1regularization method in the batch setting. We prove small rates of sparsification result in only small additional regret with respect to typical onlinelearning guarantees. Finally, the approach works well empirically. We apply it to several datasets and find for datasets with large numbers of features, substantial sparsity is discoverable. 1