Results 1  10
of
133
Maximum Entropy Discrimination
, 1999
"... We present a general framework for discriminative estimation based on the maximum entropy principle and its extensions. All calculations involve distributions over structures and/or parameters rather than specific settings and reduce to relative entropy projections. This holds even when the data is ..."
Abstract

Cited by 125 (21 self)
 Add to MetaCart
We present a general framework for discriminative estimation based on the maximum entropy principle and its extensions. All calculations involve distributions over structures and/or parameters rather than specific settings and reduce to relative entropy projections. This holds even when the data is not separable within the chosen parametric class, in the context of anomaly detection rather than classification, or when the labels in the training set are uncertain or incomplete. Support vector machines are naturally subsumed under this class and we provide several extensions. We are also able to estimate exactly and efficiently discriminative distributions over tree structures of classconditional models within this framework. Preliminary experimental results are indicative of the potential in these techniques.
AUC optimization vs. error rate minimization
 Advances in Neural Information Processing Systems
, 2004
"... The area under an ROC curve (AUC) is a criterion used in many applications to measure the quality of a classification algorithm. However, the objective function optimized in most of these algorithms is the error rate and not the AUC value. We give a detailed statistical analysis of the relationship ..."
Abstract

Cited by 110 (2 self)
 Add to MetaCart
The area under an ROC curve (AUC) is a criterion used in many applications to measure the quality of a classification algorithm. However, the objective function optimized in most of these algorithms is the error rate and not the AUC value. We give a detailed statistical analysis of the relationship between the AUC and the error rate, including the first exact expression of the expected value and the variance of the AUC for a fixed error rate. Our results show that the average AUC is monotonically increasing as a function of the classification accuracy, but that the standard deviation for uneven distributions and higher error rates can be large. Thus, algorithms designed to minimize the error rate may not lead to the best possible AUC values. We show that under certain conditions the global function optimized by the RankBoost algorithm is exactly the AUC. We report results of our experiments with RankBoost in several datasets that demonstrate the benefits of an algorithm specifically designed to globally optimize the AUC over other existing algorithms optimizing an approximation of the AUC or only locally optimizing the AUC. 1
Grouped and hierarchical model selection through composite absolute penalties
 Annals of Statistics
, 2006
"... Extracting useful information from highdimensional data is an important part of the focus of today’s statistical research and practice. Penalized loss function minimization has been shown to be effective for this task both theoretically and empirically. With the virtues of both regularization and ..."
Abstract

Cited by 94 (3 self)
 Add to MetaCart
Extracting useful information from highdimensional data is an important part of the focus of today’s statistical research and practice. Penalized loss function minimization has been shown to be effective for this task both theoretically and empirically. With the virtues of both regularization and sparsity, the L1penalized L2 minimization method Lasso has been popular in regression models. In this paper, we combine different norms including L1 to form an intelligent penalty in order to add side information to the fitting of a regression or classification model to obtain reasonable estimates. Specifically, we introduce the Composite Absolute Penalties (CAP) family which allows the grouping and hierarchical relationships between the predictors to be expressed. CAP penalties are built by defining groups and combining the properties of norm penalties at the across group and within group levels. Grouped selection occurs for nonoverlapping groups. In that case, we give a Bayesian 1 interpretation for CAP penalties. Hierarchical variable selection is reached by defining groups with particular overlapping patterns. In the computation aspect, we propose using the BLASSO and crossvalidation to obtain CAP estimates. For a subfamily of CAP estimates involving only the L1 and L ∞ norms, we introduce the iCAP algorithm to trace the entire regularization path for the grouped selection problem. Within this subfamily, unbiased estimates of the degrees of freedom (df) are derived allowing the regularization parameter to be selected without crossvalidation. CAP is shown to improve on the predictive performance of the LASSO in a series of simulated experiments including cases with p>> n and misspecified groupings. When the complexity of a model is properly calculated, iCAP is seen to be parsimonious in the experiments. 1
The composite absolute penalties family for grouped and hierarchical variable selection
 Ann. Statist
"... Extracting useful information from highdimensional data is an important focus of today’s statistical research and practice. Penalized loss function minimization has been shown to be effective for this task both theoretically and empirically. With the virtues of both regularization and sparsity, the ..."
Abstract

Cited by 75 (3 self)
 Add to MetaCart
Extracting useful information from highdimensional data is an important focus of today’s statistical research and practice. Penalized loss function minimization has been shown to be effective for this task both theoretically and empirically. With the virtues of both regularization and sparsity, the L1penalized squared error minimization method Lasso has been popular in regression models and beyond. In this paper, we combine different norms including L1 to form an intelligent penalty in order to add side information to the fitting of a regression or classification model to obtain reasonable estimates. Specifically, we introduce the Composite Absolute Penalties (CAP) family, which allows given grouping and hierarchical relationships between the predictors to be expressed. CAP penalties are built by defining groups and combining the properties of norm penalties at the acrossgroup and withingroup levels. Grouped selection occurs for nonoverlapping groups. Hierarchical variable selection is reached
Boosting as a Regularized Path to a Maximum Margin Classifier
 Journal of Machine Learning Research
, 2004
"... In this paper we study boosting methods from a new perspective. We build on recent work by Efron et al. to show that boosting approximately (and in some cases exactly) minimizes its loss criterion with an l 1 constraint on the coefficient vector. This helps understand the success of boosting with ..."
Abstract

Cited by 74 (20 self)
 Add to MetaCart
In this paper we study boosting methods from a new perspective. We build on recent work by Efron et al. to show that boosting approximately (and in some cases exactly) minimizes its loss criterion with an l 1 constraint on the coefficient vector. This helps understand the success of boosting with early stopping as regularized fitting of the loss criterion. For the two most commonly used criteria (exponential and binomial loglikelihood), we further show that as the constraint is relaxedor equivalently as the boosting iterations proceedthe solution converges (in the separable case) to an "l 1 optimal" separating hyperplane. We prove that this l 1 optimal separating hyperplane has the property of maximizing the minimal l 1 margin of the training data, as defined in the boosting literature.
PACBayes & Margins
 Advances in Neural Information Processing Systems 15
, 2002
"... We show two related things: (1) Given a classi er which consists of a weighted sum of features with a large margin, we can construct a stochastic classi er with negligibly larger training error rate. The stochastic classi er has a future error rate bound that depends on the margin distributio ..."
Abstract

Cited by 64 (10 self)
 Add to MetaCart
We show two related things: (1) Given a classi er which consists of a weighted sum of features with a large margin, we can construct a stochastic classi er with negligibly larger training error rate. The stochastic classi er has a future error rate bound that depends on the margin distribution and is independent of the size of the base hypothesis class.
AdaBoosting neural networks
 Neural Computation
, 1997
"... Convexity has recently received a lot of attention in the machine learning community, and the lack of convexity has been seen as a major disadvantage of many learning algorithms, such as multilayer artificial neural networks. We show that training multilayer neural networks in which the number of ..."
Abstract

Cited by 51 (8 self)
 Add to MetaCart
Convexity has recently received a lot of attention in the machine learning community, and the lack of convexity has been seen as a major disadvantage of many learning algorithms, such as multilayer artificial neural networks. We show that training multilayer neural networks in which the number of hidden units is learned can be viewed as a convex optimization problem. This problem involves an infinite number of variables, but can be solved by incrementally inserting a hidden unit at a time, each time finding a linear classifier that minimizes a weighted sum of errors. 1
Boosting for Text Classification with Semantic Features
 IN PROCEEDINGS OF THE MSW 2004 WORKSHOP AT THE 10TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING
, 2004
"... Current text classification systems typically use term stems for representing document content. Semantic Web technologies allow the usage of features on a higher semantic level than single words for text classification purposes. In this paper we propose such an enhancement of the classical docume ..."
Abstract

Cited by 44 (2 self)
 Add to MetaCart
Current text classification systems typically use term stems for representing document content. Semantic Web technologies allow the usage of features on a higher semantic level than single words for text classification purposes. In this paper we propose such an enhancement of the classical document representation through concepts extracted from background knowledge. Boosting, a successful machine learning technique is used for classification. Comparative experimental evaluations in three different settings support our approach through consistent improvement of the results. An analysis of the results shows that this improvement is due to two separate effects.
A Constructive Algorithm for Training Cooperative Neural Network Ensembles
 IEEE Transactions on Neural Networks
, 2003
"... This paper presents a constructive algorithm for training cooperative neuralnetwork ensembles (CNNEs). CNNE combines ensemble architecture design with cooperative training for individual neural networks (NNs) in ensembles. Unlike most previous studies on training ensembles, CNNE puts emphasis on bo ..."
Abstract

Cited by 44 (16 self)
 Add to MetaCart
This paper presents a constructive algorithm for training cooperative neuralnetwork ensembles (CNNEs). CNNE combines ensemble architecture design with cooperative training for individual neural networks (NNs) in ensembles. Unlike most previous studies on training ensembles, CNNE puts emphasis on both accuracy and diversity among individual NNs in an ensemble. In order to maintain accuracy among individual NNs, the number of hidden nodes in individual NNs are also determined by a constructive approach. Incremental training based on negative correlation is used in CNNE to train individual NNs for different numbers of training epochs. The use of negative correlation learning and different training epochs for training individual NNs reflect CNNEs emphasis on diversity among individual NNs in an ensemble. CNNE has been tested extensively on a number of benchmark problems in machine learning and neural networks, including Australian credit card assessment, breast cancer, diabetes, glass, heart disease, letter recognition, soybean, and MackeyGlass time series prediction problems. The experimental results show that CNNE can produce NN ensembles with good generalization ability.
AlmostEverywhere Algorithmic Stability and Generalization Error
 In UAI2002: Uncertainty in Artificial Intelligence
, 2002
"... We introduce a new notion of algorithmic stability, which we call training stability. ..."
Abstract

Cited by 43 (8 self)
 Add to MetaCart
We introduce a new notion of algorithmic stability, which we call training stability.