Results 11 - 20
of
253
Grouped and hierarchical model selection through composite absolute penalties
- Annals of Statistics
, 2006
"... Extracting useful information from high-dimensional data is an important part of the focus of today’s statistical research and practice. Penalized loss function minimiza-tion has been shown to be effective for this task both theoretically and empirically. With the virtues of both regularization and ..."
Abstract
-
Cited by 60 (2 self)
- Add to MetaCart
Extracting useful information from high-dimensional data is an important part of the focus of today’s statistical research and practice. Penalized loss function minimiza-tion has been shown to be effective for this task both theoretically and empirically. With the virtues of both regularization and sparsity, the L1-penalized L2 minimization method Lasso has been popular in regression models. In this paper, we combine different norms including L1 to form an intelligent penalty in order to add side information to the fitting of a regression or classification model to obtain reasonable estimates. Specifically, we introduce the Composite Absolute Penal-ties (CAP) family which allows the grouping and hierarchical relationships between the predictors to be expressed. CAP penalties are built by defining groups and com-bining the properties of norm penalties at the across group and within group levels. Grouped selection occurs for non-overlapping groups. In that case, we give a Bayesian 1 interpretation for CAP penalties. Hierarchical variable selection is reached by defining groups with particular overlapping patterns. In the computation aspect, we propose using the BLASSO and cross-validation to obtain CAP estimates. For a subfamily of CAP estimates involving only the L1 and L ∞ norms, we introduce the iCAP algorithm to trace the entire regularization path for the grouped selection problem. Within this subfamily, unbiased estimates of the degrees of freedom (df) are derived allowing the regularization parameter to be selected without cross-validation. CAP is shown to im-prove on the predictive performance of the LASSO in a series of simulated experiments including cases with p>> n and mis-specified groupings. When the complexity of a model is properly calculated, iCAP is seen to be parsimonious in the experiments. 1
Adaptive Sparseness for Supervised Learning
- IEEE Transactions on Pattern Analysis and Machine Intelligence
, 2003
"... The goal of supervised learning is to infer a functional mapping based on a set of training examples. To achieve good generalization, it is necessary to control the "complexity" of the learned function. In Bayesian approaches, this is done by adopting a prior for the parameters of the function bei ..."
Abstract
-
Cited by 57 (3 self)
- Add to MetaCart
The goal of supervised learning is to infer a functional mapping based on a set of training examples. To achieve good generalization, it is necessary to control the "complexity" of the learned function. In Bayesian approaches, this is done by adopting a prior for the parameters of the function being learned. We propose a Bayesian approach to supervised learning, which leads to sparse solutions; that is, in which irrelevant parameters are automatically set exactly to zero. Other ways to obtain sparse classifiers (such as Laplacian priors, support vector machines) involve (hyper)parameters which control the degree of sparseness of the resulting classifiers; these parameters have to be somehow adjusted/estimated from the training data. In contrast, our approach does not involve any (hyper)parameters to be adjusted or estimated. This is achieved by a hierarchical-Bayes interpretation of the Laplacian prior, which is then modified by the adoption of a Jeffreys' noninformative hyperprior. Implementation is carried out by an expectationmaximization (EM) algorithm. Experiments with several benchmark data sets show that the proposed approach yields state-of-the-art performance. In particular, our method outperforms SVMs and performs competitively with the best alternative techniques, although it involves no tuning or adjustment of sparseness-controlling hyperparameters.
Multi-Weight Enveloping: Least-Squares Approximation Techniques for Skin Animation
, 2002
"... We present a process called multi-weight enveloping for deforming the skin geometry of the body of a digital creature around its skeleton. It is based on a deformation equation whose coefficients we compute using a statistical fit to an input training exercise. In this input, the skeleton and the sk ..."
Abstract
-
Cited by 53 (0 self)
- Add to MetaCart
We present a process called multi-weight enveloping for deforming the skin geometry of the body of a digital creature around its skeleton. It is based on a deformation equation whose coefficients we compute using a statistical fit to an input training exercise. In this input, the skeleton and the skin move together, by arbitrary external means, through a range of motion representative of what the creature is expected to achieve in practice. The input can also come from existing pieces of handcrafted skin animation. Using a modified least-squares fitting technique, we compute the coefficients, or “weights”, of the deformation equation. The result is that the equation generalizes the skin movement so that it applies well to other sequences of animation. The multi-weight deformation equation is computationally efficient to evaluate; once the training process is complete, even creatures with high levels of geometric detail can move at interactive frames rates with a look that approximates that of anatomical, physically-based models. We demonstrate the technique in a feature film production environment, on a human model whose input poses are sculpted by hand and an animal model whose input poses come from the output of an anatomically-based dynamic simulation.
Piecewise linear regularized solution paths
- Ann. Statist
, 2007
"... We consider the generic regularized optimization problem ˆ β(λ) = arg minβ L(y, Xβ) + λJ(β). Recently, Efron et al. (2004) have shown that for the Lasso – that is, if L is squared error loss and J(β) = ‖β‖1 is the l1 norm of β – the optimal coefficient path is piecewise linear, i.e., ∂ ˆ β(λ)/∂λ i ..."
Abstract
-
Cited by 53 (6 self)
- Add to MetaCart
We consider the generic regularized optimization problem ˆ β(λ) = arg minβ L(y, Xβ) + λJ(β). Recently, Efron et al. (2004) have shown that for the Lasso – that is, if L is squared error loss and J(β) = ‖β‖1 is the l1 norm of β – the optimal coefficient path is piecewise linear, i.e., ∂ ˆ β(λ)/∂λ is piecewise constant. We derive a general characterization of the properties of (loss L, penalty J) pairs which give piecewise linear coefficient paths. Such pairs allow for efficient generation of the full regularized coefficient paths. We investigate the nature of efficient path following algorithms which arise. We use our results to suggest robust versions of the Lasso for regression and classification, and to develop new, efficient algorithms for existing problems in the literature, including Mammen & van de Geer’s Locally Adaptive Regression Splines. 1
A semidefinite framework for trust region subproblems with applications to large scale minimization
- Math. Programming
, 1997
"... This is an abbreviated revision of the University of Waterloo research report CORR 94-32. y ..."
Abstract
-
Cited by 52 (8 self)
- Add to MetaCart
This is an abbreviated revision of the University of Waterloo research report CORR 94-32. y
Grafting: Fast, Incremental Feature Selection by Gradient Descent in Function Space
- Journal of Machine Learning Research
, 2003
"... We present a novel and flexible approach to the problem of feature selection, called grafting.Rather than considering feature selection as separate from learning, grafting treats the selection of suitable features as an integral part of learning a predictor in a regularized learning framework. To ..."
Abstract
-
Cited by 51 (1 self)
- Add to MetaCart
We present a novel and flexible approach to the problem of feature selection, called grafting.Rather than considering feature selection as separate from learning, grafting treats the selection of suitable features as an integral part of learning a predictor in a regularized learning framework. To make this regularized learning process sufficiently fast for large scale problems, grafting operates in an incremental iterative fashion, gradually building up a feature set while training a predictor model using gradient descent. At each iteration, a fast gradient-based heuristic is used to quickly assess which feature is most likely to improve the existing model, that feature is then added to the model, and the model is incrementally optimized using gradient descent. The algorithm scales linearly with the number of data points and at most quadratically with the number of features. Grafting can be used with a variety of predictor model classes, both linear and non-linear, and can be used for both classification and regression. Experiments are reported here on a variant of grafting for classification, using both linear and non-linear models, and using a logistic regression-inspired loss function. Results on a variety of synthetic and real world data sets are presented. Finally the relationship between grafting, stagewise additive modelling, and boosting is explored.
Tree induction vs. logistic regression: A learning-curve analysis
- CEDER WORKING PAPER #IS-01-02, STERN SCHOOL OF BUSINESS
, 2001
"... Tree induction and logistic regression are two standard, off-the-shelf methods for building models for classi cation. We present a large-scale experimental comparison of logistic regression and tree induction, assessing classification accuracy and the quality of rankings based on class-membership pr ..."
Abstract
-
Cited by 50 (16 self)
- Add to MetaCart
Tree induction and logistic regression are two standard, off-the-shelf methods for building models for classi cation. We present a large-scale experimental comparison of logistic regression and tree induction, assessing classification accuracy and the quality of rankings based on class-membership probabilities. We use a learning-curve analysis to examine the relationship of these measures to the size of the training set. The results of the study show several remarkable things. (1) Contrary to prior observations, logistic regression does not generally outperform tree induction. (2) More specifically, and not surprisingly, logistic regression is better for smaller training sets and tree induction for larger data sets. Importantly, this often holds for training sets drawn from the same domain (i.e., the learning curves cross), so conclusions about induction-algorithm superiority on a given domain must be based on an analysis of the learning curves. (3) Contrary to conventional wisdom, tree induction is effective atproducing probability-based rankings, although apparently comparatively less so foragiven training{set size than at making classifications. Finally, (4) the domains on which tree induction and logistic regression are ultimately preferable canbecharacterized surprisingly well by a simple measure of signal-to-noise ratio.
Regularized estimation of large covariance matrices
- Ann. Statist
, 2008
"... This paper considers estimating a covariance matrix of p variables from n observations by either banding or tapering the sample covariance matrix, or estimating a banded version of the inverse of the covariance. We show that these estimates are consistent in the operator norm as long as (log p)/n → ..."
Abstract
-
Cited by 43 (12 self)
- Add to MetaCart
This paper considers estimating a covariance matrix of p variables from n observations by either banding or tapering the sample covariance matrix, or estimating a banded version of the inverse of the covariance. We show that these estimates are consistent in the operator norm as long as (log p)/n → 0, and obtain explicit rates. The results are uniform over some fairly natural well-conditioned families of covariance matrices. We also introduce an analogue of the Gaussian white noise model and show that if the population covariance is embeddable in that model and well-conditioned, then the banded approximations produce consistent estimates of the eigenvalues and associated eigenvectors of the covariance matrix. The results can be extended to smooth versions of banding and to non-Gaussian distributions with sufficiently short tails. A resampling approach is proposed for choosing the banding parameter in practice. This approach is illustrated numerically on both simulated and real data. 1. Introduction. Estimation
Pedestrian Detection for Driving Assistance Systems: Single-frame Classification and System Level Performance
- IN PROCEEDINGS OF IEEE INTELLIGENT VEHICLES SYMPOSIUM
, 2004
"... We describe the functional and architectural breakdown of a monocular pedestrian detection system. We describe in detail our approach for single-frame classification based on a novel scheme of breaking down the class variability by repeatedly training a set of relatively simple classifiers on cluste ..."
Abstract
-
Cited by 41 (2 self)
- Add to MetaCart
We describe the functional and architectural breakdown of a monocular pedestrian detection system. We describe in detail our approach for single-frame classification based on a novel scheme of breaking down the class variability by repeatedly training a set of relatively simple classifiers on clusters of the training set. Single-frame classification performance results and system level performance figures for daytime conditions are presented with a discussion about the remaining gap to meet a daytime normal weather condition production system.
The composite absolute penalties family for grouped and hierarchical variable selection
- Ann. Statist
"... Extracting useful information from high-dimensional data is an important focus of today’s statistical research and practice. Penalized loss function minimization has been shown to be effective for this task both theoretically and empirically. With the virtues of both regularization and sparsity, the ..."
Abstract
-
Cited by 37 (1 self)
- Add to MetaCart
Extracting useful information from high-dimensional data is an important focus of today’s statistical research and practice. Penalized loss function minimization has been shown to be effective for this task both theoretically and empirically. With the virtues of both regularization and sparsity, the L1-penalized squared error minimization method Lasso has been popular in regression models and beyond. In this paper, we combine different norms including L1 to form an intelligent penalty in order to add side information to the fitting of a regression or classification model to obtain reasonable estimates. Specifically, we introduce the Composite Absolute Penalties (CAP) family, which allows given grouping and hierarchical relationships between the predictors to be expressed. CAP penalties are built by defining groups and combining the properties of norm penalties at the across-group and within-group levels. Grouped selection occurs for nonoverlapping groups. Hierarchical variable selection is reached

