Results 1  10
of
96
Sparse multinomial logistic regression: fast algorithms and generalization bounds
 IEEE Trans. on Pattern Analysis and Machine Intelligence
"... Abstract—Recently developed methods for learning sparse classifiers are among the stateoftheart in supervised learning. These methods learn classifiers that incorporate weighted sums of basis functions with sparsitypromoting priors encouraging the weight estimates to be either significantly larg ..."
Abstract

Cited by 113 (1 self)
 Add to MetaCart
Abstract—Recently developed methods for learning sparse classifiers are among the stateoftheart in supervised learning. These methods learn classifiers that incorporate weighted sums of basis functions with sparsitypromoting priors encouraging the weight estimates to be either significantly large or exactly zero. From a learningtheoretic perspective, these methods control the capacity of the learned classifier by minimizing the number of basis functions used, resulting in better generalization. This paper presents three contributions related to learning sparse classifiers. First, we introduce a true multiclass formulation based on multinomial logistic regression. Second, by combining a bound optimization approach with a componentwise update procedure, we derive fast exact algorithms for learning sparse multiclass classifiers that scale favorably in both the number of training samples and the feature dimensionality, making them applicable even to large data sets in highdimensional feature spaces. To the best of our knowledge, these are the first algorithms to perform exact multinomial logistic regression with a sparsitypromoting prior. Third, we show how nontrivial generalization bounds can be derived for our classifier in the binary case. Experimental results on standard benchmark data sets attest to the accuracy, sparsity, and efficiency of the proposed methods.
A tutorial on MM algorithms
 Amer. Statist
, 2004
"... Most problems in frequentist statistics involve optimization of a function such as a likelihood or a sum of squares. EM algorithms are among the most effective algorithms for maximum likelihood estimation because they consistently drive the likelihood uphill by maximizing a simple surrogate function ..."
Abstract

Cited by 65 (3 self)
 Add to MetaCart
Most problems in frequentist statistics involve optimization of a function such as a likelihood or a sum of squares. EM algorithms are among the most effective algorithms for maximum likelihood estimation because they consistently drive the likelihood uphill by maximizing a simple surrogate function for the loglikelihood. Iterative optimization of a surrogate function as exemplified by an EM algorithm does not necessarily require missing data. Indeed, every EM algorithm is a special case of the more general class of MM optimization algorithms, which typically exploit convexity rather than missing data in majorizing or minorizing an objective function. In our opinion, MM algorithms deserve to part of the standard toolkit of professional statisticians. The current article explains the principle behind MM algorithms, suggests some methods for constructing them, and discusses some of their attractive features. We include numerous examples throughout the article to illustrate the concepts described. In addition to surveying previous work on MM algorithms, this article introduces some new material on constrained optimization and standard error estimation. Key words and phrases: constrained optimization, EM algorithm, majorization, minorization, NewtonRaphson 1 1
Onestep sparse estimates in nonconcave penalized likelihood models. Ann. Statist., to appear. 36 Proof of Theorems 2(ii) and 4 Proof of Theorem 2(ii). To prove asymptotic normality for ˆφ n1, note that by (A.23), for αn with ‖αn‖ = 1 and νn = αnHnαn, n 1
 n1) = I1 + I2 + I3, (S.1) where I2 = λn(nνn) −1/2 α T n G−1 11 Wns/2 , I3
, 2008
"... Fan and Li propose a family of variable selection methods via penalized likelihood using concave penalty functions. The nonconcave penalized likelihood estimators enjoy the oracle properties, but maximizing the penalized likelihood function is computationally challenging, because the objective funct ..."
Abstract

Cited by 58 (0 self)
 Add to MetaCart
Fan and Li propose a family of variable selection methods via penalized likelihood using concave penalty functions. The nonconcave penalized likelihood estimators enjoy the oracle properties, but maximizing the penalized likelihood function is computationally challenging, because the objective function is nondifferentiable and nonconcave. In this article, we propose a new unified algorithm based on the local linear approximation (LLA) for maximizing the penalized likelihood for a broad class of concave penalty functions. Convergence and other theoretical properties of the LLA algorithm are established. A distinguished feature of the LLA algorithm is that at each LLA step, the LLA estimator can naturally adopt a sparse representation. Thus, we suggest using the onestep LLA estimator from the LLA algorithm as the final estimates. Statistically, we show that if the regularization parameter is appropriately chosen, the onestep LLA estimates enjoy the oracle properties with good initial estimators. Computationally, the onestep LLA estimation methods dramatically reduce the computational cost in maximizing the nonconcave penalized likelihood. We conduct some Monte Carlo simulation to assess the finite sample performance of the onestep sparse estimation methods. The results are very encouraging. 1. Introduction. Variable
On semisupervised classification
 In
, 2005
"... A graphbased prior is proposed for parametric semisupervised classification. The prior utilizes both labelled and unlabelled data; it also integrates features from multiple views of a given sample (e.g., multiple sensors), thus implementing a Bayesian form of cotraining. An EM algorithm for train ..."
Abstract

Cited by 40 (8 self)
 Add to MetaCart
A graphbased prior is proposed for parametric semisupervised classification. The prior utilizes both labelled and unlabelled data; it also integrates features from multiple views of a given sample (e.g., multiple sensors), thus implementing a Bayesian form of cotraining. An EM algorithm for training the classifier automatically adjusts the tradeoff between the contributions of: (a) the labelled data; (b) the unlabelled data; and (c) the cotraining information. Active label query selection is performed using a mutual information based criterion that explicitly uses the unlabelled data and the cotraining information. Encouraging results are presented on public benchmarks and on measured data from single and multiple sensors. 1
Variable Selection Using MM Algorithm
 Annals of Statistics
, 2005
"... Variable selection is fundamental to highdimensional statistical modeling. Many variable selection techniques may be implemented by maximum penalized likelihood using various penalty functions. Optimizing the penalized likelihood function is often challenging because it may be nondifferentiable and ..."
Abstract

Cited by 38 (4 self)
 Add to MetaCart
Variable selection is fundamental to highdimensional statistical modeling. Many variable selection techniques may be implemented by maximum penalized likelihood using various penalty functions. Optimizing the penalized likelihood function is often challenging because it may be nondifferentiable and/or nonconcave. This article proposes a new class of algorithms for finding a maximizer of the penalized likelihood for a broad class of penalty functions. These algorithms operate by perturbing the penalty function slightly to render it differentiable, then optimizing this differentiable function using a minorize–maximize (MM) algorithm. MM algorithms are useful extensions of the wellknown class of EM algorithms, a fact that allows us to analyze the local and global convergence of the proposed algorithm using some of the techniques employed for EM algorithms. In particular, we prove that when our MM algorithms converge, they must converge to a desirable point; we also discuss conditions under which this convergence may be guaranteed. We exploit the Newton–Raphsonlike aspect of these algorithms
Distributed WeightedMultidimensional Scaling for Node Localization in Sensor Networks
 ACM TRANSACTIONS ON SENSOR NETWORKS
, 2005
"... Accurate, distributed localization algorithms are needed for a wide variety of wireless sensor network applications. This paper introduces a scalable, distributed weightedmultidimensional scaling (dwMDS) algorithm that adaptively emphasizes the most accurate range measurements and naturally account ..."
Abstract

Cited by 34 (0 self)
 Add to MetaCart
Accurate, distributed localization algorithms are needed for a wide variety of wireless sensor network applications. This paper introduces a scalable, distributed weightedmultidimensional scaling (dwMDS) algorithm that adaptively emphasizes the most accurate range measurements and naturally accounts for communication constraints within the sensor network. Each node adaptively chooses a neighborhood of sensors, updates its position estimate by minimizing a local cost function and then passes this update to neighboring sensors. Derived bounds on communication requirements provide insight on the energy efficiency of the proposed distributed method versus a centralized approach. For received signalstrength (RSS) based range measurements, we demonstrate via simulation that location estimates are nearly unbiased with variance close to the CramerRao lower bound. Further, RSS and timeofarrival (TOA) channel measurements are used to demonstrate performance as good as the centralized maximumlikelihood estimator (MLE) in a realworld sensor network.
MM algorithms for generalized BradleyTerry models
 The Annals of Statistics
, 2004
"... The Bradley–Terry model for paired comparisons is a simple and muchstudied means to describe the probabilities of the possible outcomes when individuals are judged against one another in pairs. Among the many studies of the model in the past 75 years, numerous authors have generalized it in several ..."
Abstract

Cited by 29 (1 self)
 Add to MetaCart
The Bradley–Terry model for paired comparisons is a simple and muchstudied means to describe the probabilities of the possible outcomes when individuals are judged against one another in pairs. Among the many studies of the model in the past 75 years, numerous authors have generalized it in several directions, sometimes providing iterative algorithms for obtaining maximum likelihood estimates for the generalizations. Building on a theory of algorithms known by the initials MM, for minorization–maximization, this paper presents a powerful technique for producing iterative maximum likelihood estimation algorithms for a wide class of generalizations of the Bradley–Terry model. While algorithms for problems of this type have tended to be custombuilt in the literature, the techniques in this paper enable their mass production. Simple conditions are stated that guarantee that each algorithm described will produce a sequence that converges to the unique maximum likelihood estimator. Several of the algorithms and convergence results herein are new. 1. Introduction. In
A wideangle view at iterated shrinkage algorithms
 in SPIE (Wavelet XII
, 2007
"... Sparse and redundant representations – an emerging and powerful model for signals – suggests that a data source could be described as a linear combination of few atoms from a prespecified and overcomplete dictionary. This model has drawn a considerable attention in the past decade, due to its appe ..."
Abstract

Cited by 25 (1 self)
 Add to MetaCart
Sparse and redundant representations – an emerging and powerful model for signals – suggests that a data source could be described as a linear combination of few atoms from a prespecified and overcomplete dictionary. This model has drawn a considerable attention in the past decade, due to its appealing theoretical foundations, and promising practical results it leads to. Many of the applications that use this model are formulated as a mixture of ℓ2ℓp (p ≤ 1) optimization expressions. Iterated Shrinkage algorithms are a new family of highly effective numerical techniques for handling these optimization tasks, surpassing traditional optimization techniques. In this paper we aim to give a broad view of this group of methods, motivate their need, present their derivation, show their comparative performance, and most important of all, discuss their potential in various applications.
Semisupervised hyperspectral image segmentation using multinomial logistic regression with active learning
 IEEE Transactions on Geoscience and Remote Sensing
"... Abstract—This paper presents a new semisupervised segmentation algorithm, suited to highdimensional data, of which remotely sensed hyperspectral image data sets are an example. The algorithm implements two main steps: 1) semisupervised learning of the posterior class distributions followed by 2) se ..."
Abstract

Cited by 23 (14 self)
 Add to MetaCart
Abstract—This paper presents a new semisupervised segmentation algorithm, suited to highdimensional data, of which remotely sensed hyperspectral image data sets are an example. The algorithm implements two main steps: 1) semisupervised learning of the posterior class distributions followed by 2) segmentation, which infers an image of class labels from a posterior distribution built on the learned class distributions and on a Markov random field. The posterior class distributions are modeled using multinomial logistic regression, where the regressors are learned using both labeled and, through a graphbased technique, unlabeled samples. Such unlabeled samples are actively selected based on the entropy of the corresponding class label. The prior on the image of labels is a multilevel logistic model, which enforces segmentation results in which neighboring labels belong to the same class. The maximum a posteriori segmentation is computed by the αexpansion mincutbased integer optimization algorithm. Our experimental results, conducted using synthetic and real hyperspectral image data sets collected by the Airborne Visible/Infrared
On the convergence of concaveconvex procedure
 In NIPS Workshop on Optimization for Machine Learning
, 2009
"... The concaveconvex procedure (CCCP) is a majorizationminimization algorithm that solves d.c. (difference of convex functions) programs as a sequence of convex programs. In machine learning, CCCP is extensively used in many learning algorithms like sparse support vector machines (SVMs), transductive ..."
Abstract

Cited by 21 (0 self)
 Add to MetaCart
The concaveconvex procedure (CCCP) is a majorizationminimization algorithm that solves d.c. (difference of convex functions) programs as a sequence of convex programs. In machine learning, CCCP is extensively used in many learning algorithms like sparse support vector machines (SVMs), transductive SVMs, sparse principal component analysis, etc. Though widely used in many applications, the convergence behavior of CCCP has not gotten a lot of specific attention. Yuille and Rangarajan analyzed its convergence in their original paper, however, we believe the analysis is not complete. Although the convergence of CCCP can be derived from the convergence of the d.c. algorithm (DCA), its proof is more specialized and technical than actually required for the specific case of CCCP. In this paper, we follow a different reasoning and show how Zangwill’s global convergence theory of iterative algorithms provides a natural framework to prove the convergence of CCCP, allowing a more elegant and simple proof. This underlines Zangwill’s theory as a powerful and general framework to deal with the convergence issues of iterative algorithms, after also being used to prove the convergence of algorithms like expectationmaximization, generalized alternating minimization, etc. In this paper, we provide a rigorous analysis of the convergence of CCCP by addressing these questions: (i) When does CCCP find a local minimum or a stationary point of the d.c. program under consideration? (ii) When does the sequence generated by CCCP converge? We also present an open problem on the issue of local convergence of CCCP. 1