Results 11 - 20
of
965
Probing the Pareto frontier for basis pursuit solutions
, 2008
"... The basis pursuit problem seeks a minimum one-norm solution of an underdetermined least-squares problem. Basis pursuit denoise (BPDN) fits the least-squares problem only approximately, and a single parameter determines a curve that traces the optimal trade-off between the least-squares fit and the ..."
Abstract
-
Cited by 95 (0 self)
- Add to MetaCart
The basis pursuit problem seeks a minimum one-norm solution of an underdetermined least-squares problem. Basis pursuit denoise (BPDN) fits the least-squares problem only approximately, and a single parameter determines a curve that traces the optimal trade-off between the least-squares fit and the one-norm of the solution. We prove that this curve is convex and continuously differentiable over all points of interest, and show that it gives an explicit relationship to two other optimization problems closely related to BPDN. We describe a root-finding algorithm for finding arbitrary points on this curve; the algorithm is suitable for problems that are large scale and for those that are in the complex domain. At each iteration, a spectral gradient-projection method approximately minimizes a least-squares problem with an explicit one-norm constraint. Only matrix-vector operations are required. The primal-dual solution of this problem gives function and derivative information needed for the root-finding method. Numerical experiments on a comprehensive set of test problems demonstrate that the method scales well to large problems.
Wide-coverage efficient statistical parsing with CCG and log-linear models
- COMPUTATIONAL LINGUISTICS
, 2007
"... This paper describes a number of log-linear parsing models for an automatically extracted lexicalized grammar. The models are "full" parsing models in the sense that probabilities are defined for complete parses, rather than for independent events derived by decomposing the parse tree. Discriminativ ..."
Abstract
-
Cited by 87 (20 self)
- Add to MetaCart
This paper describes a number of log-linear parsing models for an automatically extracted lexicalized grammar. The models are "full" parsing models in the sense that probabilities are defined for complete parses, rather than for independent events derived by decomposing the parse tree. Discriminative training is used to estimate the models, which requires incorrect parses for each sentence in the training data as well as the correct parse. The lexicalized grammar formalism used is Combinatory Categorial Grammar (CCG), and the grammar is automatically extracted from CCGbank, a CCG version of the Penn Treebank. The combination of discriminative training and an automatically extracted grammar leads to a significant memory requirement (over 20 GB), which is satisfied using a parallel implementation of the BFGS optimisation algorithm running on a Beowulf cluster. Dynamic programming over a packed chart, in combination with the parallel implementation, allows us to solve one of the largest-scale estimation problems in the statistical parsing literature in under three hours. A key component of the parsing system, for both training and testing, is a Maximum Entropy supertagger which assigns CCG lexical categories to words in a sentence. The supertagger makes the discriminative training feasible, and also leads to a highly efficient parser. Surprisingly,
Fast maximum margin matrix factorization for collaborative prediction
- In Proceedings of the 22nd International Conference on Machine Learning (ICML
, 2005
"... Maximum Margin Matrix Factorization (MMMF) was recently suggested (Srebro et al., 2005) as a convex, infinite dimensional alternative to low-rank approximations and standard factor models. MMMF can be formulated as a semi-definite programming (SDP) and learned using standard SDP solvers. However, cu ..."
Abstract
-
Cited by 85 (7 self)
- Add to MetaCart
Maximum Margin Matrix Factorization (MMMF) was recently suggested (Srebro et al., 2005) as a convex, infinite dimensional alternative to low-rank approximations and standard factor models. MMMF can be formulated as a semi-definite programming (SDP) and learned using standard SDP solvers. However, current SDP solvers can only handle MMMF problems on matrices of dimensionality up to a few hundred. Here, we investigate a direct gradient-based optimization method for MMMF and demonstrate it on large collaborative prediction problems. We compare against results obtained by Marlin (2004) and find that MMMF substantially outperforms all nine methods he tested. 1.
Algorithms and applications for approximate nonnegative matrix factorization
- Computational Statistics and Data Analysis
, 2006
"... In this paper we discuss the development and use of low-rank approximate nonnegative matrix factorization (NMF) algorithms for feature extraction and identification in the fields of text mining and spectral data analysis. The evolution and convergence properties of hybrid methods based on both spars ..."
Abstract
-
Cited by 81 (6 self)
- Add to MetaCart
In this paper we discuss the development and use of low-rank approximate nonnegative matrix factorization (NMF) algorithms for feature extraction and identification in the fields of text mining and spectral data analysis. The evolution and convergence properties of hybrid methods based on both sparsity and smoothness constraints for the resulting nonnegative matrix factors are discussed. The interpretability of NMF outputs in specific contexts are provided along with opportunities for future work in the modification of NMF algorithms for large-scale and time-varying datasets. Key words: nonnegative matrix factorization, text mining, spectral data analysis, email surveillance, conjugate gradient, constrained least squares.
An interior-point method for large-scale l1-regularized logistic regression
- Journal of Machine Learning Research
, 2007
"... Logistic regression with ℓ1 regularization has been proposed as a promising method for feature selection in classification problems. In this paper we describe an efficient interior-point method for solving large-scale ℓ1-regularized logistic regression problems. Small problems with up to a thousand ..."
Abstract
-
Cited by 77 (3 self)
- Add to MetaCart
Logistic regression with ℓ1 regularization has been proposed as a promising method for feature selection in classification problems. In this paper we describe an efficient interior-point method for solving large-scale ℓ1-regularized logistic regression problems. Small problems with up to a thousand or so features and examples can be solved in seconds on a PC; medium sized problems, with tens of thousands of features and examples, can be solved in tens of seconds (assuming some sparsity in the data). A variation on the basic method, that uses a preconditioned conjugate gradient method to compute the search step, can solve very large problems, with a million features and examples (e.g., the 20 Newsgroups data set), in a few minutes, on a PC. Using warm-start techniques, a good approximation of the entire regularization path can be computed much more efficiently than by solving a family of problems independently.
Predicting clicks: Estimating the click-through rate for new ads
- In Proceedings of the 16th International World Wide Web Conference (WWW-07
, 2007
"... Search engine advertising has become a significant element of the Web browsing experience. Choosing the right ads for the query and the order in which they are displayed greatly affects the probability that a user will see and click on each ad. This ranking has a strong impact on the revenue the sea ..."
Abstract
-
Cited by 68 (1 self)
- Add to MetaCart
Search engine advertising has become a significant element of the Web browsing experience. Choosing the right ads for the query and the order in which they are displayed greatly affects the probability that a user will see and click on each ad. This ranking has a strong impact on the revenue the search engine receives from the ads. Further, showing the user an ad that they prefer to click on improves user satisfaction. For these reasons, it is important to be able to accurately estimate the click-through rate of ads in the system. For ads that have been displayed repeatedly, this is empirically measurable, but for new ads, other means must be used. We show that we can use features of ads, terms, and advertisers to learn a model that accurately predicts the click-though rate for new ads. We also show that using our model improves the convergence and performance of an advertising system. As a result, our model increases both revenue and user satisfaction.
A Note on Platt's Probabilistic Outputs for Support Vector Machines
, 2003
"... Platt's probabilistic outputs for Support Vector Machines [6] has been popular for applications that require posterior class probabilities. In this note, we propose an improvement which theoretically converges and avoids numerical difficulties. A simpler and ready-to-use pseudo code is included. ..."
Abstract
-
Cited by 63 (4 self)
- Add to MetaCart
Platt's probabilistic outputs for Support Vector Machines [6] has been popular for applications that require posterior class probabilities. In this note, we propose an improvement which theoretically converges and avoids numerical difficulties. A simpler and ready-to-use pseudo code is included.
Scalable training of L1-regularized log-linear models
- In ICML ’07
, 2007
"... The l-bfgs limited-memory quasi-Newton method is the algorithm of choice for optimizing the parameters of large-scale log-linear models with L2 regularization, but it cannot be used for an L1-regularized loss due to its non-differentiability whenever some parameter is zero. Efficient algorithms have ..."
Abstract
-
Cited by 62 (1 self)
- Add to MetaCart
The l-bfgs limited-memory quasi-Newton method is the algorithm of choice for optimizing the parameters of large-scale log-linear models with L2 regularization, but it cannot be used for an L1-regularized loss due to its non-differentiability whenever some parameter is zero. Efficient algorithms have been proposed for this task, but they are impractical when the number of parameters is very large. We present an algorithm Orthant-Wise Limited-memory Quasi-Newton (owlqn), based on l-bfgs, that can efficiently optimize the L1-regularized log-likelihood of log-linear models with millions of parameters. In our experiments on a parse reranking task, our algorithm was several orders of magnitude faster than an alternative algorithm, and substantially faster than lbfgs on the analogous L2-regularized problem. We also present a proof that owl-qn is guaranteed to converge to a globally optimal parameter vector. 1.
The importance of supertagging for wide-coverage CCG parsing
- IN PROCEEDINGS OF COLING-04
, 2004
"... This paper describes the role of supertagging in a wide-coverage CCG parser which uses a log-linear model to select an analysis. The supertagger reduces the derivation space over which model estimation is performed, reducing the space required for discriminative training. It also dramatically increa ..."
Abstract
-
Cited by 56 (15 self)
- Add to MetaCart
This paper describes the role of supertagging in a wide-coverage CCG parser which uses a log-linear model to select an analysis. The supertagger reduces the derivation space over which model estimation is performed, reducing the space required for discriminative training. It also dramatically increases the speed of the parser. We show that large increases in speed can be obtained by tightly integrating the supertagger with the CCG grammar and parser. This is the first work we are aware of to successfully integrate a supertagger with a full parser which uses an automatically extracted grammar. We also further reduce the derivation space using constraints on category combination. The result is an accurate wide-coverage CCG parser which is an order of magnitude faster than comparable systems for other linguistically motivated formalisms.
Parallel Lagrange-Newton-Krylov-Schur methods for PDE-constrained optimization. Part I: The Krylov-Schur solver
- SIAM J. Sci. Comput
, 2000
"... Abstract. Large scale optimization of systems governed by partial differential equations (PDEs) is a frontier problem in scientific computation. The state-of-the-art for such problems is reduced quasi-Newton sequential quadratic programming (SQP) methods. These methods take full advantage of existin ..."
Abstract
-
Cited by 53 (8 self)
- Add to MetaCart
Abstract. Large scale optimization of systems governed by partial differential equations (PDEs) is a frontier problem in scientific computation. The state-of-the-art for such problems is reduced quasi-Newton sequential quadratic programming (SQP) methods. These methods take full advantage of existing PDE solver technology and parallelize well. However, their algorithmic scalability is questionable; for certain problem classes they can be very slow to converge. In this two-part article we propose a new method for steady-state PDE-constrained optimization, based on the idea of full space SQP with reduced space quasi-Newton SQP preconditioning. The basic components of the method are: Newton solution of the first-order optimality conditions that characterize stationarity of the Lagrangian function; Krylov solution of the Karush-Kuhn-Tucker (KKT) linear systems arising at each Newton iteration using a symmetric quasi-minimum residual method; preconditioning of the KKT system using an approximate state/decision variable decomposition that replaces the forward PDE Jacobians by their own preconditioners, and the decision space Schur complement (the reduced Hessian) by a BFGS approximation or by a two-step stationary method. Accordingly, we term the new method Lagrange-Newton-Krylov Schur (LNKS). It is fully parallelizable, exploits the structure of available parallel algorithms for the PDE forward problem, and is locally quadratically convergent. In the first part of the paper we investigate the effectiveness of the KKT linear system solver. We test the method on two optimal control problems in which the flow is described by the steady-state Stokes equations. The

