Results 1  10
of
29
On the algorithmic implementation of multiclass kernelbased vector machines
 Journal of Machine Learning Research
"... In this paper we describe the algorithmic implementation of multiclass kernelbased vector machines. Our starting point is a generalized notion of the margin to multiclass problems. Using this notion we cast multiclass categorization problems as a constrained optimization problem with a quadratic ob ..."
Abstract

Cited by 363 (13 self)
 Add to MetaCart
In this paper we describe the algorithmic implementation of multiclass kernelbased vector machines. Our starting point is a generalized notion of the margin to multiclass problems. Using this notion we cast multiclass categorization problems as a constrained optimization problem with a quadratic objective function. Unlike most of previous approaches which typically decompose a multiclass problem into multiple independent binary classification tasks, our notion of margin yields a direct method for training multiclass predictors. By using the dual of the optimization problem we are able to incorporate kernels with a compact set of constraints and decompose the dual problem into multiple optimization problems of reduced size. We describe an efficient fixedpoint algorithm for solving the reduced optimization problems and prove its convergence. We then discuss technical details that yield significant running time improvements for large datasets. Finally, we describe various experiments with our approach comparing it to previously studied kernelbased methods. Our experiments indicate that for multiclass problems we attain stateoftheart accuracy.
Discriminative Reranking for Natural Language Parsing
, 2005
"... This article considers approaches which rerank the output of an existing probabilistic parser. The base parser produces a set of candidate parses for each input sentence, with associated probabilities that define an initial ranking of these parses. A second model then attempts to improve upon this i ..."
Abstract

Cited by 268 (9 self)
 Add to MetaCart
This article considers approaches which rerank the output of an existing probabilistic parser. The base parser produces a set of candidate parses for each input sentence, with associated probabilities that define an initial ranking of these parses. A second model then attempts to improve upon this initial ranking, using additional features of the tree as evidence. The strength of our approach is that it allows a tree to be represented as an arbitrary set of features, without concerns about how these features interact or overlap and without the need to define a derivation or a generative model which takes these features into account. We introduce a new method for the reranking task, based on the boosting approach to ranking problems described in Freund et al. (1998). We apply the boosting method to parsing the Wall Street Journal treebank. The method combined the loglikelihood under a baseline model (that of Collins [1999]) with evidence from an additional 500,000 features over parse trees that were not included in the original model. The new model achieved 89.75 % Fmeasure, a 13 % relative decrease in Fmeasure error over the baseline model’s score of 88.2%. The article also introduces a new algorithm for the boosting approach which takes advantage of the sparsity of the feature space in the parsing data. Experiments show significant efficiency gains for the new algorithm over the obvious implementation of the boosting approach. We argue that the method is an appealing alternative—in terms of both simplicity and efficiency—to work on feature selection methods within loglinear (maximumentropy) models. Although the experiments in this article are on natural language parsing (NLP), the approach should be applicable to many other NLP problems which are naturally framed as ranking tasks, for example, speech recognition, machine translation, or natural language generation.
Learning to resolve natural language ambiguities: A unified approach
 In Proceedings of the National Conference on Artificial Intelligence. 806813. Segond F., Schiller A., Grefenstette & Chanod F.P
, 1998
"... distinct semanticonceptsuch as interest rate and has interest in Math are conflated in ordinary text. We analyze a few of the commonly used statistics based The surrounding context word associations and synand machine learning algorithms for natural language tactic patterns in this case are suffl ..."
Abstract

Cited by 169 (84 self)
 Add to MetaCart
distinct semanticonceptsuch as interest rate and has interest in Math are conflated in ordinary text. We analyze a few of the commonly used statistics based The surrounding context word associations and synand machine learning algorithms for natural language tactic patterns in this case are sufflcicnt to identify disambiguation tasks and observe tha they can bc recast as learning linear separators in the feature space. the correct form. Each of the methods makes a priori assumptions, which Many of these arc important standalone problems it employs, given the data, when searching for its hy but even more important is thei role in many applicapothesis. Nevertheless, as we show, it searches a space tions including speech recognition, machine translation, that is as rich as the space of all linear separators. information extraction and intelligent humanmachine We use this to build an argument for a data driven interaction. Most of the ambiguity resolution problems approach which merely searches for a good linear sepa are at the lower level of the natural language inferences rator in the feature space, without further assumptions chain; a wide range and a large number of ambigui
The Hardness of Approximate Optima in Lattices, Codes, and Systems of Linear Equations
, 1993
"... We prove the following about the Nearest Lattice Vector Problem (in any `p norm), the Nearest Codeword Problem for binary codes, the problem of learning a halfspace in the presence of errors, and some other problems. 1. Approximating the optimum within any constant factor is NPhard. 2. If for some ..."
Abstract

Cited by 159 (7 self)
 Add to MetaCart
We prove the following about the Nearest Lattice Vector Problem (in any `p norm), the Nearest Codeword Problem for binary codes, the problem of learning a halfspace in the presence of errors, and some other problems. 1. Approximating the optimum within any constant factor is NPhard. 2. If for some ffl ? 0 there exists a polynomialtime algorithm that approximates the optimum within a factor of 2 log 0:5\Gammaffl n , then every NP language can be decided in quasipolynomial deterministic time, i.e., NP ` DTIME(n poly(log n) ). Moreover, we show that result 2 also holds for the Shortest Lattice Vector Problem in the `1 norm. Also, for some of these problems we can prove the same result as above, but for a larger factor such as 2 log 1\Gammaffl n or n ffl . Improving the factor 2 log 0:5\Gammaffl n to p dimension for either of the lattice problems would imply the hardness of the Shortest Vector Problem in `2 norm; an old open problem. Our proofs use reductions from fewpr...
Automatic Construction of Decision Trees from Data: A MultiDisciplinary Survey
 Data Mining and Knowledge Discovery
, 1997
"... Decision trees have proved to be valuable tools for the description, classification and generalization of data. Work on constructing decision trees from data exists in multiple disciplines such as statistics, pattern recognition, decision theory, signal processing, machine learning and artificial ne ..."
Abstract

Cited by 146 (1 self)
 Add to MetaCart
Decision trees have proved to be valuable tools for the description, classification and generalization of data. Work on constructing decision trees from data exists in multiple disciplines such as statistics, pattern recognition, decision theory, signal processing, machine learning and artificial neural networks. Researchers in these disciplines, sometimes working on quite different problems, identified similar issues and heuristics for decision tree construction. This paper surveys existing work on decision tree construction, attempting to identify the important issues involved, directions the work has taken and the current state of the art. Keywords: classification, treestructured classifiers, data compaction 1. Introduction Advances in data collection methods, storage and processing technology are providing a unique challenge and opportunity for automated data exploration techniques. Enormous amounts of data are being collected daily from major scientific projects e.g., Human Genome...
Theory and Applications of Agnostic PACLearning with Small Decision Trees
, 1995
"... We exhibit a theoretically founded algorithm T2 for agnostic PAClearning of decision trees of at most 2 levels, whose computation time is almost linear in the size of the training set. We evaluate the performance of this learning algorithm T2 on 15 common "realworld" datasets, and show that for mo ..."
Abstract

Cited by 75 (2 self)
 Add to MetaCart
We exhibit a theoretically founded algorithm T2 for agnostic PAClearning of decision trees of at most 2 levels, whose computation time is almost linear in the size of the training set. We evaluate the performance of this learning algorithm T2 on 15 common "realworld" datasets, and show that for most of these datasets T2 provides simple decision trees with little or no loss in predictive power (compared with C4.5). In fact, for datasets with continuous attributes its error rate tends to be lower than that of C4.5. To the best of our knowledge this is the first time that a PAClearning algorithm is shown to be applicable to "realworld" classification problems. Since one can prove that T2 is an agnostic PAClearning algorithm, T2 is guaranteed to produce close to optimal 2level decision trees from sufficiently large training sets for any (!) distribution of data. In this regard T2 differs strongly from all other learning algorithms that are considered in applied machine learning, for w...
Learning in natural language
 Proceedings of the 16th International Joint Conference on Artificial Intelligence (IJCAI ’99); 31 July–6
, 1999
"... Statisticsbased classifiers in natural language are developed typically by assuming a generative model for the data, estimating its parameters from training data and then using Bayes rule to obtain a classifier. For many problems the assumptions made by the generative models are evidently wrong, le ..."
Abstract

Cited by 42 (22 self)
 Add to MetaCart
Statisticsbased classifiers in natural language are developed typically by assuming a generative model for the data, estimating its parameters from training data and then using Bayes rule to obtain a classifier. For many problems the assumptions made by the generative models are evidently wrong, leaving open the question of why these approaches work. This paper presents a learning theory account of the major statistical approaches to learning in natural language. A class of Linear Statistical Queries (LSQ) hypotheses is defined and learning with it is shown to exhibit some robustness properties. Many statistical learners used in natural language, including naive Bayes, Markov Models and Maximum Entropy models are shown to be LSQ hypotheses, explaining the robustness of these predictors even when the underlying probabilistic assumptions do not hold. This coherent view of when and why learning approaches work in this context may help to develop better learning methods and an understanding of the role of learning in natural language inferences. 1
Computing the Maximum Bichromatic Discrepancy, with applications to Computer Graphics and Machine Learning
 in Computer Graphics and Machine Learning. Journal of Computer and Systems Sciences
, 1996
"... Computing the maximum bichromatic discrepancy is an interesting theoretical problem with important applications in computational learning theory, computational geometry and computer graphics. In this paper we give algorithms to compute the maximum bichromatic discrepancy for simple geometric ranges, ..."
Abstract

Cited by 39 (8 self)
 Add to MetaCart
Computing the maximum bichromatic discrepancy is an interesting theoretical problem with important applications in computational learning theory, computational geometry and computer graphics. In this paper we give algorithms to compute the maximum bichromatic discrepancy for simple geometric ranges, including rectangles and halfspaces. In addition, we give extensions to other discrepancy problems. 1. Introduction The main theme of this paper is to present efficient algorithms that solve the problem of computing the maximum bichromatic discrepancy for axis oriented rectangles. This problem arises naturally in different areas of computer science, such as computational 1 The research work of these authors was supported by NSF Grant CCR9301254 and the Geometry Center. learning theory, computational geometry and computer graphics ([Ma], [DG]), and has applications in all these areas. In computational learning theory, the problem of agnostic PAClearning with simple geometric hypothese...
Efficient agnostic paclearning with simple hypotheses
 Proc. of the 7th Annual ACM Conference on Computational Learning Theory
, 1994
"... We exhibit efficient algorithms for agnostic PAClearning with rectangles, unions of two rectangles, and unions of k intervals as hypotheses. These hypothesis classes are of some interest from the point of view of applied machine learning, because empirical studies show that hypotheses of this simp ..."
Abstract

Cited by 36 (3 self)
 Add to MetaCart
We exhibit efficient algorithms for agnostic PAClearning with rectangles, unions of two rectangles, and unions of k intervals as hypotheses. These hypothesis classes are of some interest from the point of view of applied machine learning, because empirical studies show that hypotheses of this simple type (in just one or two of the attributes) provide good prediction rules for various realworld classification problems. In addition, optimal hypotheses of this type may provide valuable heuristic insight into the structure of a realworld classification problem, The algorithms that are introduced in this paper make it feasible to compute optimal hypotheses of this type for a training set of several hundred examples. We also exhibit an approximation algorithm that can compute nearly optimal hypotheses for much larger datasets.
Linear Concepts and Hidden Variables
, 2000
"... We study a learning problem which allows for a \fair" comparison between unsupervised learning methodsprobabilistic model construction, and more traditional algorithms that directly learn a classication. The merits of each approach are intuitively clear: inducing a model is more expensive comput ..."
Abstract

Cited by 22 (16 self)
 Add to MetaCart
We study a learning problem which allows for a \fair" comparison between unsupervised learning methodsprobabilistic model construction, and more traditional algorithms that directly learn a classication. The merits of each approach are intuitively clear: inducing a model is more expensive computationally, but may support a wider range of predictions. Its performance, however, will depend on how well the postulated probabilistic model ts that data. To compare the paradigms we consider a model which postulates a single binaryvalued hidden variable on which all other attributes depend. In this model, nding the most likely value of any one variable (given known values for the others) reduces to testing a linear function of the observed values. We learn the model with two techniques: the standard EM algorithm, and a new algorithm we develop based on covariances. We compare these, in a controlled fashion, against an algorithm (a version of Winnow) that attempts to nd a good l...