Results 1 - 10
of
107
Generalized Additive Models
, 1995
"... This article describes flexible statistical methods that may be used to identify and characterize nonlinear regression effects. These methods are called "generalized additive models". For example, a commonly used statistical model in medical research is the logistic regression model for binary data. ..."
Abstract
-
Cited by 968 (32 self)
- Add to MetaCart
This article describes flexible statistical methods that may be used to identify and characterize nonlinear regression effects. These methods are called "generalized additive models". For example, a commonly used statistical model in medical research is the logistic regression model for binary data. Here we relate the mean of the binary response ¯ = P (y = 1) to the predictors via a linear regression model and the logit link function: log
Fisher Discriminant Analysis With Kernels
, 1999
"... A non-linear classification technique based on Fisher's discriminant is proposed. The main ingredient is the kernel trick which allows the efficient computation of Fisher discriminant in feature space. The linear classification in feature space corresponds to a (powerful) non-linear decision functi ..."
Abstract
-
Cited by 231 (14 self)
- Add to MetaCart
A non-linear classification technique based on Fisher's discriminant is proposed. The main ingredient is the kernel trick which allows the efficient computation of Fisher discriminant in feature space. The linear classification in feature space corresponds to a (powerful) non-linear decision function in input space. Large scale simulations demonstrate the competitiveness of our approach.
Unsupervised learning of finite mixture models
- IEEE Transactions on Pattern Analysis and Machine Intelligence
, 2002
"... AbstractÐThis paper proposes an unsupervised algorithm for learning a finite mixture model from multivariate data. The adjective ªunsupervisedº is justified by two properties of the algorithm: 1) it is capable of selecting the number of components and 2) unlike the standard expectation-maximization ..."
Abstract
-
Cited by 201 (16 self)
- Add to MetaCart
AbstractÐThis paper proposes an unsupervised algorithm for learning a finite mixture model from multivariate data. The adjective ªunsupervisedº is justified by two properties of the algorithm: 1) it is capable of selecting the number of components and 2) unlike the standard expectation-maximization (EM) algorithm, it does not require careful initialization. The proposed method also avoids another drawback of EM for mixture fitting: the possibility of convergence toward a singular estimate at the boundary of the parameter space. The novelty of our approach is that we do not use a model selection criterion to choose one among a set of preestimated candidate models; instead, we seamlessly integrate estimation and model selection in a single algorithm. Our technique can be applied to any type of parametric mixture model for which it is possible to write an EM algorithm; in this paper, we illustrate it with experiments involving Gaussian mixtures. These experiments testify for the good performance of our approach. Index TermsÐFinite mixtures, unsupervised learning, model selection, minimum message length criterion, Bayesian methods, expectation-maximization algorithm, clustering. 1
Model-Based Clustering, Discriminant Analysis, and Density Estimation
- JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION
, 2000
"... Cluster analysis is the automated search for groups of related observations in a data set. Most clustering done in practice is based largely on heuristic but intuitively reasonable procedures and most clustering methods available in commercial software are also of this type. However, there is little ..."
Abstract
-
Cited by 171 (23 self)
- Add to MetaCart
Cluster analysis is the automated search for groups of related observations in a data set. Most clustering done in practice is based largely on heuristic but intuitively reasonable procedures and most clustering methods available in commercial software are also of this type. However, there is little systematic guidance associated with these methods for solving important practical questions that arise in cluster analysis, such as \How many clusters are there?", "Which clustering method should be used?" and \How should outliers be handled?". We outline a general methodology for model-based clustering that provides a principled statistical approach to these issues. We also show that this can be useful for other problems in multivariate analysis, such as discriminant analysis and multivariate density estimation. We give examples from medical diagnosis, mineeld detection, cluster recovery from noisy data, and spatial density estimation. Finally, we mention limitations of the methodology, a...
A Comparison of Prediction Accuracy, Complexity, and Training Time of Thirty-three Old and New Classification Algorithms
, 2000
"... . Twenty-two decision tree, nine statistical, and two neural network algorithms are compared on thirty-two datasets in terms of classication accuracy, training time, and (in the case of trees) number of leaves. Classication accuracy is measured by mean error rate and mean rank of error rate. Both cr ..."
Abstract
-
Cited by 134 (6 self)
- Add to MetaCart
. Twenty-two decision tree, nine statistical, and two neural network algorithms are compared on thirty-two datasets in terms of classication accuracy, training time, and (in the case of trees) number of leaves. Classication accuracy is measured by mean error rate and mean rank of error rate. Both criteria place a statistical, spline-based, algorithm called Polyclass at the top, although it is not statistically signicantly dierent from twenty other algorithms. Another statistical algorithm, logistic regression, is second with respect to the two accuracy criteria. The most accurate decision tree algorithm is Quest with linear splits, which ranks fourth and fth, respectively. Although spline-based statistical algorithms tend to have good accuracy, they also require relatively long training times. Polyclass, for example, is third last in terms of median training time. It often requires hours of training compared to seconds for other algorithms. The Quest and logistic regression algor...
Map-reduce for machine learning on multicore
- In Proceedings of NIPS
, 2007
"... We are at the beginning of the multicore era. Computers will have increasingly many cores (processors), but there is still no good programming framework for these architectures, and thus no simple and unified way for machine learning to take advantage of the potential speed up. In this paper, we dev ..."
Abstract
-
Cited by 83 (5 self)
- Add to MetaCart
We are at the beginning of the multicore era. Computers will have increasingly many cores (processors), but there is still no good programming framework for these architectures, and thus no simple and unified way for machine learning to take advantage of the potential speed up. In this paper, we develop a broadly applicable parallel programming method, one that is easily applied to many different learning algorithms. Our work is in distinct contrast to the tradition in machine learning of designing (often ingenious) ways to speed up a single algorithm at a time. Specifically, we show that algorithms that fit the Statistical Query model [15] can be written in a certain “summation form, ” which allows them to be easily parallelized on multicore computers. We adapt Google’s map-reduce [7] paradigm to demonstrate this parallel speed up technique on a variety of learning algorithms including locally weighted linear regression (LWLR), k-means, logistic regression
Learning from incomplete data
, 1994
"... Real-world learning tasks often involve high-dimensional data sets with complex patterns of missing features. In this paper we review the problem of learning from incomplete data from two statistical perspectives---the likelihood-based and the Bayesian. The goal is two-fold: to place current neura ..."
Abstract
-
Cited by 49 (0 self)
- Add to MetaCart
Real-world learning tasks often involve high-dimensional data sets with complex patterns of missing features. In this paper we review the problem of learning from incomplete data from two statistical perspectives---the likelihood-based and the Bayesian. The goal is two-fold: to place current neural network approaches to missing data within a statistical framework, and to describe a set of algorithms, derived from the likelihood-based framework, that handle clustering, classification, and function approximation from incomplete data in a principled and efficient manner. These algorithms are based on mixture modeling and maketwo distinct appeals to the Expectation-Maximization (EM) principle (Dempster et al., 1977)---both for the estimation of mixture components and for coping with the missing data.
Graphical models and automatic speech recognition
- Mathematical Foundations of Speech and Language Processing
, 2003
"... Graphical models provide a promising paradigm to study both existing and novel techniques for automatic speech recognition. This paper first provides a brief overview of graphical models and their uses as statistical models. It is then shown that the statistical assumptions behind many pattern recog ..."
Abstract
-
Cited by 49 (10 self)
- Add to MetaCart
Graphical models provide a promising paradigm to study both existing and novel techniques for automatic speech recognition. This paper first provides a brief overview of graphical models and their uses as statistical models. It is then shown that the statistical assumptions behind many pattern recognition techniques commonly used as part of a speech recognition system can be described by a graph – this includes Gaussian distributions, mixture models, decision trees, factor analysis, principle component analysis, linear discriminant analysis, and hidden Markov models. Moreover, this paper shows that many advanced models for speech recognition and language processing can also be simply described by a graph, including many at the acoustic-, pronunciation-, and language-modeling levels. A number of speech recognition techniques born directly out of the graphical-models paradigm are also surveyed. Additionally, this paper includes a novel graphical analysis regarding why derivative (or delta) features improve hidden Markov model-based speech recognition by improving structural discriminability. It also includes an example where a graph can be used to represent language model smoothing constraints. As will be seen, the space of models describable by a graph is quite large. A thorough exploration of this space should yield techniques that ultimately will supersede the hidden Markov model.
Bankruptcy Analysis with Self-Organizing Maps in Learning Metrics
- IEEE Transactions on Neural Networks
, 2001
"... We introduce a method for deriving a metric, locally based on the Fisher information matrix, into the data space. A Self-Organizing Map is computed in the new metric to explore financial statements of enterprises. The metric measures local distances in terms of changes in the distribution of an auxi ..."
Abstract
-
Cited by 46 (19 self)
- Add to MetaCart
We introduce a method for deriving a metric, locally based on the Fisher information matrix, into the data space. A Self-Organizing Map is computed in the new metric to explore financial statements of enterprises. The metric measures local distances in terms of changes in the distribution of an auxiliary random variable that reflects what is important in the data. In this paper the variable indicates bankruptcy within the next few years. The conditional density of the auxiliary variable is first estimated, and the change in the estimate resulting from local displacements in the primary data space is measured using the Fisher information matrix. When a Self-Organizing Map is computed in the new metric it still visualizes the data space in a topology-preserving fashion, but represents the (local) directions in which the probability of bankruptcy changes the most.
Multiclass Linear Dimension Reduction by Weighted Pairwise Fisher Criteria
- IEEE Transactions on Pattern Analysis and Machine Intelligence
, 2001
"... We derive a class of computationally inexpensive linear dimension reduction criteria by introducing a weighted variant of the well-known K-class Fisher criterion associated with linear discriminant analysis (LDA). It can be seen that LDA weights contributions of individual class pairs according to ..."
Abstract
-
Cited by 42 (2 self)
- Add to MetaCart
We derive a class of computationally inexpensive linear dimension reduction criteria by introducing a weighted variant of the well-known K-class Fisher criterion associated with linear discriminant analysis (LDA). It can be seen that LDA weights contributions of individual class pairs according to the Euclidian distance of the respective class means. We generalize upon LDA by introducing a different weighting function.

