Results 1 - 10
of
13
Text Classification from Labeled and Unlabeled Documents using EM
- Machine Learning
, 1999
"... . This paper shows that the accuracy of learned text classifiers can be improved by augmenting a small number of labeled training documents with a large pool of unlabeled documents. This is important because in many text classification problems obtaining training labels is expensive, while large qua ..."
Abstract
-
Cited by 632 (16 self)
- Add to MetaCart
. This paper shows that the accuracy of learned text classifiers can be improved by augmenting a small number of labeled training documents with a large pool of unlabeled documents. This is important because in many text classification problems obtaining training labels is expensive, while large quantities of unlabeled documents are readily available. We introduce an algorithm for learning from labeled and unlabeled documents based on the combination of Expectation-Maximization (EM) and a naive Bayes classifier. The algorithm first trains a classifier using the available labeled documents, and probabilistically labels the unlabeled documents. It then trains a new classifier using the labels for all the documents, and iterates to convergence. This basic EM procedure works well when the data conform to the generative assumptions of the model. However these assumptions are often violated in practice, and poor performance can result. We present two extensions to the algorithm that improve ...
Using Unlabeled Data to Improve Text Classification
, 2001
"... One key difficulty with text classification learning algorithms is that they require many hand-labeled examples to learn accurately. This dissertation demonstrates that supervised learning algorithms that use a small number of labeled examples and many inexpensive unlabeled examples can create high- ..."
Abstract
-
Cited by 41 (0 self)
- Add to MetaCart
One key difficulty with text classification learning algorithms is that they require many hand-labeled examples to learn accurately. This dissertation demonstrates that supervised learning algorithms that use a small number of labeled examples and many inexpensive unlabeled examples can create high-accuracy text classifiers. By assuming that documents are created by a parametric generative model, Expectation-Maximization (EM) finds local maximum a posteriori models and classifiers from all the data -- labeled and unlabeled. These generative models do not capture all the intricacies of text; however on some domains this technique substantially improves classification accuracy, especially when labeled data are sparse. Two problems arise from this basic approach. First, unlabeled data can hurt performance in domains where the generative modeling assumptions are too strongly violated. In this case the assumptions can be made more representative in two ways: by modeling sub-topic class structure, and by modeling super-topic hierarchical class relationships. By doing so, model probability and classification accuracy come into correspondence, allowing unlabeled data to improve classification performance. The second problem is that even with a representative model, the improvements given by unlabeled data do not sufficiently compensate for a paucity of labeled data. Here, limited labeled data provide EM initializations that lead to low-probability models. Performance can be significantly improved by using active learning to select high-quality initializations, and by using alternatives to EM that avoid low-probability local maxima.
Dirichlet Prior Sieves in Finite Normal Mixtures
- Statistica Sinica
, 2002
"... Abstract: The use of a finite dimensional Dirichlet prior in the finite normal mixture model has the effect of acting like a Bayesian method of sieves. Posterior consistency is directly related to the dimension of the sieve and the choice of the Dirichlet parameters in the prior. We find that naive ..."
Abstract
-
Cited by 24 (1 self)
- Add to MetaCart
Abstract: The use of a finite dimensional Dirichlet prior in the finite normal mixture model has the effect of acting like a Bayesian method of sieves. Posterior consistency is directly related to the dimension of the sieve and the choice of the Dirichlet parameters in the prior. We find that naive use of the popular uniform Dirichlet prior leads to an inconsistent posterior. However, a simple adjustment to the parameters in the prior induces a random probability measure that approximates the Dirichlet process and yields a posterior that is strongly consistent for the density and weakly consistent for the unknown mixing distribution. The dimension of the resulting sieve can be selected easily in practice and a simple and efficient Gibbs sampler can be used to sample the posterior of the mixing distribution. Key words and phrases: Bose-Einstein distribution, Dirichlet process, identification, method of sieves, random probability measure, relative entropy, weak convergence.
Approximate Dirichlet Process Computing in Finite Normal Mixtures: Smoothing and Prior Information
- JOURNAL OF COMPUTATIONAL AND GRAPHICAL STATISTICS
, 2000
"... ..."
Bayesian Model Selection in Finite Mixtures by Marginal Density Decompositions
- JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION
, 2001
"... ..."
The Likelihood Ratio Test for Homogeneity in the Finite Mixture Models
, 2001
"... The authors study the asymptotic behaviour of the likelihood ratio statistic for testing homogeneity in the finite mixture models of a general parametric distribution family. They prove that the limiting distribution of this statistic is the squared supremum of a truncated standard Gaussian process. ..."
Abstract
-
Cited by 11 (1 self)
- Add to MetaCart
The authors study the asymptotic behaviour of the likelihood ratio statistic for testing homogeneity in the finite mixture models of a general parametric distribution family. They prove that the limiting distribution of this statistic is the squared supremum of a truncated standard Gaussian process. The autocorrelation function of the Gaussian process is explicitly presented. A re-sampling procedure is recommended to obtain the asymptotic p-value. Three kernel functions, normal, binomial and Poisson, are used in a simulation study which illustrates the procedure.
Rates Of Convergence For The Gaussian Mixture Sieve
- The Annals of Statistics
, 2000
"... Gaussian mixtures provide a convenient method of density estimation that lies somewhere between parametric models and kernel... ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
Gaussian mixtures provide a convenient method of density estimation that lies somewhere between parametric models and kernel...
Semiparametric estimation of a two-component mixture model
- Annals of Statistics
, 2006
"... Suppose that univariate data are drawn from a mixture of two distributions that are equal up to a shift parameter. Such a model is known to be nonidentifiable from a nonparametric viewpoint. However, if we assume that the unknown mixed distribution is symmetric, we obtain the identifiability of this ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
Suppose that univariate data are drawn from a mixture of two distributions that are equal up to a shift parameter. Such a model is known to be nonidentifiable from a nonparametric viewpoint. However, if we assume that the unknown mixed distribution is symmetric, we obtain the identifiability of this model, which is then defined by four unknown parameters: the mixing proportion, two location parameters and the cumulative distribution function of the symmetric mixed distribution. We propose estimators for these four parameters when no training data is available. Our estimators are shown to be strongly consistent under mild regularity assumptions and their convergence rates are studied. Their finite-sample properties are illustrated by a Monte Carlo study and our method is applied to real data.
Identifiability of Finite Linear Regression Mixtures
, 1996
"... Identifiability is a necessary condition for the existence of consistent estimates for the parameters of mixture models. In this paper the identifiability of finite mixtures of linear regression models with Normal errors is investigated. Three different models are treated: Mixture models with random ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
Identifiability is a necessary condition for the existence of consistent estimates for the parameters of mixture models. In this paper the identifiability of finite mixtures of linear regression models with Normal errors is investigated. Three different models are treated: Mixture models with random and fixed independent variables and a model with fixed partition of the data to the mixture components. Sometimes only parts of the unknown parameter values are of interest. "Partial identifiability" is introduced for this purpose. It turns out that identifiability of finite linear regression mixtures depends on the number of p \Gamma 1-dimensional hyperplanes which one needs to cover the independent variables. Counterexamples and sufficient conditions for identifiability are given for all models. 1 Introduction In general, a stochastic identifiability problem can be explained as follows: Definition 1.1 (Identifiability) Let\Omega be an arbitrary parameter space, P be some space of distr...
Toward Learning Gaussian Mixtures with Arbitrary Separation
"... In recent years analysis of complexity of learning Gaussian mixture models from sampled data has received significant attention in computational machine learning and theory communities. In this paper we present the first result showing that polynomial time learning of multidimensional Gaussian Mixtu ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
In recent years analysis of complexity of learning Gaussian mixture models from sampled data has received significant attention in computational machine learning and theory communities. In this paper we present the first result showing that polynomial time learning of multidimensional Gaussian Mixture distributions is possible when the separation between the component means is arbitrarily small. Specifically, we present an algorithm for learning the parameters of a mixture of k identical spherical Gaussians in n-dimensional space with an arbitrarily small separation between the components, which is polynomial in dimension, inverse component separation and other input parameters for a fixed number of components k. The algorithm uses a projection to k dimensions and then a reduction to the 1-dimensional case. It relies on a theoretical analysis showing that two 1-dimensional mixtures whose densities are close in the L 2 norm must have similar means and mixing coefficients. To produce the necessary lower bound for the L 2 norm in terms of the distances between the corresponding means, we analyze the behavior of the Fourier transform of a mixture of Gaussians in one dimension around the origin, which turns out to be closely related to the properties of the Vandermonde matrix obtained from the component means. Analysis of minors of the Vandermonde matrix together with basic function approximation results allows us to provide a lower bound for the norm of the mixture in the Fourier domain and hence a bound in the original space. Additionally, we present a separate argument for reconstructing variance. 1

