Results 1  10
of
15
Linear Concepts and Hidden Variables
, 2000
"... We study a learning problem which allows for a \fair" comparison between unsupervised learning methodsprobabilistic model construction, and more traditional algorithms that directly learn a classication. The merits of each approach are intuitively clear: inducing a model is more expensive comput ..."
Abstract

Cited by 22 (16 self)
 Add to MetaCart
We study a learning problem which allows for a \fair" comparison between unsupervised learning methodsprobabilistic model construction, and more traditional algorithms that directly learn a classication. The merits of each approach are intuitively clear: inducing a model is more expensive computationally, but may support a wider range of predictions. Its performance, however, will depend on how well the postulated probabilistic model ts that data. To compare the paradigms we consider a model which postulates a single binaryvalued hidden variable on which all other attributes depend. In this model, nding the most likely value of any one variable (given known values for the others) reduces to testing a linear function of the observed values. We learn the model with two techniques: the standard EM algorithm, and a new algorithm we develop based on covariances. We compare these, in a controlled fashion, against an algorithm (a version of Winnow) that attempts to nd a good l...
Learning Mixtures of Product Distributions using Correlations and Independence
"... We study the problem of learning mixtures of distributions, a natural formalization of clustering. A mixture of distributions is a collection of distributions D = {D1,...DT}, and � mixing weights, {w1,..., wT} such that i wi = 1. A sample from a mixture is generated by choosing i with probability wi ..."
Abstract

Cited by 13 (1 self)
 Add to MetaCart
We study the problem of learning mixtures of distributions, a natural formalization of clustering. A mixture of distributions is a collection of distributions D = {D1,...DT}, and � mixing weights, {w1,..., wT} such that i wi = 1. A sample from a mixture is generated by choosing i with probability wi and then choosing a sample from distribution Di. The problem of learning the mixture is that of finding the parameters of the distributions comprising D, given only the ability to sample from the mixture. In this paper, we restrict ourselves to learning mixtures of product distributions. The key to learning the mixtures is to find a few vectors, such that points from different distributions are sharply separated upon projection onto these vectors. Previous techniques use the vectors corresponding to the top few directions of highest variance of the mixture. Unfortunately, these directions may be directions of high noise and not directions along which the distributions are separated. Further, skewed mixing weights amplify the effects of noise, and as a result, previous techniques only work when the separation between the input distributions is large relative to the imbalance in the mixing weights. In this paper, we show an algorithm which successfully learns mixtures of distributions with a separation condition that depends only logarithmically on the skewed mixing weights. In particular, it succeeds for a separation between the centers that is Θ(σ √ T log Λ), where σ is the maximum directional standard deviation of any distribution in the mixture, T is the number of distributions, and Λ is polynomial in T, σ, log n and the imbalance in the mixing
LEARNING MIXTURES OF SEPARATED Nonspherical Gaussians
, 2005
"... Mixtures of Gaussian (or normal) distributions arise in a variety of application areas. Many heuristics have been proposed for the task of finding the component Gaussians given samples from the mixture, such as the EM algorithm, a localsearch heuristic from Dempster, Laird and Rubin [J. Roy. Statis ..."
Abstract

Cited by 10 (1 self)
 Add to MetaCart
Mixtures of Gaussian (or normal) distributions arise in a variety of application areas. Many heuristics have been proposed for the task of finding the component Gaussians given samples from the mixture, such as the EM algorithm, a localsearch heuristic from Dempster, Laird and Rubin [J. Roy. Statist. Soc. Ser. B 39 (1977) 1–38]. These do not provably run in polynomial time. We present the first algorithm that provably learns the component Gaussians in time that is polynomial in the dimension. The Gaussians may have arbitrary shape, but they must satisfy a “separation condition” which places a lower bound on the distance between the centers of any two component Gaussians. The mathematical results at the heart of our proof are “distance concentration” results—proved using isoperimetric inequalities— which establish bounds on the probability distribution of the distance between a pair of points generated according to the mixture. We also formalize the more general problem of maxlikelihood fit of a Gaussian mixture to unstructured data.
Some DiscriminantBased PAC Algorithms
 JOURNAL OF MACHINE LEARNING RESEARCH
, 2006
"... A classical approach in multiclass pattern classification is the following. Estimate the probability distributions that generated the observations for each label class, and then label new instances by applying the Bayes classifier to the estimated distributions. That approach provides more useful ..."
Abstract

Cited by 7 (3 self)
 Add to MetaCart
A classical approach in multiclass pattern classification is the following. Estimate the probability distributions that generated the observations for each label class, and then label new instances by applying the Bayes classifier to the estimated distributions. That approach provides more useful information than just a class label; it also provides estimates of the conditional distribution of class labels, in situations where there is class overlap. We would
When Can Two Unsupervised Learners Achieve PAC Separation?
 PAC Separation? Procs. of COLT/EUROCOLT, LNAI 2111
, 2001
"... . In this paper we study a new restriction of the PAC learning framework, in which each label class is handled by an unsupervised learner that aims to t an appropriate probability distribution to its own data. A hypothesis is derived by choosing, for any unlabeled instance, the label whose distr ..."
Abstract

Cited by 5 (1 self)
 Add to MetaCart
. In this paper we study a new restriction of the PAC learning framework, in which each label class is handled by an unsupervised learner that aims to t an appropriate probability distribution to its own data. A hypothesis is derived by choosing, for any unlabeled instance, the label whose distribution assigns it the higher likelihood. The motivation for the new learning setting is that the general approach of tting separate distributions to each label class, is often used in practice for classication problems. The set of probability distributions that is obtained is more useful than a collection of decision boundaries. A question that arises, however, is whether it is ever more tractable (in terms of computational complexity or samplesize required) to nd a simple decision boundary than to divide the problem up into separate unsupervised learning problems and nd appropriate distributions. Within the framework, we give algorithms for learning various simple geometric concept classes. In the boolean domain we show how to learn parity functions, and functions having a constant upper bound on the number of relevant attributes. These results distinguish the new setting from various other wellknown restrictions of PAClearning. We give an algorithm for learning monomials over input vectors generated by an unknown product distribution. The main open problem is whether monomials (or any other concept class) distinguish learnability in this framework from standard PAClearnability. 1
Robust pca and clustering on noisy mixtures
 in Proc. of SODA
, 2009
"... This paper presents a polynomial algorithm for learning mixtures of logconcave distributions in R n in the presence of malicious noise. That is, each sample is corrupted with some small probability, being replaced by a point about which we can make no assumptions. A key element of the algorithm is R ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
This paper presents a polynomial algorithm for learning mixtures of logconcave distributions in R n in the presence of malicious noise. That is, each sample is corrupted with some small probability, being replaced by a point about which we can make no assumptions. A key element of the algorithm is Robust Principle Components Analysis (PCA), which is less susceptible to corruption by noisy points. While noise may cause standard PCA to collapse wellseparated mixture components so that they are indistinguishable, Robust PCA preserves the distance between some of the components, making a partition possible. It then recurses on each half of the mixture until every component is isolated. The success of this algorithm requires only a O ∗ (log n) factor increase in the required separation between components of the mixture compared to the noiseless case. 1
Efficient learning of naive Bayes classifiers under classconditional classification noise
 in Proceedings of the 23rd international conference on Machine learning (ICML’06
, 2006
"... We address the problem of efficiently learning Naive Bayes classifiers under classconditional classification noise (CCCN). Naive Bayes classifiers rely on the hypothesis that the distributions associated to each class are product distributions. When data is subject to CCCnoise, these conditional di ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
We address the problem of efficiently learning Naive Bayes classifiers under classconditional classification noise (CCCN). Naive Bayes classifiers rely on the hypothesis that the distributions associated to each class are product distributions. When data is subject to CCCnoise, these conditional distributions are themselves mixtures of product distributions. We give analytical formulas which makes it possible to identify them from data subject to CCCN. Then, we design a learning algorithm based on these formulas able to learn Naive Bayes classifiers under CCCN. We present results on artificial datasets and datasets extracted from the UCI repository database. These results show that CCCN can be efficiently and successfully handled. 1.
Learning and Approximation Algorithms for problems motivated by Evolutionary Trees
, 1999
"... vi Chapter 1 Introduction 1 1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Biological Background . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2.2 Models and Methods . . . . . . ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
vi Chapter 1 Introduction 1 1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Biological Background . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2.2 Models and Methods . . . . . . . . . . . . . . . . . . . . . . 7 1.3 Learning in the General Markov Model . . . . . . . . . . . . . . . 15 1.3.1 The Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 1.3.2 Learning Problems for Evolutionary Trees . . . . . . . . . 19 1.4 Layout of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 Chapter 2 Learning TwoState Markov Evolutionary Trees 28 2.1 Previous research . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.1.1 The General Idea . . . . . . . . . . . . . . . . . . . . . . . . 28 2.1.2 Previous work on learning the distribution . . . . . . . . . 34 2.1.3 Previous work on finding the topology . . . . . . . . . . . . 39 ii 2.1.4 Re...
Incomplete statistical information fusion and its application to clinical trials data
 In Scalable Uncertainty Management (SUM’07), volume 4772 of LNCS
, 2007
"... Abstract. In medical clinical trials, overall trial results are highlighted in the abstracts of papers/reports. These results are summaries of underlying statistical analysis where most of the time normal distributions are assumed in the analysis. It is common for clinicians to focus on the informat ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
Abstract. In medical clinical trials, overall trial results are highlighted in the abstracts of papers/reports. These results are summaries of underlying statistical analysis where most of the time normal distributions are assumed in the analysis. It is common for clinicians to focus on the information in the abstracts in order to review or integrate several clinical trial results that address the same or similar medical question(s). Therefore, developing techniques to merge results from clinical trials based on information in the abstracts is useful and important. In reality information in an abstract can either provide sufficient details about a normal distribution or just partial information about a distribution. In this paper, we first propose approaches to constructing normal distributions from both complete and incomplete statistical information in the abstracts. We then provide methods to merge these normal distributions (or sampling distributions). Following this, we investigate the conditions under which two normal distributions can be merged. Finally, we design an algorithm to sequence the merging of trials results to ensure that the most reliable trials are considered first.
Separating Populations with Wide Data: a Spectral Analysis
"... Abstract. In this paper, we consider the problem of partitioning a small data sample drawn from a mixture of k product distributions. We are interested in the case that individual features are of low average quality γ, and we want to use as few of them as possible to correctly partition the sample. ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
Abstract. In this paper, we consider the problem of partitioning a small data sample drawn from a mixture of k product distributions. We are interested in the case that individual features are of low average quality γ, and we want to use as few of them as possible to correctly partition the sample. We analyze a spectral technique that is able to approximately optimize the total data size—the product of number of data points n and the number of features K—needed to correctly perform this partitioning as a function of 1/γ for K> n. Our goal is motivated by an application in clustering individuals according to their population of origin using markers, when the divergence between any two of the populations is small. 1