Results 1  10
of
22
ModelBased Clustering, Discriminant Analysis, and Density Estimation
 JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION
, 2000
"... Cluster analysis is the automated search for groups of related observations in a data set. Most clustering done in practice is based largely on heuristic but intuitively reasonable procedures and most clustering methods available in commercial software are also of this type. However, there is little ..."
Abstract

Cited by 260 (24 self)
 Add to MetaCart
Cluster analysis is the automated search for groups of related observations in a data set. Most clustering done in practice is based largely on heuristic but intuitively reasonable procedures and most clustering methods available in commercial software are also of this type. However, there is little systematic guidance associated with these methods for solving important practical questions that arise in cluster analysis, such as \How many clusters are there?", "Which clustering method should be used?" and \How should outliers be handled?". We outline a general methodology for modelbased clustering that provides a principled statistical approach to these issues. We also show that this can be useful for other problems in multivariate analysis, such as discriminant analysis and multivariate density estimation. We give examples from medical diagnosis, mineeld detection, cluster recovery from noisy data, and spatial density estimation. Finally, we mention limitations of the methodology, a...
Learning with Labeled and Unlabeled Data
, 2001
"... In this paper, on the one hand, we aim to give a review on literature dealing with the problem of supervised learning aided by additional unlabeled data. On the other hand, being a part of the author's first year PhD report, the paper serves as a frame to bundle related work by the author as well as ..."
Abstract

Cited by 165 (3 self)
 Add to MetaCart
In this paper, on the one hand, we aim to give a review on literature dealing with the problem of supervised learning aided by additional unlabeled data. On the other hand, being a part of the author's first year PhD report, the paper serves as a frame to bundle related work by the author as well as numerous suggestions for potential future work. Therefore, this work contains more speculative and partly subjective material than the reader might expect from a literature review. We give a rigorous definition of the problem and relate it to supervised and unsupervised learning. The crucial role of prior knowledge is put forward, and we discuss the important notion of inputdependent regularization. We postulate a number of baseline methods, being algorithms or algorithmic schemes which can more or less straightforwardly be applied to the problem, without the need for genuinely new concepts. However, some of them might serve as basis for a genuine method. In the literature revi...
A DataClustering Algorithm On Distributed Memory Multiprocessors
 In LargeScale Parallel Data Mining, Lecture Notes in Artificial Intelligence
, 2000
"... To cluster increasingly massive data sets that are common today in data and text mining, we propose a parallel implementation of the kmeans clustering algorithm based on the message passing model. The proposed algorithm exploits the inherent dataparallelism in the kmeans algorithm. We analyticall ..."
Abstract

Cited by 95 (1 self)
 Add to MetaCart
To cluster increasingly massive data sets that are common today in data and text mining, we propose a parallel implementation of the kmeans clustering algorithm based on the message passing model. The proposed algorithm exploits the inherent dataparallelism in the kmeans algorithm. We analytically show that the speedup and the scaleup of our algorithm approach the optimal as the number of data points increases. We implemented our algorithm on an IBM POWERparallel SP2 with a maximum of 16 nodes. On typical test data sets, we observe nearly linear relative speedups, for example, 15.62 on 16 nodes, and essentially linear scaleup in the size of the data set and in the number of clusters desired. For a 2 gigabyte test data set, our implementation drives the 16 node SP2 at more than 1.8 gigaflops. Keywords: kmeans, data mining, massive data sets, messagepassing, text mining. 1 Introduction Data sets measuring in gigabytes and even terabytes are now quite common in data and text minin...
A Probabilistic Framework for the Hierarchic Organisation and Classification of Document Collections
, 2002
"... This paper presents a probabilistic mixture modeling framework for the hierarchic organisation of document collections. It is demonstrated that the probabilistic corpus model which emerges from the automatic or unsupervised hierarchical organisation of a document collection can be further exploited ..."
Abstract

Cited by 23 (4 self)
 Add to MetaCart
This paper presents a probabilistic mixture modeling framework for the hierarchic organisation of document collections. It is demonstrated that the probabilistic corpus model which emerges from the automatic or unsupervised hierarchical organisation of a document collection can be further exploited to create a kernel which boosts the performance of stateoftheart Support Vector Machine document classifiers. It is shown that the performance of such a classifier is further enhanced when employing the kernel derived from an appropriate hierarchic mixture model used for partitioning a document corpus rather than the kernel associated with a at nonhierarchic mixture model. This has important implications for document classification when a hierarchic ordering of topics exists. This can be considered as the eective combination of documents with no topic or class labels (unlabeled data), labeled documents, and prior domain knowledge (in the form of the known hierarchic structure), in providing enhanced document classification performance.
Learning recursive Bayesian multinets for data clustering by means of constructive induction
, 2001
"... This paper introduces and evaluates a new class of knowledge model, the recursive Bayesian multinet (RBMN), which encodes the joint probability distribution of a given database. RBMNs extend Bayesian networks (BNs) as well as partitional clustering systems. Briefly, a RBMN is a decision tree with co ..."
Abstract

Cited by 18 (7 self)
 Add to MetaCart
This paper introduces and evaluates a new class of knowledge model, the recursive Bayesian multinet (RBMN), which encodes the joint probability distribution of a given database. RBMNs extend Bayesian networks (BNs) as well as partitional clustering systems. Briefly, a RBMN is a decision tree with component BNs at the leaves. A RBMN is learnt using a greedy, heuristic approach akin to that used by many supervised decision tree learners, but where BNs are learnt at leaves using constructive induction. A key idea is to treat expected data as real data. This allows us to complete the database and to take advantage of a closed form for the marginal likelihood of the expected complete data that factorizes into separate marginal likelihoods for each family (a node and its parents). Our approach is evaluated on synthetic and realworld databases.
Structural Mining of Molecular Biology Data
 IEEE Engineering in Medicine and Biology, special issue on Advances in Genomics
, 2001
"... The increasing amount and complexity of molecular biology data evokes a need to focus on automated methods for mining this data. In addition, molecular biology data is frequently structural in nature, or is composed of parts and relations between the parts. Hence, there exists a need to develop t ..."
Abstract

Cited by 16 (1 self)
 Add to MetaCart
The increasing amount and complexity of molecular biology data evokes a need to focus on automated methods for mining this data. In addition, molecular biology data is frequently structural in nature, or is composed of parts and relations between the parts. Hence, there exists a need to develop tools to analyze and discover concepts in structural databases.
On a Recursive Spectral Algorithm for Clustering from Pairwise Similarities
, 2003
"... We present a practical implementation of the clustering algorithm described in [20]. The clustering algorithm is given either an implicit or explicit representation of the pairwise similarities between n objects and produces a complete hierarchical clustering of the n objects. The implementation r ..."
Abstract

Cited by 12 (1 self)
 Add to MetaCart
We present a practical implementation of the clustering algorithm described in [20]. The clustering algorithm is given either an implicit or explicit representation of the pairwise similarities between n objects and produces a complete hierarchical clustering of the n objects. The implementation runs in O(M log n) time per cluster where M is the number of nonzero entries in the \documentterm" matrix, a common implicit representation of similarities between data objects. We perform a thorough experimental evaluation of the algorithm in practice. The results show that the algorithm is better or competitive with existing clustering algorithms (e.g. kmeans [21], ROCK [18], pQR [37]).
Fusion of Domain Knowledge with Data for Structural Learning in Object Oriented Domains
, 2003
"... When constructing a Bayesian network, it can be advantageous to employ structural learning algorithms to combine knowledge captured in databases with prior information provided by domain experts. Unfortunately, conventional learning algorithms do not easily incorporate prior information, if this inf ..."
Abstract

Cited by 9 (0 self)
 Add to MetaCart
When constructing a Bayesian network, it can be advantageous to employ structural learning algorithms to combine knowledge captured in databases with prior information provided by domain experts. Unfortunately, conventional learning algorithms do not easily incorporate prior information, if this information is too vague to be encoded as properties that are local to families of variables. For instance, conventional algorithms do not exploit prior information about repetitive structures, which are often found in object oriented domains such as computer networks, large pedigrees and genetic analysis.
Mixnets: Factored Mixtures of Gaussians in Bayesian Networks with Mixed Continuous And Discrete Variables
, 2000
"... Recently developed techniques have made it possible to quickly learn accurate probability density functions from data in lowdimensional continuous spaces. In particular, mixtures of Gaussians can be fitted to data very quickly using an accelerated EM algorithm that employs multiresolution kdtrees ..."
Abstract

Cited by 7 (2 self)
 Add to MetaCart
Recently developed techniques have made it possible to quickly learn accurate probability density functions from data in lowdimensional continuous spaces. In particular, mixtures of Gaussians can be fitted to data very quickly using an accelerated EM algorithm that employs multiresolution kdtrees (Moore, 1999). In this paper, we propose a kind of Bayesian network in which lowdimensional mixtures of Gaussians over different subsets of the domain’s variables are combined into a coherent joint probability model over the entire domain. The network is also capable of modeling complex dependencies between discrete variables and continuous variables without requiring discretization of the continuous variables. We present efficient heuristic algorithms for automatically learning these networks from data, and perform comparative experiments illustrating how well these networks model real scientific data and synthetic data. We also briefly discuss some possible improvements to the networks, as well as possible applications.