Results 1 - 10
of
26
Constrained K-means Clustering with Background Knowledge
- In ICML
, 2001
"... Clustering is traditionally viewed as an unsupervised method for data analysis. However, in some cases information about the problem domain is available in addition to the data instances themselves. In this paper, we demonstrate how the popular k-means clustering algorithm can be pro tably modi- ed ..."
Abstract
-
Cited by 250 (9 self)
- Add to MetaCart
Clustering is traditionally viewed as an unsupervised method for data analysis. However, in some cases information about the problem domain is available in addition to the data instances themselves. In this paper, we demonstrate how the popular k-means clustering algorithm can be pro tably modi- ed to make use of this information. In experiments with arti cial constraints on six data sets, we observe improvements in clustering accuracy. We also apply this method to the real-world problem of automatically detecting road lanes from GPS data and observe dramatic increases in performance. 1.
Model-Based Clustering, Discriminant Analysis, and Density Estimation
- JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION
, 2000
"... Cluster analysis is the automated search for groups of related observations in a data set. Most clustering done in practice is based largely on heuristic but intuitively reasonable procedures and most clustering methods available in commercial software are also of this type. However, there is little ..."
Abstract
-
Cited by 171 (23 self)
- Add to MetaCart
Cluster analysis is the automated search for groups of related observations in a data set. Most clustering done in practice is based largely on heuristic but intuitively reasonable procedures and most clustering methods available in commercial software are also of this type. However, there is little systematic guidance associated with these methods for solving important practical questions that arise in cluster analysis, such as \How many clusters are there?", "Which clustering method should be used?" and \How should outliers be handled?". We outline a general methodology for model-based clustering that provides a principled statistical approach to these issues. We also show that this can be useful for other problems in multivariate analysis, such as discriminant analysis and multivariate density estimation. We give examples from medical diagnosis, mineeld detection, cluster recovery from noisy data, and spatial density estimation. Finally, we mention limitations of the methodology, a...
Support Vector Clustering
, 2001
"... We present a novel clustering method using the approach of support vector machines. Data points are mapped by means of a Gaussian kernel to a high dimensional feature space, where we search for the minimal enclosing sphere. This sphere, when mapped back to data space, can separate into several compo ..."
Abstract
-
Cited by 124 (1 self)
- Add to MetaCart
We present a novel clustering method using the approach of support vector machines. Data points are mapped by means of a Gaussian kernel to a high dimensional feature space, where we search for the minimal enclosing sphere. This sphere, when mapped back to data space, can separate into several components, each enclosing a separate cluster of points. We present a simple algorithm for identifying these clusters. The width of the Gaussian kernel controls the scale at which the data is probed while the soft margin constant helps coping with outliers and overlapping clusters. The structure of a dataset is explored by varying the two parameters, maintaining a minimal number of support vectors to assure smooth cluster boundaries. We demonstrate the performance of our algorithm on several datasets.
Semi-supervised Clustering by Seeding
- In Proceedings of 19th International Conference on Machine Learning (ICML-2002
, 2002
"... Semi-supervised clustering uses a small amount of labeled data to aid and bias the clustering of unlabeled data. This paper explores the use of labeled data to generate initial seed clusters, as well as the use of constraints generated from labeled data to guide the clustering process. It intr ..."
Abstract
-
Cited by 98 (14 self)
- Add to MetaCart
Semi-supervised clustering uses a small amount of labeled data to aid and bias the clustering of unlabeled data. This paper explores the use of labeled data to generate initial seed clusters, as well as the use of constraints generated from labeled data to guide the clustering process. It introduces two semi-supervised variants of KMeans clustering that can be viewed as instances of the EM algorithm, where labeled data provides prior information about the conditional distributions of hidden category labels. Experimental results demonstrate the advantages of these methods over standard random seeding and COP-KMeans, a previously developed semi-supervised clustering algorithm.
Algorithmic Approaches to Clustering Gene Expression Data
- Current Topics in Computational Biology
, 2001
"... Technologies for generating high-density arrays of cDNAs and oligonucleotides are developing rapidly, and changing the landscape of biological and biomedical research. They enable, for the first time, a global, simultaneous view on the transcription levels of many thousands of genes, when the cell u ..."
Abstract
-
Cited by 53 (2 self)
- Add to MetaCart
Technologies for generating high-density arrays of cDNAs and oligonucleotides are developing rapidly, and changing the landscape of biological and biomedical research. They enable, for the first time, a global, simultaneous view on the transcription levels of many thousands of genes, when the cell undergoes specific conditions or processes. For several organisms that had their genomes completely sequenced, the full set of genes can already be monitored this way today. The potential of such technologies is tremendous: The information obtained by monitoring gene expression levels in different developmental stages, tissue types, clinical conditions and di erent organisms can help understanding gene function and gene networks, and assist in the diagnostic of disease conditions and of effects of medical treatments. Undoubtedly, other applications will emerge in coming years. A key step in the analysis of gene expression data is the identification of groups of genes that manifest...
Predicting the Future of Discrete Sequences From Fractal Representations of the Past
, 2001
"... We propose a novel approach for building nite memory predictive models similar in spirit to variable memory length Markov models (VLMMs). The models are constructed by rst transforming the n-block structure of the training sequence into a geometric structure of points in a unit hypercube, such ..."
Abstract
-
Cited by 24 (8 self)
- Add to MetaCart
We propose a novel approach for building nite memory predictive models similar in spirit to variable memory length Markov models (VLMMs). The models are constructed by rst transforming the n-block structure of the training sequence into a geometric structure of points in a unit hypercube, such that the longer is the common sux shared by any two n-blocks, the closer lie their point representations.
Alternatives to the k-Means Algorithm That Find Better Clusterings
"... We investigate here the behavior of the standard k-means clustering algorithm and several alternatives to it: the k- harmonic means algorithm due to Zhang and colleagues, fuzzy k-means, Gaussian expectation-maximization, and two new variants of k-harmonic means. Our aim is to nd which aspect ..."
Abstract
-
Cited by 23 (3 self)
- Add to MetaCart
We investigate here the behavior of the standard k-means clustering algorithm and several alternatives to it: the k- harmonic means algorithm due to Zhang and colleagues, fuzzy k-means, Gaussian expectation-maximization, and two new variants of k-harmonic means. Our aim is to nd which aspects of these algorithms contribute to nding good clusterings, as opposed to converging to a low-quality local optimum. We describe each algorithm in a uni ed framework that introduces separate cluster membership and data weight functions.
Learning recursive Bayesian multinets for data clustering by means of constructive induction
, 2001
"... This paper introduces and evaluates a new class of knowledge model, the recursive Bayesian multinet (RBMN), which encodes the joint probability distribution of a given database. RBMNs extend Bayesian networks (BNs) as well as partitional clustering systems. Briefly, a RBMN is a decision tree with co ..."
Abstract
-
Cited by 18 (7 self)
- Add to MetaCart
This paper introduces and evaluates a new class of knowledge model, the recursive Bayesian multinet (RBMN), which encodes the joint probability distribution of a given database. RBMNs extend Bayesian networks (BNs) as well as partitional clustering systems. Briefly, a RBMN is a decision tree with component BNs at the leaves. A RBMN is learnt using a greedy, heuristic approach akin to that used by many supervised decision tree learners, but where BNs are learnt at leaves using constructive induction. A key idea is to treat expected data as real data. This allows us to complete the database and to take advantage of a closed form for the marginal likelihood of the expected complete data that factorizes into separate marginal likelihoods for each family (a node and its parents). Our approach is evaluated on synthetic and real-world databases.
One Step Evolutionary Mining of Context Sensitive Associations and Web Navigation Patterns
- in SIAM conference on Data Mining
, 2002
"... this paper is our previous work [9, 10, 11] where we have proposed new robust and fuzzy rela- Partial support of this work was provided by the National Science Foundation Grant IIS 9800899 to Raghu Krishnapuram and National Science Foundation CAREER Award IIS 0133948 to Olfa Nasraoui ..."
Abstract
-
Cited by 11 (6 self)
- Add to MetaCart
this paper is our previous work [9, 10, 11] where we have proposed new robust and fuzzy rela- Partial support of this work was provided by the National Science Foundation Grant IIS 9800899 to Raghu Krishnapuram and National Science Foundation CAREER Award IIS 0133948 to Olfa Nasraoui
Refining Clusters in High Dimensional Text Data
, 2002
"... The k-means algorithm with cosine similarity, also known as the spherical k-means algorithm, is a popular method for clustering document collections. However, spherical k-means can often yield qualitatively poor results, especially for small clusters, say 25-30 documents per cluster, where it tends ..."
Abstract
-
Cited by 9 (1 self)
- Add to MetaCart
The k-means algorithm with cosine similarity, also known as the spherical k-means algorithm, is a popular method for clustering document collections. However, spherical k-means can often yield qualitatively poor results, especially for small clusters, say 25-30 documents per cluster, where it tends to get stuck at a local maximum far away from the optimal. In this paper, we present the first-variation principle that refines a given clustering by incrementally moving data points between clusters, thus achieving a higher objective function value.

