Results 1 - 10
of
76
A Probabilistic Framework for Semi-Supervised Clustering
, 2004
"... Unsupervised clustering can be significantly improved using supervision in the form of pairwise constraints, i.e., pairs of instances labeled as belonging to same or different clusters. In recent years, a number of algorithms have been proposed for enhancing clustering quality by employing such supe ..."
Abstract
-
Cited by 134 (10 self)
- Add to MetaCart
Unsupervised clustering can be significantly improved using supervision in the form of pairwise constraints, i.e., pairs of instances labeled as belonging to same or different clusters. In recent years, a number of algorithms have been proposed for enhancing clustering quality by employing such supervision. Such methods use the constraints to either modify the objective function, or to learn the distance measure. We propose a probabilistic model for semisupervised clustering based on Hidden Markov Random Fields (HMRFs) that provides a principled framework for incorporating supervision into prototype-based clustering. The model generalizes a previous approach that combines constraints and Euclidean distance learning, and allows the use of a broad range of clustering distortion measures, including Bregman divergences (e.g., Euclidean distance and I-divergence) and directional similarity measures (e.g., cosine similarity). We present an algorithm that performs partitional semi-supervised clustering of data by minimizing an objective function derived from the posterior energy of the HMRF model. Experimental results on several text data sets demonstrate the advantages of the proposed framework. 1.
Integrating Constraints and Metric Learning in Semi-Supervised Clustering
- In ICML
, 2004
"... Semi-supervised clustering employs a small amount of labeled data to aid unsupervised learning. Previous work in the area has utilized supervised data in one of two approaches: 1) constraint-based methods that guide the clustering algorithm towards a better grouping of the data, and 2) distanc ..."
Abstract
-
Cited by 124 (6 self)
- Add to MetaCart
Semi-supervised clustering employs a small amount of labeled data to aid unsupervised learning. Previous work in the area has utilized supervised data in one of two approaches: 1) constraint-based methods that guide the clustering algorithm towards a better grouping of the data, and 2) distance-function learning methods that adapt the underlying similarity metric used by the clustering algorithm. This paper provides new methods for the two approaches as well as presents a new semi-supervised clustering algorithm that integrates both of these techniques in a uniform, principled framework. Experimental results demonstrate that the unified approach produces better clusters than both individual approaches as well as previously proposed semisupervised clustering algorithms.
Clustering with instance-level constraints
- In Proceedings of the Seventeenth International Conference on Machine Learning
, 2000
"... One goal of research in artificial intelligence is to automate tasks that currently require human expertise; this automation is important because it saves time and brings problems that were previously too large to be solved into the feasible domain. Data analysis, or the ability to identify meaningf ..."
Abstract
-
Cited by 116 (6 self)
- Add to MetaCart
One goal of research in artificial intelligence is to automate tasks that currently require human expertise; this automation is important because it saves time and brings problems that were previously too large to be solved into the feasible domain. Data analysis, or the ability to identify meaningful patterns and trends in large volumes of data, is an important task that falls into this category. Clustering algorithms are a particularly useful group of data analysis tools. These methods are used, for example, to analyze satellite images of the Earth to identify and categorize different land and foliage types or to analyze telescopic observations to determine what distinct types of astronomical bodies exist and to categorize each observation. However, most existing clustering methods apply general similarity techniques rather than making use of problem-specific information. This dissertation first presents a novel method for converting existing clustering algorithms into constrained clustering algorithms. The resulting methods are able to accept domain-specific information in the form of constraints on the output clusters. At the most general level, each constraint is an instance-level statement
Active Semi-Supervision for Pairwise Constrained Clustering
- Proc. 4th SIAM Intl. Conf. on Data Mining (SDM-2004
"... Semi-supervised clustering uses a small amount of supervised data to aid unsupervised learning. One typical approach specifies a limited number of must-link and cannotlink constraints between pairs of examples. This paper presents a pairwise constrained clustering framework and a new method for acti ..."
Abstract
-
Cited by 60 (6 self)
- Add to MetaCart
Semi-supervised clustering uses a small amount of supervised data to aid unsupervised learning. One typical approach specifies a limited number of must-link and cannotlink constraints between pairs of examples. This paper presents a pairwise constrained clustering framework and a new method for actively selecting informative pairwise constraints to get improved clustering performance. The clustering and active learning methods are both easily scalable to large datasets, and can handle very high dimensional data. Experimental and theoretical results confirm that this active querying of pairwise constraints significantly improves the accuracy of clustering when given a relatively small amount of supervision. 1
Learning to classify text using positive and unlabeled data
- In: Proceedings of the 19th international joint conference on artificial intelligence
, 2003
"... In traditional text classification, a classifier is built using labeled training documents of every class. This paper studies a different problem. Given a set P of documents of a particular class (called positive class) and a set U of unlabeled documents that contains documents from class P and also ..."
Abstract
-
Cited by 42 (9 self)
- Add to MetaCart
In traditional text classification, a classifier is built using labeled training documents of every class. This paper studies a different problem. Given a set P of documents of a particular class (called positive class) and a set U of unlabeled documents that contains documents from class P and also other types of documents (called negative class documents), we want to build a classifier to classify the documents in U into documents from P and documents not from P. The key feature of this problem is that there is no labeled negative document, which makes traditional text classification techniques inapplicable. In this paper, we propose an effective technique to solve the problem. It combines the Rocchio method and the SVM technique for classifier building. Experimental results show that the new method outperforms existing methods significantly. 1
Semi-supervised learning with penalized probabilistic clustering
- In Advances in
, 2005
"... While clustering is usually an unsupervised operation, there are circumstances in which we believe (with varying degrees of certainty) that items A and B should be assigned to the same cluster, while items A and C should not. We would like such pairwise relations to influence cluster assignments of ..."
Abstract
-
Cited by 26 (1 self)
- Add to MetaCart
While clustering is usually an unsupervised operation, there are circumstances in which we believe (with varying degrees of certainty) that items A and B should be assigned to the same cluster, while items A and C should not. We would like such pairwise relations to influence cluster assignments of out-of-sample data in a manner consistent with the prior knowledge expressed in the training set. Our starting point is probabilistic clustering based on Gaussian mixture models (GMM) of the data distribution. We express clustering preferences in the prior distribution over assignments of data points to clusters. This prior penalizes cluster assignments according to the degree with which they violate the preferences. We fit the model parameters with EM. Experiments on a variety of data sets show that PPC can consistently improve clustering results. 1
Semi-supervised graph clustering: a kernel approach
, 2008
"... Semi-supervised clustering algorithms aim to improve clustering results using limited supervision. The supervision is generally given as pairwise constraints; such constraints are natural for graphs, yet most semi-supervised clustering algorithms are designed for data represented as vectors. In this ..."
Abstract
-
Cited by 24 (1 self)
- Add to MetaCart
Semi-supervised clustering algorithms aim to improve clustering results using limited supervision. The supervision is generally given as pairwise constraints; such constraints are natural for graphs, yet most semi-supervised clustering algorithms are designed for data represented as vectors. In this paper, we unify vector-based and graph-based approaches. We first show that a recently-proposed objective function for semi-supervised clustering based on Hidden Markov Random Fields, with squared Euclidean distance and a certain class of constraint penalty functions, can be expressed as a special case of the weighted kernel k-means objective (Dhillon et al., in Proceedings of the 10th International Conference on Knowledge Discovery and Data Mining, 2004a). A recent theoretical connection between weighted kernel k-means and several graph clustering objectives enables us to perform semi-supervised clustering of data given either as vectors or as a graph. For graph data, this result leads to algorithms for optimizing several new semi-supervised graph clustering objectives. For vector data, the kernel approach also enables us to find clusters with non-linear boundaries in the input data space. Furthermore, we show that recent work on spectral learning (Kamvar et al., in Proceedings of the 17th International Joint Conference on Artificial Intelligence, 2003) may be viewed as a special case of our formulation. We empirically show that our algorithm is able to outperform current state-of-the-art semi-supervised algorithms on both vector-based and graph-based data sets.
Comparing and unifying search-based and similarity-based approaches to semi-supervised clustering
- In Proceedings of the ICML-2003 Workshop on the Continuum from Labeled to Unlabeled Data in Machine Learning and Data Mining
, 2003
"... Semi-supervised clustering employs a small amount of labeled data to aid unsupervised learning. Previous work in the area has employed one of two approaches: 1) Searchbased methods that utilize supervised data to guide the search for the best clustering, and 2) Similarity-based methods that use supe ..."
Abstract
-
Cited by 19 (3 self)
- Add to MetaCart
Semi-supervised clustering employs a small amount of labeled data to aid unsupervised learning. Previous work in the area has employed one of two approaches: 1) Searchbased methods that utilize supervised data to guide the search for the best clustering, and 2) Similarity-based methods that use supervised data to adapt the underlying similarity metric used by the clustering algorithm. This paper presents a unified approach based on the K-Means clustering algorithm that incorporates both of these techniques. Experimental results demonstrate that the combined approach generally produces better clusters than either of the individual approaches. 1.
Supervised Clustering – Algorithms and Benefits
- In proceedings of the 16th IEEE International Conference on Tools with Artificial Intelligence (ICTAI04) , Boca
, 2004
"... This paper centers on a novel data mining technique we term supervised clustering. Unlike traditional clustering, supervised clustering assumes that the examples are classified and has the goal of identifying class-uniform clusters that have high probability densities. Four representative–based algo ..."
Abstract
-
Cited by 18 (13 self)
- Add to MetaCart
This paper centers on a novel data mining technique we term supervised clustering. Unlike traditional clustering, supervised clustering assumes that the examples are classified and has the goal of identifying class-uniform clusters that have high probability densities. Four representative–based algorithms for supervised clustering are introduced: a greedy algorithm with random restart, named SRIDHCR, that seeks for solutions by inserting and removing single objects from the current solution, SPAM (a variation of the clustering algorithm PAM), an evolutionary computing algorithm named SCEC, and a fast medoid-based top-down splitting algorithm, named TDS. The four algorithms were evaluated using a benchmark consisting of four UCI machine learning data sets. In general, it seems that “greedy ” algorithms, such as SPAM, SRIDHCR, and TDS, do not perform particularly well for supervised clustering and seem to terminate prematurely too often. We also briefly describe the applications of supervised clustering. 1.

