Results 1  10
of
118
From Instancelevel Constraints to Spacelevel Constraints: Making the Most of Prior Knowledge in Data Clustering
, 2002
"... We present an improved method for clustering in the presence of very limited supervisory information, given as pairwise instance constraints. By allowing instancelevel constraints to have spacelevel inductive implications, we are able to successfully incorporate constraints for a wide range o ..."
Abstract

Cited by 153 (4 self)
 Add to MetaCart
We present an improved method for clustering in the presence of very limited supervisory information, given as pairwise instance constraints. By allowing instancelevel constraints to have spacelevel inductive implications, we are able to successfully incorporate constraints for a wide range of data set types. Our method greatly improves on the previously studied constrained kmeans algorithm, generally requiring less than half as many constraints to achieve a given accuracy on a range of realworld data, while also being more robust when overconstrained. We additionally discuss an active learning algorithm which increases the value of constraints even further.
Active SemiSupervision for Pairwise Constrained Clustering
 Proc. 4th SIAM Intl. Conf. on Data Mining (SDM2004
"... Semisupervised clustering uses a small amount of supervised data to aid unsupervised learning. One typical approach specifies a limited number of mustlink and cannotlink constraints between pairs of examples. This paper presents a pairwise constrained clustering framework and a new method for acti ..."
Abstract

Cited by 90 (10 self)
 Add to MetaCart
Semisupervised clustering uses a small amount of supervised data to aid unsupervised learning. One typical approach specifies a limited number of mustlink and cannotlink constraints between pairs of examples. This paper presents a pairwise constrained clustering framework and a new method for actively selecting informative pairwise constraints to get improved clustering performance. The clustering and active learning methods are both easily scalable to large datasets, and can handle very high dimensional data. Experimental and theoretical results confirm that this active querying of pairwise constraints significantly improves the accuracy of clustering when given a relatively small amount of supervision. 1
Spectral learning
 In IJCAI
, 2003
"... We present a simple, easily implemented spectral learning algorithm which applies equally whether we have no supervisory information, pairwise link constraints, or labeled examples. In the unsupervised case, it performs consistently with other spectral clustering algorithms. In the supervised case, ..."
Abstract

Cited by 71 (5 self)
 Add to MetaCart
We present a simple, easily implemented spectral learning algorithm which applies equally whether we have no supervisory information, pairwise link constraints, or labeled examples. In the unsupervised case, it performs consistently with other spectral clustering algorithms. In the supervised case, our approach achieves high accuracy on the categorization of thousands of documents given only a few dozen labeled training documents for the 20 Newsgroups data set. Furthermore, its classification accuracy increases with the addition of unlabeled documents, demonstrating effective use of unlabeled data. By using normalized affinity matrices which are both symmetric and stochastic, we also obtain both a probabilistic interpretation of our method and certain guarantees of performance. 1
NonRedundant Data Clustering
, 2004
"... Data clustering is a popular approach for automatically finding classes, concepts, or groups of patterns. In practice this discovery process should avoid redundancies with existing knowledge about class structures or groupings, and reveal novel, previously unknown aspects of the data. In order to de ..."
Abstract

Cited by 68 (3 self)
 Add to MetaCart
Data clustering is a popular approach for automatically finding classes, concepts, or groups of patterns. In practice this discovery process should avoid redundancies with existing knowledge about class structures or groupings, and reveal novel, previously unknown aspects of the data. In order to deal with this problem, we present an extension of the information bottleneck framework, called coordinated conditional information bottleneck, which takes negative relevance information into account by maximizing a conditional mutual information score subject to constraints. Algorithmically, one can apply an alternating optimization scheme that can be used in conjunction with different types of numeric and nonnumeric attributes. We present experimental results for applications in text mining and computer vision.
Document clustering with committees
 In Proc. of SIGIR’02
, 2002
"... Document clustering is useful in many information retrieval tasks: document browsing, organization and viewing of retrieval results, generation of Yahoolike hierarchies of documents, etc. The general goal of clustering is to group data elements such that the intragroup similarities are high and th ..."
Abstract

Cited by 57 (4 self)
 Add to MetaCart
Document clustering is useful in many information retrieval tasks: document browsing, organization and viewing of retrieval results, generation of Yahoolike hierarchies of documents, etc. The general goal of clustering is to group data elements such that the intragroup similarities are high and the intergroup similarities are low. We present a clustering algorithm called CBC (Clustering By Committee) that is shown to produce higher quality clusters in document clustering tasks as compared to several well known clustering algorithms. It initially discovers a set of tight clusters (high intragroup similarity), called committees, that are well scattered in the similarity space (low intergroup similarity). The union of the committees is but a subset of all elements. The algorithm proceeds by assigning elements to their most similar committee. Evaluating cluster quality has always been a difficult task. We present a new evaluation methodology that is based on the editing distance between output clusters and manually constructed classes (the answer key). This evaluation measure is more intuitive and easier to interpret than previous evaluation measures.
Clustering with Constraints: Feasibility Issues and the kMeans Algorithm
, 2005
"... Recent work has looked at extending the kMeans algorithm to incorporate background information in the form of instance level mustlink and cannotlink constraints. We introduce two ways of specifying additional background information in the form of # and # constraints that operate on all instances ..."
Abstract

Cited by 56 (7 self)
 Add to MetaCart
Recent work has looked at extending the kMeans algorithm to incorporate background information in the form of instance level mustlink and cannotlink constraints. We introduce two ways of specifying additional background information in the form of # and # constraints that operate on all instances but which can be interpreted as conjunctions or disjunctions of instance level constraints and hence are easy to implement. We present complexity results for the feasibility of clustering under each type of constraint individually and several types together. A key finding is that determining whether there is a feasible solution satisfying all constraints is, in general, NPcomplete. Thus, an iterative algorithm such as kMeans should not try to find a feasible partitioning at each iteration. This motivates our derivation of a new version of the kMeans algorithm that minimizes the constrained vector quantization error but at each iteration does not attempt to satisfy all constraints. Using standard UCI datasets, we find that using constraints improves accuracy as others have reported, but we also show that our algorithm reduces the number of iterations until convergence. Finally, we illustrate these benefits and our new constraint types on a complex real world object identification problem using the infrared detector on an Aibo robot.
Segmentation given partial grouping constraints
 IEEE Transactions on Pattern Analysis and Machine Intelligence
, 2004
"... Abstract—We consider data clustering problems where partial grouping is known a priori. We formulate such biased grouping problems as a constrained optimization problem, where structural properties of the data define the goodness of a grouping and partial grouping cues define the feasibility of a gr ..."
Abstract

Cited by 55 (3 self)
 Add to MetaCart
Abstract—We consider data clustering problems where partial grouping is known a priori. We formulate such biased grouping problems as a constrained optimization problem, where structural properties of the data define the goodness of a grouping and partial grouping cues define the feasibility of a grouping. We enforce grouping smoothness and fairness on labeled data points so that sparse partial grouping information can be effectively propagated to the unlabeled data. Considering the normalized cuts criterion in particular, our formulation leads to a constrained eigenvalue problem. By generalizing the RayleighRitz theorem to projected matrices, we find the global optimum in the relaxed continuous domain by eigendecomposition, from which a nearglobal optimum to the discrete labeling problem can be obtained effectively. We apply our method to real image segmentation problems, where partial grouping priors can often be derived based on a crude spatial attentional map that binds places with common salient features or focuses on expected object locations. We demonstrate not only that it is possible to integrate both image structures and priors in a single grouping process, but also that objects can be segregated from the background without specific object knowledge. Index Terms—Grouping, image segmentation, graph partitioning, bias, spatial attention, semisupervised clustering, partially labeled classification. æ
Clustering by Committee
, 2003
"... children, the narratives that capture our thoughts, and the stories that shape our world. In this work, we present some recent advances in automatically acquiring knowledge from text. We propose a general purpose clustering algorithm called CBC (Clustering By Committee) from which we will organiz ..."
Abstract

Cited by 27 (0 self)
 Add to MetaCart
children, the narratives that capture our thoughts, and the stories that shape our world. In this work, we present some recent advances in automatically acquiring knowledge from text. We propose a general purpose clustering algorithm called CBC (Clustering By Committee) from which we will organize documents according to topics as well as discover concepts and word senses. We will explore the value of these systems by experimenting with two novel evaluation methodologies that attempt to define what a word sense is and define the quality of a particular clustering.
Nearduplicate Detection by Instancelevel Constrained Clustering
 In Proceedings of the 29th ACM Conference on Research and Development in Information Retrieval (SIGIR06). 2006
, 2006
"... For the task of nearduplicated document detection, both traditional fingerprinting techniques used in database community and bagofword comparison approaches used in information retrieval community are not sufficiently accurate. This is due to the fact that the characteristics of nearduplicated d ..."
Abstract

Cited by 26 (6 self)
 Add to MetaCart
For the task of nearduplicated document detection, both traditional fingerprinting techniques used in database community and bagofword comparison approaches used in information retrieval community are not sufficiently accurate. This is due to the fact that the characteristics of nearduplicated documents are different from that of both “almostidentical ” documents in the data cleaning task and “relevant ” documents in the search task. This paper presents an instancelevel constrained clustering approach for nearduplicate detection. The framework incorporates information such as document attributes and content structure into the clustering process to form nearduplicate clusters. Gathered from several collections of public comments sent to U.S. government agencies on proposed new regulations, the experimental results demonstrate that our approach outperforms other nearduplicate detection algorithms and as about as effective as human assessors.
Measuring constraintset utility for partitional clustering algorithms
 In: Proceedings of the Tenth European Conference on Principles and Practice of Knowledge Discovery in Databases
, 2006
"... Abstract. Clustering with constraints is an active area of machine learning and data mining research. Previous empirical work has convincingly shown that adding constraints to clustering improves the performance of a variety of algorithms. However, in most of these experiments, results are averaged ..."
Abstract

Cited by 25 (3 self)
 Add to MetaCart
Abstract. Clustering with constraints is an active area of machine learning and data mining research. Previous empirical work has convincingly shown that adding constraints to clustering improves the performance of a variety of algorithms. However, in most of these experiments, results are averaged over different randomly chosen constraint sets from a given set of labels, thereby masking interesting properties of individual sets. We demonstrate that constraint sets vary significantly in how useful they are for constrained clustering; some constraint sets can actually decrease algorithm performance. We create two quantitative measures, informativeness and coherence, that can be used to identify useful constraint sets. We show that these measures can also help explain differences in performance for four particular constrained clustering algorithms. 1