Results 1  10
of
134
A Probabilistic Framework for SemiSupervised Clustering
, 2004
"... Unsupervised clustering can be significantly improved using supervision in the form of pairwise constraints, i.e., pairs of instances labeled as belonging to same or different clusters. In recent years, a number of algorithms have been proposed for enhancing clustering quality by employing such supe ..."
Abstract

Cited by 277 (14 self)
 Add to MetaCart
(Show Context)
Unsupervised clustering can be significantly improved using supervision in the form of pairwise constraints, i.e., pairs of instances labeled as belonging to same or different clusters. In recent years, a number of algorithms have been proposed for enhancing clustering quality by employing such supervision. Such methods use the constraints to either modify the objective function, or to learn the distance measure. We propose a probabilistic model for semisupervised clustering based on Hidden Markov Random Fields (HMRFs) that provides a principled framework for incorporating supervision into prototypebased clustering. The model generalizes a previous approach that combines constraints and Euclidean distance learning, and allows the use of a broad range of clustering distortion measures, including Bregman divergences (e.g., Euclidean distance and Idivergence) and directional similarity measures (e.g., cosine similarity). We present an algorithm that performs partitional semisupervised clustering of data by minimizing an objective function derived from the posterior energy of the HMRF model. Experimental results on several text data sets demonstrate the advantages of the proposed framework. 1.
Semisupervised graph clustering: a kernel approach
, 2008
"... Semisupervised clustering algorithms aim to improve clustering results using limited supervision. The supervision is generally given as pairwise constraints; such constraints are natural for graphs, yet most semisupervised clustering algorithms are designed for data represented as vectors. In this ..."
Abstract

Cited by 95 (3 self)
 Add to MetaCart
Semisupervised clustering algorithms aim to improve clustering results using limited supervision. The supervision is generally given as pairwise constraints; such constraints are natural for graphs, yet most semisupervised clustering algorithms are designed for data represented as vectors. In this paper, we unify vectorbased and graphbased approaches. We first show that a recentlyproposed objective function for semisupervised clustering based on Hidden Markov Random Fields, with squared Euclidean distance and a certain class of constraint penalty functions, can be expressed as a special case of the weighted kernel kmeans objective (Dhillon et al., in Proceedings of the 10th International Conference on Knowledge Discovery and Data Mining, 2004a). A recent theoretical connection between weighted kernel kmeans and several graph clustering objectives enables us to perform semisupervised clustering of data given either as vectors or as a graph. For graph data, this result leads to algorithms for optimizing several new semisupervised graph clustering objectives. For vector data, the kernel approach also enables us to find clusters with nonlinear boundaries in the input data space. Furthermore, we show that recent work on spectral learning (Kamvar et al., in Proceedings of the 17th International Joint Conference on Artificial Intelligence, 2003) may be viewed as a special case of our formulation. We empirically show that our algorithm is able to outperform current stateoftheart semisupervised algorithms on both vectorbased and graphbased data sets.
Clustering with Constraints: Feasibility Issues and the kMeans Algorithm
, 2005
"... Recent work has looked at extending the kMeans algorithm to incorporate background information in the form of instance level mustlink and cannotlink constraints. We introduce two ways of specifying additional background information in the form of # and # constraints that operate on all instances ..."
Abstract

Cited by 89 (9 self)
 Add to MetaCart
Recent work has looked at extending the kMeans algorithm to incorporate background information in the form of instance level mustlink and cannotlink constraints. We introduce two ways of specifying additional background information in the form of # and # constraints that operate on all instances but which can be interpreted as conjunctions or disjunctions of instance level constraints and hence are easy to implement. We present complexity results for the feasibility of clustering under each type of constraint individually and several types together. A key finding is that determining whether there is a feasible solution satisfying all constraints is, in general, NPcomplete. Thus, an iterative algorithm such as kMeans should not try to find a feasible partitioning at each iteration. This motivates our derivation of a new version of the kMeans algorithm that minimizes the constrained vector quantization error but at each iteration does not attempt to satisfy all constraints. Using standard UCI datasets, we find that using constraints improves accuracy as others have reported, but we also show that our algorithm reduces the number of iterations until convergence. Finally, we illustrate these benefits and our new constraint types on a complex real world object identification problem using the infrared detector on an Aibo robot.
Active learning for anomaly and rarecategory detection
 In Advances in Neural Information Processing Systems 18
, 2004
"... We introduce a novel activelearning scenario in which a user wants to work with a learning algorithm to identify useful anomalies. These are distinguished from the traditional statistical definition of anomalies as outliers or merely illmodeled points. Our distinction is that the usefulness of ano ..."
Abstract

Cited by 46 (0 self)
 Add to MetaCart
(Show Context)
We introduce a novel activelearning scenario in which a user wants to work with a learning algorithm to identify useful anomalies. These are distinguished from the traditional statistical definition of anomalies as outliers or merely illmodeled points. Our distinction is that the usefulness of anomalies is categorized subjectively by the user. We make two additional assumptions. First, there exist extremely few useful anomalies to be hunted down within a massive dataset. Second, both useful and useless anomalies may sometimes exist within tiny classes of similar anomalies. The challenge is thus to identify “rare category ” records in an unlabeled noisy set with help (in the form of class labels) from a human expert who has a small budget of datapoints that they are prepared to categorize. We propose a technique to meet this challenge, which assumes a mixture model fit to the data, but otherwise makes no assumptions on the particular form of the mixture components. This property promises wide applicability in reallife scenarios and for various statistical models. We give an overview of several alternative methods, highlighting their strengths and weaknesses, and conclude with a detailed empirical analysis. We show that our method can quickly zoom in on an anomaly set containing a few tens of points in a dataset of hundreds of thousands. 1
A Discriminative Learning Framework with Pairwise Constraints for Video Object Classification
 In Proc. of CVPR
, 2004
"... In video object classification, insufficient labeled data may at times be easily augmented with pairwise constraints on sample points, i.e, whether they are in the same class or not. In this paper, we proposed a discriminative learning approach which incorporates pairwise constraints into a conventi ..."
Abstract

Cited by 38 (5 self)
 Add to MetaCart
(Show Context)
In video object classification, insufficient labeled data may at times be easily augmented with pairwise constraints on sample points, i.e, whether they are in the same class or not. In this paper, we proposed a discriminative learning approach which incorporates pairwise constraints into a conventional marginbased learning framework. The proposed approach offers several advantages over existing approaches dealing with pairwise constraints. First, as opposed to learning distance metrics, the new approach derives its classification power by directly modeling the decision boundary. Second, most previous work handles labeled data by converting them to pairwise constraints and thus leads to much more computation. The proposed approach can handle pairwise constraints together with labeled data so that the computation is greatly reduced. The proposed approach is evaluated on a people classification task with two surveillance video datasets.
Active coanalysis of a set of shapes
 ACM Trans. on Graph (SIGGRAPH Asia
, 2012
"... Figure 1: Overview of our active coanalysis: (a) We start with an initial unsupervised cosegmentation of the input set. (b) During active learning, the system automatically suggests constraints which would refine results and the user interactively adds constraints as appropriate. In this example, ..."
Abstract

Cited by 33 (9 self)
 Add to MetaCart
Figure 1: Overview of our active coanalysis: (a) We start with an initial unsupervised cosegmentation of the input set. (b) During active learning, the system automatically suggests constraints which would refine results and the user interactively adds constraints as appropriate. In this example, the user adds a cannotlink constraint (in red) and a mustlink constraint (in blue) between segments. (c) The constraints are propagated to the set and the cosegmentation is refined. The process from (b) to (c) is repeated until the desired result is obtained. Unsupervised coanalysis of a set of shapes is a difficult problem since the geometry of the shapes alone cannot always fully describe the semantics of the shape parts. In this paper, we propose a semisupervised learning method where the user actively assists in the coanalysis by iteratively providing inputs that progressively constrain the system. We introduce a novel constrained clustering method based on a spring system which embeds elements to better respect their interdistances in feature space together with the usergiven set of constraints. We also present an active learning method that suggests to the user where his input is likely to be the most effective in refining the results. We show that each single pair of constraints affects many relations across the set. Thus, the method requires only a sparse set of constraints to quickly converge toward a consistent and errorfree semantic labeling of the set.
Learning with constrained and unlabeled data
 In CVPR
, 2005
"... Classification problems abundantly arise in many computer vision tasks – being of supervised, semisupervised or unsupervised nature. Even when class labels are not available, a user still might favor certain grouping solutions over others. This bias can be expressed either by providing a clustering ..."
Abstract

Cited by 28 (3 self)
 Add to MetaCart
(Show Context)
Classification problems abundantly arise in many computer vision tasks – being of supervised, semisupervised or unsupervised nature. Even when class labels are not available, a user still might favor certain grouping solutions over others. This bias can be expressed either by providing a clustering criterion or cost function and, in addition to that, by specifying pairwise constraints on the assignment of objects to classes. In this work, we discuss a unifying formulation for labelled and unlabelled data that can incorporate constrained data for model fitting. Our approach models the constraint information by the maximum entropy principle. This modeling strategy allows us (i) to handle constraint violations and soft constraints, and, at the same time, (ii) to speed up the optimization process. Experimental results on face classification and image segmentation indicates that the proposed algorithm is computationally efficient and generates superior groupings when compared with alternative techniques. 1.
Subjectivity Word Sense Disambiguation
"... This paper investigates a new task, subjectivity word sense disambiguation (SWSD), which is to automatically determine which word instances in a corpus are being used with subjective senses, and which are being used with objective senses. We provide empirical evidence that SWSD is more feasible than ..."
Abstract

Cited by 27 (2 self)
 Add to MetaCart
This paper investigates a new task, subjectivity word sense disambiguation (SWSD), which is to automatically determine which word instances in a corpus are being used with subjective senses, and which are being used with objective senses. We provide empirical evidence that SWSD is more feasible than full word sense disambiguation, and that it can be exploited to improve the performance of contextual subjectivity and sentiment analysis systems. 1
Generation of Alternative Clusterings Using the CAMI Approach
"... Exploratory data analysis aims to discover and generate multiple views of the structure within a dataset. Conventional clustering techniques, however, are designed to only provide a single grouping or clustering of a dataset. In this paper, we introduce a novel algorithm called CAMI, that can uncove ..."
Abstract

Cited by 26 (3 self)
 Add to MetaCart
Exploratory data analysis aims to discover and generate multiple views of the structure within a dataset. Conventional clustering techniques, however, are designed to only provide a single grouping or clustering of a dataset. In this paper, we introduce a novel algorithm called CAMI, that can uncover alternative clusterings from a dataset. CAMI takes a mathematically appealing approach, combining the use of mutual information to distinguish between alternative clusterings, coupled with an expectation maximization framework to ensure clustering quality. We experimentally test CAMI on both synthetic and realworld datasets, comparing it against a variety of stateoftheart algorithms. We demonstrate that CAMI’s performance is high and that its formulation provides a number of advantages compared to existing techniques. 1
Comparing and unifying searchbased and similaritybased approaches to semisupervised clustering
 In Proceedings of the ICML2003 Workshop on the Continuum from Labeled to Unlabeled Data in Machine Learning and Data Mining
, 2003
"... Semisupervised clustering employs a small amount of labeled data to aid unsupervised learning. Previous work in the area has employed one of two approaches: 1) Searchbased methods that utilize supervised data to guide the search for the best clustering, and 2) Similaritybased methods that use supe ..."
Abstract

Cited by 25 (3 self)
 Add to MetaCart
(Show Context)
Semisupervised clustering employs a small amount of labeled data to aid unsupervised learning. Previous work in the area has employed one of two approaches: 1) Searchbased methods that utilize supervised data to guide the search for the best clustering, and 2) Similaritybased methods that use supervised data to adapt the underlying similarity metric used by the clustering algorithm. This paper presents a unified approach based on the KMeans clustering algorithm that incorporates both of these techniques. Experimental results demonstrate that the combined approach generally produces better clusters than either of the individual approaches. 1.