Results 1  10
of
146
A Probabilistic Framework for SemiSupervised Clustering
, 2004
"... Unsupervised clustering can be significantly improved using supervision in the form of pairwise constraints, i.e., pairs of instances labeled as belonging to same or different clusters. In recent years, a number of algorithms have been proposed for enhancing clustering quality by employing such supe ..."
Abstract

Cited by 183 (13 self)
 Add to MetaCart
Unsupervised clustering can be significantly improved using supervision in the form of pairwise constraints, i.e., pairs of instances labeled as belonging to same or different clusters. In recent years, a number of algorithms have been proposed for enhancing clustering quality by employing such supervision. Such methods use the constraints to either modify the objective function, or to learn the distance measure. We propose a probabilistic model for semisupervised clustering based on Hidden Markov Random Fields (HMRFs) that provides a principled framework for incorporating supervision into prototypebased clustering. The model generalizes a previous approach that combines constraints and Euclidean distance learning, and allows the use of a broad range of clustering distortion measures, including Bregman divergences (e.g., Euclidean distance and Idivergence) and directional similarity measures (e.g., cosine similarity). We present an algorithm that performs partitional semisupervised clustering of data by minimizing an objective function derived from the posterior energy of the HMRF model. Experimental results on several text data sets demonstrate the advantages of the proposed framework. 1.
Incremental Clustering and Dynamic Information Retrieval
, 1997
"... Motivated by applications such as document and image classification in information retrieval, we consider the problem of clustering dynamic point sets in a metric space. We propose a model called incremental clustering which is based on a careful analysis of the requirements of the information retri ..."
Abstract

Cited by 153 (5 self)
 Add to MetaCart
Motivated by applications such as document and image classification in information retrieval, we consider the problem of clustering dynamic point sets in a metric space. We propose a model called incremental clustering which is based on a careful analysis of the requirements of the information retrieval application, and which should also be useful in other applications. The goal is to efficiently maintain clusters of small diameter as new points are inserted. We analyze several natural greedy algorithms and demonstrate that they perform poorly. We propose new deterministic and randomized incremental clustering algorithms which have a provably good performance. We complement our positive results with lower bounds on the performance of incremental algorithms. Finally, we consider the dual clustering problem where the clusters are of fixed diameter, and the goal is to minimize the number of clusters. 1 Introduction We consider the following problem: as a sequence of points from a metric...
Videobased face recognition using probabilistic appearance manifolds
 In Proc. IEEE Conference on Computer Vision and Pattern Recognition
, 2003
"... This paper presents a novel method to model and recognize human faces in video sequences. Each registered person is represented by a lowdimensional appearance manifold in the ambient image space. The complex nonlinear appearance manifold expressed as a collection of subsets (named pose manifolds), ..."
Abstract

Cited by 112 (5 self)
 Add to MetaCart
This paper presents a novel method to model and recognize human faces in video sequences. Each registered person is represented by a lowdimensional appearance manifold in the ambient image space. The complex nonlinear appearance manifold expressed as a collection of subsets (named pose manifolds), and the connectivity among them. Each pose manifold is approximated by an affine plane. To construct this representation, exemplars are sampled from videos, and these exemplars are clustered with a Kmeans algorithm; each cluster is represented as a plane computed through principal component analysis (PCA). The connectivity between the pose manifolds encodes the transition probability between images in each of the pose manifold and is learned from a training video sequences. A maximum a posteriori formulation is presented for face recognition in test video sequences by integrating the likelihood that the input image comes from a particular pose manifold and the transition probability to this pose manifold from the previous frame. To recognize faces with partial occlusion, we introduce a weight mask into the process. Extensive experiments demonstrate that the proposed algorithm outperforms existing framebased face recognition methods with temporal voting schemes. 1
Clustering data streams: Theory and practice
 IEEE TKDE
, 2003
"... Abstract—The data stream model has recently attracted attention for its applicability to numerous types of data, including telephone records, Web documents, and clickstreams. For analysis of such data, the ability to process the data in a single pass, or a small number of passes, while using little ..."
Abstract

Cited by 106 (2 self)
 Add to MetaCart
Abstract—The data stream model has recently attracted attention for its applicability to numerous types of data, including telephone records, Web documents, and clickstreams. For analysis of such data, the ability to process the data in a single pass, or a small number of passes, while using little memory, is crucial. We describe such a streaming algorithm that effectively clusters large data streams. We also provide empirical evidence of the algorithm’s performance on synthetic and real data streams. Index Terms—Clustering, data streams, approximation algorithms. 1
Improved fast Gauss transform and efficient kernel density estimation
 In ICCV
, 2003
"... Evaluating sums of multivariate Gaussians is a common computational task in computer vision and pattern recognition, including in the general and powerful kernel density estimation technique. The quadratic computational complexity of the summation is a significant barrier to the scalability of this ..."
Abstract

Cited by 103 (7 self)
 Add to MetaCart
Evaluating sums of multivariate Gaussians is a common computational task in computer vision and pattern recognition, including in the general and powerful kernel density estimation technique. The quadratic computational complexity of the summation is a significant barrier to the scalability of this algorithm to practical applications. The fast Gauss transform (FGT) has successfully accelerated the kernel density estimation to linear running time for lowdimensional problems. Unfortunately, the cost of a direct extension of the FGT to higherdimensional problems grows exponentially with dimension, making it impractical for dimensions above 3. We develop an improved fast Gauss transform to efficiently estimate sums of Gaussians in higher dimensions, where a new multivariate expansion scheme and an adaptive space subdivision technique dramatically improve the performance. The improved FGT has been applied to the mean shift algorithm achieving linear computational complexity. Experimental results demonstrate the efficiency and effectiveness of our algorithm. 1
Efficient algorithms for geometric optimization
 ACM Comput. Surv
, 1998
"... We review the recent progress in the design of efficient algorithms for various problems in geometric optimization. We present several techniques used to attack these problems, such as parametric searching, geometric alternatives to parametric searching, pruneandsearch techniques for linear progra ..."
Abstract

Cited by 94 (12 self)
 Add to MetaCart
We review the recent progress in the design of efficient algorithms for various problems in geometric optimization. We present several techniques used to attack these problems, such as parametric searching, geometric alternatives to parametric searching, pruneandsearch techniques for linear programming and related problems, and LPtype problems and their efficient solution. We then describe a variety of applications of these and other techniques to numerous problems in geometric optimization, including facility location, proximity problems, statistical estimators and metrology, placement and intersection of polygons and polyhedra, and ray shooting and other querytype problems.
Active SemiSupervision for Pairwise Constrained Clustering
 Proc. 4th SIAM Intl. Conf. on Data Mining (SDM2004
"... Semisupervised clustering uses a small amount of supervised data to aid unsupervised learning. One typical approach specifies a limited number of mustlink and cannotlink constraints between pairs of examples. This paper presents a pairwise constrained clustering framework and a new method for acti ..."
Abstract

Cited by 90 (10 self)
 Add to MetaCart
Semisupervised clustering uses a small amount of supervised data to aid unsupervised learning. One typical approach specifies a limited number of mustlink and cannotlink constraints between pairs of examples. This paper presents a pairwise constrained clustering framework and a new method for actively selecting informative pairwise constraints to get improved clustering performance. The clustering and active learning methods are both easily scalable to large datasets, and can handle very high dimensional data. Experimental and theoretical results confirm that this active querying of pairwise constraints significantly improves the accuracy of clustering when given a relatively small amount of supervision. 1
Online Choice of Active Learning Algorithms
 JOURNAL OF MACHINE LEARNING RESEARCH
, 2004
"... This work is concerned with the question of how to combine online an ensemble of active learners so as to expedite the learning progress in poolbased active learning. We develop an activelearning master algorithm, based on a known competitive algorithm for the multiarmed bandit problem. A major ..."
Abstract

Cited by 88 (2 self)
 Add to MetaCart
This work is concerned with the question of how to combine online an ensemble of active learners so as to expedite the learning progress in poolbased active learning. We develop an activelearning master algorithm, based on a known competitive algorithm for the multiarmed bandit problem. A major challenge in successfully choosing top performing active learners online is to reliably estimate their progress during the learning session. To this end we propose a simple maximum entropy criterion that provides effective estimates in realistic settings. We study the performance of the proposed master algorithm using an ensemble containing two of the best known activelearning algorithms as well as a new algorithm. The resulting
Nearestneighbor searching and metric space dimensions
 In NearestNeighbor Methods for Learning and Vision: Theory and Practice
, 2006
"... Given a set S of n sites (points), and a distance measure d, the nearest neighbor searching problem is to build a data structure so that given a query point q, the site nearest to q can be found quickly. This paper gives a data structure for this problem; the data structure is built using the distan ..."
Abstract

Cited by 87 (0 self)
 Add to MetaCart
Given a set S of n sites (points), and a distance measure d, the nearest neighbor searching problem is to build a data structure so that given a query point q, the site nearest to q can be found quickly. This paper gives a data structure for this problem; the data structure is built using the distance function as a “black box”. The structure is able to speed up nearest neighbor searching in a variety of settings, for example: points in lowdimensional or structured Euclidean space, strings under Hamming and edit distance, and bit vector data from an OCR application. The data structures are observed to need linear space, with a modest constant factor. The preprocessing time needed per site is observed to match the query time. The data structure can be viewed as an application of a “kdtree ” approach in the metric space setting, using Voronoi regions of a subset in place of axisaligned boxes. 1
Clustering in Large Graphs and Matrices
, 1999
"... We consider the problem of dividing a set of m points in Euclidean n\Gammaspace into k clusters (m; n are variable while k is fixed), so as to minimize the sum of distance squared of each point to its "cluster center". This formulation differs in two ways from the most frequently considered clusteri ..."
Abstract

Cited by 84 (20 self)
 Add to MetaCart
We consider the problem of dividing a set of m points in Euclidean n\Gammaspace into k clusters (m; n are variable while k is fixed), so as to minimize the sum of distance squared of each point to its "cluster center". This formulation differs in two ways from the most frequently considered clustering problems in the literature, namely, here we have k fixed and m;n variable and we use the sum of squared distances as our measure; we will argue that our problem is natural in many contexts. We consider a relaxation of the discrete problem : find the k\Gammadimensional subspace V so that the sum of distances squared to V (of the m points) is minimized. We show : (i) The relaxation can be solved by Singular Value Decomposition (SVD) of Linear Algebra. (ii) The solution of the relaxation can be used to get a 2approximation algorithm for the original problem. More importantly, (iii) we argue that in fact the relaxation provides a generalized clustering which is useful in its own right. Final...