Results 1  10
of
32
Indexdriven similarity search in metric spaces
 ACM Transactions on Database Systems
, 2003
"... Similarity search is a very important operation in multimedia databases and other database applications involving complex objects, and involves finding objects in a data set S similar to a query object q, based on some similarity measure. In this article, we focus on methods for similarity search th ..."
Abstract

Cited by 133 (6 self)
 Add to MetaCart
Similarity search is a very important operation in multimedia databases and other database applications involving complex objects, and involves finding objects in a data set S similar to a query object q, based on some similarity measure. In this article, we focus on methods for similarity search that make the general assumption that similarity is represented with a distance metric d. Existing methods for handling similarity search in this setting typically fall into one of two classes. The first directly indexes the objects based on distances (distancebased indexing), while the second is based on mapping to a vector space (mappingbased approach). The main part of this article is dedicated to a survey of distancebased indexing methods, but we also briefly outline how search occurs in mappingbased methods. We also present a general framework for performing search based on distances, and present algorithms for common types of queries that operate on an arbitrary “search hierarchy. ” These algorithms can be applied on each of the methods presented, provided a suitable search hierarchy is defined.
Using the Triangle Inequality to Accelerate kMeans
, 2003
"... The kmeans algorithm is by far the most widely used method for discovering clusters in data. We show how to accelerate it dramatically, while still always computing exactly the same result as the standard algorithm. The accelerated algorithm avoids unnecessary distance calculations by applying the ..."
Abstract

Cited by 98 (1 self)
 Add to MetaCart
The kmeans algorithm is by far the most widely used method for discovering clusters in data. We show how to accelerate it dramatically, while still always computing exactly the same result as the standard algorithm. The accelerated algorithm avoids unnecessary distance calculations by applying the triangle inequality in two different ways, and by keeping track of lower and upper bounds for distances between points and centers. Experiments show that the new algorithm is effective for datasets with up to 1000 dimensions, and becomes more and more effective as the number k of clusters increases. For k>=20 it is many times faster than the best previously known accelerated kmeans method.
Active SemiSupervision for Pairwise Constrained Clustering
 Proc. 4th SIAM Intl. Conf. on Data Mining (SDM2004
"... Semisupervised clustering uses a small amount of supervised data to aid unsupervised learning. One typical approach specifies a limited number of mustlink and cannotlink constraints between pairs of examples. This paper presents a pairwise constrained clustering framework and a new method for acti ..."
Abstract

Cited by 90 (10 self)
 Add to MetaCart
Semisupervised clustering uses a small amount of supervised data to aid unsupervised learning. One typical approach specifies a limited number of mustlink and cannotlink constraints between pairs of examples. This paper presents a pairwise constrained clustering framework and a new method for actively selecting informative pairwise constraints to get improved clustering performance. The clustering and active learning methods are both easily scalable to large datasets, and can handle very high dimensional data. Experimental and theoretical results confirm that this active querying of pairwise constraints significantly improves the accuracy of clustering when given a relatively small amount of supervision. 1
A Comparison of Spectral Clustering Algorithms
, 2003
"... Spectral Clustering has become quite popular over the last few years and several new algorithms have been published. In this paper, we compare several of the bestknown algorithms from the point of view of clustering quality over arti cial and real datasets. We implement many variations of the ex ..."
Abstract

Cited by 59 (2 self)
 Add to MetaCart
Spectral Clustering has become quite popular over the last few years and several new algorithms have been published. In this paper, we compare several of the bestknown algorithms from the point of view of clustering quality over arti cial and real datasets. We implement many variations of the existing spectral algorithms and compare their performance to see which features are more important. We also demonstrate that spectral methods show competitive performance on real dataset with respect to existing methods.
Approximation Algorithms for Hierarchical Location Problems
 in Proceedings of the 35th Annual ACM Symposium on Theory of Computing. ACM
, 2002
"... We formulate and (approximately) solve hierarchical versions of two prototypical problems in discrete location theory, namely, the metric uncapacitated kmedian and facility location problems. Our work yields new insights into hierarchical clustering, a widely used technique in data analysis. Firs ..."
Abstract

Cited by 11 (0 self)
 Add to MetaCart
We formulate and (approximately) solve hierarchical versions of two prototypical problems in discrete location theory, namely, the metric uncapacitated kmedian and facility location problems. Our work yields new insights into hierarchical clustering, a widely used technique in data analysis. First, we show that every metric space admits a hierarchical clustering that is within a constant factor of optimal at every level of granularity with respect to the average (squared) distance objective. Second, we provide a natural solution to the leaf ordering problem encountered in the traditional dendrogrambased approach to the visualization of hierarchical clusterings.
Robust hierarchical clustering
, 2010
"... Oneofthemostwidelyusedtechniquesfordataclusteringisagglomerativeclustering. Such algorithms have been long used across many different fields ranging from computational biologytosocialsciencestocomputervisioninpartbecausetheiroutputiseasytointerpret. Unfortunately, it is well known, however, that man ..."
Abstract

Cited by 8 (2 self)
 Add to MetaCart
Oneofthemostwidelyusedtechniquesfordataclusteringisagglomerativeclustering. Such algorithms have been long used across many different fields ranging from computational biologytosocialsciencestocomputervisioninpartbecausetheiroutputiseasytointerpret. Unfortunately, it is well known, however, that many of the classic agglomerative clustering algorithms are not robust to noise [14]. In this paper we propose and analyze a new robust algorithm for bottomup agglomerative clustering. We show that our algorithm can be used to cluster accurately in cases where the data satisfies a number of natural properties and where the traditional agglomerative algorithms fail. We also show how to adapt our algorithm to the inductive setting where our given data is only a small random sample of the entire data set. 1
FrequencyBased Views to Pattern Collections
, 2003
"... Finding frequently occurring patterns from data sets is a central computational task in data mining. In this paper we suggest to focus on pattern frequencies. We advocate frequency simplifications as a complementary approach to structural constraints on patterns. As a special case of the frequency s ..."
Abstract

Cited by 6 (5 self)
 Add to MetaCart
Finding frequently occurring patterns from data sets is a central computational task in data mining. In this paper we suggest to focus on pattern frequencies. We advocate frequency simplifications as a complementary approach to structural constraints on patterns. As a special case of the frequency simplifications, we consider discretizing the frequencies. We analyze the worst case error of certain discretization functions and give e#cient algorithms minimizing several error functions. In addition, we show that the discretizations can be used to find small approximate condensed representations for the frequent patterns.
Hierarchical flow
 in Proceedings of the 2nd International Network Optimization Conference, 2005
, 2004
"... This paper defines a hierarchical version of the maximum flow problem. In this model, the capacities increase over time and the resulting solution is a sequence of flows that build on each other incrementally. Thus far, hierarchical problems considered in the literature have been built on NPcomplet ..."
Abstract

Cited by 4 (4 self)
 Add to MetaCart
This paper defines a hierarchical version of the maximum flow problem. In this model, the capacities increase over time and the resulting solution is a sequence of flows that build on each other incrementally. Thus far, hierarchical problems considered in the literature have been built on NPcomplete problems. To the best of our knowledge, our results are the first to find a polynomial time problem whose hierarchical version is NPcomplete. We present approximation algorithms and hardness results for many versions of this problem, and comment on the relation to multicommodity flow.
Active Clustering of Biological Sequences ∗
"... Given a point set S and an unknown metric d on S, we study the problem of efficiently partitioning S into k clusters while querying few distances between the points. In our model we assume that we have access to one versus all queries that given a point s∈S return the distances between s and all oth ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
Given a point set S and an unknown metric d on S, we study the problem of efficiently partitioning S into k clusters while querying few distances between the points. In our model we assume that we have access to one versus all queries that given a point s∈S return the distances between s and all other points. We show that given a natural assumption about the structure of the instance, we can efficiently find an accurate clustering using only O(k) distance queries. Our algorithm uses an active selection strategy to choose a small set of points that we call landmarks, and considers only the distances between landmarks and other points to produce a clustering. We use our procedure to cluster proteins by sequence similarity. This setting nicely fits our model because we can use a fast sequence database search program to query a sequence against an entire data set. We conduct an empirical study that shows that even though we query a small fraction of the distances between the points, we produce clusterings that are close to a desired clustering given by manual classification.
An incremental model for combinatorial minimization
, 2006
"... Traditional optimization algorithms are concerned with static input, static constraints, and attempt to produce static output of optimal value. Recent literature has strayed from this conventional approach to deal with more realistic situations in which the input changes over time. Incremental optim ..."
Abstract

Cited by 3 (2 self)
 Add to MetaCart
Traditional optimization algorithms are concerned with static input, static constraints, and attempt to produce static output of optimal value. Recent literature has strayed from this conventional approach to deal with more realistic situations in which the input changes over time. Incremental optimization is a new framework for handling this type of dynamic behavior. We consider a general model for producing incremental versions of traditional covering problems along with several natural incremental metrics. Using this model, we demonstrate how to convert conventional algorithms into incremental algorithms with only a constant factor loss in approximation power. We introduce incremental versions of min cut, edge cover, and (k, r)center and present some hardness results. Lastly, we discuss how the incremental model can help us more fully understand online problems and their corresponding algorithms.