Results 1  10
of
55
Indexdriven similarity search in metric spaces
 ACM Transactions on Database Systems
, 2003
"... Similarity search is a very important operation in multimedia databases and other database applications involving complex objects, and involves finding objects in a data set S similar to a query object q, based on some similarity measure. In this article, we focus on methods for similarity search th ..."
Abstract

Cited by 149 (6 self)
 Add to MetaCart
Similarity search is a very important operation in multimedia databases and other database applications involving complex objects, and involves finding objects in a data set S similar to a query object q, based on some similarity measure. In this article, we focus on methods for similarity search that make the general assumption that similarity is represented with a distance metric d. Existing methods for handling similarity search in this setting typically fall into one of two classes. The first directly indexes the objects based on distances (distancebased indexing), while the second is based on mapping to a vector space (mappingbased approach). The main part of this article is dedicated to a survey of distancebased indexing methods, but we also briefly outline how search occurs in mappingbased methods. We also present a general framework for performing search based on distances, and present algorithms for common types of queries that operate on an arbitrary “search hierarchy. ” These algorithms can be applied on each of the methods presented, provided a suitable search hierarchy is defined.
Using the Triangle Inequality to Accelerate kMeans
, 2003
"... The kmeans algorithm is by far the most widely used method for discovering clusters in data. We show how to accelerate it dramatically, while still always computing exactly the same result as the standard algorithm. The accelerated algorithm avoids unnecessary distance calculations by applying the ..."
Abstract

Cited by 115 (1 self)
 Add to MetaCart
The kmeans algorithm is by far the most widely used method for discovering clusters in data. We show how to accelerate it dramatically, while still always computing exactly the same result as the standard algorithm. The accelerated algorithm avoids unnecessary distance calculations by applying the triangle inequality in two different ways, and by keeping track of lower and upper bounds for distances between points and centers. Experiments show that the new algorithm is effective for datasets with up to 1000 dimensions, and becomes more and more effective as the number k of clusters increases. For k>=20 it is many times faster than the best previously known accelerated kmeans method.
Active SemiSupervision for Pairwise Constrained Clustering
 Proc. 4th SIAM Intl. Conf. on Data Mining (SDM2004
"... Semisupervised clustering uses a small amount of supervised data to aid unsupervised learning. One typical approach specifies a limited number of mustlink and cannotlink constraints between pairs of examples. This paper presents a pairwise constrained clustering framework and a new method for acti ..."
Abstract

Cited by 100 (9 self)
 Add to MetaCart
(Show Context)
Semisupervised clustering uses a small amount of supervised data to aid unsupervised learning. One typical approach specifies a limited number of mustlink and cannotlink constraints between pairs of examples. This paper presents a pairwise constrained clustering framework and a new method for actively selecting informative pairwise constraints to get improved clustering performance. The clustering and active learning methods are both easily scalable to large datasets, and can handle very high dimensional data. Experimental and theoretical results confirm that this active querying of pairwise constraints significantly improves the accuracy of clustering when given a relatively small amount of supervision. 1
A Comparison of Spectral Clustering Algorithms
, 2003
"... Spectral Clustering has become quite popular over the last few years and several new algorithms have been published. In this paper, we compare several of the bestknown algorithms from the point of view of clustering quality over arti cial and real datasets. We implement many variations of the ex ..."
Abstract

Cited by 63 (3 self)
 Add to MetaCart
Spectral Clustering has become quite popular over the last few years and several new algorithms have been published. In this paper, we compare several of the bestknown algorithms from the point of view of clustering quality over arti cial and real datasets. We implement many variations of the existing spectral algorithms and compare their performance to see which features are more important. We also demonstrate that spectral methods show competitive performance on real dataset with respect to existing methods.
Robust hierarchical clustering
, 2010
"... Oneofthemostwidelyusedtechniquesfordataclusteringisagglomerativeclustering. Such algorithms have been long used across many different fields ranging from computational biologytosocialsciencestocomputervisioninpartbecausetheiroutputiseasytointerpret. Unfortunately, it is well known, however, that man ..."
Abstract

Cited by 13 (2 self)
 Add to MetaCart
(Show Context)
Oneofthemostwidelyusedtechniquesfordataclusteringisagglomerativeclustering. Such algorithms have been long used across many different fields ranging from computational biologytosocialsciencestocomputervisioninpartbecausetheiroutputiseasytointerpret. Unfortunately, it is well known, however, that many of the classic agglomerative clustering algorithms are not robust to noise [14]. In this paper we propose and analyze a new robust algorithm for bottomup agglomerative clustering. We show that our algorithm can be used to cluster accurately in cases where the data satisfies a number of natural properties and where the traditional agglomerative algorithms fail. We also show how to adapt our algorithm to the inductive setting where our given data is only a small random sample of the entire data set. 1
Approximation Algorithms for Hierarchical Location Problems
 in Proceedings of the 35th Annual ACM Symposium on Theory of Computing. ACM
, 2002
"... We formulate and (approximately) solve hierarchical versions of two prototypical problems in discrete location theory, namely, the metric uncapacitated kmedian and facility location problems. Our work yields new insights into hierarchical clustering, a widely used technique in data analysis. Firs ..."
Abstract

Cited by 12 (0 self)
 Add to MetaCart
(Show Context)
We formulate and (approximately) solve hierarchical versions of two prototypical problems in discrete location theory, namely, the metric uncapacitated kmedian and facility location problems. Our work yields new insights into hierarchical clustering, a widely used technique in data analysis. First, we show that every metric space admits a hierarchical clustering that is within a constant factor of optimal at every level of granularity with respect to the average (squared) distance objective. Second, we provide a natural solution to the leaf ordering problem encountered in the traditional dendrogrambased approach to the visualization of hierarchical clusterings.
FrequencyBased Views to Pattern Collections
, 2003
"... Finding frequently occurring patterns from data sets is a central computational task in data mining. In this paper we suggest to focus on pattern frequencies. We advocate frequency simplifications as a complementary approach to structural constraints on patterns. As a special case of the frequency s ..."
Abstract

Cited by 7 (5 self)
 Add to MetaCart
(Show Context)
Finding frequently occurring patterns from data sets is a central computational task in data mining. In this paper we suggest to focus on pattern frequencies. We advocate frequency simplifications as a complementary approach to structural constraints on patterns. As a special case of the frequency simplifications, we consider discretizing the frequencies. We analyze the worst case error of certain discretization functions and give e#cient algorithms minimizing several error functions. In addition, we show that the discretizations can be used to find small approximate condensed representations for the frequent patterns.
Active Clustering of Biological Sequences
, 2012
"... Given a point set S and an unknown metric d on S, we study the problem of efficiently partitioning S into k clusters while querying few distances between the points. In our model we assume that we have access to one versus all queries that given a point s∈S return the distances between s and all oth ..."
Abstract

Cited by 5 (1 self)
 Add to MetaCart
Given a point set S and an unknown metric d on S, we study the problem of efficiently partitioning S into k clusters while querying few distances between the points. In our model we assume that we have access to one versus all queries that given a point s∈S return the distances between s and all other points. We show that given a natural assumption about the structure of the instance, we can efficiently find an accurate clustering using only O(k) distance queries. Our algorithm uses an active selection strategy to choose a small set of points that we call landmarks, and considers only the distances between landmarks and other points to produce a clustering. We use our procedure to cluster proteins by sequence similarity. This setting nicely fits our model because we can use a fast sequence database search program to query a sequence against an entire data set. We conduct an empirical study that shows that even though we query a small fraction of the distances between the points, we produce clusterings that are close to a desired clustering given by manual classification.
Hierarchical flow
 in Proceedings of the 2nd International Network Optimization Conference, 2005
, 2004
"... This paper defines a hierarchical version of the maximum flow problem. In this model, the capacities increase over time and the resulting solution is a sequence of flows that build on each other incrementally. Thus far, hierarchical problems considered in the literature have been built on NPcomplet ..."
Abstract

Cited by 4 (4 self)
 Add to MetaCart
(Show Context)
This paper defines a hierarchical version of the maximum flow problem. In this model, the capacities increase over time and the resulting solution is a sequence of flows that build on each other incrementally. Thus far, hierarchical problems considered in the literature have been built on NPcomplete problems. To the best of our knowledge, our results are the first to find a polynomial time problem whose hierarchical version is NPcomplete. We present approximation algorithms and hardness results for many versions of this problem, and comment on the relation to multicommodity flow.
Incremental Medians via Online Bidding
, 2008
"... In the kmedian problem we are given sets of facilities and customers, and distances between them. For a given set F of facilities, the cost of serving a customer u is the minimum distance between u and a facility in F. The goal is to find a set F of k facilities that minimizes the sum, over all c ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
In the kmedian problem we are given sets of facilities and customers, and distances between them. For a given set F of facilities, the cost of serving a customer u is the minimum distance between u and a facility in F. The goal is to find a set F of k facilities that minimizes the sum, over all customers, of their service costs. Following the work of Mettu and Plaxton, we study the incremental medians problem, where k is not known in advance. An incremental algorithm produces a nested sequence of facility sets F1 ⊆ F2 ⊆··· ⊆ Fn, where Fk  =k for each k. Such an algorithm is called ccostcompetitive if the cost of each Fk is at most c times the optimum kmedian cost. We give improved incremental algorithms for the metric version of this problem: an 8costcompetitive deterministic algorithm, a 2e ≈ 5.44costcompetitive randomized algorithm, a (24 + ɛ)costcompetitive, polynomialtime deterministic algorithm, and a 6e + ɛ ≈ 16.31costcompetitive, polynomialtime randomized algorithm. We also consider the competitive ratio with respect to size. An algorithm is ssizecompetitive if the cost of each Fk is at most the minimum cost of any set of k facilities, while the size of Fk is at most sk. We show that the optimal sizecompetitive ratios for this problem, in the deterministic and randomized cases, are 4 and e. For polynomialtime algorithms, we present the first polynomialtime O(log m)size