Results 1 - 10
of
28
Index-driven similarity search in metric spaces
- ACM Transactions on Database Systems
, 2003
"... Similarity search is a very important operation in multimedia databases and other database applications involving complex objects, and involves finding objects in a data set S similar to a query object q, based on some similarity measure. In this article, we focus on methods for similarity search th ..."
Abstract
-
Cited by 118 (6 self)
- Add to MetaCart
Similarity search is a very important operation in multimedia databases and other database applications involving complex objects, and involves finding objects in a data set S similar to a query object q, based on some similarity measure. In this article, we focus on methods for similarity search that make the general assumption that similarity is represented with a distance metric d. Existing methods for handling similarity search in this setting typically fall into one of two classes. The first directly indexes the objects based on distances (distance-based indexing), while the second is based on mapping to a vector space (mapping-based approach). The main part of this article is dedicated to a survey of distance-based indexing methods, but we also briefly outline how search occurs in mapping-based methods. We also present a general framework for performing search based on distances, and present algorithms for common types of queries that operate on an arbitrary “search hierarchy. ” These algorithms can be applied on each of the methods presented, provided a suitable search hierarchy is defined.
Using the Triangle Inequality to Accelerate k-Means
, 2003
"... The k-means algorithm is by far the most widely used method for discovering clusters in data. We show how to accelerate it dramatically, while still always computing exactly the same result as the standard algorithm. The accelerated algorithm avoids unnecessary distance calculations by applying the ..."
Abstract
-
Cited by 65 (0 self)
- Add to MetaCart
The k-means algorithm is by far the most widely used method for discovering clusters in data. We show how to accelerate it dramatically, while still always computing exactly the same result as the standard algorithm. The accelerated algorithm avoids unnecessary distance calculations by applying the triangle inequality in two different ways, and by keeping track of lower and upper bounds for distances between points and centers. Experiments show that the new algorithm is effective for datasets with up to 1000 dimensions, and becomes more and more effective as the number k of clusters increases. For k>=20 it is many times faster than the best previously known accelerated k-means method.
Active Semi-Supervision for Pairwise Constrained Clustering
- Proc. 4th SIAM Intl. Conf. on Data Mining (SDM-2004
"... Semi-supervised clustering uses a small amount of supervised data to aid unsupervised learning. One typical approach specifies a limited number of must-link and cannotlink constraints between pairs of examples. This paper presents a pairwise constrained clustering framework and a new method for acti ..."
Abstract
-
Cited by 60 (6 self)
- Add to MetaCart
Semi-supervised clustering uses a small amount of supervised data to aid unsupervised learning. One typical approach specifies a limited number of must-link and cannotlink constraints between pairs of examples. This paper presents a pairwise constrained clustering framework and a new method for actively selecting informative pairwise constraints to get improved clustering performance. The clustering and active learning methods are both easily scalable to large datasets, and can handle very high dimensional data. Experimental and theoretical results confirm that this active querying of pairwise constraints significantly improves the accuracy of clustering when given a relatively small amount of supervision. 1
A Comparison of Spectral Clustering Algorithms
, 2003
"... Spectral Clustering has become quite popular over the last few years and several new algorithms have been published. In this paper, we compare several of the best-known algorithms from the point of view of clustering quality over arti cial and real datasets. We implement many variations of the ex ..."
Abstract
-
Cited by 47 (2 self)
- Add to MetaCart
Spectral Clustering has become quite popular over the last few years and several new algorithms have been published. In this paper, we compare several of the best-known algorithms from the point of view of clustering quality over arti cial and real datasets. We implement many variations of the existing spectral algorithms and compare their performance to see which features are more important. We also demonstrate that spectral methods show competitive performance on real dataset with respect to existing methods.
Approximation Algorithms for Hierarchical Location Problems
- in Proceedings of the 35th Annual ACM Symposium on Theory of Computing. ACM
, 2002
"... We formulate and (approximately) solve hierarchical versions of two prototypical problems in discrete location theory, namely, the metric uncapacitated k-median and facility location problems. Our work yields new insights into hierarchical clustering, a widely used technique in data analysis. Firs ..."
Abstract
-
Cited by 10 (0 self)
- Add to MetaCart
We formulate and (approximately) solve hierarchical versions of two prototypical problems in discrete location theory, namely, the metric uncapacitated k-median and facility location problems. Our work yields new insights into hierarchical clustering, a widely used technique in data analysis. First, we show that every metric space admits a hierarchical clustering that is within a constant factor of optimal at every level of granularity with respect to the average (squared) distance objective. Second, we provide a natural solution to the leaf ordering problem encountered in the traditional dendrogram-based approach to the visualization of hierarchical clusterings.
Frequency-Based Views to Pattern Collections
, 2003
"... Finding frequently occurring patterns from data sets is a central computational task in data mining. In this paper we suggest to focus on pattern frequencies. We advocate frequency simplifications as a complementary approach to structural constraints on patterns. As a special case of the frequency s ..."
Abstract
-
Cited by 6 (5 self)
- Add to MetaCart
Finding frequently occurring patterns from data sets is a central computational task in data mining. In this paper we suggest to focus on pattern frequencies. We advocate frequency simplifications as a complementary approach to structural constraints on patterns. As a special case of the frequency simplifications, we consider discretizing the frequencies. We analyze the worst case error of certain discretization functions and give e#cient algorithms minimizing several error functions. In addition, we show that the discretizations can be used to find small approximate condensed representations for the frequent patterns.
Hierarchical flow
- in Proceedings of the 2nd International Network Optimization Conference, 2005
, 2004
"... This paper defines a hierarchical version of the maximum flow problem. In this model, the capacities increase over time and the resulting solution is a sequence of flows that build on each other incrementally. Thus far, hierarchical problems considered in the literature have been built on NP-complet ..."
Abstract
-
Cited by 4 (4 self)
- Add to MetaCart
This paper defines a hierarchical version of the maximum flow problem. In this model, the capacities increase over time and the resulting solution is a sequence of flows that build on each other incrementally. Thus far, hierarchical problems considered in the literature have been built on NP-complete problems. To the best of our knowledge, our results are the first to find a polynomial time problem whose hierarchical version is NP-complete. We present approximation algorithms and hardness results for many versions of this problem, and comment on the relation to multicommodity flow.
An incremental model for combinatorial minimization
, 2006
"... Traditional optimization algorithms are concerned with static input, static constraints, and attempt to produce static output of optimal value. Recent literature has strayed from this conventional approach to deal with more realistic situations in which the input changes over time. Incremental optim ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
Traditional optimization algorithms are concerned with static input, static constraints, and attempt to produce static output of optimal value. Recent literature has strayed from this conventional approach to deal with more realistic situations in which the input changes over time. Incremental optimization is a new framework for handling this type of dynamic behavior. We consider a general model for producing incremental versions of traditional covering problems along with several natural incremental metrics. Using this model, we demonstrate how to convert conventional algorithms into incremental algorithms with only a constant factor loss in approximation power. We introduce incremental versions of min cut, edge cover, and (k, r)-center and present some hardness results. Lastly, we discuss how the incremental model can help us more fully understand online problems and their corresponding algorithms.
Incremental Medians via Online Bidding
, 2008
"... In the k-median problem we are given sets of facilities and customers, and distances between them. For a given set F of facilities, the cost of serving a customer u is the minimum distance between u and a facility in F. The goal is to find a set F of k facilities that minimizes the sum, over all c ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
In the k-median problem we are given sets of facilities and customers, and distances between them. For a given set F of facilities, the cost of serving a customer u is the minimum distance between u and a facility in F. The goal is to find a set F of k facilities that minimizes the sum, over all customers, of their service costs. Following the work of Mettu and Plaxton, we study the incremental medians problem, where k is not known in advance. An incremental algorithm produces a nested sequence of facility sets F1 ⊆ F2 ⊆··· ⊆ Fn, where |Fk | =k for each k. Such an algorithm is called c-cost-competitive if the cost of each Fk is at most c times the optimum k-median cost. We give improved incremental algorithms for the metric version of this problem: an 8-cost-competitive deterministic algorithm, a 2e ≈ 5.44cost-competitive randomized algorithm, a (24 + ɛ)-cost-competitive, polynomialtime deterministic algorithm, and a 6e + ɛ ≈ 16.31-cost-competitive, polynomialtime randomized algorithm. We also consider the competitive ratio with respect to size. An algorithm is s-sizecompetitive if the cost of each Fk is at most the minimum cost of any set of k facilities, while the size of Fk is at most sk. We show that the optimal size-competitive ratios for this problem, in the deterministic and randomized cases, are 4 and e. For polynomial-time algorithms, we present the first polynomial-time O(log m)-size-
Robust hierarchical clustering
, 2010
"... Oneofthemostwidelyusedtechniquesfordataclusteringisagglomerativeclustering. Such algorithms have been long used across many different fields ranging from computational biologytosocialsciencestocomputervisioninpartbecausetheiroutputiseasytointerpret. Unfortunately, it is well known, however, that man ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Oneofthemostwidelyusedtechniquesfordataclusteringisagglomerativeclustering. Such algorithms have been long used across many different fields ranging from computational biologytosocialsciencestocomputervisioninpartbecausetheiroutputiseasytointerpret. Unfortunately, it is well known, however, that many of the classic agglomerative clustering algorithms are not robust to noise [14]. In this paper we propose and analyze a new robust algorithm for bottom-up agglomerative clustering. We show that our algorithm can be used to cluster accurately in cases where the data satisfies a number of natural properties and where the traditional agglomerative algorithms fail. We also show how to adapt our algorithm to the inductive setting where our given data is only a small random sample of the entire data set. 1

