Results 1 -
6 of
6
Parallel Algorithms for Hierarchical Clustering
- Parallel Computing
, 1995
"... Hierarchical clustering is a common method used to determine clusters of similar data points in multidimensional spaces. O(n 2 ) algorithms are known for this problem [3, 4, 10, 18]. This paper reviews important results for sequential algorithms and describes previous work on parallel algorithms f ..."
Abstract
-
Cited by 69 (1 self)
- Add to MetaCart
Hierarchical clustering is a common method used to determine clusters of similar data points in multidimensional spaces. O(n 2 ) algorithms are known for this problem [3, 4, 10, 18]. This paper reviews important results for sequential algorithms and describes previous work on parallel algorithms for hierarchical clustering. Parallel algorithms to perform hierarchical clustering using several distance metrics are then described. Optimal PRAM algorithms using n log n processors are given for the average link, complete link, centroid, median, and minimum variance metrics. Optimal butterfly and tree algorithms using n log n processors are given for the centroid, median, and minimum variance metrics. Optimal asymptotic speedups are achieved for the best practical algorithm to perform clustering using the single link metric on a n log n processor PRAM, butterfly, or tree. Keywords. Hierarchical clustering, pattern analysis, parallel algorithm, butterfly network, PRAM algorithm. 1 In...
A structured family of clustering and tree construction methods
- ADVANCES IN APPLIED MATHEMATICS
, 2001
"... A cluster A is an Apresjan cluster if every pair of objects within A is more similar than either is to any object outside A. The criterion is intuitive, compelling, but often too restrictive for applications in classification. We therefore explore extensions of Apresjan clustering to a family of rel ..."
Abstract
-
Cited by 10 (4 self)
- Add to MetaCart
A cluster A is an Apresjan cluster if every pair of objects within A is more similar than either is to any object outside A. The criterion is intuitive, compelling, but often too restrictive for applications in classification. We therefore explore extensions of Apresjan clustering to a family of related hierarchical clustering methods. The extensions are shown to be closely connected with the well-known single and average linkage tree constructions. A dual family of methods for classification by splits is also presented. Splits are partitions of the set of objects into two disjoint blocks and are widely used in domains such as phylogenetics. Both the cluster and split methods give rise to progressively refined tree representations. We exploit dualities and connections between the various methods, giving polynomial time construction algorithms for most of the constructions and NP-hardness results for the
Clustering in Massive Data Sets
- Handbook of massive data sets
, 1999
"... We review the time and storage costs of search and clustering algorithms. We exemplify these, based on case-studies in astronomy, information retrieval, visual user interfaces, chemical databases, and other areas. Sections 2 to 6 relate to nearest neighbor searching, an elemental form of clustering, ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
We review the time and storage costs of search and clustering algorithms. We exemplify these, based on case-studies in astronomy, information retrieval, visual user interfaces, chemical databases, and other areas. Sections 2 to 6 relate to nearest neighbor searching, an elemental form of clustering, and a basis for clustering algorithms to follow. Sections 7 to 11 review a number of families of clustering algorithm. Sections 12 to 14 relate to visual or image representations of data sets, from which a number of interesting algorithmic developments arise.
Characterizing Computer Systems' Workloads
, 2002
"... The performance of any system cannot be determined without knowing the workload, that is, the set of requests presented to the system. Workload characterization is the process by which we produce models that are capable of describing and reproducing the behavior of a workload. Such models are imp ..."
Abstract
-
Cited by 6 (5 self)
- Add to MetaCart
The performance of any system cannot be determined without knowing the workload, that is, the set of requests presented to the system. Workload characterization is the process by which we produce models that are capable of describing and reproducing the behavior of a workload. Such models are imperative to any performance related studies such as capacity planning, workload balancing, performance prediction and system tuning. In this paper, we survey workload characterization techniques used for several types of computer systems. We identify significant issues and concerns encountered during the characterization process and propose an augmented methodology for workload characterization as a framework.
Fast Full-Search Equivalent Nearest-Neighbour Search Algorithms
, 1999
"... A fundamental activity common to many image processing, pattern classification, and clustering algorithms involves searching a set of n, k-dimensional data for the one which is nearest to a given target item with respect to a distance function. Our goal is to find fast search algorithms which are fu ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
A fundamental activity common to many image processing, pattern classification, and clustering algorithms involves searching a set of n, k-dimensional data for the one which is nearest to a given target item with respect to a distance function. Our goal is to find fast search algorithms which are full-search equivalent---that is, the resulting match is as good as what we could obtain if we were to search the set exhaustively. We propose a framework made up of three components, namely (i) a technique for obtaining a good initial match, (ii) an inexpensive method for determining whether the current match is a full-search equivalent match, and (iii) an effective technique for improving the current match. Our approach is to consider good solutions for each component in order to find an algorithm which balances the overall complexity of the search. We also propose a technique for hierarchical ordering and cluster elimination using a minimal cost spanning tree. Our experiments on vector quantisation coding of images show that the framework and techniques we proposed can be used to construct suitable algorithms for most of our data sets which require full-search equivalent matches at an average arithmetic cost of less than O(k log n) while using only O(n) space.

