#### DMCA

## Clustering for Data Reduction: A Divide and Conquer Approach (2007)

Citations: | 2 - 0 self |

### Citations

3722 | Normalized cuts and image segmentation
- Shi, Malik
- 1997
(Show Context)
Citation Context ...more local decisions are made, this does not always occur in practice. Spectral cuts, which look at the clustering problem from a graph theoretic standpoint, have enjoyed considerable recent interest =-=[15, 12]-=-. One appeal of this clustering formulation is that no assumptions are made concerning the properties of the clusters; k-means, for instance, assumes that the clusters form disjoint convex sets, and i... |

2153 |
Finding Groups in Data: An Introduction to Cluster Analysis
- Kaufman, Rousseeuw
- 1990
(Show Context)
Citation Context ...∈X A different formulation known as k-medoids 1 examines the cost of swapping, from an initial partition, the current prototype with another potential prototype. The Partitioning Around Medoids (PAM) =-=[8]-=- algorithm accomplishes its clustering objective by examining the cost of swapping the current prototype with every other potential prototype at each iteration. This formulation has the advantage of b... |

1673 | On spectral clustering: Analysis and an algorithm
- Ng, Jordan, et al.
- 2001
(Show Context)
Citation Context ...more local decisions are made, this does not always occur in practice. Spectral cuts, which look at the clustering problem from a graph theoretic standpoint, have enjoyed considerable recent interest =-=[15, 12]-=-. One appeal of this clustering formulation is that no assumptions are made concerning the properties of the clusters; k-means, for instance, assumes that the clusters form disjoint convex sets, and i... |

683 | Clustering by passing messages between data points
- Frey, Dueck
- 2007
(Show Context)
Citation Context ...AP) has been proposed, which, while having quadratic runtime, is more accurate than k-centers and provides a convenient way of controlling the reduction ratio based on a parameter called a preference =-=[5, 6]-=-. Properties of this algorithm as it applies to the task of data reduction will be discussed in Section 3. Unfortunately, a quadratic runtime can be problematic when dealing very large datasets, as is... |

296 |
Some methods for classification and analysis of multivariate observations
- McQueen
- 1967
(Show Context)
Citation Context ...al validation of its effectiveness in Section 5. 2 Background. Exact clustering is an NP-hard problem. The most popular approximate solution is the well-known and studied Lloyd’s or k-means algorithm =-=[11]-=-, which can be derived as a special case of expectation maximization. Let φ(x1, x2) denote the similarity between two objects from a set x1, x2 ∈ X . Then k-means optimizes the following objective fun... |

211 | A min-max cut algorithm for graph partitioning and data clustering
- Ding, He, et al.
(Show Context)
Citation Context ...ervals along the eigenvector has been found to work well, even when the intervals are quite large. It should be noted that other such criteria exist, such as the MinMax cut, which work similarly well =-=[3]-=-. In addition, while we are interested in bisections, k-way clusterings can be achieved by clustering (using e.g., k-means) in the eigenspace spanned by the first k eigenvalues of the Laplacian. Our n... |

138 | CLARANS: A Method for Clustering Objects for Spatial Data Mining
- Ng, Han
- 2002
(Show Context)
Citation Context ...h iteration, the number of operations to find the best swap is of order O(k(n − k) 2 ). Therefore, a number of sampling strategies, such as CLARANS, have been developed to scale it to larger datasets =-=[13]-=-. A good data reduction is one in which the reduction ratio is small and where prototypes well-describe their clusters. We take reduction ratio to be the ratio of the number of prototypes to the numbe... |

69 | A Divide-and-Merge Methodology for Clustering - Cheng, Kannan, et al. |

61 | Iterative clustering of high dimensional text data augmented by local search
- Dhillon, Guan, et al.
- 2002
(Show Context)
Citation Context ... maxima and is quite susceptible to the quality of the initialization. Furthermore, k-means has a tendency to get “stuck” on smaller datasets, a problem which is exacerbated in high-dimensional space =-=[2]-=-. Thus, while we expect the clusters to become purer as progressively more local decisions are made, this does not always occur in practice. Spectral cuts, which look at the clustering problem from a ... |

48 | Generative model-based document clustering: a comparative study
- Zhong, Ghosh
(Show Context)
Citation Context ...llections, varying in their balance (ratio of largest to smallest cluster) and number of classes, which have been fully described elsewhere and are common benchmarks of clustering performance on text =-=[17]-=-. In addition to those described in the cited work, we contruct some collections from the 20 newsgroups dataset2 . The dataset used for illustrative purposes in the previous sections is a sampling of ... |

26 | Mixture modeling by affinity propagation
- Frey, Dueck
- 2006
(Show Context)
Citation Context ...AP) has been proposed, which, while having quadratic runtime, is more accurate than k-centers and provides a convenient way of controlling the reduction ratio based on a parameter called a preference =-=[5, 6]-=-. Properties of this algorithm as it applies to the task of data reduction will be discussed in Section 3. Unfortunately, a quadratic runtime can be problematic when dealing very large datasets, as is... |

16 | A fast k-means implementation using coresets
- Frahling, Sohler
- 2006
(Show Context)
Citation Context ...fective summary of the document. Data reduction is also a common method of dealing with intractability, whereby we are interested in finding a small subset, called a coreset in computational geometry =-=[4]-=-, that well approximates the properties of the original data. Such reduction can be useful on a small scale for real-time applications; for example, by reducing a large number of query results to a mo... |

14 | Scalable, balanced model-based clustering
- Zhong, Ghosh
(Show Context)
Citation Context ...e portion of the entire dataset (particularly as k increases), limiting the usefulness of this approach for improving scalability. Enforcing the balancing constraint has been an area of some interest =-=[16]-=-. For this purpose, we opt to use a simple variant of k-means which repeatedly bisects the data until the desired value of k is reached. This formulation has the added benefit of being eminently paral... |

7 | Data reduction techniques for instancebased learning from human/computer interface data
- Lane, Brodley
- 2000
(Show Context)
Citation Context ...sensors used in the Large Hadron Collider, or LADAR scanners used in autonomous navigation and obstacle avoidance). Data reduction has also been used to ease storage costs for instance-based learning =-=[9]-=- and for model selection algorithms [14]. This latter application is a particular motivation for this work, as model selection (that is, finding the number of clusters from the data) is mainly concern... |

5 |
Parallel bisecting k-means with prediction clustering algorithm, The Journal of Supercomputing 39
- Li, Chung
- 2007
(Show Context)
Citation Context ... purpose, we opt to use a simple variant of k-means which repeatedly bisects the data until the desired value of k is reached. This formulation has the added benefit of being eminently parallelizable =-=[10]-=-. That is, rather than searching for k means at once, the data is divided into two clusters k times. By choosing the largest cluster as the next to bisect, the algorithm typically converges to a balan... |

3 |
of both: a hybridized centroid-medoid clustering heuristic
- Best
- 2007
(Show Context)
Citation Context ...onquer approach. Our idea is similar in spirit to that of a recent “best of both” approach to improving k-means performance, where the authors combine batch k-means with a k-medoid style local search =-=[7]-=-, and also to the sampling strategies for PAM mentioned in the previous section. We would like to scale k-medoid data reduction by retricting the problem space to groups of already similar objects out... |

3 | Clump: A scalable and robust framework for structure discovery
- Punera, Ghosh
- 2005
(Show Context)
Citation Context ...er, or LADAR scanners used in autonomous navigation and obstacle avoidance). Data reduction has also been used to ease storage costs for instance-based learning [9] and for model selection algorithms =-=[14]-=-. This latter application is a particular motivation for this work, as model selection (that is, finding the number of clusters from the data) is mainly concerned with the shape of the data, and there... |