Results 1  10
of
32
Incremental Clustering and Dynamic Information Retrieval
, 1997
"... Motivated by applications such as document and image classification in information retrieval, we consider the problem of clustering dynamic point sets in a metric space. We propose a model called incremental clustering which is based on a careful analysis of the requirements of the information retri ..."
Abstract

Cited by 153 (5 self)
 Add to MetaCart
Motivated by applications such as document and image classification in information retrieval, we consider the problem of clustering dynamic point sets in a metric space. We propose a model called incremental clustering which is based on a careful analysis of the requirements of the information retrieval application, and which should also be useful in other applications. The goal is to efficiently maintain clusters of small diameter as new points are inserted. We analyze several natural greedy algorithms and demonstrate that they perform poorly. We propose new deterministic and randomized incremental clustering algorithms which have a provably good performance. We complement our positive results with lower bounds on the performance of incremental algorithms. Finally, we consider the dual clustering problem where the clusters are of fixed diameter, and the goal is to minimize the number of clusters. 1 Introduction We consider the following problem: as a sequence of points from a metric...
Similaritybased approaches to natural language processing
, 1997
"... Statistical methods for automatically extracting information about associations between words or documents from large collections of text have the potential to have considerable impact in a number of areas, such as information retrieval and naturallanguagebased user interfaces. However, even huge ..."
Abstract

Cited by 40 (3 self)
 Add to MetaCart
Statistical methods for automatically extracting information about associations between words or documents from large collections of text have the potential to have considerable impact in a number of areas, such as information retrieval and naturallanguagebased user interfaces. However, even huge bodies of text yield highly unreliable estimates of the probability of relatively common events, and, in fact, perfectly reasonable events may not occur in the training data at all. This is known as the sparse data problem. Traditional approaches to the sparse data problem use crude approximations. We propose a different solution: if we are able to organize the data into classes of similar events, then, if information about an event is lacking, we can estimate its behavior from information about similar events. This thesis presents two such similaritybased approaches, where, in general, we measure similarity by the KullbackLeibler divergence, an informationtheoretic quantity. Our first approach is to build soft, hierarchical clusters: soft, because each event belongs to each cluster with some probability; hierarchical, because cluster centroids are iteratively split to model finer distinctions. Our clustering method, which uses the technique of deterministic annealing,
JMeans: A New Local Search Heuristic for Minimum SumofSquares Clustering
"... A new local search heuristic, called JMeans, is proposed for solving the minimum sumofsquares clustering problem. The neighborhood of the current solution is defined by all possible centroidtoentity relocations followed by corresponding changes of assignments. Moves are made in such neighborhoo ..."
Abstract

Cited by 35 (10 self)
 Add to MetaCart
A new local search heuristic, called JMeans, is proposed for solving the minimum sumofsquares clustering problem. The neighborhood of the current solution is defined by all possible centroidtoentity relocations followed by corresponding changes of assignments. Moves are made in such neighborhoods until a local optimum is reached. The new heuristic is compared with two other wellknown local search heuristics, KMeans and HMeans as well as with HMeans+, an improved version of the latter in which degeneracy is removed. Moreover, another heuristic, which fits into the Variable Neighborhood Search metaheuristic framework and uses JMeans in its local search step, is proposed too. Results on standard test problems from the literature are reported. It appears that JMeans outperforms the other local search methods, quite substantially when many entities and clusters are considered. 1 Introduction Consider a set X = fx 1 ; : : : ; xN g, x j = (x 1j ; : : : ; x qj ) 2 R q of N entiti...
A Theory of Proximity Based Clustering: Structure Detection by Optimization
 Pattern Recognition
, 1999
"... In this paper, a systematic optimization approach for clustering proximity or similarity data is developed. Starting from fundamental invariance and robustness properties, a set of axioms is proposed and discussed to distinguish different cluster compactness and separation criteria. The approach cov ..."
Abstract

Cited by 34 (8 self)
 Add to MetaCart
In this paper, a systematic optimization approach for clustering proximity or similarity data is developed. Starting from fundamental invariance and robustness properties, a set of axioms is proposed and discussed to distinguish different cluster compactness and separation criteria. The approach covers the case of sparse proximity matrices, and is extended to nested partitionings for hierarchical data clustering. To solve the associated optimization problems, a rigorous mathematical framework for deterministic annealing and meanfield approximation is presented. Efficient optimization heuristics are derived in a canonical way, which also clarifies the relation to stochastic optimization by Gibbs sampling. Similaritybased clustering techniques have a broad range of possible applications in computer vision, pattern recognition, and data analysis. As a major practical application we present a novel approach to the problem of unsupervised texture segmentation, which relies on statistical...
Inferring Congestion Sharing and Path Characteristics from Packet Interarrival Times
"... This paper presents new nonintrusive measurement techniques to detect sharing of upstream congestion and discover bottleneck router link speeds. Our techniques are completely passive and require only arrival times of packets and flow identifiers. Our technique for detecting shared congestion is ba ..."
Abstract

Cited by 32 (0 self)
 Add to MetaCart
This paper presents new nonintrusive measurement techniques to detect sharing of upstream congestion and discover bottleneck router link speeds. Our techniques are completely passive and require only arrival times of packets and flow identifiers. Our technique for detecting shared congestion is based upon the observation that an aggregated arrival trace from flows that share a bottleneck has very different statistics from those that do not share a bottleneck. In particular the entropy of the interarrival times is much lower for aggregated traffic sharing a bottleneck. Additionally this paper identifies mode structure in the interarrival distribution that enables discovery of the link bandwidths of multiple upstream routers. We
Geometric Embeddings for Faster and Better MultiWay Netlist Partitioning
 Proc. ACM/IEEE Design Automation Conf
, 1993
"... We give new, effective algorithms for kway circuit partitioning in the two regimes of k ø n and k = \Theta(n), where n is the number of modules in the circuit. We show that partitioning an appropriately designed geometric embedding of the netlist, rather than a traditional graph representation, yi ..."
Abstract

Cited by 30 (15 self)
 Add to MetaCart
We give new, effective algorithms for kway circuit partitioning in the two regimes of k ø n and k = \Theta(n), where n is the number of modules in the circuit. We show that partitioning an appropriately designed geometric embedding of the netlist, rather than a traditional graph representation, yields improved results as well as large speedups. We derive d dimensional geometric embeddings of the netlist via (i) a new "partitioningspecific" net model for constructing the Laplacian of the netlist, and (ii) computation of d eigenvectors of the netlist Laplacian; we then apply (iii) fast topdown and bottomup geometric clustering methods. 1 Preliminaries In topdown layout synthesis of complex VLSI systems, the goal of partitioning/clustering is to reveal the natural circuit structure, via a decomposition into k subcircuits which minimizes connectivity between subcircuits. A generic problem statement is as follows: kWay Partitioning: Given a circuit netlist G = (V; E) with jV j...
Geometric Clusterings
, 1990
"... A kclustering of a given set of points in the plane is a partition of the points into k subsets ("clusters"). For any fixed k, we can find a kclustering which minimizes any monotone function of the diameters or the radii of the clusters in polynomial time. The algorithm is based on the fact that a ..."
Abstract

Cited by 28 (1 self)
 Add to MetaCart
A kclustering of a given set of points in the plane is a partition of the points into k subsets ("clusters"). For any fixed k, we can find a kclustering which minimizes any monotone function of the diameters or the radii of the clusters in polynomial time. The algorithm is based on the fact that any two clusters in an optimal solution can be separated by a line.
Clustering with the connectivity kernel
 In NIPS
, 2004
"... Clustering aims at extracting hidden structure in dataset. While the problem of finding compact clusters has been widely studied in the literature, extracting arbitrarily formed elongated structures is considered a much harder problem. In this paper we present a novel clustering algorithm which tack ..."
Abstract

Cited by 24 (1 self)
 Add to MetaCart
Clustering aims at extracting hidden structure in dataset. While the problem of finding compact clusters has been widely studied in the literature, extracting arbitrarily formed elongated structures is considered a much harder problem. In this paper we present a novel clustering algorithm which tackles the problem by a two step procedure: first the data are transformed in such a way that elongated structures become compact ones. In a second step, these new objects are clustered by optimizing a compactnessbased criterion. The advantages of the method over related approaches are threefold: (i) robustness properties of compactnessbased criteria naturally transfer to the problem of extracting elongated structures, leading to a model which is highly robust against outlier objects; (ii) the transformed distances induce a Mercer kernel which allows us to formulate a polynomial approximation scheme to the generally N Phard clustering problem; (iii) the new method does not contain free kernel parameters in contrast to methods like spectral clustering or meanshift clustering. 1
Differential evolution and particle swarm optimisation in partitional clustering
 Comput. Stat. Data Anal
, 2006
"... Abstract: In recent years, many partitional clustering algorithms based on genetic algorithms (GA) have been proposed to tackle the problem of finding the optimal partition of a data set. Surprisingly, very few studies considered alternative stochastic search heuristics other than GAs or simulated a ..."
Abstract

Cited by 21 (0 self)
 Add to MetaCart
Abstract: In recent years, many partitional clustering algorithms based on genetic algorithms (GA) have been proposed to tackle the problem of finding the optimal partition of a data set. Surprisingly, very few studies considered alternative stochastic search heuristics other than GAs or simulated annealing. Two promising algorithms for numerical optimization, which are hardly known outside the heuristic search field, are particle swarm optimisation (PSO) and differential evolution (DE). In this study, we compared the performance of GAs with PSO and DE for a medoid evolution approach to clustering, which Paterlini and Minerva (2003) introduced in a previous paper. Moreover, we compared these results with the nominal classification, kmeans and random search (RS) as a lower bound. Our results show that DE is clearly and consistently superior compared to GAs and PSO for hard clustering problems, both in respect to precision as well as robustness (reproducibility) of the results. Only for simple data sets, the GA and PSO can obtain the same quality of results in contrast to kmeans and RS, and, as expected, for trivial problems all algorithms can obtain comparable results. Apart from superior performance, DE is very easy to implement and requires hardly any parameter tuning compared to substantial tuning for GAs and PSOs. Our study shows that DE rather than GAs should receive primary attention in partitional cluster algorithms. Keywords: Cluster analysis, partitional clustering, differential evolution, particle swarm optimization, genetic algorithms. 1 1
Distributional Similarity Models: Clustering vs. Nearest Neighbors
 PROCEEDINGS OF THE 37TH ANNUAL MEETING OF THE ACL, PP. 3340, 1999
, 1999
"... Distributional similarity is a useful notion in estimating the probabilities of rare joint events. It has been employed both to cluster events according to their distributions, and to directly compute averages of estimates for distributional neighbors of a target event. Here, we examine the tradeoff ..."
Abstract

Cited by 20 (1 self)
 Add to MetaCart
Distributional similarity is a useful notion in estimating the probabilities of rare joint events. It has been employed both to cluster events according to their distributions, and to directly compute averages of estimates for distributional neighbors of a target event. Here, we examine the tradeoffs between model size and prediction accuracy for clusterbased and nearest neighbors distributional models of unseen events.