Results 1  10
of
45
StreamingData Algorithms for HighQuality Clustering
, 2001
"... As data gathering grows easier, and as researchers discover new ways to interpret data, streamingdata algorithms have become essential in many fields. Data stream computation precludes algorithms that require random access or large memory. In this paper, we consider the problem of clustering data s ..."
Abstract

Cited by 74 (1 self)
 Add to MetaCart
As data gathering grows easier, and as researchers discover new ways to interpret data, streamingdata algorithms have become essential in many fields. Data stream computation precludes algorithms that require random access or large memory. In this paper, we consider the problem of clustering data streams, which is important in the analysis a variety of sources of data streams, such as routing data, telephone records, web documents, and clickstreams. We provide a new clustering algorithms with theoretical guarantees on its performance. We give empirical evidence of its superiority over the commonlyused kMeans algorithm. We then adapt our algorithm to be able to operate on data streams and experimentally demonstrate its superior performance in this context.
A Clustering Technique for the Identification of Piecewise Affine Systems
, 2001
"... We propose a new technique for the identification of discretetime hybrid systems in the PieceWise Affine (PWA) form. This problem can be formulated as the reconstruction of a possibly discontinuous PWA map with a multidimensional domain. In order to achieve our goal, we provide an algorithm that ..."
Abstract

Cited by 48 (7 self)
 Add to MetaCart
We propose a new technique for the identification of discretetime hybrid systems in the PieceWise Affine (PWA) form. This problem can be formulated as the reconstruction of a possibly discontinuous PWA map with a multidimensional domain. In order to achieve our goal, we provide an algorithm that exploits the combined use of clustering, linear identification, and pattern recognition techniques. This allows to identify both the affine submodels and the polyhedral partition of the domain on which each submodel is valid avoiding gridding procedures. Moreover, the clustering step (used for classifying the datapoints) is performed in a suitably defined feature space which allows also to reconstruct different submodels that share the same coefficients but are defined on different regions. Measures of confidence on the samples are introduced and exploited in order to improve the performance of both the clustering and the final linear regression procedure.
Genetic algorithmbased clustering technique
 Pattern Recognition
, 2000
"... A genetic algorithmbased clustering technique, called GAclustering, is proposed in this article. The searching capability of genetic algorithms is exploited in order to search for appropriate cluster centres in the feature space such that a similarity metric of the resulting clusters is optimized. ..."
Abstract

Cited by 44 (0 self)
 Add to MetaCart
A genetic algorithmbased clustering technique, called GAclustering, is proposed in this article. The searching capability of genetic algorithms is exploited in order to search for appropriate cluster centres in the feature space such that a similarity metric of the resulting clusters is optimized. The chromosomes, which are represented as strings of real numbers, encode the centres of a "xed number of clusters. The superiority of the GAclustering algorithm over the commonly used Kmeans algorithm is extensively demonstrated for four arti"cial and three reallife data sets. � 2000
Document Clustering using Particle Swarm Optimization
 IEEE Swarm Intelligence Symposium, The Westin
, 2005
"... Fast and highquality document clustering algorithms play an important role in effectively navigating, summarizing, and organizing information. Recent studies have shown that partitional clustering algorithms are more suitable for clustering large datasets. However, the Kmeans algorithm, the most c ..."
Abstract

Cited by 22 (5 self)
 Add to MetaCart
Fast and highquality document clustering algorithms play an important role in effectively navigating, summarizing, and organizing information. Recent studies have shown that partitional clustering algorithms are more suitable for clustering large datasets. However, the Kmeans algorithm, the most commonly used partitional clustering algorithm, can only generate a local optimal solution. In this paper, we present a Particle Swarm Optimization (PSO) document clustering algorithm. Contrary to the localized searching of the Kmeans algorithm, the PSO clustering algorithm performs a globalized search in the entire solution space. In the experiments we conducted, we applied the PSO, Kmeans and hybrid PSO clustering algorithm on four different text document datasets. The number of documents in the datasets ranges from 204 to over 800, and the number of terms ranges from over 5000 to over 7000. The results illustrate that the hybrid PSO algorithm can generate more compact clustering results than the Kmeans algorithm. 1.
Document Clustering Analysis Based on Hybrid PSO+Kmeans Algorithm
 Special Issue
, 2005
"... Abstract: There is a tremendous proliferation in the amount of information available on the largest shared information source, the World Wide Web. Fast and highquality document clustering algorithms play an important role in helping users to effectively navigate, summarize, and organize the informa ..."
Abstract

Cited by 12 (6 self)
 Add to MetaCart
Abstract: There is a tremendous proliferation in the amount of information available on the largest shared information source, the World Wide Web. Fast and highquality document clustering algorithms play an important role in helping users to effectively navigate, summarize, and organize the information. Recent studies have shown that partitional clustering algorithms are more suitable for clustering large datasets. The Kmeans algorithm is the most commonly used partitional clustering algorithm because it can be easily implemented and is the most efficient one in terms of the execution time. The major problem with this algorithm is that it is sensitive to the selection of the initial partition and may converge to a local optima. In this paper, we present a hybrid Particle Swarm Optimization (PSO)+Kmeans document clustering algorithm that performs fast document clustering and can avoid being trapped in a local optimal solution as well. For comparison purpose, we applied the PSO+Kmeans, PSO, Kmeans, and other two hybrid clustering algorithms on four different text document datasets. The number of documents in the datasets range from 204 to over 800, and the number of terms range from over 5000 to over 7000. The results illustrate that the PSO+Kmeans algorithm can generate the most compact clustering results than other four algorithms.
A New Conceptual Clustering Framework
 MACHINE LEARNING
, 2004
"... We propose a new formulation of the conceptual clustering problem where the goal is to explicitly output a collection of simple and meaningful conjunctions of attributes that define the clusters. The formulation differs from previous approaches since the clusters discovered may overlap and also may ..."
Abstract

Cited by 12 (2 self)
 Add to MetaCart
We propose a new formulation of the conceptual clustering problem where the goal is to explicitly output a collection of simple and meaningful conjunctions of attributes that define the clusters. The formulation differs from previous approaches since the clusters discovered may overlap and also may not cover all the points. In addition, a point may be assigned to a cluster description even if it only satisfies most, and not necessarily all, of the attributes in the conjunction. Connections between this conceptual clustering problem and the maximum edge biclique problem are made. Simple, randomized algorithms are given that discover a collection of approximate conjunctive cluster descriptions in sublinear time.
M.: Cluster generation and cluster labelling for web snippets: A fast and accurate hierarchical solution
 In Proceedings of the 13th Symposium on String Processing and Information Retrieval (SPIRE 2006
, 2006
"... Abstract. This paper describes Armil, a metasearch engine that groups into disjoint labelled clusters the Web snippets returned by auxiliary search engines. The cluster labels generated by Armil provide the user with a compact guide to assessing the relevance of each cluster to her information need ..."
Abstract

Cited by 10 (2 self)
 Add to MetaCart
Abstract. This paper describes Armil, a metasearch engine that groups into disjoint labelled clusters the Web snippets returned by auxiliary search engines. The cluster labels generated by Armil provide the user with a compact guide to assessing the relevance of each cluster to her information need. Striking the right balance between running time and cluster wellformedness was a key point in the design of our system. Both the clustering and the labelling tasks are performed on the fly by processing only the snippets provided by the auxiliary search engines, and use no external sources of knowledge. Clustering is performed by means of a fast version of the furthestpointfirst algorithm for metric kcenter clustering. Cluster labelling is achieved by combining intracluster and intercluster term extraction based on a variant of the information gain measure. We have tested the clustering effectiveness of Armil against Vivisimo, the de facto industrial standard in Web snippet clustering, using as benchmark a comprehensive set of snippets obtained from the Open Directory Project hierarchy. According to two widely accepted “external” metrics of clustering quality, Armil achieves better performance levels by 10%. We also report the results of a thorough user evaluation of both the clustering and the cluster labelling algorithms. 1
Selforganizing Maps as Substitutes for KMeans Clustering
 In
, 2005
"... One of the most widely used clustering techniques used in GISc problems is the kmeans algorithm. One of the most important issues in the correct use of kmeans is the initialization procedure that ultimately determines which part of the solution space will be searched. In this paper we briefly ..."
Abstract

Cited by 8 (3 self)
 Add to MetaCart
One of the most widely used clustering techniques used in GISc problems is the kmeans algorithm. One of the most important issues in the correct use of kmeans is the initialization procedure that ultimately determines which part of the solution space will be searched. In this paper we briefly review different initialization procedures, and propose Kohonen's SelfOrganizing Maps as the most convenient method, given the proper training parameters. Furthermore, we show that in the final stages of its training procedure the SelfOrganizing Map algorithms is rigorously the same as the kmeans algorithm. Thus we propose the use of SelfOrganizing Maps as possible substitutes for the more classical kmeans clustering algorithms.
Multiobjective Genetic Algorithm Partitioning for Hierarchical Learning of HighDimensional Pattern Spaces: A LearningFollowsDecomposition Strategy
, 1998
"... In this paper, we present a novel approach to partitioning pattern spaces using a multiobjective genetic algorithm for identifying (near)optimal subspaces for hierarchical learning. Our approach of "learningfollowsdecomposition" is a generic solution to complex highdimensional problems where the ..."
Abstract

Cited by 7 (4 self)
 Add to MetaCart
In this paper, we present a novel approach to partitioning pattern spaces using a multiobjective genetic algorithm for identifying (near)optimal subspaces for hierarchical learning. Our approach of "learningfollowsdecomposition" is a generic solution to complex highdimensional problems where the input space is partitioned prior to the hierarchical neural domain instead of by competitive learning. In this technique, clusters are generated on the basis of fitness of purposethat is, they are explicitly optimized for their subsequent mapping onto the hierarchical classifier. Results of partitioning pattern spaces are presented. This strategy of preprocessing the data and explicitly optimizing the partitions for subsequent mapping onto a hierarchical classifier is found both to reduce the learning complexity and the classification time with no degradation in overall classification error rate. The classification performance of various algorithms is compared and it is suggested that the neural modules are superior for learning the localized decision surfaces of such partitions and offer better generalization.
Kboost: A Scalable Algorithm for High Quality Clustering of Microarray Gene Expression Data TR IIT2007015, Istituto di Informatica e Telematica del CNR
, 2007
"... We consider the problem of partitioning, in a highly accurate and highly efficient way, a set of n documents lying in a metric space into k nonoverlapping clusters. We augment the wellknown furthestpointfirst algorithm for kcenter clustering in metric spaces with a filtering scheme based on the ..."
Abstract

Cited by 7 (3 self)
 Add to MetaCart
We consider the problem of partitioning, in a highly accurate and highly efficient way, a set of n documents lying in a metric space into k nonoverlapping clusters. We augment the wellknown furthestpointfirst algorithm for kcenter clustering in metric spaces with a filtering scheme based on the triangular inequality. We apply this algorithm to Web snippet clustering, comparing it against strong baselines consisting of recent, fast variants of the classical kmeans iterative algorithm. Our main conclusion is that our method attains solutions of better or comparable accuracy, and does this within a fraction of the time required by the baselines. Our algorithm is thus valuable when, as in Web snippet clustering, either the realtime nature of the task or the large amount of data make the poorly scalable, traditional clustering methods unsuitable.