Results 1  10
of
28
Survey of clustering data mining techniques
, 2002
"... Accrue Software, Inc. Clustering is a division of data into groups of similar objects. Representing the data by fewer clusters necessarily loses certain fine details, but achieves simplification. It models data by its clusters. Data modeling puts clustering in a historical perspective rooted in math ..."
Abstract

Cited by 247 (0 self)
 Add to MetaCart
Accrue Software, Inc. Clustering is a division of data into groups of similar objects. Representing the data by fewer clusters necessarily loses certain fine details, but achieves simplification. It models data by its clusters. Data modeling puts clustering in a historical perspective rooted in mathematics, statistics, and numerical analysis. From a machine learning perspective clusters correspond to hidden patterns, the search for clusters is unsupervised learning, and the resulting system represents a data concept. From a practical perspective clustering plays an outstanding role in data mining applications such as scientific data exploration, information retrieval and text mining, spatial database applications, Web analysis, CRM, marketing, medical diagnostics, computational biology, and many others. Clustering is the subject of active research in several fields such as statistics, pattern recognition, and machine learning. This survey focuses on clustering in data mining. Data mining adds to clustering the complications of very large datasets with very many attributes of different types. This imposes unique
Automated Text Summarization in SUMMARIST
, 1999
"... SUMMARIST is an attempt to create a robust automated text summarization system, based on the equation: summarization = topic identification interpretation generation. Each of these stages contains several independent modules, many of them trained on large corpora of text. We describe the systems ..."
Abstract

Cited by 135 (10 self)
 Add to MetaCart
SUMMARIST is an attempt to create a robust automated text summarization system, based on the equation: summarization = topic identification interpretation generation. Each of these stages contains several independent modules, many of them trained on large corpora of text. We describe the systems architecture and provide details of some of its modules.
D.: Enriching Very Large Ontologies Using the WWW
 In: Proc. of 1st International Workshop on Ontology Learning (OL 2000). Held in Conjunction with the 14th European Conference on Artificial Intelligence (ECAI
, 2000
"... Abstract. This paper explores the possibility to exploit text on the world wide web in order to enrich the concepts in existing ontologies. First, a method to retrieve documents from the WWW related to a concept is described. These document collections are used 1) to construct topic signatures (list ..."
Abstract

Cited by 104 (5 self)
 Add to MetaCart
Abstract. This paper explores the possibility to exploit text on the world wide web in order to enrich the concepts in existing ontologies. First, a method to retrieve documents from the WWW related to a concept is described. These document collections are used 1) to construct topic signatures (lists of topically related words) for each concept in WordNet, and 2) to build hierarchical clusters of the concepts (the word senses) that lexicalize a given word. The overall goal is to overcome two shortcomings of WordNet: the lack of topical links among concepts, and the proliferation of senses. Topic signatures are validated on a word sense disambiguation task with good results, which are improved when the hierarchical clusters are used. 1
Parallel Algorithms for Hierarchical Clustering
 Parallel Computing
, 1995
"... Hierarchical clustering is a common method used to determine clusters of similar data points in multidimensional spaces. O(n 2 ) algorithms are known for this problem [3, 4, 10, 18]. This paper reviews important results for sequential algorithms and describes previous work on parallel algorithms f ..."
Abstract

Cited by 80 (1 self)
 Add to MetaCart
Hierarchical clustering is a common method used to determine clusters of similar data points in multidimensional spaces. O(n 2 ) algorithms are known for this problem [3, 4, 10, 18]. This paper reviews important results for sequential algorithms and describes previous work on parallel algorithms for hierarchical clustering. Parallel algorithms to perform hierarchical clustering using several distance metrics are then described. Optimal PRAM algorithms using n log n processors are given for the average link, complete link, centroid, median, and minimum variance metrics. Optimal butterfly and tree algorithms using n log n processors are given for the centroid, median, and minimum variance metrics. Optimal asymptotic speedups are achieved for the best practical algorithm to perform clustering using the single link metric on a n log n processor PRAM, butterfly, or tree. Keywords. Hierarchical clustering, pattern analysis, parallel algorithm, butterfly network, PRAM algorithm. 1 In...
An Analysis of Recent Work on Clustering Algorithms
, 1999
"... This paper describes four recent papers on clustering, each of which approaches the clustering problem from a different perspective and with different goals. It analyzes the strengths and weaknesses of each approach and describes how a user could could decide which algorithm to use for a given clust ..."
Abstract

Cited by 73 (0 self)
 Add to MetaCart
This paper describes four recent papers on clustering, each of which approaches the clustering problem from a different perspective and with different goals. It analyzes the strengths and weaknesses of each approach and describes how a user could could decide which algorithm to use for a given clustering application. Finally, it concludes with ideas that could make the selection and use of clustering algorithms for data analysis less difficult.
Efficient Pose Clustering Using a Randomized Algorithm
, 1997
"... . Pose clustering is a method to perform object recognition by determining hypothetical object poses and finding clusters of the poses in the space of legal object positions. An object that appears in an image will yield a large cluster of such poses close to the correct position of the object. If t ..."
Abstract

Cited by 23 (6 self)
 Add to MetaCart
. Pose clustering is a method to perform object recognition by determining hypothetical object poses and finding clusters of the poses in the space of legal object positions. An object that appears in an image will yield a large cluster of such poses close to the correct position of the object. If there are m model features and n image features, then there are O(m 3 n 3 ) hypothetical poses that can be determined from minimal information for the case of recognition of threedimensional objects from feature points in twodimensional images. Rather than clustering all of these poses, we show that pose clustering can have equivalent performance for this case when examining only O(mn) poses, due to correlation between the poses, if we are given two correct matches between model features and image features. Since we do not usually know two correct matches in advance, this property is used with randomization to decompose the pose clustering problem into O(n 2 ) problems, each of which...
OrderTheoretical Ranking
 JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCES (JASIS
, 2000
"... Current bestmatch ranking (BMR) systems perform well but cannot handle word mismatch between a query and a document. The best known alternative ranking method, hierarchical clusteringbased ranking (HCR), seems to be more robust than BMR with respect to this problem, but it is hampered by theoretic ..."
Abstract

Cited by 21 (3 self)
 Add to MetaCart
Current bestmatch ranking (BMR) systems perform well but cannot handle word mismatch between a query and a document. The best known alternative ranking method, hierarchical clusteringbased ranking (HCR), seems to be more robust than BMR with respect to this problem, but it is hampered by theoretical and practical limitations. We present an approach to document ranking that explicitly addresses the word mismatch problem by exploiting interdocument similarity information in a novel way. Document ranking is seen as a querydocument transformation driven by a conceptual representation of the whole document collection, into which the query is merged. Our approach is based on the theory of concept (or Galois) lattices, which, we argue, provides a powerful, wellfounded, and computationallytractable framework to model the space in which documents and query are represented and to compute such a transformation. We compared information retrieval using concept latticebased ranking (CLR) to BMR and HCR. The results showed that HCR was outperformed by CLR as well as by BMR, and suggested that, of the two best methods, BMR achieved better performance than CLR on the whole document set while CLR compared more favorably when only the first retrieved documents were used for evaluation. We also evaluated the three methods' specific ability to rank documents that did not match the query, in which case the superiority of CLR over BMR and HCR (and that of HCR over BMR) was apparent.
Time and Space Efficient Pose Clustering
 In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, 1994
"... This paper shows that the pose clustering method of object recognition can be decomposed into small subproblems without loss of accuracy. Randomization can then be used to limit the number of subproblems that need to be examined to achieve accurate recognition. These techniques are used to decrease ..."
Abstract

Cited by 14 (6 self)
 Add to MetaCart
This paper shows that the pose clustering method of object recognition can be decomposed into small subproblems without loss of accuracy. Randomization can then be used to limit the number of subproblems that need to be examined to achieve accurate recognition. These techniques are used to decrease the computational complexity of pose clustering. The clustering step is formulated as an efficient tree search of the pose space. This method requires little memory since not many poses are clustered at a time. Analysis shows that pose clustering is not inherently more sensitive to noise than other methods of generating hypotheses. Finally, experiments on real and synthetic data are presented. 1 Introduction Modelbased object recognition systems determine which objects appear in images using a catalog of object models and estimate their positions and orientations (poses) relative to the camera. This paper examines methods of improving the efficiency of the pose clustering method of object ...
Clustering in Massive Data Sets
 Handbook of massive data sets
, 1999
"... We review the time and storage costs of search and clustering algorithms. We exemplify these, based on casestudies in astronomy, information retrieval, visual user interfaces, chemical databases, and other areas. Sections 2 to 6 relate to nearest neighbor searching, an elemental form of clustering, ..."
Abstract

Cited by 11 (0 self)
 Add to MetaCart
We review the time and storage costs of search and clustering algorithms. We exemplify these, based on casestudies in astronomy, information retrieval, visual user interfaces, chemical databases, and other areas. Sections 2 to 6 relate to nearest neighbor searching, an elemental form of clustering, and a basis for clustering algorithms to follow. Sections 7 to 11 review a number of families of clustering algorithm. Sections 12 to 14 relate to visual or image representations of data sets, from which a number of interesting algorithmic developments arise.
Clustering the Hypercube
 SFB Report Series 93, TUGraz
, 1996
"... In this paper we consider various clustering methods for objects represented as binary strings of fixed length d. The dissimilarity of two given objects is the number of disagreeing bits, that is, their Hamming distance. Clustering these objects can be seen as clustering a subset of the vertices of ..."
Abstract

Cited by 3 (2 self)
 Add to MetaCart
In this paper we consider various clustering methods for objects represented as binary strings of fixed length d. The dissimilarity of two given objects is the number of disagreeing bits, that is, their Hamming distance. Clustering these objects can be seen as clustering a subset of the vertices of a ddimensional hypercube, and thus is a geometric problem in d dimensions. We give algorithms for various agglomerative hierarchical methods (including single linkage and complete linkage) as well as for twoclusterings and divisive methods. We only present linear space algorithms since for most practical applications the number of objects to be clustered is usually to large for nonlinear space solutions to be practicable. All algorithms are easy to implement and the constants in their asymptotic runtime are small. We give experimental results for all cluster methods considered, and for uniformly distributed hypercube vertices as well as for specially chosen sets. These experiments indicat...