Results 1  10
of
79
Survey of clustering data mining techniques
, 2002
"... Accrue Software, Inc. Clustering is a division of data into groups of similar objects. Representing the data by fewer clusters necessarily loses certain fine details, but achieves simplification. It models data by its clusters. Data modeling puts clustering in a historical perspective rooted in math ..."
Abstract

Cited by 351 (0 self)
 Add to MetaCart
(Show Context)
Accrue Software, Inc. Clustering is a division of data into groups of similar objects. Representing the data by fewer clusters necessarily loses certain fine details, but achieves simplification. It models data by its clusters. Data modeling puts clustering in a historical perspective rooted in mathematics, statistics, and numerical analysis. From a machine learning perspective clusters correspond to hidden patterns, the search for clusters is unsupervised learning, and the resulting system represents a data concept. From a practical perspective clustering plays an outstanding role in data mining applications such as scientific data exploration, information retrieval and text mining, spatial database applications, Web analysis, CRM, marketing, medical diagnostics, computational biology, and many others. Clustering is the subject of active research in several fields such as statistics, pattern recognition, and machine learning. This survey focuses on clustering in data mining. Data mining adds to clustering the complications of very large datasets with very many attributes of different types. This imposes unique
Evaluation of Hierarchical Clustering Algorithms for Document Datasets
 Data Mining and Knowledge Discovery
, 2002
"... Fast and highquality document clustering algorithms play an important role in providing intuitive navigation and browsing mechanisms by organizing large amounts of information into a small number of meaningful clusters. In particular, hierarchical clustering solutions provide a view of the data at ..."
Abstract

Cited by 230 (6 self)
 Add to MetaCart
(Show Context)
Fast and highquality document clustering algorithms play an important role in providing intuitive navigation and browsing mechanisms by organizing large amounts of information into a small number of meaningful clusters. In particular, hierarchical clustering solutions provide a view of the data at different levels of granularity, making them ideal for people to visualize and interactively explore large document collections.
PeertoPeer Information Retrieval Using SelfOrganizing Semantic Overlay Networks
, 2003
"... Contentbased fulltext search is a challenging problem in PeertoPeer (P2P) systems. Traditional approaches have either been centralized or use flooding to ensure accuracy of the results returned. In this paper, we present pSearch, a decentralized nonflooding P2P information retrieval system. pSea ..."
Abstract

Cited by 228 (7 self)
 Add to MetaCart
(Show Context)
Contentbased fulltext search is a challenging problem in PeertoPeer (P2P) systems. Traditional approaches have either been centralized or use flooding to ensure accuracy of the results returned. In this paper, we present pSearch, a decentralized nonflooding P2P information retrieval system. pSearch distributes document indices through the P2P network based on document semantics generated by Latent Semantic Indexing (LSI). The search cost (in terms of different nodes searched and data transmitted) for a given query is thereby reduced, since the indices of semantically related documents are likely to be colocated in the network. We also describe techniques that help distribute the indices more evenly across the nodes, and further reduce the number of nodes accessed using appropriate index distribution as well as using index samples and recently processed queries to guide the search. Experiments show that pSearch can achieve performance comparable to centralized information retrieval systems by searching only a small number of nodes. For a system with 128,000 nodes and 528,543 documents (from news, magazines, etc.), pSearch searches only 19 nodes and transmits only 95.5KB data during the search, whereas the top 15 documents returned by pSearch and LSI have a 91.7% intersection.
Criterion Functions for Document Clustering: Experiments and Analysis
, 2002
"... In recent years, we have witnessed a tremendous growth in the volume of text documents available on the Internet, digital libraries, news sources, and companywide intranets. This has led to an increased interest in developing methods that can help users to effectively navigate, summarize, and org ..."
Abstract

Cited by 186 (12 self)
 Add to MetaCart
(Show Context)
In recent years, we have witnessed a tremendous growth in the volume of text documents available on the Internet, digital libraries, news sources, and companywide intranets. This has led to an increased interest in developing methods that can help users to effectively navigate, summarize, and organize this information with the ultimate goal of helping them to find what they are looking for. Fast and highquality document clustering algorithms play an important role towards this goal as they have been shown to provide both an intuitive navigation/browsing mechanism by organizing large amounts of information into a small number of meaningful clusters as well as to greatly improve the retrieval performance either via clusterdriven dimensionality reduction, termweighting, or query expansion. This everincreasing importance of document clustering and the expanded range of its applications led to the development of a number of new and novel algorithms with different complexityquality tradeoffs. Among them, a class of clustering algorithms that have relatively low computational requirements are those that treat the clustering problem as an optimization process which seeks to maximize or minimize a particular clustering criterion function defined over the entire clustering solution.
Discovery and Evaluation of Aggregate Usage Profiles for Web Personalization
 Data Mining and Knowledge Discovery
, 2002
"... Web usage mining, possibly used in conjunction with standard approaches to personalization such as collaborative filtering, can help address some of the shortcomings of these techniques, including reliance on subjective user ratings, lack of scalability, and poor performance in the face of highdime ..."
Abstract

Cited by 130 (15 self)
 Add to MetaCart
(Show Context)
Web usage mining, possibly used in conjunction with standard approaches to personalization such as collaborative filtering, can help address some of the shortcomings of these techniques, including reliance on subjective user ratings, lack of scalability, and poor performance in the face of highdimensional and sparse data. However, the discovery of patterns from usage data by itself is not sufficient for performing the personalization tasks. The critical step is the effective derivation of good quality and useful (i.e., actionable) "aggregate usage profiles" from these patterns. In this paper we present and experimentally evaluate two techniques, based on clustering of user transactions and clustering of pageviews, in order to discover overlapping aggregate profiles that can be effectively used by recommender systems for realtime Web personalization. We evaluate these techniques both in terms of the quality of the individual profiles generated, as well as in the context of providing recommendations as an integrated part of a personalization engine. In particular, our results indicate that using the generated aggregate profiles, we can achieve effective personalization at early stages of users' visits to a site, based only on anonymous clickstream data and without the benefit of explicit input by these users or deeper knowledge about them.
Orthogonal nonnegative matrix trifactorizations for clustering
 In SIGKDD
, 2006
"... Currently, most research on nonnegative matrix factorization (NMF) focus on 2factor X = FG T factorization. We provide a systematic analysis of 3factor X = FSG T NMF. While unconstrained 3factor NMF is equivalent to unconstrained 2factor NMF, constrained 3factor NMF brings new features to constr ..."
Abstract

Cited by 102 (21 self)
 Add to MetaCart
(Show Context)
Currently, most research on nonnegative matrix factorization (NMF) focus on 2factor X = FG T factorization. We provide a systematic analysis of 3factor X = FSG T NMF. While unconstrained 3factor NMF is equivalent to unconstrained 2factor NMF, constrained 3factor NMF brings new features to constrained 2factor NMF. We study the orthogonality constraint because it leads to rigorous clustering interpretation. We provide new rules for updating F,S,G and prove the convergence of these algorithms. Experiments on 5 datasets and a real world case study are performed to show the capability of biorthogonal 3factor NMF on simultaneously clustering rows and columns of the input data matrix. We provide a new approach of evaluating the quality of clustering on words using class aggregate distribution and multipeak distribution. We also provide an overview of various NMF extensions and examine their relationships.
Empirical and theoretical comparisons of selected criterion functions for document clustering
 Machine Learning
"... Abstract. This paper evaluates the performance of different criterion functions in the context of partitional clustering algorithms for document datasets. Our study involves a total of seven different criterion functions, three of which are introduced in this paper and four that have been proposed i ..."
Abstract

Cited by 100 (7 self)
 Add to MetaCart
Abstract. This paper evaluates the performance of different criterion functions in the context of partitional clustering algorithms for document datasets. Our study involves a total of seven different criterion functions, three of which are introduced in this paper and four that have been proposed in the past. We present a comprehensive experimental evaluation involving 15 different datasets, as well as an analysis of the characteristics of the various criterion functions and their effect on the clusters they produce. Our experimental results show that there are a set of criterion functions that consistently outperform the rest, and that some of the newly proposed criterion functions lead to the best overall results. Our theoretical analysis shows that the relative performance of the criterion functions depends on (i) the degree to which they can correctly operate when the clusters are of different tightness, and (ii) the degree to which they can lead to reasonably balanced clusters. Keywords:
Discovery of Aggregate Usage Profiles for Web Personalization
 Proceedings of the WebKDD Workshop at the ACM SIGKKD
"... ..."
(Show Context)
Adaptive Dimension Reduction for Clustering High Dimensional Data
, 2002
"... It is wellknown that for high dimensional data clustering, standard algorithms such as EM and the Kmeans are often trapped in local minimum. Many initialization methods were proposed to tackle this problem , but with only limited success. In this paper we propose a new approach to resolve this pro ..."
Abstract

Cited by 64 (2 self)
 Add to MetaCart
It is wellknown that for high dimensional data clustering, standard algorithms such as EM and the Kmeans are often trapped in local minimum. Many initialization methods were proposed to tackle this problem , but with only limited success. In this paper we propose a new approach to resolve this problem by repeated dimension reductions such that Kmeans or EM are performed only in very low dimensions. Cluster membership is utilized as a bridge between the reduced dimensional subspace and the original space, providing flexibility and ease of implementation. Clustering analysis performed on highly overlapped Gaussians, DNA gene expression profiles and internet newsgroups demonstrate the e#ectiveness of the proposed algorithm.
On Scaling Latent Semantic Indexing for Large PeerToPeer Systems
 Proc. 27th Annual International ACM SIGIR Conference
, 2004
"... The exponential growth of data demands scalable infrastructures capable of indexing and searching rich content such as text, music, and images. A promising direction is to combine information retrieval with peertopeer technology for scalability, faulttolerance, and low administration cost. One pi ..."
Abstract

Cited by 36 (0 self)
 Add to MetaCart
(Show Context)
The exponential growth of data demands scalable infrastructures capable of indexing and searching rich content such as text, music, and images. A promising direction is to combine information retrieval with peertopeer technology for scalability, faulttolerance, and low administration cost. One pioneering work along this direction is pSearch [32, 33]. pSearch places documents onto a peerto peer overlay network according to semantic vectors produced using Latent Semantic Indexing (LSI). The search cost for a query is reduced since documents related to the query are likely to be colocated on a small number of nodes. Unfortunately, because of its reliance on LSI, pSearch also inherits the limitations of LSI. (1) When the corpus is large and heterogeneous, LSI's retrieval quality is inferior to methods such as Okapi. (2) The Singular Value Decomposition (SVD) used in LSI is unscalable in terms of both memory consumption and computation time.