Results 1 - 10
of
54
Web Document Clustering: A Feasibility Demonstration
, 1998
"... Abstract Users of Web search engines are often forced to sift through the long ordered list of document “snippets” returned by the engines. The IR community has explored document clustering as an alternative method of organizing retrieval results, but clustering has yet to be deployed on the major s ..."
Abstract
-
Cited by 279 (3 self)
- Add to MetaCart
Abstract Users of Web search engines are often forced to sift through the long ordered list of document “snippets” returned by the engines. The IR community has explored document clustering as an alternative method of organizing retrieval results, but clustering has yet to be deployed on the major search engines. The paper articulates the unique requirements of Web document clustering and reports on the first evaluation of clustering methods in this domain. A key requirement is that the methods create their clusters based on the short snippets returned by Web search engines. Surprisingly, we find that clusters based on snippets are almost as good as clusters created using the full text of Web documents. To satisfy the stringent requirements of the Web domain, we introduce an incremental, linear time (in the document collection size) algorithm called Suffix Tree Clustering (STC). which creates clusters based on phrases shared between documents. We show that STC is faster than standard clustering methods in this domain, and argue that Web document clustering via STC is both feasible and potentially beneficial. 1
Concept Decompositions for Large Sparse Text Data using Clustering
- Machine Learning
, 2000
"... . Unlabeled document collections are becoming increasingly common and available; mining such data sets represents a major contemporary challenge. Using words as features, text documents are often represented as high-dimensional and sparse vectors--a few thousand dimensions and a sparsity of 95 to 99 ..."
Abstract
-
Cited by 231 (23 self)
- Add to MetaCart
. Unlabeled document collections are becoming increasingly common and available; mining such data sets represents a major contemporary challenge. Using words as features, text documents are often represented as high-dimensional and sparse vectors--a few thousand dimensions and a sparsity of 95 to 99% is typical. In this paper, we study a certain spherical k-means algorithm for clustering such document vectors. The algorithm outputs k disjoint clusters each with a concept vector that is the centroid of the cluster normalized to have unit Euclidean norm. As our first contribution, we empirically demonstrate that, owing to the high-dimensionality and sparsity of the text data, the clusters produced by the algorithm have a certain "fractal-like" and "self-similar" behavior. As our second contribution, we introduce concept decompositions to approximate the matrix of document vectors; these decompositions are obtained by taking the least-squares approximation onto the linear subspace spanned...
Grouper: A Dynamic Clustering Interface to Web Search Results
, 1999
"... Users of Web search engines are often forced to sift through the long ordered list of document "snippets" returned by the engines. The IR community has explored document clustering as an alternative method of organizing retrieval results, but clustering has yet to be deployed on most major search en ..."
Abstract
-
Cited by 196 (2 self)
- Add to MetaCart
Users of Web search engines are often forced to sift through the long ordered list of document "snippets" returned by the engines. The IR community has explored document clustering as an alternative method of organizing retrieval results, but clustering has yet to be deployed on most major search engines. The NorthernLight search engine organizes its output into "custom folders" based on pre-computed document labels, but does not reveal how the folders are generated or how well they correspond to users' interests. In this paper, we introduce Grouper -- an interface to the results of the HuskySearch meta-search engine, which dynamically groups the search results into clusters labeled by phrases extracted from the snippets. In addition, we report on the first empirical comparison of user Web search behavior on a standard ranked-list presentation versus a clustered presentation. By analyzing HuskySearch logs, we are able to demonstrate substantial differences in the number of documents f...
Towards Adaptive Web Sites: Conceptual Framework and Case Study
- ARTIFICIAL INTELLIGENCE
, 2000
"... The creation of a complex web site is a thorny problem in user interface design. In this paper we explore the notion of adaptiveweb sites: sites that semi-automatically improve their organization and presentation by learning from visitor access patterns. It is easy to imagine and implementweb sit ..."
Abstract
-
Cited by 122 (4 self)
- Add to MetaCart
The creation of a complex web site is a thorny problem in user interface design. In this paper we explore the notion of adaptiveweb sites: sites that semi-automatically improve their organization and presentation by learning from visitor access patterns. It is easy to imagine and implementweb sites that offer shortcuts to popular pages. Are more sophisticated adaptiveweb sites feasible? What degree of automation can weachieve? To address the questions above, we describe the design space of adaptiveweb sites and consider a case study: the problem of synthesizing new index pages that facilitate navigation of a web site. We presentthePageGather algorithm, which automatically identifies candidate link sets to include in index pages based on user access logs. We demonstrate experimentally that PageGather outperforms the Apriori data mining algorithm on this task. In addition, we compare PageGather's link sets to pre-existing, human-authored index pages.
Cluster-based retrieval using language models
- In Proceedings of SIGIR
, 2004
"... Previous research on cluster-based retrieval has been inconclusive as to whether it does bring improved retrieval effectiveness over document-based retrieval. Recent developments in the language modeling approach to IR have motivated us to re-examine this problem within this new retrieval framework. ..."
Abstract
-
Cited by 90 (6 self)
- Add to MetaCart
Previous research on cluster-based retrieval has been inconclusive as to whether it does bring improved retrieval effectiveness over document-based retrieval. Recent developments in the language modeling approach to IR have motivated us to re-examine this problem within this new retrieval framework. We propose two new models for cluster-based retrieval and evaluate them on several TREC collections. We show that cluster-based retrieval can perform consistently across collections of realistic size, and significant improvements over document-based retrieval can be obtained in a fully automatic manner and without relevance information provided by human.
Fast and Intuitive Clustering of Web Documents
- In Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining
, 1997
"... Conventional document retrieval systems (e.g., Alta Vista) return long lists of ranked documents in response to user queries. Recently, document clustering has been put forth as an alternative method of organizing retrieval results (Cutting et al. 1992). A person browsing the clusters can discover ..."
Abstract
-
Cited by 87 (2 self)
- Add to MetaCart
Conventional document retrieval systems (e.g., Alta Vista) return long lists of ranked documents in response to user queries. Recently, document clustering has been put forth as an alternative method of organizing retrieval results (Cutting et al. 1992). A person browsing the clusters can discover patterns that could be overlooked in the traditional presentation. This paper describes two novel clustering methods that intersect the documents in a cluster to determine the set of words (or phrases) shared by all the documents in the cluster. We report on experiments that evaluate these intersectionbased clustering methods on collections of snippets returned from Web search engines. First, we show that word-intersection clustering produces superior clusters and does so faster than standard techniques. Second, we show that our O(n log n) time phrase-intersection clustering method produces comparable clusters and does so more than two orders of magnitude faster than all methods tested. I...
A Data-Clustering Algorithm On Distributed Memory Multiprocessors
- In Large-Scale Parallel Data Mining, Lecture Notes in Artificial Intelligence
, 2000
"... To cluster increasingly massive data sets that are common today in data and text mining, we propose a parallel implementation of the k-means clustering algorithm based on the message passing model. The proposed algorithm exploits the inherent data-parallelism in the k-means algorithm. We analyticall ..."
Abstract
-
Cited by 79 (1 self)
- Add to MetaCart
To cluster increasingly massive data sets that are common today in data and text mining, we propose a parallel implementation of the k-means clustering algorithm based on the message passing model. The proposed algorithm exploits the inherent data-parallelism in the k-means algorithm. We analytically show that the speedup and the scaleup of our algorithm approach the optimal as the number of data points increases. We implemented our algorithm on an IBM POWERparallel SP2 with a maximum of 16 nodes. On typical test data sets, we observe nearly linear relative speedups, for example, 15.62 on 16 nodes, and essentially linear scaleup in the size of the data set and in the number of clusters desired. For a 2 gigabyte test data set, our implementation drives the 16 node SP2 at more than 1.8 gigaflops. Keywords: k-means, data mining, massive data sets, message-passing, text mining. 1 Introduction Data sets measuring in gigabytes and even terabytes are now quite common in data and text minin...
Efficient Clustering Of Very Large Document Collections
, 2001
"... An invaluable portion of scientific data occurs naturally in text form. Given a large unlabeled document collection, it is often helpful to organize this collection into clusters of related documents. By using a vector space model, text data can be treated as high-dimensional but sparse numerical da ..."
Abstract
-
Cited by 74 (9 self)
- Add to MetaCart
An invaluable portion of scientific data occurs naturally in text form. Given a large unlabeled document collection, it is often helpful to organize this collection into clusters of related documents. By using a vector space model, text data can be treated as high-dimensional but sparse numerical data vectors. It is a contemporary challenge to efficiently preprocess and cluster very large document collections. In this paper we present a time and memory ecient technique for the entire clustering process, including the creation of the vector space model. This efficiency is obtained by (i) a memory-ecient multi-threaded preprocessing scheme, and (ii) a fast clustering algorithm that fully exploits the sparsity of the data set. We show that this entire process takes time that is linear in the size of the document collection. Detailed experimental results are presented - a highlight of our results is that we are able to effectively cluster a collection of 113,716 NSF award abstracts in 23 minutes (including disk I/O costs) on a single workstation with modest memory consumption.
A lattice conceptual clustering system and its application to browsing retrieval
- Machine Learning
, 1996
"... Abstract. The theory of concept (or Galois) lattices provides a simple and formal approach to conceptual clustering. In this paper we present GALOIS, a system that automates and applies this theory. The algorithm utilized by GALOIS to build a concept lattice is incremental and efficient, each update ..."
Abstract
-
Cited by 66 (6 self)
- Add to MetaCart
Abstract. The theory of concept (or Galois) lattices provides a simple and formal approach to conceptual clustering. In this paper we present GALOIS, a system that automates and applies this theory. The algorithm utilized by GALOIS to build a concept lattice is incremental and efficient, each update being done in time at most quadratic in the number of objects in the lattice. Also, the algorithm may incorporate background information into the lattice, and through clustering, extend the scope of the theory. The application we present is concerned with information retrieval via browsing, for which we argue that concept lattices may represent major support structures. We describe a prototype user interface for browsing through the concept lattice of a document-term relation, possibly enriched with a thesaurus of terms. An experimental evaluation of the system performed on a medium-sized bibliographic database shows good retrieval performance and a significant improvement after the introduction of background knowledge.

