Results 1 - 10
of
12
Document Clustering using Word Clusters via the Information Bottleneck Method
- In ACM SIGIR 2000
, 2000
"... We present a novel implementation of the recently introduced information bottleneck method for unsupervised document clustering. Given a joint empirical distribution of words and documents, p(x; y), we first cluster the words, Y , so that the obtained word clusters, Y_hat , maximally preserve the in ..."
Abstract
-
Cited by 123 (16 self)
- Add to MetaCart
We present a novel implementation of the recently introduced information bottleneck method for unsupervised document clustering. Given a joint empirical distribution of words and documents, p(x; y), we first cluster the words, Y , so that the obtained word clusters, Y_hat , maximally preserve the information on the documents. The resulting joint distribution, p(X; Y_hat ), contains most of the original information about the documents, I(X; Y_hat ) ~= I(X;Y ), but it is much less sparse and noisy. Using the same procedure we then cluster the documents, X , so that the information about the word-clusters is preserved. Thus, we first find word-clusters that capture most of the mutual information about the set of documents, and then find document clusters, that preserve the information about the word clusters. We tested this procedure over several document collections based on subsets taken from the standard 20Newsgroups corpus. The results were assessed by calculating the correlation between the document clusters and the correct labels for these documents. Finding from our experiments show that this double clustering procedure, which uses the information bottleneck method, yields significantly superior performance compared to other common document distributional clustering algorithms. Moreover, the double clustering procedure improves all the distributional clustering methods examined here.
What the Query Told the Link: The Integration of Hypertext and Information Retrieval
- IN PROCEEDINGS OF HYPERTEXT 97
, 1997
"... Traditionally hypertexts have been limited in size by the manual effort required to create hypertext links. In addition, large hyper-linked collections may overwhelm users with the range of possible links from any node, only a fraction of which may be appropriate for a given user at any time. This w ..."
Abstract
-
Cited by 37 (5 self)
- Add to MetaCart
Traditionally hypertexts have been limited in size by the manual effort required to create hypertext links. In addition, large hyper-linked collections may overwhelm users with the range of possible links from any node, only a fraction of which may be appropriate for a given user at any time. This work explores automatic methods of link construction based on feedback from users collected during browsing. A fulltext search engine mediates the linking process. Query terms that distinguish well among documents in the database become candidate anchors; links are mediated by passage-based relevance feedback queries. The newspaper metaphor is used to organize the retrieval results. VOIR, a software prototype that implements these algorithms has been used to browse a 74,500 node (250MB) database of newspaper articles. An experiment has been conducted to test the relative effectiveness of dynamic links and user-specified queries. Experimental results suggest that link-mediated queries are more effective than user-specified queries in retrieving relevant information. The paper concludes with a discussion of possible extensions to the linking algorithms.
The effectiveness of query-specific hierarchic clustering
- in information retrieval. Information Processing and Management
, 2002
"... Hierarchic document clustering has been widely applied to Information Retrieval (IR) on the grounds of its potential improved effectiveness over inverted file search. However, previous research has been inconclusive as to whether clustering does bring improvements. In this paper we take the view tha ..."
Abstract
-
Cited by 29 (2 self)
- Add to MetaCart
Hierarchic document clustering has been widely applied to Information Retrieval (IR) on the grounds of its potential improved effectiveness over inverted file search. However, previous research has been inconclusive as to whether clustering does bring improvements. In this paper we take the view that if hierarchic clustering is applied to search results (query-specific clustering), then it has the potential to increase the retrieval effectiveness compared both to that of static clustering and of conventional inverted file search. We conducted a number of experiments using five document collections and four hierarchic clustering methods. Our results show that the effectiveness of query-specific clustering is indeed higher, and suggest that there is scope for its application to IR.
On-Line New Event Detection, Clustering, And Tracking
, 1999
"... In this work, we discuss and evaluate solutions to text classification problems associated with the events that are reported in on-line sources of news. We present solutions to three related classification problems: new event detection, event clustering, and event tracking. The primary focus of this ..."
Abstract
-
Cited by 16 (0 self)
- Add to MetaCart
In this work, we discuss and evaluate solutions to text classification problems associated with the events that are reported in on-line sources of news. We present solutions to three related classification problems: new event detection, event clustering, and event tracking. The primary focus of this thesis is new event detection, where the goal is to identify news stories that have not previously been reported, in a stream of broadcast news comprising radio, television, and newswire. We present an algorithm for new event detection, and analyze the effects of incorporating domain properties into the classification algorithm. We explore a solution that models the temporal relationship between news stories, and investigate the use of proper noun phrase
CitiViz: A visual user interface to the CITIDEL system
- In Proc.ofECDL-04
, 2004
"... Abstract. The Digital Library (DL) field is one of the most promising areas of application for information visualization technology. In this paper, we propose a visual user interface tool kit for digital libraries, to deliver an overview of document sets, with support for interactive direct manipula ..."
Abstract
-
Cited by 8 (6 self)
- Add to MetaCart
Abstract. The Digital Library (DL) field is one of the most promising areas of application for information visualization technology. In this paper, we propose a visual user interface tool kit for digital libraries, to deliver an overview of document sets, with support for interactive direct manipulation. Our system, Citiviz, employs a dynamic hyperbolic tree to display hierarchical relationships among documents, based on where their topics fit into the ACM classification system. Also, Citiviz provides an interactive, animated 2-dimensional scatter plot. With it, users may gain insight by changing various parameters, or may directly jump to a particular document based on its label or location. According to a preliminary evaluation, our system shows advantages in performance and user preference relative to traditional text based DL web interfaces.
Active Information Retrieval
, 2001
"... In classical large information retrieval systems, the system responds to a user initiated query with a list of results ranked by relevance. ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
In classical large information retrieval systems, the system responds to a user initiated query with a list of results ranked by relevance.
Addressing Heterogeneity in the Networked Information Environment
"... Several ongoing Stanford University Digital Library projects address the issue of heterogeneity in networked information environments. A networked information environment has the following components: users, information repositories, information services, and payment mechanisms. This paper describes ..."
Abstract
-
Cited by 4 (3 self)
- Add to MetaCart
Several ongoing Stanford University Digital Library projects address the issue of heterogeneity in networked information environments. A networked information environment has the following components: users, information repositories, information services, and payment mechanisms. This paper describes three of the heterogeneityfocused Stanford projects---InfoBus, REACH, and DLITE. The InfoBus project is at the protocol level, while the REACH and DLITE projects are both at the conceptual model level. The InfoBus project provides the infrastructure necessary for accessing heterogeneous services and utilizing heterogeneous payment mechanisms. The REACH project sets forth a uniform conceptual model for finding information in networked information repositories. The DLITE project presents a general task-based strategy for building user interfaces to heterogeneous networked information services. 1.0 Introduction The recent surge of research in "digital libraries" has energized discussion ab...
An Efficient Document Clustering Algorithm based on the Topic Binder Hypothesis
"... We present an efficient document clustering algorithm that uses a term frequency vector for each document instead of using a huge proximity matrix. The algorithm has the following features: 1) it consumes relatively little memory space and runs fast, 2) it produces a hierarchy in the form of a docum ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
We present an efficient document clustering algorithm that uses a term frequency vector for each document instead of using a huge proximity matrix. The algorithm has the following features: 1) it consumes relatively little memory space and runs fast, 2) it produces a hierarchy in the form of a document classification tree, and 3) the hierarchy obtained by the algorithm explicitly reveals the collection structure. We confirm these features and thus show the algorithm's feasibility with clustering experiments in which we use two collections of Japanese documents, the sizes of which are 83,099 and 14,701. We also introduce two applications of this algorithm. 1 Motivation Document clustering has long received keen attention from those concerned with document retrieval, and some of the many papers are those of (van Rijsbergen, 1986), (Croft, 1980) and (Griffiths et al., 1984) which have studied the document clustering techniques for retrieval purposes. The research interest has mainly con...
Hierarchical Structural Approach to
"... Web users have been mainly relying on Web search engines to find information of interest on the Web. ..."
Abstract
- Add to MetaCart
Web users have been mainly relying on Web search engines to find information of interest on the Web.
Chapter IV
- In Managing Business with Electronic Commerce 02
, 2002
"... this paper, we describe a system to cluster search engine results based on a robust relational fuzzy clustering algorithm that we have recently developed. We compare the use of the Vector Space based and N-Gram based dissimilarity measure to cluster the results from the search engines, such as MetaC ..."
Abstract
- Add to MetaCart
this paper, we describe a system to cluster search engine results based on a robust relational fuzzy clustering algorithm that we have recently developed. We compare the use of the Vector Space based and N-Gram based dissimilarity measure to cluster the results from the search engines, such as MetaCrawler and Google. We start by providing a brief background on the clustering algorithm. We then describe our system, and discuss results from our experiments. These include a study of the efficiency on the Vector Space and the N-Gram methods, as well as a comparison with Husky Search (Huskysearch Web Site)

