Results 1 - 10
of
15
Hierarchical Document Clustering Using Frequent Itemsets
- IN PROC. SIAM INTERNATIONAL CONFERENCE ON DATA MINING 2003 (SDM 2003
, 2003
"... A major challenge in document clustering is the extremely high dimensionality. For example, the vocabulary for a document set can easily be thousands of words. On the other hand, each document often contains a small fraction of words in the vocabulary. These features require special handlings. Anoth ..."
Abstract
-
Cited by 55 (1 self)
- Add to MetaCart
A major challenge in document clustering is the extremely high dimensionality. For example, the vocabulary for a document set can easily be thousands of words. On the other hand, each document often contains a small fraction of words in the vocabulary. These features require special handlings. Another requirement is hierarchical clustering where clustered documents can be browsed according to the increasing specificity of topics. In this paper, we propose to use the notion of frequent itemsets, which comes from association rule mining, for document clustering. The intuition of our clustering criterion is that each cluster is identified by some common words, called frequent itemsets, for the documents in the cluster. Frequent itemsets are also used to produce a hierarchical topic tree for clusters. By focusing on frequent items, the dimensionality of the document set is drastically reduced. We show that this method outperforms best existing methods in terms of both clustering accuracy and scalability.
The effectiveness of query-specific hierarchic clustering
- in information retrieval. Information Processing and Management
, 2002
"... Hierarchic document clustering has been widely applied to Information Retrieval (IR) on the grounds of its potential improved effectiveness over inverted file search. However, previous research has been inconclusive as to whether clustering does bring improvements. In this paper we take the view tha ..."
Abstract
-
Cited by 29 (2 self)
- Add to MetaCart
Hierarchic document clustering has been widely applied to Information Retrieval (IR) on the grounds of its potential improved effectiveness over inverted file search. However, previous research has been inconclusive as to whether clustering does bring improvements. In this paper we take the view that if hierarchic clustering is applied to search results (query-specific clustering), then it has the potential to increase the retrieval effectiveness compared both to that of static clustering and of conventional inverted file search. We conducted a number of experiments using five document collections and four hierarchic clustering methods. Our results show that the effectiveness of query-specific clustering is indeed higher, and suggest that there is scope for its application to IR.
Automatic Thesaurus Construction Based on Grammatical Relations
, 1995
"... We propose a method to build thesauri on the basis of grammatical relations. The proposed method constructs thesauri by using a hierarchical clustering algorithm. An important point in this paper is the claim that thesauri in order to be efficient need to take (surface) case information into account ..."
Abstract
-
Cited by 16 (4 self)
- Add to MetaCart
We propose a method to build thesauri on the basis of grammatical relations. The proposed method constructs thesauri by using a hierarchical clustering algorithm. An important point in this paper is the claim that thesauri in order to be efficient need to take (surface) case information into account. We refer to the thesauri as "relation-based thesaurus (RBT)." In the experiment, four RBTs of Japanese nouns were constructed from 26,023 verb-noun co-occurrences, and each RBT was evaluated by objective criteria. The experiment has shown that the RBTs have better properties for selectional restriction of case frames than conventional ones. 1 Introduction For most natural language processing (NLP) systems, thesauri are one of the basic ingredients. In particular, coupled with case frames, they are useful to guide correct analysis [ Allen, 1988 ] . In the example-based frameworks, thesauri are also used to compensate for insufficient example data [ Sato and Nagao, 1990, Nagao and Kurohashi...
Clustering in Massive Data Sets
- Handbook of massive data sets
, 1999
"... We review the time and storage costs of search and clustering algorithms. We exemplify these, based on case-studies in astronomy, information retrieval, visual user interfaces, chemical databases, and other areas. Sections 2 to 6 relate to nearest neighbor searching, an elemental form of clustering, ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
We review the time and storage costs of search and clustering algorithms. We exemplify these, based on case-studies in astronomy, information retrieval, visual user interfaces, chemical databases, and other areas. Sections 2 to 6 relate to nearest neighbor searching, an elemental form of clustering, and a basis for clustering algorithms to follow. Sections 7 to 11 review a number of families of clustering algorithm. Sections 12 to 14 relate to visual or image representations of data sets, from which a number of interesting algorithmic developments arise.
Combining Text-, Link-, and Classification-based Retrieval Methods to Enhance Information Discovery on the Web
, 2002
"... ..."
Information Retrieval on the Web: Selected Topics
- IBM research, Tokyo Research Laboratory, IBM
, 1999
"... In this paper we review studies on the growth of the Internet and technologies which are useful for information search and retrieval on the Web. In the rst section, we present data on the Internet from several dierent sources, e.g., current as well as projected number of users, hosts and Web sites. ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
In this paper we review studies on the growth of the Internet and technologies which are useful for information search and retrieval on the Web. In the rst section, we present data on the Internet from several dierent sources, e.g., current as well as projected number of users, hosts and Web sites. Although the numerical gures vary, the overall trends cited by the sources are consistent and point to exponential growth during the coming decade. And Internet users are increasingly using search engines and search services to nd speci c information of interest. However, users are not satis ed with the performance of the current generation of search engines; the slow speed of retrieval, communication delays, and poor quality of retrieved results (e.g., noise and broken links) are commonly cited problems. The main body of our paper focuses on linear algebraic models and techniques for solving these problems. keywords: clustering, indexing, information retrieval, Internet, late...
Automatic thesaurus construction based on grammatical relations
- In Proceedings of IJCAI-95
, 1995
"... We propose a method to build thesauri on the basis of grammatical relations The proposed method constructs thesaun by using a hierarchical clustering algorithm An important point in this paper is the claim that thesauri in order to be efficient need to take (surface) case information into account We ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
We propose a method to build thesauri on the basis of grammatical relations The proposed method constructs thesaun by using a hierarchical clustering algorithm An important point in this paper is the claim that thesauri in order to be efficient need to take (surface) case information into account We refer to the thesauri as ' relation-based thesaurus (RBT) " In the experiment four RBTs of Japanese nouns were constructed from 26,023 verb-noun cooccurrences, and each RBT was evaluated fry objective criteria The experiment has shown that the RBTs have better properties for selectional restriction of case frames than conventional ones 1
Clustering information retrieval search outputs
- In: Proceedings of the 21st BCS IRSG Colloquium on Information Retrieval
, 1999
"... Users are known to experience difficulties in dealing with information retrieval search outputs, especially if those outputs are above a certain size. It has been argued by several researchers that search output clustering can help users in their interaction with IR systems in some retrieval situati ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
Users are known to experience difficulties in dealing with information retrieval search outputs, especially if those outputs are above a certain size. It has been argued by several researchers that search output clustering can help users in their interaction with IR systems in some retrieval situations, providing them with an overview of their results by exploiting the topicality information that resides in the output but has not been used at the retrieval stage. This overview might enable them to find relevant documents more easily by focusing on the most promising clusters, or to use the clusters as a starting-point for query refinement or expansion. In this paper, the results of experiments carried out to assess the viability of clustering as a search output presentation method are reported and discussed. 1.
Guru: Information retrieval for reuse
- Landmark Contributions in Software Reuse and Reverse Engineering. Unicom Seminars Ltd
, 1994
"... Although software reuse presents clear advantages for programmer productivity and code reliability, it is not practiced enough. One of the reasons for the only moderate success of reuse is the lack of software libraries that facilitate the actual locating and understanding of reusable components. Th ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Although software reuse presents clear advantages for programmer productivity and code reliability, it is not practiced enough. One of the reasons for the only moderate success of reuse is the lack of software libraries that facilitate the actual locating and understanding of reusable components. This paper describes a technology for automatically assembling large software libraries that promote software reuse by helping the user locate the components closest to her/his needs. Software libraries are automatically assembled from a set of unorganized components by using information retrieval techniques. The construction of the library is done in two steps. First, attributes are automatically extracted from natural language documentation by using a new free-text indexing scheme based on the notions of lexical a nities and quantity of information. Then, a hierarchy for browsing is automatically generated using a clustering technique that draws only on the information provided by the attributes. Thanks to the free-text indexing scheme, tools following this approach can accept free-style natural language queries. This technology has been implemented in the Guru system, which has been applied to construct an organized library of Aix utilities. An experiment was conducted in order to evaluate the retrieval e ectiveness of Guru as compared to InfoExplorer, ahypertext library system for Aix 3 on the IBM RS/6000 series, as well as to two other indexing schemes. We followed the usual evaluation procedure used in information retrieval, based upon recall and precision measures, and determined that our system retrieved more e ectively than the others. 1
UMass at TREC 2003: HARD and QA
- In Proceedings of the Twelfth Text Retrieval Conference (TREC-2003), Washington, DC. U.S. Government Printing Office. NIST Special Publication
, 2004
"... • In the HARD track, we developed document metadata to correspond to query metadata requirements; implemented clarification forms based on query expansion, passage retrieval, and clustering; and retrieved variable length passages deemed most likely to be relevant. This work is discussed at length in ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
• In the HARD track, we developed document metadata to correspond to query metadata requirements; implemented clarification forms based on query expansion, passage retrieval, and clustering; and retrieved variable length passages deemed most likely to be relevant. This work is discussed at length in Section 1. • In the QA track, we focused on retrieving passages that were likely to contain the answer to the question. 1 HARD track 1.1 Overview The goal of the High Accuracy Retrieval from Documents track was to explore techniques for improving the accuracy of the top-ranked documents in response to a query. We participated in all three aspects of the problem: • We mapped query metadata values to document metadata values that we assigned. We then adjusted the ranking of documents depending on whether their metadata matched the query metadata. • We generated clarification forms to tease more information out of the searcher. We tried several types of clarification forms, including providing a list of keywords that might appear in relevant documents, a list of top-ranking clusters that might contain relevant documents, and a list of passages that might appear in relevant documents. • We explored passage-level retrieval of documents to see if we could pinpoint the relevant portions of documents. In the final analysis, all runs using metadata or clarification forms failed to outperform our best baseline run. We interpret

