Results 1 -
5 of
5
Ephemeral Document Clustering for Web Applications
, 2000
"... We revisit document clustering in the context of the Web. Specifically, we investigate on-line ephemeral clustering, whereby the input document set is generated dynamically, typically by search results, and the output clustering hierarchy has a short life span, and is used for interactive browsing ..."
Abstract
-
Cited by 16 (0 self)
- Add to MetaCart
We revisit document clustering in the context of the Web. Specifically, we investigate on-line ephemeral clustering, whereby the input document set is generated dynamically, typically by search results, and the output clustering hierarchy has a short life span, and is used for interactive browsing purposes. Ephemeral clustering for interactive use introduces several new challenges. It requires an efficient algorithm, since clustering is performed on-line. It also requires high precision, because users who are not domain experts are less tolerant to errors, and because the resulting hierarchy is fully automatically generated, as opposed to off-line clustering in which the hierarchy is often manually modified. Finally, interactive clustering requires a presentation layer that enables users to effectively browse the hierarchy, including visualization techniques and automatic annotations of the hierarchy. We present new concepts, techniques and algorithms that tailor clustering to...
Guru: Information retrieval for reuse
- Landmark Contributions in Software Reuse and Reverse Engineering. Unicom Seminars Ltd
, 1994
"... Although software reuse presents clear advantages for programmer productivity and code reliability, it is not practiced enough. One of the reasons for the only moderate success of reuse is the lack of software libraries that facilitate the actual locating and understanding of reusable components. Th ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Although software reuse presents clear advantages for programmer productivity and code reliability, it is not practiced enough. One of the reasons for the only moderate success of reuse is the lack of software libraries that facilitate the actual locating and understanding of reusable components. This paper describes a technology for automatically assembling large software libraries that promote software reuse by helping the user locate the components closest to her/his needs. Software libraries are automatically assembled from a set of unorganized components by using information retrieval techniques. The construction of the library is done in two steps. First, attributes are automatically extracted from natural language documentation by using a new free-text indexing scheme based on the notions of lexical a nities and quantity of information. Then, a hierarchy for browsing is automatically generated using a clustering technique that draws only on the information provided by the attributes. Thanks to the free-text indexing scheme, tools following this approach can accept free-style natural language queries. This technology has been implemented in the Guru system, which has been applied to construct an organized library of Aix utilities. An experiment was conducted in order to evaluate the retrieval e ectiveness of Guru as compared to InfoExplorer, ahypertext library system for Aix 3 on the IBM RS/6000 series, as well as to two other indexing schemes. We followed the usual evaluation procedure used in information retrieval, based upon recall and precision measures, and determined that our system retrieved more e ectively than the others. 1
The GURU System in TREC-5
, 1997
"... robabilistic ranking uses a unique feature called "Lexical Affinity" (LA). LA between two terms is a correlation measure of their common occurrences in text. as defined by Maareck (1991). The occurrences of correlated pairs of words in a document are ranked higher than the occurrences of the individ ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
robabilistic ranking uses a unique feature called "Lexical Affinity" (LA). LA between two terms is a correlation measure of their common occurrences in text. as defined by Maareck (1991). The occurrences of correlated pairs of words in a document are ranked higher than the occurrences of the individual words over greater distances. The analyzed query is expressed as an "F" statement. "F" is a formal language which we have developed for specifying different search operations and expansions of the query terms. For example: label1 { morph (word1) word2 } determines that the query consists of one query term (corresponding to one label) . The curly brackets specify that word1 and word2 are variants of it. Occurrences of either variant will be treated by the search engine as occurrences of the query term. Variants are added manually by the user. The "morph" operator instructs the search engine to expand each word within its scope to all of its morphological forms. Morphological expansion is
Allowing Users to Weight Search Terms in Information Retrieval
- IBM Research Report RJ 10108
, 1998
"... : We give a principled method for allowing users to assign subjective weights to the importance of search terms, that is, the terms forming a query, in information retrieval systems. For example, our method makes it possible for a user to say that she cares twice as much about the first search te ..."
Abstract
- Add to MetaCart
: We give a principled method for allowing users to assign subjective weights to the importance of search terms, that is, the terms forming a query, in information retrieval systems. For example, our method makes it possible for a user to say that she cares twice as much about the first search term as the second search term, and to obtain a ranked list of results that reflects this preference. Our method is based upon a simple formula derived from any existing "unweighted" ranking function. A naive application of the weighted formula would require issuing as many distinct queries as there are search terms, thus damaging the response time of the retrieval. We explain here how to "smoothly" integrate the formula in most retrieval engines, so as not to affect the retrieval performance in terms of response time. Most of this research was done while the author was a Research Fellow at the IBM Haifa Research Laboratory. 1 Introduction and Motivation Users issuing a query (either f...
Allowing Users to Weight Search Terms
- Proc. Recherche d'Informations Assistee par Ordinateur RIAO '2000
"... Information retrieval systems typically weight the importance of search terms according to document and collection statistics (such as by using tf \Theta idf scores, where less common terms are given higher weight). We consider here the scenario where a user can express her own subjective weighti ..."
Abstract
- Add to MetaCart
Information retrieval systems typically weight the importance of search terms according to document and collection statistics (such as by using tf \Theta idf scores, where less common terms are given higher weight). We consider here the scenario where a user can express her own subjective weighting of the importance of the terms that form the query on top of the systemgenerated weighting, and show how this should modify the relevance scores of documents. This has been allowed before, but only by ad hoc heuristics. We give the first principled method for taking into account the user's subjective weighting of the importance of query terms. Our method is based on an approach by Fagin and Wimmers, that gives a simple formula derived from any existing "unweighted" ranking function. A naive application of the formula would require issuing as many distinct queries as there are terms in the query (search terms), thus damaging the response time of the retrieval. We explain here how to...

