Results 1 - 10
of
22
Improving automatic query expansion
, 1998
"... Abstract Most casual users of IR systems type short queries. Recent research has shown that adding new words to these queries via odhoc feedback improves the re-trieval effectiveness of such queries. We investigate ways to improve this query expansion process by refining the set of documents used in ..."
Abstract
-
Cited by 195 (3 self)
- Add to MetaCart
Abstract Most casual users of IR systems type short queries. Recent research has shown that adding new words to these queries via odhoc feedback improves the re-trieval effectiveness of such queries. We investigate ways to improve this query expansion process by refining the set of documents used in feedback. We start by using manually formulated Boolean filters along with proxim-ity constraints. Our approach is similar to the one pro-posed by Hearst[l2]. Next, we investigate a completely automatic method that makes use of term cooccurrence information to estimate word correlation. Experimental results show that refining the set of documents used in query expansion often prevents the query drift caused by blind expansion and yields substantial improvements in retrieval effectiveness, both in terms of average preci-sion and precision in the top twenty documents. More importantly, the fully automatic approach developed in this study performs competitively with the best manual approach and requires little computational overhead. 1
Using Clustering and SuperConcepts within SMART: TREC 6
- THE SIXTH TEXT RETRIEVAL CONFERENCE (TREC-6). NIST SPECIAL PUBLICATION 500-240, NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY
, 1998
"... The Smart information retrieval project emphasizes completely automatic approaches to the understanding and retrieval of large quantities of text. Wecontinue our work in TREC 6, performing runs in the routing, ad-hoc, and foreign language environments, including cross-lingual runs. The major focus f ..."
Abstract
-
Cited by 49 (6 self)
- Add to MetaCart
The Smart information retrieval project emphasizes completely automatic approaches to the understanding and retrieval of large quantities of text. Wecontinue our work in TREC 6, performing runs in the routing, ad-hoc, and foreign language environments, including cross-lingual runs. The major focus for TREC 6 is on trying to maintain the balance of the query -- attempting to ensure the various aspects of the original query are appropriately addressed, especially while adding expansion terms. Exactly the same procedure is used for foreign language environments as for English; our tenet is that good information retrieval techniques are more powerful than linguistic knowledge. We also give aninteresting cross-lingual run, assuming that French and English are closely enough related so that a query in one language can be run directly on a collection in the other language by just "correcting " the spelling of the query words. This is quite successful for most queries.
Using Structured Queries for Disambiguation in Cross-Language Information Retrieval
, 1997
"... Bilingual transfer dictionaries are an important resource for query translation in cross-language text retrieval. However, term translation is not an isomorphic process, so dictionary-based systems must address the problem of ambiguity in language translation. In this paper, we claim that boolean co ..."
Abstract
-
Cited by 27 (1 self)
- Add to MetaCart
Bilingual transfer dictionaries are an important resource for query translation in cross-language text retrieval. However, term translation is not an isomorphic process, so dictionary-based systems must address the problem of ambiguity in language translation. In this paper, we claim that boolean conjunction (the AND operator) provides simple and automatic disambiguation in the target language. We derive a new weighted boolean model based on a probabilistic formulation and apply it to the crosslanguage text retrieval problem. The results suggest that the weighted boolean model is highly effective for general text retrieval, but more experimental evidence is need to conclude that it is particularly advantageous for cross-language application. Nonetheless, the preliminary results are quite promising. 1 Introduction With the ongoing development of multilingual information retrieval systems, researchers are becoming increasing interested in the problem of cross-language information retrie...
Answering clinical questions with knowledge-based and statistical techniques
- Computational Linguistics
, 2007
"... The combination of recent developments in question-answering research and the availability of unparalleled resources developed specifically for automatic semantic processing of text in the medical domain provides a unique opportunity to explore complex question answering in the domain of clinical me ..."
Abstract
-
Cited by 24 (6 self)
- Add to MetaCart
The combination of recent developments in question-answering research and the availability of unparalleled resources developed specifically for automatic semantic processing of text in the medical domain provides a unique opportunity to explore complex question answering in the domain of clinical medicine. This article presents a system designed to satisfy the information needs of physicians practicing evidence-based medicine. We have developed a series of knowledge extractors, which employ a combination of knowledge-based and statistical techniques, for automatically identifying clinically relevant aspects of MEDLINE abstracts. These extracted elements serve as the input to an algorithm that scores the relevance of citations with respect to structured representations of information needs, in accordance with the principles of evidencebased medicine. Starting with an initial list of citations retrieved by PubMed, our system can bring relevant abstracts into higher ranking positions, and from these abstracts generate responses that directly answer physicians ’ questions. We describe three separate evaluations: one focused on the accuracy of the knowledge extractors, one conceptualized as a document reranking task, and finally, an evaluation of answers by two physicians. Experiments on a collection of real-world clinical questions show that our approach significantly outperforms the already competitive PubMed baseline. 1.
ATT at TREC-6
- In Proceedings of the Sixth Text REtrieval Conference (TREC-6
, 1998
"... TREC-6 is AT&T's first independent TREC participation. We are participating in the main tasks (adhoc, routing), the filtering track, the VLC track, and the SDR track 1 This year, in the main tasks, we experimented with multi-pass query expansion using Rocchio's formulation. We concentrated a reaso ..."
Abstract
-
Cited by 21 (2 self)
- Add to MetaCart
TREC-6 is AT&T's first independent TREC participation. We are participating in the main tasks (adhoc, routing), the filtering track, the VLC track, and the SDR track 1 This year, in the main tasks, we experimented with multi-pass query expansion using Rocchio's formulation. We concentrated a reasonable amount of our effort on our VLC track system, which is based on locally distributed, disjoint, and smaller sub-collections of the large collection. Our filtering track runs are based on our routing runs, followed by similarity thresholding to make a binary decision of the relevance prediction for a document. 1 Introduction TREC-6 is the first TREC in which AT&T is participating as an independent group. Much of our work is largely inspired by Smart's philosophy of fully automatic processing of large text collections. Our participation is based on an internally modified version of Cornell's SMART system. We submitted runs for the adhoc task, the routing task, the filtering track, the VL...
Document Classification using Multiword Features
- in: Proceedings of the Seventh International Conference on Information and Knowledge Management (CIKM’98), (ACM
, 1998
"... We investigate the use of multiword query features to improve the effectiveness of text-retrieval systems that accept natural-language queries. A relevance feedback process is explained that expands an initial query with single and multiword features. The multiword features are modelled as a set of ..."
Abstract
-
Cited by 12 (1 self)
- Add to MetaCart
We investigate the use of multiword query features to improve the effectiveness of text-retrieval systems that accept natural-language queries. A relevance feedback process is explained that expands an initial query with single and multiword features. The multiword features are modelled as a set of words appearing within windows of varying sizes. Our experimental results suggest that windows of larger span yield improvements in retrieval over windows of smaller span. This result gives rise to a query contraction process that prunes 25% of the features in an expanded query with no loss in retrieval effectiveness. 1 Introduction The following work investigates the representation for queries used in text-based information retrieval systems. The query representation described has applications in document filtering, routing, and clustering in addition to website searching. Our primary focus is the use of query features that represent concepts expressible in natural language by multiple word...
WebCrawler: Finding What People Want
, 2000
"... WebCrawler, the first comprehensive full-text search engine for the World-Wide Web, has played a fundamental role in making the Web easier to use for millions of people. Its invention and subsequent evolution, spanning a three-year period, helped fuel the Web's growth by creating a new way of naviga ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
WebCrawler, the first comprehensive full-text search engine for the World-Wide Web, has played a fundamental role in making the Web easier to use for millions of people. Its invention and subsequent evolution, spanning a three-year period, helped fuel the Web's growth by creating a new way of navigating hypertext. Before search engines like WebCrawler, users found Web documents by following hypertext links from one document to another. When the Web was small and its documents shared the same fundamental purpose, users could find documents with relative ease. However, the Web quickly grew to millions of pages making navigation difficult. WebCrawler assists users in their Web navigation by automating the task of link traversal, creating a searchable index of the web, and fulfilling searchers queries from the index. To use WebCrawler, a user issues a query to a pre-computed index, quickly retrieving a list of documents that match the query. This dissertation describes WebCrawler's scientific contributions: a method for choosing a subset of the Web to index; an approach to creating a search service that is easy to use; a new way to rank search results that can generate highly effective results for both naive and expert searchers; and an architecture for the service that has effectively handled a three-order-of-magnitude increase in load. This dissertation also describes how WebCrawler evolved to accommodate the extraordinary growth of the Web. This growth affected WebCrawler not only by increasing the size and scope of its index, but also by increasing the demand for its service. Each of WebCrawlers components had to change to accommodate this growth: the crawler had to download more documents, the full-text index had to become more efficient at storing and finding those documents, and the service had to accommodate heavier demand. Such changes were not only related to scale, however: the evolving nature of the Web meant that functional changes were necessary, too, such as the ability to handle naive queries from searchers.
Polyphonic Music Retrieval: The N-gram Approach
, 2004
"... This Music Information Retrieval (MIR) study investigates the use of n-grams and textual In-formation Retrieval (IR) approaches for the retrieval and access of polyphonic music data. IR, synonymous with text IR, implies the task of retrieving documents or texts with information content that is relev ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
This Music Information Retrieval (MIR) study investigates the use of n-grams and textual In-formation Retrieval (IR) approaches for the retrieval and access of polyphonic music data. IR, synonymous with text IR, implies the task of retrieving documents or texts with information content that is relevant to a user’s information need. With music retrieval, the use of n-grams has largely been confined to monophonic musical sequences. The few studies that have investigated its use with polyphonic music collections typically reduce a polyphonic file into a monophonic sequence for n-gram construction. Tech-niques for full-music indexing of polyphonic music data with n-grams are investigated. A method to obtain n-grams from polyphonic music data is introduced. The information con-tent of ‘musical n-grams ’ is extended to include rhythmic information in addition to intervallic information. For this, ratios of onset times between two adjacent pairs of pitch events are used. To encode ‘musical n-grams ’ to obtain ‘musical words ’ for indexing, a function that maps interval classes to text characters is formulated, and ranges of ratio bins are defined. These encoding approaches enable encoding of the pitch and rhythm information at vari-
Keyphrase extraction-based query expansion in digital libraries
- in Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries, Chapel Hill, NC, USA 2006
"... In pseudo-relevance feedback, the two key factors affecting the retrieval performance most are the source from which expansion terms are generated and the method of ranking those expansion terms. In this paper, we present a novel unsupervised query expansion technique that utilizes keyphrases and PO ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
In pseudo-relevance feedback, the two key factors affecting the retrieval performance most are the source from which expansion terms are generated and the method of ranking those expansion terms. In this paper, we present a novel unsupervised query expansion technique that utilizes keyphrases and POS phrase categorization. The keyphrases are extracted from the retrieved documents and weighted with an algorithm based on information gain and co-occurrence of phrases. The selected keyphrases are translated into Disjunctive Normal Form (DNF) based on the POS phrase categorization technique for better query refomulation. Furthermore, we study whether ontologies such as WordNet and MeSH improve the retrieval performance in conjunction with the keyphrases. We test our techniques on TREC 5, 6, and 7 as well as a MEDLINE collection. The experimental results show that the use of keyphrases with POS phrase categorization produces the best average precision.
A fusion approach to XML structured document retrieval
- Information Retrieval
, 2005
"... XML has emerged as a lingua franca of the WWW and is rapidly replacing other formats as the preferred form for information ranging from protocol exchange messages to full documents and databases. With this rapid growth, and the conversion of information resources to XML, comes an increasing need for ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
XML has emerged as a lingua franca of the WWW and is rapidly replacing other formats as the preferred form for information ranging from protocol exchange messages to full documents and databases. With this rapid growth, and the conversion of information resources to XML, comes an increasing need for effective search and retrieval of XML documents and their constituent elements.

