Results 1 -
8 of
8
Inverted files versus signature files for text indexing
- ACM Transactions on Database Systems
, 1998
"... Two well-known indexing methods are inverted files and signature files. We have undertaken a detailed comparison of these two approaches in the context of text indexing, paying particular attention to query evaluation speed and space requirements. We have examined their relative performance using bo ..."
Abstract
-
Cited by 74 (3 self)
- Add to MetaCart
Two well-known indexing methods are inverted files and signature files. We have undertaken a detailed comparison of these two approaches in the context of text indexing, paying particular attention to query evaluation speed and space requirements. We have examined their relative performance using both experimentation and a refined approach to modeling of signature files, and demonstrate that inverted files are distinctly superior to signature files. Not only can inverted files be used to evaluate typical queries in less time than can signature files, but inverted files require less space and provide greater functionality. Our results also show that a synthetic text database can provide a realistic indication of the behavior of an actual text database. The tools used to generate the synthetic database have been made publicly available.
Optimizing Ranking Functions: A Connectionist Approach to Adaptive Information Retrieval
- DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING, THE UNIVERSITY OF CALIFORNIA, SAN DIEGO
, 1994
"... This dissertation examines the use of adaptive methods to automatically improve the performance of ranked text retrieval systems. The goal of a ranked retrieval system is to manage a large collection of text documents and to order documents for a user based on the estimated relevance of the document ..."
Abstract
-
Cited by 26 (5 self)
- Add to MetaCart
This dissertation examines the use of adaptive methods to automatically improve the performance of ranked text retrieval systems. The goal of a ranked retrieval system is to manage a large collection of text documents and to order documents for a user based on the estimated relevance of the documents to the user's information need (or query). The ordering enables the user to quickly find documents of interest. Ranked retrieval is a difficult problem because of the ambiguity of natural language, the large size of the collections, and because of the varying needs of users and varying collection characteristics. We propose and empirically validate general adaptive methods which improve the ability of a large class of retrieval systems to rank documents effectively. Our main adaptive method is to numerically optimize free parameters in a retrieval system by minimizing a non-metric criterion function. The criterion measures how well the system is ranking documents relative to a target ordering, defined by a set of training queries which include the users' desired document orderings. Thus, the system learns parameter settings which better enable it to rank relevant documents before irrelevant. The non-metric approach is interesting because it is a general adaptive method, an alternative to supervised methods for training neural networks in domains in which rank order or prioritization is important. A second adaptive method is also examined, which is applicable to a restricted class of retrieval systems but which permits an analytic solution. The adaptive methods are applied to a number of problems in text retrieval to validate their utility and practical efficiency. The applications include: A dimensionality reduction of vector-based document representations to a vector spa...
TREC-3 Ad-Hoc, Routing Retrieval and Thresholding Experiments using PIRCS
- In Proceedings of TREC'3
, 1995
"... The PIRCS retrieval system has been upgraded in TREC-3 to handle the full English collections of 2 GB in an efficient manner. For ad-hoc retrieval, we use recurrent spreading of activation in our network to implement query learning and expansion based on the best-ranked subdocuments of an initial re ..."
Abstract
-
Cited by 22 (2 self)
- Add to MetaCart
The PIRCS retrieval system has been upgraded in TREC-3 to handle the full English collections of 2 GB in an efficient manner. For ad-hoc retrieval, we use recurrent spreading of activation in our network to implement query learning and expansion based on the best-ranked subdocuments of an initial retrieval. We also augment our standard retrieval algorithm with a soft-Boolean component. For routing, we use learning from signal-rich short documents or subdocument segments. For the optional thresholding experiment, we tried two approaches to transforming retrieval status values (RSV's) so that they could be used to partition documents into retrieved and nonretrieved sets. The first method normalizes RSV's using a query self-retrieval score. The second, which requires training data, uses logistic regression to convert RSV's into estimates of probability of relevance. Overall, our results are highly competitive with those of other participants. 1. INTRODUCTION PIRCS is an experimental info...
TREC-4 Ad-Hoc, Routing Retrieval and Filtering Experiments using PIRCS
- NIST Special Publication
, 1996
"... Our ad-hoc submissions are pircs1 which is fully automatic, and pircs2 which involves manually weighting some terms and adding some new words to the original topic descriptions. The number of words added are minimal. Both methods involve training and query expansion using the best-ranked subdocument ..."
Abstract
-
Cited by 12 (2 self)
- Add to MetaCart
Our ad-hoc submissions are pircs1 which is fully automatic, and pircs2 which involves manually weighting some terms and adding some new words to the original topic descriptions. The number of words added are minimal. Both methods involve training and query expansion using the best-ranked subdocuments from an initial retrieval as feedback. For our routing experiments we make use of massive query expansion of 350 terms in pircsL, with emphasis on expansion with low frequency terms. Training is done using short and top-ranked known relevant subdocuments. In pircsC, we define four different 'expert' queries (pircsL being one of them) for each topic by using different subsets of training document, and later combine their retrieval results into one. Filtering experiment is done with the retrieval lists of pircsL. For each query, we use the training collections to define retrieval status values (RSVs) where the utilities are maximum for the three precision types. These RSVs are then used as t...
Natural Language Information Retrieval: TREC-7 Report
"... merged using a combined strategy developed at GE and SICS. 2. Background The work reported here was part of the Natural Language Information Retrieval project (NLIR) (Strzalkowski et al., 1997 ..."
Abstract
-
Cited by 10 (0 self)
- Add to MetaCart
merged using a combined strategy developed at GE and SICS. 2. Background The work reported here was part of the Natural Language Information Retrieval project (NLIR) (Strzalkowski et al., 1997
TREC2 Document Retrieval Experiments using PIRCS
- NIST Special Publication 500-215
, 1994
"... We performed the full experiments, using our network implementation of component probabilistic indexing and retrieval model. Documents were enhanced with a list of semi-automatically generated two-word phrases, and queries with automatic Boolean expressions. An item self-learning procedure was used ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
We performed the full experiments, using our network implementation of component probabilistic indexing and retrieval model. Documents were enhanced with a list of semi-automatically generated two-word phrases, and queries with automatic Boolean expressions. An item self-learning procedure was used to initiate network edge weights for retrieval. Initial results submitted were above median for ad hoc, and below median for routing. They were not up to expectation because of a bad choice of high-frequency cutoff for terms, and no query expansion for routing. Later experiments showed that our system does return very good results after correcting the earlier problems and adjusting some parameters. We also re-design our system to handle virtually any number of large files in an incremental fashion, and to do retrieval and learning by initiating our network on demand, without first creating a full inverted file. 1. Introduction In TREC1 our system called PIRCS (acronym for Probabilistic Inde...
Document Representation in Natural Language Text Retrieval
- In Proceedings of the Human Language Technology (HLT) Conference
, 1994
"... In information retrieval, the content of a document may be represented as a collection of terms: words, stems, phrases, or other units derived or inferred from the text of the document. These terms are usually weighted to indicate their importance within the document which can then be viewed as a ve ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
In information retrieval, the content of a document may be represented as a collection of terms: words, stems, phrases, or other units derived or inferred from the text of the document. These terms are usually weighted to indicate their importance within the document which can then be viewed as a vector in a N-dimensional space. In this paper we demonstrate that a proper term weighting is at least as important as their selection, and that dif-ferent types of terms (e.g., words, phrases, names), and terms derived by different means (e.g., statistical, linguistic) must be treated differently for a maximum benefit in rel~ieval. We report some observations made during and after the second Text REtrieval Conference (TREC-2). 1 1.
Towards the Next Generation Information Retrieval
, 2000
"... Future information retrieval systems will be expected to fetch specific facts, answer questions, give advice, or compose reports that satisfy users ever more demanding information needs. We discuss an early prototype of a next generation IR system (NGIR) where users can interactively assemble inform ..."
Abstract
- Add to MetaCart
Future information retrieval systems will be expected to fetch specific facts, answer questions, give advice, or compose reports that satisfy users ever more demanding information needs. We discuss an early prototype of a next generation IR system (NGIR) where users can interactively assemble information briefs on topics of interest. Preamble Future information retrieval systems will be expected to fetch specific facts, answer questions, give advice, or compose reports that satisfy users ever more demanding information needs. Moreover, this information will have to be presented in a manner which is immediately understandable and usable. A ranked list of documents returned by today's search engines may be understandable (though often misunderstood) , but it is rarely usable. A list of `hits' is easy enough to produce, however to prepare an effective and useful information takes a good deal more effort, as exemplified by, for example, the news production process. A producer (television ...

