Results 1 - 10
of
13
Simple, Proven Approaches to Text Retrieval
, 1997
"... This technical note describes straightforward techniques for document indexing and retrieval that have been solidly established through extensive testing and are easy to apply. They are useful for many different types of text material, are viable for very large files, and have the advantage that the ..."
Abstract
-
Cited by 86 (3 self)
- Add to MetaCart
This technical note describes straightforward techniques for document indexing and retrieval that have been solidly established through extensive testing and are easy to apply. They are useful for many different types of text material, are viable for very large files, and have the advantage that they do not require special skills or training for searching, but are easy for end users. The document and text retrieval methods described here have a sound theoretical basis, are well established by extensive testing, and the ideas involved are now implemented in some commercial retrieval systems. Testing in the last few years has, in particular, shown that the methods presented here work very well with full texts, not only title and abstracts, and with large files of texts containing three quarters of a million documents. These tests, the TREC Tests (see Harman 1993 - 1997; IP&M 1995), have been rigorous comparative evaluations involving many different approaches to information retrieval. ...
Noun-Phrase Analysis in Unrestricted Text for Information Retrieval
, 1996
"... Information retrieval is an important application area of natural-language processing where one encounters the genuine challenge of processing large quantities of unrestricted natural-language text. This paper reports on the application of a few simple, yet robust and efficient nounphrase analysis t ..."
Abstract
-
Cited by 64 (10 self)
- Add to MetaCart
Information retrieval is an important application area of natural-language processing where one encounters the genuine challenge of processing large quantities of unrestricted natural-language text. This paper reports on the application of a few simple, yet robust and efficient nounphrase analysis techniques to create bet- ter indexing phrases for information retrieval. In particular, we describe a hybrid approach to the extraction of meaningful (continuous or discontinuous) subcompounds from complex noun phrases using both corpus statistics and linguistic heuristics. Results of experiments show that indexing based on such extracted sub- compounds improves both recall and precision in an information retrieval system. The noun-phrase analysis techniques are also potentially useful for book indexing and automatic thesaurus extraction.
Stylistic Experiments For Information Retrieval
, 2000
"... Information retrieval systems are built to handle texts as topical items: texts are tabulated by occurrence frequencies of content words in them, under the assumption that text topic is reasonably well modeled by content word occurrence. But texts have several interesting characteristics beyond topi ..."
Abstract
-
Cited by 47 (8 self)
- Add to MetaCart
Information retrieval systems are built to handle texts as topical items: texts are tabulated by occurrence frequencies of content words in them, under the assumption that text topic is reasonably well modeled by content word occurrence. But texts have several interesting characteristics beyond topic. The experiments described in this text investigate stylistic variation. Roughly put, style is the difference between two ways of saying the same thing -- and systematic stylistic variation can be used to characterize the genre of documents. These experiments investigate if stylistic information is distinguishable using simple language engineering methods, and if in that case this type of information can be used to improve information retrieval systems.
Translingual Information Retrieval: Learning from Bilingual Corpora
- Artificial Intelligence
, 1997
"... Translingual information retrieval (TLIR) consists of providing a query in one language and searching document collections in one or more different languages. This paper introduces new TLIR methods and reports on comparative TLIR experiments with these new methods and with previously reported ones i ..."
Abstract
-
Cited by 34 (5 self)
- Add to MetaCart
Translingual information retrieval (TLIR) consists of providing a query in one language and searching document collections in one or more different languages. This paper introduces new TLIR methods and reports on comparative TLIR experiments with these new methods and with previously reported ones in a realistic setting. Methods fall into two categories: query translation and statistical-IR approaches establishing translingual associations. The results show that using bilingual corpora for automated extraction of term equivalences in context outperforms dictionary-based methods. Translingual versions of the Generalized Vector Space Model (GVSM) and Latent Semantic Indexing (LSI) also perform well, as does translingual pseudo relevance feedback (PRF) and Example-Based Term-in-context Translation (EBT). All showed relatively small performance loss between monolingual and translingual versions, ranging between 87% to 101% of monolingual IR performance. Query translation based on a general...
Writing CGI scripts in Tcl
- Proceedings of Tcl/Tk Workshop 96
, 1996
"... CGI scripts enable dynamic generation of HTML pages. This paper describes how to write CGI scripts using Tcl. Many people use Tcl for this purpose already but in an ad hoc way and without realizing many of the more nonobvious benefits. This paper reviews these benefits and provides a framework and e ..."
Abstract
-
Cited by 11 (2 self)
- Add to MetaCart
CGI scripts enable dynamic generation of HTML pages. This paper describes how to write CGI scripts using Tcl. Many people use Tcl for this purpose already but in an ad hoc way and without realizing many of the more nonobvious benefits. This paper reviews these benefits and provides a framework and examples. Canonical solutions to HTh4L quoting problems are presented. This paper also discusses using Tcl for the generation of different formats from the same document. As an example. FAQ generation in both text and HTML are described.
Selection of Passages for Information Reduction
, 1996
"... is a huge manual undertaking, particularly when there are fifty or more texts. Unfortunately, full-text understanding is not yet feasible as an alternative and information extract techniques themselves rely on large numbers of training texts with manually encoded answer keys. By locating and present ..."
Abstract
-
Cited by 10 (5 self)
- Add to MetaCart
is a huge manual undertaking, particularly when there are fifty or more texts. Unfortunately, full-text understanding is not yet feasible as an alternative and information extract techniques themselves rely on large numbers of training texts with manually encoded answer keys. By locating and presenting relevant passages to the user, we will have significantly reduced the time and effort expenditure. Alternatively, we could save an automated informationextraction system from processing an entire text by focusing the system on those portions of the text most likely to contain the desired information. This work integrates a case-based reasoner with an IR engine to reduce the information bottleneck. SPIRE [Se- This research was supported byNSF Grant no. EEC-9209623, State/Industry/University Cooperative Research on Intelligent Information Retrieval, Digital Equipment Corporation and the National Center for Automated Information Research. lection of Passages for Inf
Automatic Indexing: An Approach Using an Index Term Corpus and Combining Linguistic and Statistical Methods
, 2000
"... This thesis discusses the problems and the methods of finding relevant information in large collections of documents. The contribution of this thesis to this problem is to develop better content analysis methods which can be used to describe document content with index terms. Index terms can be used ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
This thesis discusses the problems and the methods of finding relevant information in large collections of documents. The contribution of this thesis to this problem is to develop better content analysis methods which can be used to describe document content with index terms. Index terms can be used as meta-information that describes documents, and that is used for seeking information. The main point of this thesis is to illustrate the process of developing an automatic indexer which analyses the content of documents by combining evidence from word frequencies and evidence from linguistic analysis provided by a syntactic parser. The indexer weights the expressions of a text according to their estimated importance for describing the content of a given document on the basis of the content analysis. The typical linguistic features of index terms were explored using a linguistically analysed text collection where the index terms are manually marked up. This text collection is referred to as an index term corpus. Specific features of the index terms provided the basis for a linguistic term-weighting scheme, which was then combined with a frequency-based term-weighting scheme. The use of an index term corpus like this as training material is a new method of developing an automatic indexer. The results of the experiments were promising.
Combining Evidence for Effective Information Filtering
- In AAAI Spring Symposium on Machine Learning and Information Retrieval
, 1996
"... As part of NIST/ARPA's TREC Workshop, we used Latent Semantic Indexing (LSI) for filtering 336k incoming documents from diverse sources (newswires, patents technical abstracts) for 50 topics of interest. We developed representations of user interests, or filters, for these topics using two sources o ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
As part of NIST/ARPA's TREC Workshop, we used Latent Semantic Indexing (LSI) for filtering 336k incoming documents from diverse sources (newswires, patents technical abstracts) for 50 topics of interest. We developed representations of user interests, or filters, for these topics using two sources of training information. A Word Filter used just the words in the topic statements, and a RelDocs Filter used just the known relevant training documents and ignored the topic statement. Using the relevant training documents (a variant of relevance feedback) was more effective than using a detailed natural language description of interests. Combining these two vectors provided some additional improvements in filtering. On average, 7 of the top 10 documents and 44 of the top 100 documents are relevant using the combined vector method. Data combination using the results of the Word and RelDocs retrieval sets was not generally successful in improving performance compared to the best individual me...
A PADRE in MUFTI (A Multi User Free Text retrieval Intermediary)
- in Proceedings of the Fourth Parallel Computing Workshop paper 26
, 1995
"... The Parallel Document Retrieval Engine (PADRE) has hitherto lacked the ability to support multiple time-sharing searchers, a deficiency which detracted from its cost effectiveness. Extensions, adjuncts and performance improvements are now proposed to permit multiple use and to ensure reasonable resp ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
The Parallel Document Retrieval Engine (PADRE) has hitherto lacked the ability to support multiple time-sharing searchers, a deficiency which detracted from its cost effectiveness. Extensions, adjuncts and performance improvements are now proposed to permit multiple use and to ensure reasonable response times. This paper describes the proposed multi-user architecture, reports query-processing speed-ups, outlines a number of alternative types of user-interaction client including an interface to network browsers in common use. Implementation progress is reported and potential load handling capacity is discussed. KEYWORDS Text retrieval, information retrieval, parallel computing. 1 Introduction The Parallel Document Retrieval Engine (PADRE) has previously demonstrated that full text scanning methods supported by parallel hardware permit powerful query constructors and rapid response to changing document collections [3, 5, 6]. The addition of paralleldisk -resident inverted file indexes ...
An Evaluation of Statistical Approaches to Text Categorization
- Journal of Information Retrieval
, 1999
"... This paper is a comparative study of text categorization methods. Fourteen methods are investigated, based on previously published results and newly obtained results from additional experiments. Corpus biases in commonly used document collections are examined using the performance of three classifie ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
This paper is a comparative study of text categorization methods. Fourteen methods are investigated, based on previously published results and newly obtained results from additional experiments. Corpus biases in commonly used document collections are examined using the performance of three classifiers. Problems in previously published experiments are analyzed, and the results of flawed experiments are excluded from the cross-method evaluation. As a result, eleven out of the fourteen methods are remained. A k-nearest neighbor (kNN) classifier was chosen for the performance baseline on several collections; on each collection, the performance scores of other methods were normalized using the score of kNN. This provides a common basis for a global observation on methods whose results are only available on individual collections. Widrow-Hoff, k-nearest neighbor, neural networks and the Linear Least Squares Fit mapping are the top-performing classifiers, while the Rocchio approaches had rela...

