Results 1 - 10
of
17
Inverted files for text search engines
- ACM Computing Surveys
, 2006
"... The technology underlying text search engines has advanced dramatically in the past decade. The development of a family of new index representations has led to a wide range of innovations in index storage, index construction, and query evaluation. While some of these developments have been consolida ..."
Abstract
-
Cited by 136 (2 self)
- Add to MetaCart
The technology underlying text search engines has advanced dramatically in the past decade. The development of a family of new index representations has led to a wide range of innovations in index storage, index construction, and query evaluation. While some of these developments have been consolidated in textbooks, many specific techniques are not widely known or the textbook descriptions are out of date. In this tutorial, we introduce the key techniques in the area, describing both a core implementation and how the core can be enhanced through a range of extensions. We conclude with a comprehensive bibliography of text indexing literature.
UMass at TREC 2004: Novelty and HARD
- In Proceedings of TREC-13
, 2004
"... For the TREC 2004 Novelty track, UMass participated in all four tasks. Although finding relevant sentences was harder this year than last, we continue to show marked improvements over the baseline of calling all sentences relevant, with a variant of tfidf being the most successful approach. We achie ..."
Abstract
-
Cited by 16 (4 self)
- Add to MetaCart
For the TREC 2004 Novelty track, UMass participated in all four tasks. Although finding relevant sentences was harder this year than last, we continue to show marked improvements over the baseline of calling all sentences relevant, with a variant of tfidf being the most successful approach. We achieve 5–9 % improvements over the baseline in locating novel sentences, primarily by looking at the similarity of a sentence to earlier sentences and focusing on named entities. For the High Accuracy Retrieval from Documents (HARD) track, we investigated the use of clarification forms, fixed- and variable-length passage retrieval, and the use of metadata. Clarification form results indicate that passage level feedback can provide improvements comparable to user supplied related-text for document evaluation and outperforms related-text for passage evaluation. Document retrieval methods without a query expansion component show the most gains from related-text. We also found that displaying the top passages for feedback outperformed displaying centroid passages. Named entity feedback resulted in mixed performance. Our primary findings for passage retrieval are that document retrieval methods performed better than passage retrieval methods on the passage evaluation metric of binary preference at 12,000 characters, and that clarification forms improved passage retrieval for every retrieval method explored. We found no benefit to using variable-length passages over fixed-length passages for this corpus. Our use of geography and genre metadata resulted in no significant changes in retrieval performance.
Information Retrieval: A Survey
, 2000
"... Information Retrieval (IR) is the discipline that deals with retrieval of unstructured data, especially textual documents, in response to a query or topic statement, which may itself be unstructured, e.g., a sentence or even another document, or which may be structured, e.g., a boolean expression. T ..."
Abstract
-
Cited by 14 (0 self)
- Add to MetaCart
Information Retrieval (IR) is the discipline that deals with retrieval of unstructured data, especially textual documents, in response to a query or topic statement, which may itself be unstructured, e.g., a sentence or even another document, or which may be structured, e.g., a boolean expression. The need for effective methods of automated IR has grown in importance because of the tremendous explosion in the amount of unstructured data, both internal, corporate document collections, and the immense and growing number of document sources on the Internet. This report is a tutorial and survey of the state of the art, both research and commercial, in this dynamic field. The topics covered include: formulation of structured and unstructured queries and topic statements, indexing (including term weighting) of document collections, methods for computing the similarity of queries and documents, classification and routing of documents in an incoming stream to users on the basis of topic or nee...
UMass at TREC 2004: Notebook
- In TREC 2004
, 2004
"... The retrieval model implemented in the Indri search engine is an enhanced version of the model described in [30], which combines the language modeling [35] and inference network [38] approaches to information retrieval. The resulting model allows structured queries similar to those used in INQUERY [ ..."
Abstract
-
Cited by 8 (2 self)
- Add to MetaCart
The retrieval model implemented in the Indri search engine is an enhanced version of the model described in [30], which combines the language modeling [35] and inference network [38] approaches to information retrieval. The resulting model allows structured queries similar to those used in INQUERY [4] to be evaluated using language modeling estimates within the network, rather than tf.idf estimates. Figure 1.1 shows a graphical model representation of the network. As in the original inference network framework, documents are ranked according to P(I|D, α, β), the belief the information need I is met given document D and hyperparameters α and β as evidence. Due to space limitations, a general understanding of the inference network framework is assumed. See [30] and [38] to fill in any missing details. 1.1.1 DOCUMENT REPRESENTATION
Passage retrieval and evaluation
, 2005
"... Information retrieval researchers have studied passage retrieval extensively, yet there is no consensus within the community about how to evaluate the results of passage retrieval experiments. This paper describes five character-level passage evaluation measures and tasks for which they may be appro ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
Information retrieval researchers have studied passage retrieval extensively, yet there is no consensus within the community about how to evaluate the results of passage retrieval experiments. This paper describes five character-level passage evaluation measures and tasks for which they may be appropriate. In the second half of the paper we compare several passage retrieval models, including a new generative mixture model that outperforms strong baselines on many of the evaluation measures discussed in part one. 1.
Experimental evaluation of passage-based document retrieval
- In Proceedings of the Sixth International Conference on Document Analysis and Recognition (ICDAR’01
, 2001
"... Retrieval of electronic documents is a fundamental com-ponent for intelligent access to the contents of documents. Although the history of its research is long, it is still not a trivial task, in particular, when we retrieve long docu-ments with short queries. For the retrieval of long docu-ments, a ..."
Abstract
-
Cited by 5 (4 self)
- Add to MetaCart
Retrieval of electronic documents is a fundamental com-ponent for intelligent access to the contents of documents. Although the history of its research is long, it is still not a trivial task, in particular, when we retrieve long docu-ments with short queries. For the retrieval of long docu-ments, a method called passage-based document retrieval has proven to be effective. In this paper, we experimentally show that the passage-based retrieval is also advantageous for dealing with short queries on condition that documents are long. We employ a passage-based method based on den-sity distributions of query terms in documents, and compare it with three conventional methods: the vector space model, pseudo-feedback and latent semantic indexing. 1.
On The Design Of Reliable Efficient Information Systems
, 2001
"... ..................................................................................................................... ix CHAPTER I. ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
..................................................................................................................... ix CHAPTER I.
Fast and Furious Text Mining
"... Text mining studies in biology are often limited to thousands instead of millions of Medline records or are very slow. However, with a modified search engine, many common text mining tasks can be done rapidly. In fact, some information extraction and text categorization tasks can be achieved in seco ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
Text mining studies in biology are often limited to thousands instead of millions of Medline records or are very slow. However, with a modified search engine, many common text mining tasks can be done rapidly. In fact, some information extraction and text categorization tasks can be achieved in seconds or minutes even across tens of gigabytes of (previously indexed) text. In this paper, we present TLM, an efficient implementation of a text analysis engine that uses a highly expressive query language. With this language, users can create queries that quickly accomplish what previously required several different custom-built systems to achieve. 1
Melbourne TREC-9 Experiments
- In Proceedings of the Ninth Text Retrieval Conference (TREC-9
, 2001
"... Introduction We report results for experiments conducted in Melbourne---at CSIRO, RMIT, and The University of Melbourne---for TREC-9. We present results for the interactive track, cross-lingual track, main web track, and the query track. 2 Interactive Track 2.1 Introduction We have been continuo ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
Introduction We report results for experiments conducted in Melbourne---at CSIRO, RMIT, and The University of Melbourne---for TREC-9. We present results for the interactive track, cross-lingual track, main web track, and the query track. 2 Interactive Track 2.1 Introduction We have been continuously investigating technologies for delivering retrieved documents to support interactive question answering. In this year's interactive track, we focused on the role of a document surrogate in the interactive fact finding task. In this experiment, we compared two types of document surrogates Submitted to the 9th Text Retrieval Conference (TREC-9), Gaithersburg, MD, USA, November 13--16, 2000. in the two experimental systems. One system uses the document title and the first 20 words of a document as the document's surrogate, while the other system uses the document title and the best three Answer Indicative Sentences extracted from the document as the document's surrogate. The results show
Multi-User File System Search
, 2007
"... Information retrieval research usually deals with globally visible, static document collections. Practical applications, in contrast, like file system search and enterprise search, have to cope with highly dynamic text collections and have to take into account user-specific access permissions when ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
Information retrieval research usually deals with globally visible, static document collections. Practical applications, in contrast, like file system search and enterprise search, have to cope with highly dynamic text collections and have to take into account user-specific access permissions when generating the results to a search query. The goal of this thesis is to close the gap between information retrieval research and the requirements exacted by these real-life applications. The algorithms and data structures presented in this thesis can be used to implement a file system search engine that is able to react to changes in the file system by updating its index data in real time. File changes (in-sertions, deletions, or modifications) are reflected by the search results within a few seconds,

