Results 1 - 10
of
22
Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval
, 1998
"... The naive Bayes classifier, currently experiencing a renaissance in machine learning, has long been a core technique in information retrieval. We review some of the variations of naive Bayes models used for text retrieval and classification, focusing on the distributional assump- tions made abou ..."
Abstract
-
Cited by 268 (1 self)
- Add to MetaCart
The naive Bayes classifier, currently experiencing a renaissance in machine learning, has long been a core technique in information retrieval. We review some of the variations of naive Bayes models used for text retrieval and classification, focusing on the distributional assump- tions made about word occurrences in documents.
How effective is suffixing
- Journal of the American Society for Information Science
, 1991
"... The interaction of suffixing algorithms and ranking techniques in retrieval performance, particularly in an online environment, was investigated. Three general purpose suffixing algorithms were used for retrieval on the Cranfield 1400, Medlars, and CACM test collections, with no significant improvem ..."
Abstract
-
Cited by 118 (0 self)
- Add to MetaCart
The interaction of suffixing algorithms and ranking techniques in retrieval performance, particularly in an online environment, was investigated. Three general purpose suffixing algorithms were used for retrieval on the Cranfield 1400, Medlars, and CACM test collections, with no significant improvement in performance shown for any of the algorithms. A failure analysis suggested three modifications to ranking techniques: variable weighting of term variants, selective stemming depend-ing on query length, and selective stemming depending on term importance. None of these modifications im-proved performance. Recommendations are made re-garding the uses of suffixing in an online environment. introduction Traditional statistically based keyword retrieval systems have been the subject of experiments for over 30 years. The use of simple keyword matching as a basis for re-trieval can produce acceptable results, and the addition of ranking techniques based on the frequency of a given matching term within a document collection and/or within a given document adds considerable improvement (Sparck Jones, 1972; Salton, 1983). The conflation of word variants using suffixing al-gorithms was one of the earliest enhancements to statistical keyword retrieval systems (Salton, 1971), and has become so standard a part of most systems that many system descriptions neglect to mention the use of suffixing, or to identify the algorithm was used. Suffixing was originally done for two principle reasons: the large reduction in stor-age required by a retrieval dictionary (Bell, 1979), and the increase in performance due to the use of word variants. Recent research has been more concerned with perfor-mance improvement than with storage reduction.
Lexical Ambiguity and Information Retrieval
- ACM TRANSACTIONS ON INFORMATION SYSTEMS
, 2000
"... Lexical ambiguity is a pervasive problem in natural language processing. However, little quantitative information is available about the extent of the problem, or about the impact that it has on information retrieval systems. We report on an analysis of lexical ambiguity in information retrieval ..."
Abstract
-
Cited by 113 (3 self)
- Add to MetaCart
Lexical ambiguity is a pervasive problem in natural language processing. However, little quantitative information is available about the extent of the problem, or about the impact that it has on information retrieval systems. We report on an analysis of lexical ambiguity in information retrieval test collections, and on experiments to determine the utility of word meanings for separating relevant from non-relevant documents. The experiments show that there is considerable ambiguity even in a specialized database. Word senses
A Probabilistic Learning Approach for Document Indexing
- ACM TRANSACTIONS ON INFORMATION SYSTEMS
, 1991
"... We describe a method for probabilistic document indexing using relevance feedback data that has been collected from a set of queries. Our approach is based on three new concepts: (1) Abstraction from specific terms and documents, which overcomes the restriction of limited relevance information fo ..."
Abstract
-
Cited by 84 (12 self)
- Add to MetaCart
We describe a method for probabilistic document indexing using relevance feedback data that has been collected from a set of queries. Our approach is based on three new concepts: (1) Abstraction from specific terms and documents, which overcomes the restriction of limited relevance information for parameter estimation. (2) Flexibility of the representation, which allows the integration of new text analysis and knowledge-based methods in our approach as well as the consideration of document structures or different types of terms. (3) Probabilistic learning or classification methods for the estimation of the indexing weights making better use of the available relevance information. Our approach can be applied under restrictions that hold for real applications. We give experimental results for five test collections which show improvements over other indexing methods.
Models for retrieval with probabilistic indexing
- Information Processing and Management
, 1989
"... Abstract- in this article three retrieval models for probabilistic indexing are described along with evaluation results for each. First is the binary independence indexing @II) model, which is a generalized version of the Maron and Kuhns indexing model. In this model, the indexing weight of a descri ..."
Abstract
-
Cited by 78 (14 self)
- Add to MetaCart
Abstract- in this article three retrieval models for probabilistic indexing are described along with evaluation results for each. First is the binary independence indexing @II) model, which is a generalized version of the Maron and Kuhns indexing model. In this model, the indexing weight of a descriptor in a document is an estimate of the proba-bility of relevance of this document with respect to queries using this descriptor. Sec-ond is the retrieval-with-probabilistic-indexing (RPI) model, which is suited to different kinds of probabilistic indexing. For that we assume that each indexing scheme has its own concept of “correctness ” to which the probabilities relate. In addition to the prob-abilistic indexing weights, the RPI model provides the possibility of reIevance weight-ing of search terms. A third mode1 that is similar was proposed by Croft some years ago as an extension of the binary independence retrieval model but it can be shown that this model is not based on the probabilistic ranking principle. The probabilistic indexing weights required for any of these models can be provided by an application of the Darm-stadt indexing approach (DIA) for indexing with descriptors from a controlled vocabu-Iary. The experimental results show signi~cant improvements over retrieval with binary indexing. Finally, suggestions are made regarding how the DIA can be applied to prob-abilistic indexing with free text terms. 1.
Text categorization of low quality images
- In Proceedings of SDAIR-95, 4th Annual Symposium on Document Analysis and Information Retrieval
, 1995
"... Categorization of text images into content-oriented classes would be a useful capability in a variety of document handling systems. Many methods can be usedtocategorize texts once their words are known, but OCR can garble a large proportion of words, particularly when low quality images are used. De ..."
Abstract
-
Cited by 52 (2 self)
- Add to MetaCart
Categorization of text images into content-oriented classes would be a useful capability in a variety of document handling systems. Many methods can be usedtocategorize texts once their words are known, but OCR can garble a large proportion of words, particularly when low quality images are used. Despite this, we show for one data set that fax quality images can be categorized with nearly the same accuracy as the original text. Further, the categorization system can be trained on noisy OCR output, without need for the true text of any image, or for editing of OCR output. The useofavector space classi er and training method robust to large feature sets, combined with discarding of low frequency OCR output strings are the key to our approach. 1
Limited-Memory Matrix Methods with Applications
, 1997
"... Abstract. The focus of this dissertation is on matrix decompositions that use a limited amount of computer memory � thereby allowing problems with a very large number of variables to be solved. Speci�cally � we will focus on two applications areas � optimization and information retrieval. We introdu ..."
Abstract
-
Cited by 28 (6 self)
- Add to MetaCart
Abstract. The focus of this dissertation is on matrix decompositions that use a limited amount of computer memory � thereby allowing problems with a very large number of variables to be solved. Speci�cally � we will focus on two applications areas � optimization and information retrieval. We introduce a general algebraic form for the matrix update in limited�memory quasi� Newton methods. Many well�known methods such as limited�memory Broyden Family meth� ods satisfy the general form. We are able to prove several results about methods which sat� isfy the general form. In particular � we show that the only limited�memory Broyden Family method �using exact line searches � that is guaranteed to terminate within n iterations on an n�dimensional strictly convex quadratic is the limited�memory BFGS method. Further� more � we are able to introduce several new variations on the limited�memory BFGS method that retain the quadratic termination property. We also have a new result that shows that full�memory Broyden Family methods �using exact line searches � that skip p updates to the quasi�Newton matrix will terminate in no more than n�p steps on an n�dimensional strictly convex quadratic. We propose several new variations on the limited�memory BFGS method
Detection of Heterogeneities in a Multiple Text Database Environment
- IN PROCEEDINGS OF THE FOURTH IFCIS INTERNATIONAL CONFERENCE ON COOPERATIVE INFORMATION SYSTEMS
, 1999
"... As the number of text retrieval systems (search engines) grows rapidly on the World Wide Web, there is an increasing need to build search brokers (metasearch engines) on top of them. Often, the task of building an effective and efficient metasearch engine is hindered by the heterogeneities among the ..."
Abstract
-
Cited by 20 (7 self)
- Add to MetaCart
As the number of text retrieval systems (search engines) grows rapidly on the World Wide Web, there is an increasing need to build search brokers (metasearch engines) on top of them. Often, the task of building an effective and efficient metasearch engine is hindered by the heterogeneities among the underlying local search engines. In this paper, we first analyze the impact of various heterogeneities on building a metasearch engine. We then present some techniques that can be used to detect the most prominent heterogeneities among multiple search engines. Applications of utilizing the detected heterogeneities in building better metasearch engines will be provided.
Searching and browsing collections of structural information
- In IEEE Advances in Digital Libraries (ADL’2000
, 1997
"... This paper proposes a new approach to querying collections of structured textual information such as SGML/XML documents. Knowledge about the structure of documents is an additional resource that should be exploited during retrieval since the semantics of the different textual objects can be used to ..."
Abstract
-
Cited by 19 (0 self)
- Add to MetaCart
This paper proposes a new approach to querying collections of structured textual information such as SGML/XML documents. Knowledge about the structure of documents is an additional resource that should be exploited during retrieval since the semantics of the different textual objects can be used to specify an information need much more precisely. However, the traditional probabilistic retrieval model lacks the ability to handle structural information. We define a new retrieval function based on the probabilistic model which overcomes this drawback. The presented query language allows the assignment of structural roles to individual terms. The efficient evaluation of queries in this framework requires appropriate index structures. We design text and structure indexes and show how their information is combined during evaluation. The implementation supports additional functionalities such as a table of contents for browsing. First evaluation results show the feasibility of the approach on collections of unstructured documents. 1

