Results 1 - 10
of
14
A Probabilistic Model of Information Retrieval: Development and Status
, 1998
"... The paper combines a comprehensive account of the probabilistic model of retrieval with new systematic experiments on TREC Programme material. It presents the model from its foundations through its logical development to cover more aspects of retrieval data and a wider range of system functions. Eac ..."
Abstract
-
Cited by 206 (16 self)
- Add to MetaCart
The paper combines a comprehensive account of the probabilistic model of retrieval with new systematic experiments on TREC Programme material. It presents the model from its foundations through its logical development to cover more aspects of retrieval data and a wider range of system functions. Each step in the argument is matched by comparative retrieval tests, to provide a single coherent account of a major line of research. The experiments demonstrate, for a large test collection, that the probabilistic model is effective and robust, and that it responds appropriately, with major improvements in performance, to key features of retrieval situations.
Probabilistic Models for Information Retrieval based on Divergence from Randomness
- ACM Transactions on Information Systems
, 2002
"... We introduce and create a framework for deriving probabilistic models of Information Retrieval. The models are nonparametric models of IR obtained in the language model approach. We derive term-weighting models by measuring the divergence of the actual term distribution from that obtained under a ra ..."
Abstract
-
Cited by 111 (5 self)
- Add to MetaCart
We introduce and create a framework for deriving probabilistic models of Information Retrieval. The models are nonparametric models of IR obtained in the language model approach. We derive term-weighting models by measuring the divergence of the actual term distribution from that obtained under a random process. Among the random processes we study the binomial distribution and Bose–Einstein statistics. We define two types of term frequency normalization for tuning term weights in the document–query matching process. The first normalization assumes that documents have the same length and measures the information gain with the observed term once it has been accepted as a good descriptor of the observed document. The second normalization is related to the document length and to other statistics. These two normalization methods are applied to the basic models in succession to obtain weighting formulae. Results show that our framework produces different nonparametric models forming baseline alternatives to the standard tf-idf model.
Matching and Record Linkage
- Business Survey Methods
, 1995
"... INTRODUCTION Matching has a long history of uses in statistical surveys and administrative data development. A business register consisting of names, addresses, and other identifying information such as total financial receipts might be constructed from tax and employment data bases (see chapters b ..."
Abstract
-
Cited by 77 (14 self)
- Add to MetaCart
INTRODUCTION Matching has a long history of uses in statistical surveys and administrative data development. A business register consisting of names, addresses, and other identifying information such as total financial receipts might be constructed from tax and employment data bases (see chapters by Colledge, Nijhowne, and Archer). A survey of retail establishments or agricultural establishments might combine results from an area frame and a list frame. To produce a combined estimator, units from the area frame would need to be identified in the list frame (see Vogel-Kott chapter). To estimate the size of a (sub)population via capture-recapture techniques, one needs to accurately determine units common to two or more independent listings (Sekar and Deming 1949; Scheuren 1983; Winkler 1989b). Samples must be drawn appropriately to estimate overlap (Deming and Gleser 1959). Rather than develop a special survey to collect data for policy decisions, it might be more appropriate t
A Risk Minimization Framework for Information Retrieval
- IN PROCEEDINGS OF THE ACM SIGIR 2003 WORKSHOP ON MATHEMATICAL/FORMAL METHODS IN IR. ACM
, 2003
"... This paper presents a novel probabilistic information retrieval framework in which the retrieval problem is formally treated as a statistical decision problem. In this framework, queries and documents are modeled using statistical language models (i.e., probabilistic models of text), user preference ..."
Abstract
-
Cited by 36 (1 self)
- Add to MetaCart
This paper presents a novel probabilistic information retrieval framework in which the retrieval problem is formally treated as a statistical decision problem. In this framework, queries and documents are modeled using statistical language models (i.e., probabilistic models of text), user preferences are modeled through loss functions, and retrieval is cast as a risk minimization problem. We discuss how this framework can unify existing retrieval models and accommodate the systematic development of new retrieval models. As an example of using the framework to model non-traditional retrieval problems, we derive new retrieval models for subtopic retrieval, which is concerned with retrieving documents to cover many different subtopics of a general query topic. These new models differ from traditional retrieval models in that they go beyond independent topical relevance.
Inferring Informational Goals from Free-Text Queries: A Bayesian Approach
- In Proceedings of Fourteenth Conference on Uncertainty in Artificial Intelligence
, 1998
"... People using consumer software applications typically do not use technical jargon when querying an online database of help topics. Rather, they attempt to communicate their goals with common words and phrases that describe software functionality in terms of structure and objects they understand. We ..."
Abstract
-
Cited by 30 (4 self)
- Add to MetaCart
People using consumer software applications typically do not use technical jargon when querying an online database of help topics. Rather, they attempt to communicate their goals with common words and phrases that describe software functionality in terms of structure and objects they understand. We describe a Bayesian approach to modeling the relationship between words in a user’s query for assistance and the informational goals of the user. After reviewing the general method, we describe several extensions that center on integrating additional distinctions and structure about language usage and user goals into the Bayesian models. 1
Methods for evaluating and creating data quality
- Information Systems
, 2003
"... This paper provides a survey of two classes of methods that can be used in determining and improving the quality of individual files or groups of files. The first are edit/imputation methods for maintaining business rules and for imputing for missing data. The second are methods of data cleaning for ..."
Abstract
-
Cited by 19 (2 self)
- Add to MetaCart
This paper provides a survey of two classes of methods that can be used in determining and improving the quality of individual files or groups of files. The first are edit/imputation methods for maintaining business rules and for imputing for missing data. The second are methods of data cleaning for finding duplicates within files or across files. Published by Elsevier Ltd.
Assessing Deduplication and Data Linkage Quality: What to Measure
- In Proc. of the 2005 Australian Conf. on Data Mining
, 2005
"... Abstract. Deduplicating one data set or linking several data sets are increasingly important tasks in the data preparation steps of many data mining projects. The aim of such linkages is to match all records relating to the same entity. Research interest in this area has increased in recent years, w ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Abstract. Deduplicating one data set or linking several data sets are increasingly important tasks in the data preparation steps of many data mining projects. The aim of such linkages is to match all records relating to the same entity. Research interest in this area has increased in recent years, with techniques originating from statistics, machine learning, information retrieval, and database research being combined and applied to improve the linkage quality, as well as to increase performance and efficiency when deduplicating or linking very large data sets. Different measures have been used to characterise the quality of data linkage algorithms. This paper presents an overview of the issues involved in measuring deduplication and data linkage quality, and it is shown that measures in the space of record pair comparisons can produce deceptive accuracy results. Various measures are discussed and recommendations are given on how to assess deduplication and data linkage quality.
Chapter ADVANCES IN COMPUTERS 10/6/2003
"... Cognitive Hacking In this chapter, we define and propose countermeasures for a category of computer security exploits which we call "cognitive hacking. " Cognitive hacking refers to a computer or information system attack that relies on changing human users ' perceptions and corresponding ..."
Abstract
- Add to MetaCart
Cognitive Hacking In this chapter, we define and propose countermeasures for a category of computer security exploits which we call "cognitive hacking. " Cognitive hacking refers to a computer or information system attack that relies on changing human users ' perceptions and corresponding behaviors in order to be successful. This is in contrast to denial of service (DOS) and other kinds of well-known attacks that operate solely within the computer and network infrastructure. Examples are given of several cognitive hacking techniques, and a taxonomy for these types of attacks is developed. Legal, economic, and

