Results 1 -
9 of
9
Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval
- In Proceedings of SIGIR’94
, 1994
"... The 2–Poisson model for term frequencies is used to suggest ways of incorporating certain variables in probabilistic models for information retrieval. The variables concerned are within-document term frequency, document length, and within-query term frequency. Simple weighting functions are develope ..."
Abstract
-
Cited by 289 (9 self)
- Add to MetaCart
The 2–Poisson model for term frequencies is used to suggest ways of incorporating certain variables in probabilistic models for information retrieval. The variables concerned are within-document term frequency, document length, and within-query term frequency. Simple weighting functions are developed, and tested on the TREC test collection. Considerable performance improvements (over simple inverse collection frequency weighting) are demonstrated. 1
Understanding inverse document frequency: On theoretical arguments for IDF
- Journal of Documentation
, 2004
"... The term weighting function known as IDF was proposed in 1972, and has since been extremely widely used, usually as part of a TF*IDF function. It is often described as a heuristic, and many papers have been written (some based on Shannon’s Information Theory) seeking to establish some theoretical ba ..."
Abstract
-
Cited by 55 (1 self)
- Add to MetaCart
The term weighting function known as IDF was proposed in 1972, and has since been extremely widely used, usually as part of a TF*IDF function. It is often described as a heuristic, and many papers have been written (some based on Shannon’s Information Theory) seeking to establish some theoretical basis for it. Some of these attempts are reviewed, and it is shown that the Information Theory approaches are problematic, but that there are good theoretical justifications of both IDF and TF*IDF in traditional probabilistic model of information retrieval.
S.: Term-weighting for summarization of multi-party spoken dialogues
- In: Proc. of MLMI 2007
, 2007
"... Abstract. This paper explores the issue of term-weighting in the genre of spontaneous, multi-party spoken dialogues, with the intent of using such term-weights in the creation of extractive meeting summaries. The field of text information retrieval has yielded many term-weighting techniques to impor ..."
Abstract
-
Cited by 6 (5 self)
- Add to MetaCart
Abstract. This paper explores the issue of term-weighting in the genre of spontaneous, multi-party spoken dialogues, with the intent of using such term-weights in the creation of extractive meeting summaries. The field of text information retrieval has yielded many term-weighting techniques to import for our purposes; this paper implements and compares several of these, namely tf.idf, Residual IDF and Gain. Weproposethat term-weighting for multi-party dialogues can exploit patterns in word usage among participant speakers, and introduce the su.idf metric as one attempt to do so. Results for all metrics are reported on both manual and automatic speech recognition (ASR) transcripts, and on both the ICSI and AMI meeting corpora. 1
Discovery of Similarity Computations of Search Engines
- In Proceedings of the 9th International Conference on Information and Knowledge Management (CIKM). ACM
, 2000
"... Two typical situations in which it is of practical interest to determine the similarities of text documents to a query due to a search engine are: (1) a global search engine, constructed on top of a group of local search engines, wishes to retrieve the set of local documents globally most similar to ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
Two typical situations in which it is of practical interest to determine the similarities of text documents to a query due to a search engine are: (1) a global search engine, constructed on top of a group of local search engines, wishes to retrieve the set of local documents globally most similar to a given query; and (2) an organization wants to compare the retrieval performance of search engines. The dot-product function is a widely used similarity function. For a search engine using such a function, we can determine its similarity computations if how the search engine sets the weights of terms is known, which is usually not the case. In this paper, techniques are presented to discover certain mathematical expressions of these formulas and the values of embedded constants when the dot-product similarity function is used. Preliminary results from experiments on the WebCrawler search engine are given to illustrate our techniques. 1 Categories and Subject Descriptors H.3 [Information...
York University at TREC 2006: Legal Track
"... York University participated in the legal track this year. For this track, we developed an Okapi-based Legal Search Engine (LSE) v1.0. Our experiments mainly focused on evaluating the effect of a probabilistic text retrieval model on the legal domain. In order to address the special problems in lega ..."
Abstract
- Add to MetaCart
York University participated in the legal track this year. For this track, we developed an Okapi-based Legal Search Engine (LSE) v1.0. Our experiments mainly focused on evaluating the effect of a probabilistic text retrieval model on the legal domain. In order to address the special problems in legal text retrieval, new automatic feedback methods and term weighting methods are proposed and tested. 1
Using Speech-Specific Characteristics for . . .
, 2008
"... In this thesis we address the challenge of automatically summarizing spontaneous, multi-party spoken dialogues. The experimental hypothesis is that it is advantageous when summarizing such meeting speech to exploit a variety of speech-specific characteristics, rather than simply treating the task as ..."
Abstract
- Add to MetaCart
In this thesis we address the challenge of automatically summarizing spontaneous, multi-party spoken dialogues. The experimental hypothesis is that it is advantageous when summarizing such meeting speech to exploit a variety of speech-specific characteristics, rather than simply treating the task as text summarization with a noisy transcript. We begin by investigating which term-weighting metrics are effective for summarization of meeting speech, with the inclusion of two novel metrics designed specifically for multi-party dialogues. We then provide an in-depth analysis of useful multi-modal features for summarization, including lexical, prosodic, speaker, and structural features. A particular type of speech-specific information we explore is the presence of meta comments in meeting speech, which can be exploited to make extractive summaries more high-level and increasingly abstractive in quality. We conduct our experiments on the AMI and ICSI meeting corpora, illustrating how informative utterances can be realized in contrasting ways in differing domains of meeting speech. Our central summarization evaluation is a large-scale extrinsic task, a decision audit evaluation. In this evaluation, we explicitly compare the usefulness of extractive summaries to gold-standard abstracts and a baseline keyword condition for navigating through a large amount of meeting data in order to satisfy a complex information need.
Query Expansion for Noisy Legal Documents
"... The vocabulary of the TREC Legal OCR collection is noisy and huge. Standard techniques for improving retrieval performance such as content-based query expansion are ineffective for such document collection. In our work, we focused on exploiting metadata using blind relevance feedback, iterative impr ..."
Abstract
- Add to MetaCart
The vocabulary of the TREC Legal OCR collection is noisy and huge. Standard techniques for improving retrieval performance such as content-based query expansion are ineffective for such document collection. In our work, we focused on exploiting metadata using blind relevance feedback, iterative improvement from the reference Boolean run, and the effects of using terms from different topic fields for automatic query formulation. This paper describes our methodologies and results. 1
Utilizing User-input Contextual Terms for Query Disambiguation
"... Precision-oriented search results such as those typically returned by the major search engines are vulnerable to issues of polysemy. When the same term refers to different things, the dominant sense is preferred in the rankings of search results. In this paper, we propose a novel technique in the co ..."
Abstract
- Add to MetaCart
Precision-oriented search results such as those typically returned by the major search engines are vulnerable to issues of polysemy. When the same term refers to different things, the dominant sense is preferred in the rankings of search results. In this paper, we propose a novel technique in the context of web search that utilizes contextual terms provided by users for query disambiguation, making it possible to prefer other senses without altering the original query. 1
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS 1 Estimating Relevance for the Emergency
"... Abstract—In this paper, we compare two methods of estimating relevance for the emergency electronic brake light application: One method uses an analytically derived formula based on the minimum safety gap that is required to avoid a collision, whereas the other method uses a machine learning approac ..."
Abstract
- Add to MetaCart
Abstract—In this paper, we compare two methods of estimating relevance for the emergency electronic brake light application: One method uses an analytically derived formula based on the minimum safety gap that is required to avoid a collision, whereas the other method uses a machine learning approach. The application works by disseminating reports about vehicles that perform emergency deceleration in an effort to warn drivers about the need to perform emergency braking. Vehicles that receive such reports have to decide on whether the information contained in the report is relevant to the driver and warn the driver if that is the case. Common ways of determining relevance are based on the lane or direction information, but using only these attributes can lead to many false warnings, which can desensitize the driver. Desensitized drivers may ignore warnings or completely turn off the system, thus eliminating any safety benefits of the application. We show that the machine learning method, compared with the analytically derived formula, can significantly reduce the number of false warnings by learning from the actions that drivers take after receiving a report. The methods were compared using simulated experiments with a range of traffic and communication parameters. Index Terms—Machine learning, vehicle safety.

