Results 1 - 10
of
59
Connections: using context to enhance file search
- In Proceedings of the 20th ACM Symposium on Operating Systems Principles (SOSP ’05
, 2005
"... Connections is a file system search tool that combines traditional content-based search with context information gathered from user activity. By tracing file system calls, Connections can identify temporal relationships between files and use them to expand and reorder traditional content search resu ..."
Abstract
-
Cited by 43 (3 self)
- Add to MetaCart
Connections is a file system search tool that combines traditional content-based search with context information gathered from user activity. By tracing file system calls, Connections can identify temporal relationships between files and use them to expand and reorder traditional content search results. Doing so improves both recall (reducing falsepositives) and precision (reducing false-negatives). For example, Connections improves the average recall (from 13% to 22%) and precision (from 23 % to 29%) on the first ten results. When averaged across all recall levels, Connections improves precision from 17 % to 28%. Connections provides these benefits with only modest increases in average query time (2 seconds), indexing time (23 seconds daily), and index size (under 1 % of the user’s data set).
Regularizing ad hoc retrieval scores
, 2005
"... The cluster hypothesis states: closely related documents tend to be relevant to the same request. We exploit this hypothesis directly by adjusting ad hoc retrieval scores from an initial retrieval so that topically related documents receive similar scores. We refer to this process as score regulariz ..."
Abstract
-
Cited by 31 (1 self)
- Add to MetaCart
The cluster hypothesis states: closely related documents tend to be relevant to the same request. We exploit this hypothesis directly by adjusting ad hoc retrieval scores from an initial retrieval so that topically related documents receive similar scores. We refer to this process as score regularization. Score regularization can be presented as an optimization problem, allowing the use of results from semisupervised learning. We demonstrate that regularized scores consistently and significantly rank documents better than un-regularized scores, given a variety of initial retrieval algorithms. We evaluate our method on two large corpora across a substantial number of topics.
Evaluating high accuracy retrieval techniques
- In Proceedings of SIGIR
, 2004
"... ABSTRACT Although information retrieval research has always been concernedwith improving the effectiveness of search, in some applications, such as information analysis, a more specific requirement exists forhigh accuracy retrieval. This means that achieving high precision in the top document ranks ..."
Abstract
-
Cited by 21 (0 self)
- Add to MetaCart
ABSTRACT Although information retrieval research has always been concernedwith improving the effectiveness of search, in some applications, such as information analysis, a more specific requirement exists forhigh accuracy retrieval. This means that achieving high precision in the top document ranks is paramount. In this paper we presentwork aimed at achieving high accuracy in ad-hoc document retrieval by incorporating approaches from question answering (QA).We focus on getting the first relevant result as high as possible in the ranked list and argue that traditional precision and recall are notappropriate measures for evaluating this task. We instead use the mean reciprocal rank (MRR) of the first relevant result. We eval-uate three different methods for modifying queries to achieve high accuracy. The experiments done on TREC data provide support forthe approach of using MRR and incorporating QA techniques for getting high accuracy in ad-hoc retrieval task. Categories and Subject Descriptors H.3.4 [Information Storage and Retrieval]: Systems and Soft-ware--Performance evaluation (efficiency and effectiveness) ; H.3.3 [Information Storage and Retrieval]: Information Search andRetrieval--Query formulation
Indri at TREC 2005: Terabyte Track
, 2004
"... This work details the experiments carried out using the Indri search engine during the TREC 2005 Terabyte Track. Results are presented for each of the three tasks, including e#ciency, ad hoc, and named page finding. Our e#ciency runs focused on query optimization techniques, our ad hoc runs look at ..."
Abstract
-
Cited by 17 (2 self)
- Add to MetaCart
This work details the experiments carried out using the Indri search engine during the TREC 2005 Terabyte Track. Results are presented for each of the three tasks, including e#ciency, ad hoc, and named page finding. Our e#ciency runs focused on query optimization techniques, our ad hoc runs look at the importance of term proximity and document quality, and our named-page finding runs investigate the use of document priors and document structure.
Find-similar: similarity browsing as a search tool
- In Proceedings of SIGIR 2006
, 2006
"... Search systems have for some time provided users with the ability to request documents similar to a given document. Interfaces provide this feature via a link or button for each document in the search results. We call this feature findsimilar or similarity browsing. We examined find-similar as a sea ..."
Abstract
-
Cited by 17 (2 self)
- Add to MetaCart
Search systems have for some time provided users with the ability to request documents similar to a given document. Interfaces provide this feature via a link or button for each document in the search results. We call this feature findsimilar or similarity browsing. We examined find-similar as a search tool, like relevance feedback, for improving retrieval performance. Our investigation focused on find-similar’s document-to-document similarity, the reexamination of documents during a search, and the user’s browsing pattern. Find-similar with a query-biased similarity, avoiding the reexamination of documents, and a breadth-like browsing pattern achieved a 23 % increase in the arithmetic mean average precision and a 66 % increase in the geometric mean average precision over our baseline retrieval. This performance matched that of a more traditionally styled iterative relevance feedback technique.
Automatic Feature Selection in the Markov Random Field Model for Information Retrieval
- In Proceedings of CIKM’07
, 2007
"... Previous applications of the Markov random field model for information retrieval have used manually chosen features. However, it is often difficult or impossible to know, a priori, the best set of features to use for a given task or data set. Therefore, there is a need to develop automatic feature s ..."
Abstract
-
Cited by 16 (6 self)
- Add to MetaCart
Previous applications of the Markov random field model for information retrieval have used manually chosen features. However, it is often difficult or impossible to know, a priori, the best set of features to use for a given task or data set. Therefore, there is a need to develop automatic feature selection techniques. In this paper we describe a greedy procedure for automatically selecting features to use within the Markov random field model for information retrieval. We also propose a novel, robust method for describing classes of textual information retrieval features. Experimental results, evaluated on standard TREC test collections, show that our feature selection algorithm produces models that are either significantly more effective than, or equally effective as, models with manually selected features, such as those used in the past.
Document Representation and Query Expansion Models for Blog Recommendation
"... We explore several different document representation models and two query expansion models for the task of recommending blogs to a user in response to a query. Blog relevance ranking differs from traditional document ranking in ad-hoc information retrieval in several ways: (1) the unit of output (th ..."
Abstract
-
Cited by 14 (3 self)
- Add to MetaCart
We explore several different document representation models and two query expansion models for the task of recommending blogs to a user in response to a query. Blog relevance ranking differs from traditional document ranking in ad-hoc information retrieval in several ways: (1) the unit of output (the blog) is composed of a collection of documents (the blog posts) rather than a single document, (2) the query represents an ongoing – and typically multifaceted – interest in the topic rather than a passing ad-hoc information need and (3) due to the propensity of spam, splogs, and tangential comments, the blogosphere is particularly challenging to use as a source for high-quality query expansion terms. We address these differences at the document representation level, by comparing retrieval models that view either the blog or its constituent posts as the atomic units of retrieval, and at the query expansion level, by making novel use of the links and anchor text in Wikipedia 1 to expand a user’s initial query. We develop two complementary models of blog retrieval that perform at comparable levels of precision and recall. We also show consistent and significant improvement across all models using our Wikipedia expansion strategy.
The role of knowledge in conceptual retrieval: A study in the domain of clinical medicine
- In SIGIR2006
, 2006
"... Despite its intuitive appeal, the hypothesis that retrieval at the level of “concepts ” should outperform purely term-based approaches remains unverified empirically. In addition, the use of “knowledge ” has not consistently resulted in performance gains. After identifying possible reasons for previ ..."
Abstract
-
Cited by 10 (1 self)
- Add to MetaCart
Despite its intuitive appeal, the hypothesis that retrieval at the level of “concepts ” should outperform purely term-based approaches remains unverified empirically. In addition, the use of “knowledge ” has not consistently resulted in performance gains. After identifying possible reasons for previous negative results, we present a novel framework for “conceptual retrieval ” that articulates the types of knowledge that are important for information seeking. We instantiate this general framework in the domain of clinical medicine based on the principles of evidence-based medicine (EBM). Experiments show that an EBM-based scoring algorithm dramatically outperforms a state-of-the-art baseline that employs only term statistics. Ablation studies further yield a better understanding of the performance contributions of different components. Finally, we discuss how other domains can benefit from knowledge-based approaches.
Regularizing query-based retrieval scores
- Information Retrieval
, 2007
"... Abstract. We adapt the cluster hypothesis for score-based information retrieval by claiming that closely related documents should have similar scores. Given a retrieval from an arbitrary system, we describe an algorithm which directly optimizes this objective by adjusting retrieval scores so that to ..."
Abstract
-
Cited by 9 (2 self)
- Add to MetaCart
Abstract. We adapt the cluster hypothesis for score-based information retrieval by claiming that closely related documents should have similar scores. Given a retrieval from an arbitrary system, we describe an algorithm which directly optimizes this objective by adjusting retrieval scores so that topically related documents receive similar scores. We refer to this process as score regularization. Because score regularization operates on retrieval scores, regardless of their origin, we can apply the technique to arbitrary initial retrieval rankings. Document rankings derived from regularized scores, when compared to rankings derived from un-regularized scores, consistently and significantly result in improved performance given a variety of baseline retrieval algorithms. We also present several proofs demonstrating that regularization generalizes methods such as pseudo-relevance feedback, document expansion, and cluster-based retrieval. Because of these strong empirical and theoretical results, we argue for the adoption of score regularization as general design principle or post-processing step for information retrieval systems.

