Results 1 - 10
of
23
Analysis of Anchor Text for Web Search
, 2003
"... It has been observed that anchor text in web documents is very useful in improving the quality of web text search for some classes of queries. By examining properties of anchor text in a large intranet, we hope to shed light on why this is the case. Our main premise is that anchor text behaves very ..."
Abstract
-
Cited by 50 (1 self)
- Add to MetaCart
It has been observed that anchor text in web documents is very useful in improving the quality of web text search for some classes of queries. By examining properties of anchor text in a large intranet, we hope to shed light on why this is the case. Our main premise is that anchor text behaves very much like real user queries and consensus titles. Thus an understanding of how anchor text is related to a document will likely lead to better understanding of how to translate a user's query into high quality search results. Our approach is experimental, based on a study of a large corporate intranet, including the content as well as a large stream of queries against that content. We conduct experiments to investigate several aspects of anchor text, including their relationship to titles, the frequency of queries that can be satisfied by anchortext alone, and the homogeneity of results fetched by anchor text.
Searching the Workplace Web
, 2003
"... The social impact from the World Wide Web cannot be underestimated, but technologies used to build the Web are also revolutionizing the sharing of business and government information within intranets. In many ways the lessons learned from the Internet carry over directly to intranets, but others do ..."
Abstract
-
Cited by 46 (4 self)
- Add to MetaCart
The social impact from the World Wide Web cannot be underestimated, but technologies used to build the Web are also revolutionizing the sharing of business and government information within intranets. In many ways the lessons learned from the Internet carry over directly to intranets, but others do not apply. In particular, the social forces that guide the development of intranets are quite di#erent, and the determination of a "good answer" for intranet search is quite di#erent than on the Internet. In this paper we study the problem of intranet search. Our approach focuses on the use of rank aggregation, and allows us to examine the e#ects of di#erent heuristics on ranking of search results.
Context-Sensitive Semantic Smoothing for the Language Modeling Approach to Genomic IR
- ACM SIGIR 2006, Aug 6-11
, 2006
"... Semantic smoothing, which incorporates synonym and sense information into the language models, is effective and potentially significant to improve retrieval performance. The implemented semantic smoothing models, such as the translation model which statistically maps document terms to query terms, a ..."
Abstract
-
Cited by 15 (7 self)
- Add to MetaCart
Semantic smoothing, which incorporates synonym and sense information into the language models, is effective and potentially significant to improve retrieval performance. The implemented semantic smoothing models, such as the translation model which statistically maps document terms to query terms, and a number of works that have followed have shown good experimental results. However, these models are unable to incorporate contextual information. Thus, the resulting translation might be mixed and fairly general. To overcome this limitation, we propose a novel context-sensitive semantic smoothing method that decomposes a document or a query into a set of weighted context-sensitive topic signatures and then translate those topic signatures into query terms. In detail, we solve this problem through (1) choosing concept pairs as topic signatures and adopting an ontology-based approach to extract concept pairs; (2) estimating the translation model for each topic signature using the EM algorithm; and (3) expanding document and query models based on topic signature translations. The new smoothing method is evaluated on TREC 2004/05 Genomics Track collections and significant improvements are obtained. The MAP (mean average precision) achieves a 33.6 % maximal gain over the simple language model, as well as a 7.8 % gain over the language model with context-insensitive semantic smoothing.
Empirical Development of an Exponential Probabilistic Model for Text Retrieval: Using Textual Analysis to Build a Better Model
- In Proceedings of the 26th Annual ACM Conference on Research and Development in Information Retrieval
, 2003
"... Much work in information retrieval focuses on using a model of documents and queries to derive retrieval algorithms. Model based development is a useful alternative to heuristic development because in a model the assumptions are explicit and can be examined and refined independent of the particular ..."
Abstract
-
Cited by 9 (0 self)
- Add to MetaCart
Much work in information retrieval focuses on using a model of documents and queries to derive retrieval algorithms. Model based development is a useful alternative to heuristic development because in a model the assumptions are explicit and can be examined and refined independent of the particular retrieval algorithm. We explore the explicit assumptions underlying the naive Bayesian framework by performing computational analysis of actual corpora and queries to devise a generative document model that closely matches text. Our thesis is that a model so developed will be more accurate than existing models, and thus more useful in retrieval, as well as other applications. We test this by learning from a corpus the best document model. We find the learned model better predicts the existence of text data and has improved performance on certain IR tasks.
Probabilistic Model for Contextual Retrieval
- In: Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2004
, 2004
"... Contextual retrieval is a critical technique for facilitating many important applications such as mobile search, personalized search, PC troubleshooting, etc. Despite of its importance, there is no comprehensive retrieval model to describe the contextual retrieval process. We observed that incompati ..."
Abstract
-
Cited by 8 (1 self)
- Add to MetaCart
Contextual retrieval is a critical technique for facilitating many important applications such as mobile search, personalized search, PC troubleshooting, etc. Despite of its importance, there is no comprehensive retrieval model to describe the contextual retrieval process. We observed that incompatible context, noisy context and incomplete query are several important issues commonly existing in contextual retrieval applications. However, these issues have not been previously explored and discussed. In this paper, we propose probabilistic models to address these problems. Our study clearly shows that query log is the key to build effective contextual retrieval models. We also conduct a case study in the PC troubleshooting domain to testify the performance of the proposed models and experimental results show that the models can achieve very good retrieval precision.
Topic Signature Language Models for Ad Hoc Retrieval
, 2007
"... Semantic smoothing, which incorporates synonym and sense information into the language models, is effective and potentially significant to improve retrieval performance. Previously implemented semantic smoothing models such as the translation model have shown good experimental results. However, the ..."
Abstract
-
Cited by 7 (1 self)
- Add to MetaCart
Semantic smoothing, which incorporates synonym and sense information into the language models, is effective and potentially significant to improve retrieval performance. Previously implemented semantic smoothing models such as the translation model have shown good experimental results. However, these models are unable to incorporate contextual information. To overcome this limitation, we propose a novel context-sensitive semantic smoothing method that decomposes a document into a set of weighted context-sensitive topic signatures and then maps those topic signatures into query terms. The language model with such a contextsensitive semantic smoothing is referred to as the topic signature language model. In detail, we implement two types of topic signatures, depending on whether ontology exists in the application domain. One is the ontology-based concept and the other is the multiword phrase. The mapping probabilities from each topic signature to individual terms are estimated through the EM algorithm. Document models based on topic signature mapping are then derived. The new smoothing method is evaluated on the TREC 2004/ 2005 Genomics Track with ontology-based concepts, as well as the TREC Ad Hoc Track (Disks 1, 2, and 3) with multiword phrases. Both experiments show significant improvements over the two-stage language model, as well as the language model with contextinsensitive semantic smoothing.
Clickthrough-Based Translation Models for Web Search: from Word Models to Phrase Models
"... Web search is challenging partly due to the fact that search queries and Web documents use different language styles and vocabularies. This paper provides a quantitative analysis of the language discrepancy issue, and explores the use of clickthrough data to bridge documents and queries. We assume t ..."
Abstract
-
Cited by 7 (4 self)
- Add to MetaCart
Web search is challenging partly due to the fact that search queries and Web documents use different language styles and vocabularies. This paper provides a quantitative analysis of the language discrepancy issue, and explores the use of clickthrough data to bridge documents and queries. We assume that a query is parallel to the titles of documents clicked on for that query. Two translation models are trained and integrated into retrieval models: A word-based translation model that learns the translation probability between single words, and a phrase-based translation model that learns the translation probability between multi-term phrases. Experiments are carried out on a real world data set. The results show that the retrieval systems that use the translation models outperform significantly the systems that do not. The paper also demonstrates that standard statistical machine translation techniques such as word alignment, bilingual phrase extraction, and phrase-based decoding, can be adapted for building a better Web document retrieval system.
Clickthrough-Based Latent Semantic Models for Web Search
- In Proceedings of SIGIR
, 2011
"... This paper presents two new document ranking models for Web search based upon the methods of semantic representation and the statistical translation-based approach to information retrieval (IR). Assuming that a query is parallel to the titles of the documents clicked on for that query, large amounts ..."
Abstract
-
Cited by 5 (3 self)
- Add to MetaCart
This paper presents two new document ranking models for Web search based upon the methods of semantic representation and the statistical translation-based approach to information retrieval (IR). Assuming that a query is parallel to the titles of the documents clicked on for that query, large amounts of query-title pairs are constructed from clickthrough data; two latent semantic models are learned from this data. One is a bilingual topic model within the language modeling framework. It ranks documents for a query by the likelihood of the query being a semantics-based translation of the documents. The semantic representation is language independent and learned from query-title pairs, with the assumption that a query and its paired titles share the same distribution over semantic topics. The other is a discriminative projection model within the vector space modeling framework. Unlike Latent Semantic Analysis and its variants, the projection matrix in our model, which is used to map from term vectors into sematic space, is learned discriminatively such that the distance between a query and its paired title, both represented as vectors in the projected semantic space, is smaller than that between the query and the titles of other documents which have no clicks for that query. These models are evaluated on the Web search task using a real world data set. Results show that they significantly outperform their corresponding baseline models, which are state-of-the-art.
Regularizing Translation Models for Better Automatic Image Annotation
- In Proceedings of CIKM’04
, 2004
"... The goal of automatic image annotation is to automatically generate annotations for images to describe their content. In the past, statistical machine translation models have been successfully applied to automatic image annotation task [8]. It views the process of annotating images as a process of t ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
The goal of automatic image annotation is to automatically generate annotations for images to describe their content. In the past, statistical machine translation models have been successfully applied to automatic image annotation task [8]. It views the process of annotating images as a process of translating the content from a ‘visual language ’ to textual words. One problem with the existing translation models is that common words are usually associated with too many different image regions. As a result, uncommon words have little chance to be used for annotating images. Uncommon words are important for automatic image annotation because they are often used in the queries. In this paper, we propose two modified translation models for automatic image annotation, namely the normalized translation model and the regularized translation model, that specifically address the problem of common annotated words. The basic idea is to raise the number of blobs that are associated with uncommon words. The normalized translation model realizes this by scaling translation probabilities of different words with different factors. The same goal is achieved in the regularized translation model through the introduction of a special Dirichlet prior. Empirical study with the Corel dataset has shown that both two modified translation models outperform the original translation model and several existing approaches for automatic image annotation substantially.
Using Thematic Information in Statistical Headline Generation
, 2003
"... We explore the problem of single sentence summarisation. In the news domain, such a summary might resemble a headline. The headline generation system we present uses Singular Value Decomposition (SVD) to guide the generation of a headline towards the theme that best represents the document t ..."
Abstract
-
Cited by 4 (3 self)
- Add to MetaCart
We explore the problem of single sentence summarisation. In the news domain, such a summary might resemble a headline. The headline generation system we present uses Singular Value Decomposition (SVD) to guide the generation of a headline towards the theme that best represents the document to be summarised. In doing so, the intuition is that the generated summary will more accurately reflect the content of the source document. This paper presents SVD as an alternative method to determine if a word is a suitable candidate for inclusion in the headline. The results of a recall based evaluation comparing three different strategies to word selection, indicate that thematic information does help improve recall.

