Results 1 - 10
of
31
Implicit User Modeling for Personalized Search
- In Proceedings of CIKM
, 2005
"... Information retrieval systems (e.g., web search engines) are critical for overcoming information overload. A major deficiency of existing retrieval systems is that they generally lack user modeling and are not adaptive to individual users, resulting in inherently non-optimal retrieval performance. F ..."
Abstract
-
Cited by 47 (9 self)
- Add to MetaCart
Information retrieval systems (e.g., web search engines) are critical for overcoming information overload. A major deficiency of existing retrieval systems is that they generally lack user modeling and are not adaptive to individual users, resulting in inherently non-optimal retrieval performance. For example, a tourist and a programmer may use the same word “java ” to search for different information, but the current search systems would return the same results. In this paper, we study how to infer a user’s interest from the user’s search context and use the inferred implicit user model for personalized search. We present a decision theoretic framework and develop techniques for implicit user modeling in information retrieval. We develop an intelligent client-side web search agent (UCAIR) that can perform eager implicit feedback, e.g., query expansion based on previous queries and immediate result reranking based on clickthrough information. Experiments on web search show that our search agent can improve search accuracy over the popular Google search engine.
FRank: A Ranking Method with Fidelity Loss
, 2007
"... Ranking problem is becoming important in many fields, especially in information retrieval (IR). Many machine learning techniques have been proposed for ranking problem, such as RankSVM, RankBoost, and RankNet. Among them, RankNet, which is based on a probabilistic ranking framework, is leading to pr ..."
Abstract
-
Cited by 26 (10 self)
- Add to MetaCart
Ranking problem is becoming important in many fields, especially in information retrieval (IR). Many machine learning techniques have been proposed for ranking problem, such as RankSVM, RankBoost, and RankNet. Among them, RankNet, which is based on a probabilistic ranking framework, is leading to promising results and has been applied to a commercial Web search engine. In this paper we conduct further study on the probabilistic ranking framework and provide a novel loss function named fidelity loss for measuring loss of ranking. The fidelity loss not only inherits effective properties of the probabilistic ranking framework in RankNet, but possesses new properties that are helpful for ranking. This includes the fidelity loss obtaining zero for each document pair, and having a finite upper bound that is necessary for conducting query-level normalization. We also propose an algorithm named FRank based on a generalized additive model for the sake of minimizing the fidelity loss and learning an effective ranking function. We evaluated the proposed algorithm for two datasets: TREC dataset and real Web search dataset. The experimental results show that the proposed FRank algorithm outperforms other learning-based ranking methods on both conventional IR problem and Web searching.
Boosting web retrieval through query operations
- Proceedings ECIR 2005
, 2005
"... We explore the use of phrase and proximity terms in the context of web retrieval, which is different from traditional ad-hoc retrieval both in document structure and in query characteristics. We show that for this type of task, the usage of both phrase and proximity terms is highly beneficial for e ..."
Abstract
-
Cited by 24 (2 self)
- Add to MetaCart
We explore the use of phrase and proximity terms in the context of web retrieval, which is different from traditional ad-hoc retrieval both in document structure and in query characteristics. We show that for this type of task, the usage of both phrase and proximity terms is highly beneficial for early precision as well as for overall retrieval effectiveness. We also analyze why phrase and proximity terms are far more effective for web retrieval than for ad-hoc retrieval.
Very Large Scale Retrieval and Web Search
, 2004
"... Together, the TREC Very Large Collection (VLC) Track and its successor the Web Track have run for seven years, after an initial VLC pre-track. During that time five new test collections have been created, five different types of retrieval task have been studied, a large number of important issues ..."
Abstract
-
Cited by 20 (1 self)
- Add to MetaCart
Together, the TREC Very Large Collection (VLC) Track and its successor the Web Track have run for seven years, after an initial VLC pre-track. During that time five new test collections have been created, five different types of retrieval task have been studied, a large number of important issues have been addressed, and new methods have been tried, not only for retrieval, but also for test collection construction. Since the Web Track was a natural evolutionary step from the VLC Track, from here on we will refer to them as a single VLC/Web track. The corpora created in support of the track have been distributed to more than 120 organisations world wide; they are clearly being used for evaluation and research purposes well beyond the confines of TREC. Not only that but the Web Track model has been adopted for similar Japanese language evaluations within the context of NTCIR (NII-NACSIS Test Collection for IR Systems, research.nii. ac.jp/ntcir/index-en.html). Each editio
University of Glasgow at TREC2004: Experiments in Web, Robust and Terabyte tracks with Terrier
- In Proceedings of TREC 2004
, 2004
"... With our participation in TREC2004, we test Terrier, a modular and scalable Information Retrieval framework, in three tracks. For the mixed query task of the Web track, we employ a decision mechanism for selecting appropriate retrieval approaches on a per-query basis. For the robust track, in order ..."
Abstract
-
Cited by 17 (6 self)
- Add to MetaCart
With our participation in TREC2004, we test Terrier, a modular and scalable Information Retrieval framework, in three tracks. For the mixed query task of the Web track, we employ a decision mechanism for selecting appropriate retrieval approaches on a per-query basis. For the robust track, in order to cope with the poorlyperforming queries, we use two pre-retrieval performance predictors and a weighting function recommender mechanism. We also test a new training approach for the automatic tuning of the term frequency normalisation parameters. In the Terabyte track, we employ a distributed version of Terrier and test the effectiveness of techniques, such as using the anchor text, pseudo query expansion and selecting different weighting models for each query. 1
Overview of TREC 2004
- IN NIST SPECIAL PUBLICATION 500-261: THE THIRTEENTH TEXT RETRIEVAL CONFERENCE PROCEEDINGS (TREC 2004
, 2004
"... ..."
Toward Better Weighting of Anchors
- SIGIR '04
, 2004
"... Okapi BM25 scoring of anchor text surrogate documents has been shown to facilitate e#ective ranking in navigational search tasks over web data. We hypothesize that even better ranking can be achieved in certain important cases, particularly when anchor scores must be fused with content scores, by av ..."
Abstract
-
Cited by 16 (2 self)
- Add to MetaCart
Okapi BM25 scoring of anchor text surrogate documents has been shown to facilitate e#ective ranking in navigational search tasks over web data. We hypothesize that even better ranking can be achieved in certain important cases, particularly when anchor scores must be fused with content scores, by avoiding length normalisation and by reducing the attentuation of scores associated with high tf . Preliminary results are presented.
Query-level loss functions for information retrieval
- INFORMATION PROCESSING AND MANAGEMENT
, 2008
"... Many machine learning technologies such as support vector machines, boosting, and neural networks have been applied to the ranking problem in information retrieval. However, since originally the methods were not developed for this task, their loss functions do not directly link to the criteria used ..."
Abstract
-
Cited by 15 (9 self)
- Add to MetaCart
Many machine learning technologies such as support vector machines, boosting, and neural networks have been applied to the ranking problem in information retrieval. However, since originally the methods were not developed for this task, their loss functions do not directly link to the criteria used in the evaluation of ranking. Specifically, the loss functions are defined on the level of documents or document pairs, in contrast to the fact that the evaluation criteria are defined on the level of queries. Therefore, minimizing the loss functions does not necessarily imply enhancing ranking performances. To solve this problem, we propose using query-level loss functions in learning of ranking functions. We discuss the basic properties that a query-level loss function should have and propose a query-level loss function based on the cosine similarity between a ranking list and the corresponding ground truth. We further design a coordinate descent algorithm, referred to as RankCosine, which utilizes the proposed loss function to create a generalized additive ranking model. We also discuss whether the loss functions of existing ranking algorithms can be extended to query-level. Experimental results on the datasets of TREC web track, OHSUMED, and a commercial web search engine show that with the use of the proposed querylevel loss function we can significantly improve ranking accuracies. Furthermore, we found that it is difficult to extend the document-level loss functions to query-level loss functions.
Server Selection Methods in Hybrid Portal Search
- IN PROC. ACM SIGIR CONF
, 2005
"... The TREC .GOV collection makes a valuable web testbed for distributed information retrieval methods because it is naturally partitioned and includes 725 web-oriented queries with judged answers. It can usefully model aspects of government and large corporate portals. Analysis of the .gov data shows ..."
Abstract
-
Cited by 13 (2 self)
- Add to MetaCart
The TREC .GOV collection makes a valuable web testbed for distributed information retrieval methods because it is naturally partitioned and includes 725 web-oriented queries with judged answers. It can usefully model aspects of government and large corporate portals. Analysis of the .gov data shows that a purely distributed approach would not be feasible for providing search on a .gov portal because of the large number (17,000+) of web sites and the high proportion that do not provide a search interface. An alternative hybrid approach, combining both distributed and centralized techniques, is proposed and server selection methods are evaluated within this framework using web-oriented evaluation methodology. A number of well-known algorithms are compared against representatives (highest anchor ranked page (HARP) and anchor weighted sum (AWSUM)) of a family of new selection methods which use link anchortext extracted from an auxiliary crawl to provide descriptions of sites which are not themselves crawled. Of the previously published methods, ReDDE substantially outperformed three variants of CORI and also outperformed a method based on Kullback-Leibler Divergence (extended) except on topic distillation. HARP and AWSUM performed best overall but were outperformed on the topic distillation task by extended KL Divergence.
Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval
- In Proceedings of the 28th Annual international ACM SIGIR Conference on Research and Development in information Retrieval
, 2005
"... This paper is concerned with automatic extraction of titles from the bodies of HTML documents. Titles of HTML documents should be correctly defined in the title fields; however, in reality HTML titles are often bogus. It is desirable to conduct automatic extraction of titles from the bodies of HTML ..."
Abstract
-
Cited by 13 (5 self)
- Add to MetaCart
This paper is concerned with automatic extraction of titles from the bodies of HTML documents. Titles of HTML documents should be correctly defined in the title fields; however, in reality HTML titles are often bogus. It is desirable to conduct automatic extraction of titles from the bodies of HTML documents. This is an issue which does not seem to have been investigated previously. In this paper, we take a supervised machine learning approach to address the problem. We propose a specification on HTML titles. We utilize format information such as font size, position, and font weight as features in title extraction. Our method significantly outperforms the baseline method of using the lines in largest font size as title (20.9%-32.6 % improvement in F1 score). As application, we consider web page retrieval. We use the TREC Web Track data for evaluation. We propose a new method for HTML documents retrieval using extracted titles. Experimental results indicate that the use of both extracted titles and title fields is almost always better than the use of title fields alone; the use of extracted titles is particularly helpful in the task of named page finding (23.1 %-29.0 % improvements).

