Results 1 - 10
of
88
Combining Document Representations for Known-Item Search
, 2003
"... This paper investigates the pre-conditions for successful combination of document representations formed from structural markup for the task of known-item search. As this task is very similar to work in meta-search and data fusion, we adapt several hypotheses from those research areas and invest ..."
Abstract
-
Cited by 67 (4 self)
- Add to MetaCart
This paper investigates the pre-conditions for successful combination of document representations formed from structural markup for the task of known-item search. As this task is very similar to work in meta-search and data fusion, we adapt several hypotheses from those research areas and investigate them in this context. To investigate these hypotheses, we present a mixturebased language model and also examine many of the current metasearch algorithms. We find that compatible output from systems is important for successful combination of document representations. We also demonstrate that combining low performing document representations can improve performance, but not consistently. We find that the techniques best suited for this task are robust to the inclusion of poorly performing document representations. We also explore the role of variance of results across systems and its impact on the performance of fusion, with the surprising result that the correct documents have higher variance across document representations than highly ranking incorrect documents.
Searching the Workplace Web
, 2003
"... The social impact from the World Wide Web cannot be underestimated, but technologies used to build the Web are also revolutionizing the sharing of business and government information within intranets. In many ways the lessons learned from the Internet carry over directly to intranets, but others do ..."
Abstract
-
Cited by 46 (4 self)
- Add to MetaCart
The social impact from the World Wide Web cannot be underestimated, but technologies used to build the Web are also revolutionizing the sharing of business and government information within intranets. In many ways the lessons learned from the Internet carry over directly to intranets, but others do not apply. In particular, the social forces that guide the development of intranets are quite di#erent, and the determination of a "good answer" for intranet search is quite di#erent than on the Internet. In this paper we study the problem of intranet search. Our approach focuses on the use of rank aggregation, and allows us to examine the e#ects of di#erent heuristics on ranking of search results.
Harvesting Image Databases from the Web
- In ICCV
, 2007
"... The objective of this work 1 is to automatically generate a large number of images for a specified object class (for example, penguin). A multi-modal approach employing both text, meta data and visual features is used to gather many, high-quality images from the web. Candidate images are obtained by ..."
Abstract
-
Cited by 42 (0 self)
- Add to MetaCart
The objective of this work 1 is to automatically generate a large number of images for a specified object class (for example, penguin). A multi-modal approach employing both text, meta data and visual features is used to gather many, high-quality images from the web. Candidate images are obtained by a text based web search querying on the object identifier (the word penguin). The web pages and the images they contain are downloaded. The task is then to remove irrelevant images and re-rank the remainder. First, the images are re-ranked using a Bayes posterior estimator trained on the text surrounding the image and meta data features (such as the image alternative tag, image title tag, and image filename). No visual information is used at this stage. Second, the top-ranked images are used as (noisy) training data and a SVM visual classifier is learnt to improve the ranking further. The principal novelty is in combining text/meta-data and visual features in order to achieve a completely automatic ranking of the images. Examples are given for a selection of animals (e.g. camels, sharks, penguins), vehicles (cars, airplanes, bikes) and other classes (guitar, wristwatch), totalling 18 classes. The results are assessed by precision/recall curves on ground truth annotated data and by comparison to previous approaches including those of Berg et al. [5] (on an additional six classes) and Fergus et al. [9]. 1.
A Semisupervised Learning Method to Merge Search Engine Results
- ACM Transactions on Information Systems
, 2003
"... This article presents a semisupervised learning solution to the result merging problem. The key contribution is the observation that information used to create resource descriptions for resource selection can also be used to create a centralized sample database to guide the normalization of document ..."
Abstract
-
Cited by 34 (8 self)
- Add to MetaCart
This article presents a semisupervised learning solution to the result merging problem. The key contribution is the observation that information used to create resource descriptions for resource selection can also be used to create a centralized sample database to guide the normalization of document scores returned by different databases. At retrieval time, the query is sent to the selected databases, which return database-specific document scores, and to a centralized sample database, which returns database-independent document scores. Documents that have both a database-specific score and a database-independent score serve as training data for learning to normalize the scores of other documents. An extensive set of experiments demonstrates that this method is more effective than the well-known CORI result-merging algorithm under a variety of conditions
Using Sampled Data and Regression to Merge Search Engine Results
- In Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
, 2002
"... This paper addresses the problem of merging results obtained from different databases and search engines in a distributed information retrieval environment. The prior research on this problem either assumed the exchange of statistics necessary for normalizing scores (cooperative solutions) or is heu ..."
Abstract
-
Cited by 32 (10 self)
- Add to MetaCart
This paper addresses the problem of merging results obtained from different databases and search engines in a distributed information retrieval environment. The prior research on this problem either assumed the exchange of statistics necessary for normalizing scores (cooperative solutions) or is heuristic. Both approaches have disadvantages.
Cranking: Combining Rankings Using Conditional Probability Models on Permutations
- In Proceedings of the 19th International Conference on Machine Learning
, 2002
"... A new approach to ensemble learning is introduced that takes ranking rather than classification as fundamental, leading to models on the symmetric group and its cosets. The approach uses a generalization of the Mallows model on permutations to combine multiple input rankings. Applications incl ..."
Abstract
-
Cited by 31 (3 self)
- Add to MetaCart
A new approach to ensemble learning is introduced that takes ranking rather than classification as fundamental, leading to models on the symmetric group and its cosets. The approach uses a generalization of the Mallows model on permutations to combine multiple input rankings. Applications include the task of combining the output of multiple search engines and multiclass or multilabel classification, where a set of input classifiers is viewed as generating a ranking of class labels.
Relevance score normalization for metasearch
- 10 th Conf. on Information and Knowledge Management (CIKM 2001). Atlanta, GA
, 2001
"... Given the ranked lists of documents returned by multiple search engines in response to a given query, the problem of metasearch is to combine these lists in a way which optimizes the performance of the combination. This problem can be naturally decomposed into three subproblems: (1) normalizing the ..."
Abstract
-
Cited by 30 (1 self)
- Add to MetaCart
Given the ranked lists of documents returned by multiple search engines in response to a given query, the problem of metasearch is to combine these lists in a way which optimizes the performance of the combination. This problem can be naturally decomposed into three subproblems: (1) normalizing the relevance scores given by the input systems, (2) estimating relevance scores for unretrieved documents, and (3) combining the newly-acquired scores for each document into one, improved score. Research on the problem of metasearch has historically concentrated on algorithms for combining (normalized) scores. In this paper, we show that the techniques used for normalizing relevance scores and estimating the relevance scores of unretrieved documents can have a significant effect on the overall performance of metasearch. We propose two new normalization/estimation techniques and demonstrate empirically that the performance of well known metasearch algorithms can be significantly improved through their use. 1.
Uncertainty and description logic programs: A proposal . . .
- FUZZY LOGIC AND THE SEMANTIC WEB, CAPTURING INTELLIGENCE, CHAPTER 7
, 2004
"... Rule-based and object-oriented techniques are rapidly making their way into the infrastructure for representing and reasoning about the Semantic Web and combining these two paradigms emerges as an important objective. We present a new family of representation languages, which extents existing langua ..."
Abstract
-
Cited by 23 (6 self)
- Add to MetaCart
Rule-based and object-oriented techniques are rapidly making their way into the infrastructure for representing and reasoning about the Semantic Web and combining these two paradigms emerges as an important objective. We present a new family of representation languages, which extents existing language families for the Semantic Web: namely Description Logic Programs (DLPs) and DLPs with uncertainty (µDLPs). The former combine the expressive power of description logics (which capture the meaning of the most popular features of structured representation of knowledge) and disjunctive logic programs (powerful rule-based representation languages). The latter are DLPs in which the management of uncertainty is considered as well. We show that µDLPs may be applied in the context of distributed information search in the Semantic Web, where the representation of the inherent uncertainty of the relationships among resource ontologies, to which an automated agent has access to, is required.
Web Metasearch: Rank vs. Score Based Rank Aggregation Methods
, 2003
"... Given a set of rankings, the task of ranking fusion is the problem of combining these lists in such a way to optimize the performance of the combination. The ranking fusion problem is encountered in many situations and, e.g., metasearch is a prominent one. It deals with the problem of combining the ..."
Abstract
-
Cited by 21 (3 self)
- Add to MetaCart
Given a set of rankings, the task of ranking fusion is the problem of combining these lists in such a way to optimize the performance of the combination. The ranking fusion problem is encountered in many situations and, e.g., metasearch is a prominent one. It deals with the problem of combining the result lists returned by multiple search engines in response to a given query, where each item in a result list is ordered with respect to a search engine and a relevance score. Several ranking fusion methods have been proposed in the literature. They can be classified based on whether: (i) they rely on the rank; (ii) they rely on the score; and (iii) they require training data or not. Our paper will make the following contributions: (i) we will report experimental results for the Markov chain rank based methods, for which no large experimental tests have yet been made; (ii) while it is believed that the rank based method, named Borda Count, is competitive with score based methods, we will show that this is not true for metasearch; and (iii) we will show that Markov chain based methods compete with score based methods. This is especially important in the context of metasearch as scores are usually not available from the search engines.
Probabilistic Models for Combining Diverse Knowledge Sources in Multimedia Retrieval
- In Ph.D Thesis
, 2006
"... In recent years, the multimedia retrieval community is gradually shifting its emphasis from analyzing one media source at a time to exploring the opportunities of combining diverse knowledge sources from correlated media types and context. This thesis presents a conditional probabilistic retrieval m ..."
Abstract
-
Cited by 18 (2 self)
- Add to MetaCart
In recent years, the multimedia retrieval community is gradually shifting its emphasis from analyzing one media source at a time to exploring the opportunities of combining diverse knowledge sources from correlated media types and context. This thesis presents a conditional probabilistic retrieval model as a principled framework to combine diverse knowledge sources. An efficient rank-based learning approach has been developed to explicitly model the ranking relations in the learning process. Under this retrieval framework, we overview and develop a number of state-of-the-art approaches for extracting ranking features from multimedia knowledge sources. To incorporate query information in the combination model, this thesis develops a number of query analysis models that can automatically discover mixing structure of the query space based on previous retrieval results. To adapt the combination function on a per query basis, this thesis also presents a probabilistic local context analysis(pLCA) model to automatically leverage additional retrieval sources to improve initial retrieval outputs. All the proposed approaches are evaluated on multimedia retrieval tasks with large-scale video collections as well as meta-search tasks with large-scale text collections. 1

