Results 1 - 10
of
12
Distributed search over the hidden web: Hierarchical database sampling and selection
- In VLDB
, 2002
"... Many valuable text databases on the web have non-crawlable contents that are “hidden ” behind search interfaces. Metasearchers are helpful tools for searching over many such databases at once through a unified query interface. A critical task for a metasearcher to process a query efficiently and eff ..."
Abstract
-
Cited by 85 (12 self)
- Add to MetaCart
Many valuable text databases on the web have non-crawlable contents that are “hidden ” behind search interfaces. Metasearchers are helpful tools for searching over many such databases at once through a unified query interface. A critical task for a metasearcher to process a query efficiently and effectively is the selection of the most promising databases for the query, a task that typically relies on statistical summaries of the database contents. Unfortunately, web-accessible text databases do not generally export content summaries. In this paper, we present an algorithm to derive content summaries from “uncooperative ” databases by using “focused query probes,” which adaptively zoom in on and extract documents that are representative of the topic coverage of the databases. Our content summaries are the first to include absolute document frequency estimates for the database words. We also present a novel database selection algorithm that exploits both the extracted content summaries and a hierarchical classification of the databases, automatically derived during probing, to compensate for potentially incomplete content summaries. Finally, we evaluate our techniques thoroughly using a variety of databases, including 50 real web-accessible text databases. Our experiments indicate that our new content-summary construction technique is efficient and produces more accurate summaries than those from previously proposed strategies. Also, our hierarchical database selection algorithm exhibits significantly higher precision than its flat counterparts. 1
Fast and effective query refinement
- IN PROC. OF THE 20TH INTL. ACM SIGIR CONF. ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL
, 1997
"... Query Refinement is an essential information retrieval tool that interactively recommends new terms related to a particular query. This paper introduces concept recall, an experimental measure of an algorithm's ability to suggest terms humans have judged to be semantically related to an information ..."
Abstract
-
Cited by 25 (1 self)
- Add to MetaCart
Query Refinement is an essential information retrieval tool that interactively recommends new terms related to a particular query. This paper introduces concept recall, an experimental measure of an algorithm's ability to suggest terms humans have judged to be semantically related to an information need. This study uses precision improvement experiments to measure the ability of an algorithm to produce single term query modifications that predict a user's information need as partially encoded by the query. An oracle algorithm produces ideal query modifications, providing a meaningful context for interpreting precision improvement results. This study also introduces RMAP, a fast and practical query refinement algorithm that refines multiple term queries by dynamically combining precomputed suggestions for single term queries. RMAP achieves accuracy comparable to a much slower algorithm, although both RMAP and the slower algorithm lag behind the best possible term suggestions o ered by the oracle. We believe RMAP is fast enough to be integrated into present dayInternet search engines: RMAP computes 100 term suggestions for a 160,000 document collection in 15 ms on a low-end PC.
Supporting Dynamic Interactions among Web-Based Information Sources
- IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
, 2000
"... ..."
When one sample is not enough: Improving text database selection using shrinkage
- In SIGMOD’04
, 2004
"... Database selection is an important step when searching over large numbers of distributed text databases. The database selection task relies on statistical summaries of the database contents, which are not typically exported by databases. Previous research has developed algorithms for constructing an ..."
Abstract
-
Cited by 18 (3 self)
- Add to MetaCart
Database selection is an important step when searching over large numbers of distributed text databases. The database selection task relies on statistical summaries of the database contents, which are not typically exported by databases. Previous research has developed algorithms for constructing an approximate content summary of a text database from a small document sample extracted via querying. Unfortunately, Zipf’s law practically guarantees that content summaries built this way for any relatively large database will fail to cover many low-frequency words. Incomplete content summaries might negatively affect the database selection process, especially for short queries with infrequent words. To improve the coverage of approximate content summaries, we build on the observation that topically similar databases tend to have related vocabularies. Therefore, the approximate
Landscaping the Information Space of Large Multi-Database Networks
, 2001
"... The promises of network-accessible information are increasingly difficult to achieve. These difficulties are due to a variety of causes, such as, the rapid growth in the volume of network-available information and the increasing complexity, diversity and terminological fluctuations of the differen ..."
Abstract
-
Cited by 8 (3 self)
- Add to MetaCart
The promises of network-accessible information are increasingly difficult to achieve. These difficulties are due to a variety of causes, such as, the rapid growth in the volume of network-available information and the increasing complexity, diversity and terminological fluctuations of the different information sources available. This paper presents a conceptual architecture for the organisation information space across collections of component systems in multi-databases that provides serendipity, exploration and contextualisation support so that users can achieve logical connections between concepts they are familiar with and schema terms employed in multi-database systems. Large-scale searching for multi-database schema information is guided by a combination of lexical, structural and semantic aspects of schema terms in order to reveal more meaning both about the contents of a requested information term and about its placement within the distributed information space. 1
Maintaining Retrieval Effectiveness in Distributed, Dynamic Information Retrieval Systems
, 1996
"... Traditional information retrieval (IR) techniques were developed under the tacit assumptions of static, centralized archives of documents. Advanced techniques invariably use information derived from the entire collection in an effort to produce high-quality responses to user queries. In dynamic, dis ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
Traditional information retrieval (IR) techniques were developed under the tacit assumptions of static, centralized archives of documents. Advanced techniques invariably use information derived from the entire collection in an effort to produce high-quality responses to user queries. In dynamic, distributed information environments these assumptions are clearly not met. Heretofore easily obtainable collection wide information (CWI) may be unavailable to some or all member sites in a distributed document archive, so some degree of incompleteness or inconsistency must be tolerated. In this dissertation, we present a rigorous empirical study investigating how allowing the view of CWI to drift from rigorously defined values influences retrieval effectiveness. We give a generic model for searching a document collection that allows for the use of CWI derived from a subset of the collection. Within this model, we identify two realistic scenarios where the use of subset-derived collection stat...
Independent Proprietorship and Competition in Distributed Web Search Architectures
- in: Proceedings of the 7th IEEE International Conference on Engineering of Complex Computer Systems (ICECCS 2001
, 2001
"... The predominant Web search model attempts to use multiple computers under centralised management to act as one search engine for the entire Web. As the quantity of online information increases, systems based on this model become prohibitively expensive for all but the largest organisations. We advoc ..."
Abstract
-
Cited by 3 (3 self)
- Add to MetaCart
The predominant Web search model attempts to use multiple computers under centralised management to act as one search engine for the entire Web. As the quantity of online information increases, systems based on this model become prohibitively expensive for all but the largest organisations. We advocate the use of distributed search architectures where multiple independently owned and managed search engines act as one search system. This approach has significant advantages including low market entry cost for individual search providers and the potential to stimulate the provision of high-quality services through competition. The low entry cost allows small organisations and even individual users to influence service features and quality by establishing specialised search services. However, independent proprietorship also greatly complicates the search system design. The potential for competition between engines requires new approaches to effective engine management. Many new issues arise such as deciding what information an engine will index. In this paper, we analyse the sources of complexity in distributed Web search architectures with independent proprietorship and competition between engines. We outline possible ways to cope with this complexity using techniques from the field of computational economics such as game theory.
Pro-active Information Elicitation in Wide-Area Information Networks
, 1996
"... Rapid growth in the volume of data, complexity, diversity and terminological variations render networkaccessible information increasingly difficult to achieve. This paper compares current approaches in networkedbased information retrieval, Internet resource discovery and multi-agent systems and pres ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Rapid growth in the volume of data, complexity, diversity and terminological variations render networkaccessible information increasingly difficult to achieve. This paper compares current approaches in networkedbased information retrieval, Internet resource discovery and multi-agent systems and presents a set of criteria for improving the quality of networkedinformation delivery. It subsequently presents an information elicitation scheme based on these criteria in order to enhance scalability, accelerate searches and make interactions in a large collection of networkeddatabases more tractable. 1 Introduction Contemporary wide-area information networks provide large, rapidly evolving and autonomously managed information spaces. Such inter-networked information access systems have placed vast amounts of information within easy reach for users. However, information access is quite intricate not only due to the sheer volume of information available, but also because of naming conventions...
Adaptive Distributed Search and Advertising for WWW
, 2001
"... In this paper, we present the concept of, and discuss problems related to, distributed search architectures for the World Wide Web. We structure the problem area and analyse what aspects have already been covered by existing research and what needs to be done. We outline possible approaches to ..."
Abstract
- Add to MetaCart
In this paper, we present the concept of, and discuss problems related to, distributed search architectures for the World Wide Web. We structure the problem area and analyse what aspects have already been covered by existing research and what needs to be done. We outline possible approaches to some of the important research issues in distributed search architectures and present the ADSA (Adaptive Distributed Search and Advertising) project which aims to resolve them.

