Results 1 - 10
of
71
Query-Based Sampling of Text Databases
- ACM TRANSACTIONS ON INFORMATION SYSTEMS
, 1999
"... ... This paper presents query-based sampling, a new technique for acquiring accurate resource descriptions. Query-based sampling does not require the cooperationof resource providers nor does it require that resource providers use a particular search engine or representation technique. An extensive ..."
Abstract
-
Cited by 134 (13 self)
- Add to MetaCart
... This paper presents query-based sampling, a new technique for acquiring accurate resource descriptions. Query-based sampling does not require the cooperationof resource providers nor does it require that resource providers use a particular search engine or representation technique. An extensive set of experimental results demonstrates that accurate resource descriptions are created, that computation and communication costs are reasonable, and that the resource descriptions do in fact enable accurate automatic database selection.
Distributed Information Retrieval
- In: Advances in Information Retrieval
, 2000
"... A multi-database model of distributed information retrieval is presented, in which people are assumed to have access to many searchable text databases. In such an environment, full-text information retrieval consists of discovering database contents, ranking databases by their expected ability to sa ..."
Abstract
-
Cited by 116 (18 self)
- Add to MetaCart
A multi-database model of distributed information retrieval is presented, in which people are assumed to have access to many searchable text databases. In such an environment, full-text information retrieval consists of discovering database contents, ranking databases by their expected ability to satisfy the query, searching a small number of databases, and merging results returned by different databases. This paper presents algorithms for each task. It also discusses how to reorganize conventional test collections into multi-database testbeds, and evaluation methodologies for multi-database experiments. A broad and diverse group of experimental results is presented to demonstrate that the algorithms are effective, efficient, robust, and scalable. 1. INTRODUCTION Wide area networks, particularly the Internet, have transformed how people interact with information. Much of the routine information access by the general public is now based on full-text information retrieval, as opposed t...
Building efficient and effective metasearch engines
- ACM Computing Surveys
, 2002
"... Frequently a user's information needs are stored in the databases of multiple search engines. It is inconvenient and inefficient for an ordinary user to invoke multiple search engines and identify useful documents from the returned results. To support unified access to multiple search engines, a met ..."
Abstract
-
Cited by 107 (9 self)
- Add to MetaCart
Frequently a user's information needs are stored in the databases of multiple search engines. It is inconvenient and inefficient for an ordinary user to invoke multiple search engines and identify useful documents from the returned results. To support unified access to multiple search engines, a metasearch engine can be constructed. When a metasearch engine receives a query from a user, it invokes the underlying search engines to retrieve useful information for the user. Metasearch engines have other benefits as a search tool such as increasing the search coverage of the Web and improving the scalability of the search. In this article, we survey techniques that have been proposed to tackle several underlying challenges for building a good metasearch engine. Among the main challenges, the database selection problem is to identify search engines that are likely to return useful documents to a given query. The document selection problem is to determine what documents to retrieve from each identified search engine. The result merging problem is to combine the documents returned from multiple search engines. We will also point out some problems that need to be further researched.
Distributed search over the hidden web: Hierarchical database sampling and selection
- In VLDB
, 2002
"... Many valuable text databases on the web have non-crawlable contents that are “hidden ” behind search interfaces. Metasearchers are helpful tools for searching over many such databases at once through a unified query interface. A critical task for a metasearcher to process a query efficiently and eff ..."
Abstract
-
Cited by 85 (12 self)
- Add to MetaCart
Many valuable text databases on the web have non-crawlable contents that are “hidden ” behind search interfaces. Metasearchers are helpful tools for searching over many such databases at once through a unified query interface. A critical task for a metasearcher to process a query efficiently and effectively is the selection of the most promising databases for the query, a task that typically relies on statistical summaries of the database contents. Unfortunately, web-accessible text databases do not generally export content summaries. In this paper, we present an algorithm to derive content summaries from “uncooperative ” databases by using “focused query probes,” which adaptively zoom in on and extract documents that are representative of the topic coverage of the databases. Our content summaries are the first to include absolute document frequency estimates for the database words. We also present a novel database selection algorithm that exploits both the extracted content summaries and a hierarchical classification of the databases, automatically derived during probing, to compensate for potentially incomplete content summaries. Finally, we evaluate our techniques thoroughly using a variety of databases, including 50 real web-accessible text databases. Our experiments indicate that our new content-summary construction technique is efficient and produces more accurate summaries than those from previously proposed strategies. Also, our hierarchical database selection algorithm exhibits significantly higher precision than its flat counterparts. 1
Server Selection on the World Wide Web
, 2000
"... We evaluate server selection methods in a Web environment, modeling a digital library which makes use of existing Web search servers rather than building its own index. The evaluation framework portrays the Web realistically in several ways. Its search servers index real Web documents, are of variou ..."
Abstract
-
Cited by 66 (4 self)
- Add to MetaCart
We evaluate server selection methods in a Web environment, modeling a digital library which makes use of existing Web search servers rather than building its own index. The evaluation framework portrays the Web realistically in several ways. Its search servers index real Web documents, are of various sizes, cover different topic areas and employ different retrieval methods. Selection is based on statistics extracted from the results of probe queries submitted to each server. We evaluate published selection methods and a new method for enhancing selection based on expected search server effectiveness. Results show CORI to be the most effective of three published selection methods. CORI selection steadily degrades with fewer probe queries, causing a drop in early precision of as much as 0#05 (one relevant document out of 20). Modifying CORI selection based on an estimation of expected effectiveness disappointingly yields no significant improvement in effectiveness. However, modifying COR...
QProber: A system for automatic classification of hidden-web databases
- ACM TOIS
, 2003
"... The contents of many valuable web-accessible databases are only available through search interfaces and are hence invisible to traditional web “crawlers. ” Recently, commercial web sites have started to manually organize web-accessible databases into Yahoo!-like hierarchical classification schemes. ..."
Abstract
-
Cited by 53 (11 self)
- Add to MetaCart
The contents of many valuable web-accessible databases are only available through search interfaces and are hence invisible to traditional web “crawlers. ” Recently, commercial web sites have started to manually organize web-accessible databases into Yahoo!-like hierarchical classification schemes. Here, we introduce QProber, a modular system that automates this classification process by using a small number of query probes, generated by document classifiers. QProber can use a variety of types of classifiers to generate the probes. To classify a database, QProber does not retrieve or inspect any documents or pages from the database, but rather just exploits the number of matches that each query probe generates at the database in question. We have conducted an extensive experimental evaluation of QProber over collections of real documents, experimenting with different types of document classifiers and retrieval models. We have also tested our system with over one hundred web-accessible databases. Our experiments show that our system has low overhead and achieves high classification accuracy across a variety of databases.
Structured databases on the web: Observations and implications
- SIGMOD Record[J
"... The Web has been rapidly “deepened ” by the prevalence of databases online. With the potentially unlimited information hidden behind their query interfaces, this “deep Web ” of searchable databases is clearly an important frontier for data access. This paper surveys this relatively unexplored fronti ..."
Abstract
-
Cited by 50 (19 self)
- Add to MetaCart
The Web has been rapidly “deepened ” by the prevalence of databases online. With the potentially unlimited information hidden behind their query interfaces, this “deep Web ” of searchable databases is clearly an important frontier for data access. This paper surveys this relatively unexplored frontier, measuring characteristics pertinent to both exploring and integrating structured Web sources. On one hand, our “macro ” study surveys the deep Web at large, in April 2004, adopting the random IP-sampling approach, with one million samples. (How large is the deep Web? How is it covered by current directory services?) On the other hand, our “micro ” study surveys source-specific characteristics over 441 sources in eight representative domains, in December 2002. (How “hidden ” are deep-Web sources? How do search engines cover their data? How complex and expressive are query forms?) We report our observations and publish the resulting datasets to the research community. We conclude with several implications (of our own) which, while necessarily subjective, might help shape research directions and solutions. 1.
Probe, Count, and Classify: Categorizing Hidden-Web Databases
, 2001
"... The contents of many valuable web-accessible databases are only accessible through search interfaces and are hence invisible to traditional web "crawlers." Recent studies have estimated the size of this "hidden web" to be 500 billion pages, while the size of the "crawlable" web is only an estimated ..."
Abstract
-
Cited by 41 (4 self)
- Add to MetaCart
The contents of many valuable web-accessible databases are only accessible through search interfaces and are hence invisible to traditional web "crawlers." Recent studies have estimated the size of this "hidden web" to be 500 billion pages, while the size of the "crawlable" web is only an estimated two billion pages. Recently, commercial web sites have started to manually organize web-accessible databases into Yahoo!-like hierarchical classification schemes. In this paper, we introduce a method for automating this classification process by using a small number of query probes. To classify a database, our algorithm does not retrieve or inspect any documents or pages from the database, but rather just exploits the number of matches that each query probe generates at the database in question. We have conducted an extensive experimental evaluation of our technique over collections of real documents, including over one hundred web-accessible databases. Our experiments show that our system has low overhead and achieves high classification accuracy across a variety of databases. 1.
Collection selection and results merging with topically organized U.S. patents and TREC data
- In CIKM 2000
, 2000
"... We investigate three issues in distributed information retrieval, considering both TREC data and U.S. Patents: (1) topical organization of large text collections, (2) collection ranking and selection with topically organized collections (3) results merging, particularly document score normalization, ..."
Abstract
-
Cited by 38 (7 self)
- Add to MetaCart
We investigate three issues in distributed information retrieval, considering both TREC data and U.S. Patents: (1) topical organization of large text collections, (2) collection ranking and selection with topically organized collections (3) results merging, particularly document score normalization, with topically organized collections. We find that it is better to organize collections topically, and that topical collections can be well ranked using either INQUERY’s CORI algorithm, or the Kullback-Leibler divergence (KL), but KL is far worse than CORI for non-topically organized collections. For results merging, collections organized by topic require global idfs for the best performance. Contrary to results found elsewhere, normalized scores are not as good as global idfs for merging when the collections are topically organized.
Downloading textual hidden web content through keyword queries
- In JCDL
, 2005
"... An ever-increasing amount of information on the Web today is available only through search interfaces: the users have to type in a set of keywords in a search form in order to access the pages from certain Web sites. These pages are often referred to as the Hidden Web or the Deep Web. Since there ar ..."
Abstract
-
Cited by 37 (1 self)
- Add to MetaCart
An ever-increasing amount of information on the Web today is available only through search interfaces: the users have to type in a set of keywords in a search form in order to access the pages from certain Web sites. These pages are often referred to as the Hidden Web or the Deep Web. Since there are no static links to the Hidden Web pages, search engines cannot discover and index such pages and thus do not return them in the results. However, according to recent studies, the content provided by many Hidden Web sites is often of very high quality and can be extremely valuable to many users. In this paper, we study how we can build an effective Hidden Web crawler that can autonomously discover and download pages from the Hidden Web. Since the only “entry point ” to a Hidden Web site is a query interface, the main challenge that a Hidden Web crawler has to face is how to automatically generate meaningful queries to issue to the site. Here, we provide a theoretical framework to investigate the query generation problem for the Hidden Web and we propose effective policies for generating queries automatically. Our policies proceed iteratively, issuing a different query in every iteration. We experimentally evaluate the effectiveness of these policies on 4 real Hidden Web sites and our results are very promising. For instance, in one experiment, one of our policies downloaded more than 90 % of a Hidden Web site (that contains 14 million documents) after issuing fewer than 100 queries.

