Results 1 -
8 of
8
An adaptive crawler for locating hidden-Web entry points
- In Proceedings of WWW
, 2007
"... In this paper we describe new adaptive crawling strategies to efficiently locate the entry points to hidden-Web sources. The fact that hidden-Web sources are very sparsely distributed makes the problem of locating them especially challenging. We deal with this problem by using the contents of pages ..."
Abstract
-
Cited by 21 (10 self)
- Add to MetaCart
In this paper we describe new adaptive crawling strategies to efficiently locate the entry points to hidden-Web sources. The fact that hidden-Web sources are very sparsely distributed makes the problem of locating them especially challenging. We deal with this problem by using the contents of pages to focus the crawl on a topic; by prioritizing promising links within the topic; and by also following links that may not lead to immediate benefit. We propose a new framework whereby crawlers automatically learn patterns of promising links and adapt their focus as the crawl progresses, thus greatly reducing the amount of required manual setup and tuning. Our experiments over real Web pages in a representative set of domains indicate that online learning leads to significant gains in harvest rates—the adaptive crawlers retrieve up to three times as many forms as crawlers that use a fixed focus strategy.
Organizing hidden-web databases by clustering visible web documents
- In ICDE
, 2007
"... In this paper we address the problem of organizing hidden-Web databases. Given a heterogeneous set of Web forms that serve as entry points to hidden-Web databases, our goal is to cluster the forms according to the database domains to which they belong. We propose a new clustering approach that model ..."
Abstract
-
Cited by 15 (6 self)
- Add to MetaCart
In this paper we address the problem of organizing hidden-Web databases. Given a heterogeneous set of Web forms that serve as entry points to hidden-Web databases, our goal is to cluster the forms according to the database domains to which they belong. We propose a new clustering approach that models Web forms as a set of hyperlinked objects and considers visible information in the form context— both within and in the neighborhood of forms—as the basis for similarity comparison. Since the clustering is performed over features that can be automatically extracted, the process is scalable. In addition, because it uses a rich set of metadata, our approach is able to handle a wide range of forms, including content-rich forms that contain multiple attributes, as well as simple keyword-based search interfaces. An experimental evaluation over real Web data shows that our strategy generates high-quality clusters—measured both in terms of entropy and F-measure. This indicates that our approach provides an effective and general solution to the problem of organizing hidden-Web databases. 1
Combining classifiers to identify online databases
- In Proceedings of WWW
, 2007
"... We address the problem of identifying the domain of online databases. More precisely, given a set F of Web forms automatically gathered by a focused crawler and an online database domain D, our goal is to select from F only the forms that are entry points to databases in D. Having a set of Web forms ..."
Abstract
-
Cited by 11 (4 self)
- Add to MetaCart
We address the problem of identifying the domain of online databases. More precisely, given a set F of Web forms automatically gathered by a focused crawler and an online database domain D, our goal is to select from F only the forms that are entry points to databases in D. Having a set of Web forms that serve as entry points to similar online databases is a requirement for many applications and techniques that aim to extract and integrate hidden-Web information, such as meta-searchers, online database directories, hidden-Web crawlers, and form-schema matching and merging. We propose a new strategy that automatically and accurately classifies online databases based on features that can be easily extracted from Web forms. By judiciously partitioning the space of form features, this strategy allows the use of simpler classifiers that can be constructed using learning techniques that are better suited for the features of each partition. Experiments using real Web data in a representative set of domains show that the use of different classifiers leads to high accuracy, precision and recall. This indicates that our modular classifier composition provides an effective and scalable solution for classifying online databases.
HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching
"... We address the problem of unsupervised matching of schema information from a large number of data sources into the schema of a data warehouse. The matching process is the first step of a framework to integrate data feeds from thirdparty data providers into a structured-search engine’s data warehouse ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
We address the problem of unsupervised matching of schema information from a large number of data sources into the schema of a data warehouse. The matching process is the first step of a framework to integrate data feeds from thirdparty data providers into a structured-search engine’s data warehouse. Our experiments show that traditional schemabased and instance-based schema matching methods fall short. We propose a new technique based on the search engine’s clicklogs. Two schema elements are matched if the distribution of keyword queries that cause click-throughs on their instances are similar. We present experiments on large commercial datasets that show the new technique has much better accuracy than traditional techniques. 1.
Automatically Constructing a Directory of Molecular Biology Databases
"... Abstract. There has been an explosion in the volume of biology-related information that is available in online databases. But finding the right information can be challenging. Not only is this information spread over multiple sources, but often, it is hidden behind form interfaces of online database ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Abstract. There has been an explosion in the volume of biology-related information that is available in online databases. But finding the right information can be challenging. Not only is this information spread over multiple sources, but often, it is hidden behind form interfaces of online databases. There are several ongoing efforts that aim to simplify the process of finding, integrating and exploring these data. However, existing approaches are not scalable, and require substantial manual input. Notable examples include the NCBI databases and the NAR database compilation. As an important step towards a scalable solution to this problem, we describe a new infrastructure that automates, to a large extent, the process of locating and organizing online databases. We show how this infrastructure can be used to automate the construction and maintenance of a Molecular Biology database collection. We also provide an evaluation which shows that the infrastructure is scalable and effective—it is able to efficiently locate and accurately identify the relevant online databases. 1
HAMSTER: Human Assisted Mapping of Schema & Taxonomies to Enhance Relevance
"... We address the problem of unsupervised matching of schema information from a large number of data sources into the schema of a data warehouse. The matching process is the first step of a framework to integrate data feeds from thirdparty data providers into a structured-search engine’s data warehouse ..."
Abstract
- Add to MetaCart
We address the problem of unsupervised matching of schema information from a large number of data sources into the schema of a data warehouse. The matching process is the first step of a framework to integrate data feeds from thirdparty data providers into a structured-search engine’s data warehouse. Our experiments show that traditional schemabased and instance-based schema matching methods fall short. We propose a new technique based on the search engine’s clicklogs. Two schema elements are matched if the distribution of keyword queries that cause click-throughs on their instances are similar. We present experiments on large commercial datasets that show the new technique has much better accuracy than traditional techniques. 1.
No Relation: The Mixed Blessings of Non-Relational Databases
"... To my wife Jill, without whose unending support I most certainly would not be here today. Acknowledgements I would like to acknowledge the generous support of my many professors at the University of Texas who have given graciously of their time in support of my education and research over the past t ..."
Abstract
- Add to MetaCart
To my wife Jill, without whose unending support I most certainly would not be here today. Acknowledgements I would like to acknowledge the generous support of my many professors at the University of Texas who have given graciously of their time in support of my education and research over the past two years; especially, Professors Daniel Miranker and Adnan Aziz, who advised on this project; and Professors Christine Julien and Joydeep Ghosh, whose courses and research heavily informed the work herein.

