Results 1 - 10
of
15
Understanding web query interfaces: Best-effort parsing with hidden syntax
- In SIGMOD Conference
, 2004
"... Recently, the Web has been rapidly “deepened ” by many searchable databases online, where data are hidden behind query forms. For modelling and integrating Web databases, the very first challenge is to understand what a query interface says – or what query capabilities a source supports. Such automa ..."
Abstract
-
Cited by 56 (14 self)
- Add to MetaCart
Recently, the Web has been rapidly “deepened ” by many searchable databases online, where data are hidden behind query forms. For modelling and integrating Web databases, the very first challenge is to understand what a query interface says – or what query capabilities a source supports. Such automatic extraction of interface semantics is challenging, as query forms are created autonomously. Our approach builds on the observation that, across myriad sources, query forms seem to reveal some “concerted structure, ” by sharing common building blocks. Toward this insight, we hypothesize the existence of a hidden syntax that guides the creation of query interfaces, albeit from different sources. This hypothesis effectively transforms query interfaces into a visual language with a non-prescribed grammar – and, thus, their semantic understanding a parsing problem. Such a paradigm enables principled solutions for both declaratively representing common patterns, by a derived grammar, and systematically interpreting query forms, by a global parsing mechanism. To realize this paradigm, we must address the challenges of a hypothetical syntax – that it is to be derived, and that it is secondary to the input. At the heart of our form extractor, we thus develop a 2P grammar and a best-effort parser, which together realize a parsing mechanism for a hypothetical syntax. Our experiments show the promise of this approach – it achieves above 85 % accuracy for extracting query conditions across random sources. 1.
Structured databases on the web: Observations and implications
- SIGMOD Record[J
"... The Web has been rapidly “deepened ” by the prevalence of databases online. With the potentially unlimited information hidden behind their query interfaces, this “deep Web ” of searchable databases is clearly an important frontier for data access. This paper surveys this relatively unexplored fronti ..."
Abstract
-
Cited by 50 (19 self)
- Add to MetaCart
The Web has been rapidly “deepened ” by the prevalence of databases online. With the potentially unlimited information hidden behind their query interfaces, this “deep Web ” of searchable databases is clearly an important frontier for data access. This paper surveys this relatively unexplored frontier, measuring characteristics pertinent to both exploring and integrating structured Web sources. On one hand, our “macro ” study surveys the deep Web at large, in April 2004, adopting the random IP-sampling approach, with one million samples. (How large is the deep Web? How is it covered by current directory services?) On the other hand, our “micro ” study surveys source-specific characteristics over 441 sources in eight representative domains, in December 2002. (How “hidden ” are deep-Web sources? How do search engines cover their data? How complex and expressive are query forms?) We report our observations and publish the resulting datasets to the research community. We conclude with several implications (of our own) which, while necessarily subjective, might help shape research directions and solutions. 1.
Discovering Complex Matchings across Web Query Interfaces: A Correlation Mining Approach
, 2004
"... To enable information integration, schema matching is a critical step for discovering semantic correspondences of attributes across heterogeneous sources. While complex matchings are common, because of their far more complex search space, most existing techniques focus on simple 1:1 matchings. To ta ..."
Abstract
-
Cited by 41 (12 self)
- Add to MetaCart
To enable information integration, schema matching is a critical step for discovering semantic correspondences of attributes across heterogeneous sources. While complex matchings are common, because of their far more complex search space, most existing techniques focus on simple 1:1 matchings. To tackle this challenge, this paper takes a conceptually novel approach by viewing schema matching as correlation mining, for our task of matching Web query interfaces to integrate the myriad databases on the Internet. On this "deep Web," query interfaces generally form complex matchings between attribute groups (e.g., corresponds to name, last name} in the Books domain). We observe that the cooccurrences patterns across query interfaces often reveal such complex semantic relationships: grouping attributes (e.g., last name}) tend to be co-present in query interfaces and thus positively correlated. In contrast, synonym attributes are negatively correlated because they rarely co-occur. This insight enables us to discover complex matchings by a correlation mining approach. In particular, we develop the DCM framework, which consists of data preparation, dual mining of positive and negative correlations, and finally matching selection. Unlike previous correlation mining algorithms, which mainly focus on finding strong positive correlations, our algorithm cares both positive and negative correlations, especially the subtlety of negative correlations, due to its special importance in schema matching. This leads to the introduction of a new correlation measure, H-measure, distinct from those proposed in previous work. We evaluate our approach extensively and the results show good accuracy for discovering complex matchings.
Automatic complex schema matching across web query interfaces: A correlation mining approach
- ACM Transactions on Database Systems
, 2003
"... To enable information integration, schema matching is a critical step for discovering semantic correspondences of attributes across heterogeneous sources. While complex matchings are common, because of their far more complex search space, most existing techniques focus on simple 1:1 matchings. To ta ..."
Abstract
-
Cited by 18 (3 self)
- Add to MetaCart
To enable information integration, schema matching is a critical step for discovering semantic correspondences of attributes across heterogeneous sources. While complex matchings are common, because of their far more complex search space, most existing techniques focus on simple 1:1 matchings. To tackle this challenge, this article takes a conceptually novel approach by viewing schema matching as correlation mining, for our task of matching Web query interfaces to integrate the myriad databases on the Internet. On this “deep Web, ” query interfaces generally form complex matchings between attribute groups (e.g., {author} corresponds to {first name, last name} in the Books domain). We observe that the co-occurrences patterns across query interfaces often reveal such complex semantic relationships: grouping attributes (e.g., {first name, last name}) tend to be co-present in query interfaces and thus positively correlated. In contrast, synonym attributes are negatively correlated because they rarely co-occur. This insight enables us to discover complex matchings by a correlation mining approach. In particular, we develop the DCM framework, which consists of data preprocessing, dual mining of positive and negative correlations, and finally matching construction. We evaluate the DCM framework on manually extracted interfaces and the results show good accuracy for discovering complex matchings. Further, to automate the
Knocking the Door to the Deep Web: Integrating Web Query Interfaces
- In SIGMOD Conference, System Demonstration
, 2004
"... INTRODUCTION Recently, we witness the rapid growth and thus the prevalence of databases on the Web. Our recent survey [2] in December 2002 estimated between 127,000 to 330,000 deep Web sources. On this deep Web, myriad online databases provide dynamic query-based data access through their query int ..."
Abstract
-
Cited by 7 (3 self)
- Add to MetaCart
INTRODUCTION Recently, we witness the rapid growth and thus the prevalence of databases on the Web. Our recent survey [2] in December 2002 estimated between 127,000 to 330,000 deep Web sources. On this deep Web, myriad online databases provide dynamic query-based data access through their query interfaces, instead of static URL links. As the "door" to the deep Web, it is essential to integrate these query interfaces for integrating the deep Web. The overall goal of the MetaQuerier project (http://metaquerier.- cs.uiuc.edu) aims at opening up the deep Web to users, by building a system to help users exploring and integrating deep Web sources. In particular, to start with, we focus on the integration of deep Web sources in the same domain (e.g., Books, Airfares), which is itself an important integration task. The typical scenarios include purchasing a book with lowest price among book sources and a flight ticket with the best trade-off between price and number of connections among airl
MetaQuerier over the Deep Web: Shallow Integration across Holistic Sources
, 2004
"... The Web has been rapidly "deepened" by myriad searchable databases online. To enable effective access to the "deep Web," we are building the MetaQuerier-- for exploring and integrating databases on the Web. Such metaquerying must tackle integration at a large scale (as sources are proliferatin ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
The Web has been rapidly "deepened" by myriad searchable databases online. To enable effective access to the "deep Web," we are building the MetaQuerier-- for exploring and integrating databases on the Web. Such metaquerying must tackle integration at a large scale (as sources are proliferating online) and of a dynamic nature (as each query will access different sources). Toward such integration, our approach hinges on the insight that the challenge of large scale is itself an opportunity: We observe that the desired "semantics " often connects to surface presentation characteristics, through some hidden regularities over many sources. Generalizing our recent works, this paper thus proposes our approach of shallow integration across holistic sources -- to discover desired semantics by exploiting the hidden regularities of shallow clues across many sources holistically.
On-the-Fly Constraint Mapping across Web Query Interfaces
- In Proceedings of the VLDB Workshop on Information Integration on the Web (VLDB-IIWeb’04
, 2004
"... Recently, the Web has been rapidly "deepened" with the prevalence of databases online and becomes an important frontier for data integration. On this deep Web, a significant amount of information can only be accessed as response to dynamically issued queries to the query interface of a back-end ..."
Abstract
-
Cited by 6 (3 self)
- Add to MetaCart
Recently, the Web has been rapidly "deepened" with the prevalence of databases online and becomes an important frontier for data integration. On this deep Web, a significant amount of information can only be accessed as response to dynamically issued queries to the query interface of a back-end database, instead of by traversing static URL links. Such a query interface expresses a set of constraint templates, where each constraint template states how an attribute can be queried. To enable automatic query mediation among heterogenous deep Web sources, it is critical to automatically translate those constraints, which we name as constraint mapping. In particular, this paper aims at enabling on-the-fly constraint mapping, which is a critical task for integrating the large scale and dynamic deep Web. Such on-the-fly query translation poses a significant new challenge on the generality and extensibility of the translation framework. Existing works pursue a per-source rule-driven framework and thus cannot satisfy such requirements. In contrast, we propose a generic type-based search-driven translation framework by considering the constraint mapping for each data type as a search problem. In particular, in this paper, we develop search algorithms for text and numeric types. Our experiments over real deep Web sources show that our approach is promising to mediate queries for large scale integration.
A Holistic Paradigm for Large Scale Schema Matching
, 2004
"... Schema matching is a critical problem for integrating heterogeneous information sources. Traditionally, the problem of matching multiple schemas has essentially relied on finding pairwise-attribute correspondences in isolation. In contrast, we propose a new matching paradigm, holistic schema matchin ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
Schema matching is a critical problem for integrating heterogeneous information sources. Traditionally, the problem of matching multiple schemas has essentially relied on finding pairwise-attribute correspondences in isolation. In contrast, we propose a new matching paradigm, holistic schema matching, to match many schemas at the same time and find all matchings at once. By handling a set of schemas together, we can explore their context information that reflects the semantic correspondences among attributes. Such information is not available when schemas are matched only in pairs. As the realizations of holistic schema matching, we develop two alternative approaches: global evaluation and local evaluation. Global evaluation exhaustively assesses all possible "models," where a model expresses all attribute matchings. In particular, we propose the MGS framework for such global evaluation, building upon the hypothesis of the existence of a hidden schema model that probabilistically generates the schemas we observed. On the other hand, local evaluation independently assesses every single matching to incrementally construct such a model. In particular, we develop the DCM framework for local evaluation, building upon the observation that co-occurrence patterns across schemas often reveal the complex relationships of attributes. We apply our approaches to match query interfaces on the deep Web. The result shows the effectiveness of both the MGS and DCM approaches, which together demonstrate the promise of holistic schema matching.
Aholistic schema matching for web query interfaces
- In Advances in Database Technology - EDBT 2006, 10th International Conference on Extending Database Technology
, 2006
"... Abstract. One significant part of today’s Web is Web databases, which can dynamically provide information in response to user queries. To help users submit queries to and collect query results from different Web databases, the query interface matching problem needs to be addressed. To solve this pro ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
Abstract. One significant part of today’s Web is Web databases, which can dynamically provide information in response to user queries. To help users submit queries to and collect query results from different Web databases, the query interface matching problem needs to be addressed. To solve this problem, we propose a new complex schema matching approach, Holistic Schema Matching (HSM). By examining the query interfaces of real Web databases, we observe that attribute matchings can be discovered from attribute-occurrence patterns. For example, First Name often appears together with Last Name while it is rarely co-present with Author in the Books domain. Thus, we design a count-based greedy algorithm to identify which attributes are more likely to be matched in the query interfaces. In particular, HSM can identify both simple matching and complex matching, where the former refers to 1:1 matching between attributes and the latter refers to 1:n or m:n matching between attributes. Our experiments show that HSM can discover both simple and complex matchings accurately and efficiently on real data sets. 1

