Results 1 - 10
of
14
Discovering Complex Matchings across Web Query Interfaces: A Correlation Mining Approach
, 2004
"... To enable information integration, schema matching is a critical step for discovering semantic correspondences of attributes across heterogeneous sources. While complex matchings are common, because of their far more complex search space, most existing techniques focus on simple 1:1 matchings. To ta ..."
Abstract
-
Cited by 41 (12 self)
- Add to MetaCart
To enable information integration, schema matching is a critical step for discovering semantic correspondences of attributes across heterogeneous sources. While complex matchings are common, because of their far more complex search space, most existing techniques focus on simple 1:1 matchings. To tackle this challenge, this paper takes a conceptually novel approach by viewing schema matching as correlation mining, for our task of matching Web query interfaces to integrate the myriad databases on the Internet. On this "deep Web," query interfaces generally form complex matchings between attribute groups (e.g., corresponds to name, last name} in the Books domain). We observe that the cooccurrences patterns across query interfaces often reveal such complex semantic relationships: grouping attributes (e.g., last name}) tend to be co-present in query interfaces and thus positively correlated. In contrast, synonym attributes are negatively correlated because they rarely co-occur. This insight enables us to discover complex matchings by a correlation mining approach. In particular, we develop the DCM framework, which consists of data preparation, dual mining of positive and negative correlations, and finally matching selection. Unlike previous correlation mining algorithms, which mainly focus on finding strong positive correlations, our algorithm cares both positive and negative correlations, especially the subtlety of negative correlations, due to its special importance in schema matching. This leads to the introduction of a new correlation measure, H-measure, distinct from those proposed in previous work. We evaluate our approach extensively and the results show good accuracy for discovering complex matchings.
Google’s Deep-Web Crawl
, 2008
"... The Deep Web, i.e., content hidden behind HTML forms, has long been acknowledged as a significant gap in search engine coverage. Since it represents a large portion of the structured data on the Web, accessing Deep-Web content has been a long-standing challenge for the database community. This paper ..."
Abstract
-
Cited by 26 (3 self)
- Add to MetaCart
The Deep Web, i.e., content hidden behind HTML forms, has long been acknowledged as a significant gap in search engine coverage. Since it represents a large portion of the structured data on the Web, accessing Deep-Web content has been a long-standing challenge for the database community. This paper describes a system for surfacing Deep-Web content, i.e., pre-computing submissions for each HTML form and adding the resulting HTML pages into a search engine index. The results of our surfacing have been incorporated into the Google search engine and today drive more than a thousand queries per second to Deep-Web content. Surfacing the Deep Web poses several challenges. First, our goal is to index the content behind many millions of HTML forms that span many languages and hundreds of domains. This necessitates an approach that is completely automatic, highly scalable, and very efficient. Second, a large number of forms have text inputs and require valid inputs values to be submitted. We present an algorithm for selecting input values for text search inputs that accept keywords and an algorithm for identifying inputs which accept only values of a specific type. Third, HTML forms often have more than one input and hence a naive strategy of enumerating the entire Cartesian product of all possible inputs can result in a very large number of URLs being generated. We present an algorithm that efficiently navigates the search space of possible input combinations to identify only those that generate URLs suitable for inclusion into our web search index. We present an extensive experimental evaluation validating the effectiveness of our algorithms.
Semantic matching: Algorithms and implementation
- JOURNAL ON DATA SEMANTICS
, 2007
"... We view match as an operator that takes two graph-like structures (e.g., classifications, XML schemas) and produces a mapping between the nodes of these graphs that correspond semantically to each other. Semantic matching is based on two ideas: (i) we discover mappings by computing semantic relation ..."
Abstract
-
Cited by 24 (12 self)
- Add to MetaCart
We view match as an operator that takes two graph-like structures (e.g., classifications, XML schemas) and produces a mapping between the nodes of these graphs that correspond semantically to each other. Semantic matching is based on two ideas: (i) we discover mappings by computing semantic relations (e.g., equivalence, more general); (ii) we determine semantic relations by analyzing the meaning (concepts, not labels) which is codified in the elements and the structures of schemas. In this paper we present basic and optimized algorithms for semantic matching, and we discuss their implementation within the S-Match system. We evaluate S-Match against three state of the art matching systems, thereby justifying empirically the strength of our approach.
Probabilistic Top-k and Ranking-Aggregate Queries
, 2008
"... Ranking and aggregation queries are widely used in data exploration, data analysis, and decision-making scenarios. While most of the currently proposed ranking and aggregation techniques focus on deterministic data, several emerging applications involve data that is unclean or uncertain. Ranking and ..."
Abstract
-
Cited by 12 (0 self)
- Add to MetaCart
Ranking and aggregation queries are widely used in data exploration, data analysis, and decision-making scenarios. While most of the currently proposed ranking and aggregation techniques focus on deterministic data, several emerging applications involve data that is unclean or uncertain. Ranking and aggregating uncertain (probabilistic) data raises new challenges in query semantics and processing, making conventional methods inapplicable. Furthermore, uncertainty imposes probability as a new ranking dimension that does not exist in the traditional settings. In this article we introduce new probabilistic formulations for top-k and ranking-aggregate queries in probabilistic databases. Our formulations are based on marriage of traditional top-k semantics with possible worlds semantics. In the light of these formulations, we construct a generic processing framework supporting both query types, and leveraging existing query processing and indexing capabilities in current RDBMSs. The framework encapsulates a state space model and efficient search algorithms to compute query answers. Our proposed techniques minimize the number of accessed tuples and the size of materialized search space to compute query answers. Our experimental study shows the efficiency of our techniques under different data distributions with
The ICoP Framework: Identification of Correspondences between Process Models
"... Abstract. Business process models can be compared, for example, to determine their consistency. Any comparison between process models relies on a mapping that identifies which activity in one model corresponds to which activity in another. Tools that generate such mappings are called matchers. This ..."
Abstract
-
Cited by 9 (3 self)
- Add to MetaCart
Abstract. Business process models can be compared, for example, to determine their consistency. Any comparison between process models relies on a mapping that identifies which activity in one model corresponds to which activity in another. Tools that generate such mappings are called matchers. This paper presents the ICoP framework, which can be used to develop such matchers. It consists of an architecture and re-usable matcher components. The framework enables the creation of matchers from the reusable components and, if desired, newly developed components. It focuses on matchers that also detect complex correspondences between groups of activities, where existing matchers focus on 1:1 correspondences. We evaluate the framework by applying it to find matches in process models from practice. We show that the framework can be used to develop matchers in a flexible and adaptable manner and that the resulting matchers can identify a significant number of complex correspondences. 1
A Holistic Paradigm for Large Scale Schema Matching
, 2004
"... Schema matching is a critical problem for integrating heterogeneous information sources. Traditionally, the problem of matching multiple schemas has essentially relied on finding pairwise-attribute correspondences in isolation. In contrast, we propose a new matching paradigm, holistic schema matchin ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
Schema matching is a critical problem for integrating heterogeneous information sources. Traditionally, the problem of matching multiple schemas has essentially relied on finding pairwise-attribute correspondences in isolation. In contrast, we propose a new matching paradigm, holistic schema matching, to match many schemas at the same time and find all matchings at once. By handling a set of schemas together, we can explore their context information that reflects the semantic correspondences among attributes. Such information is not available when schemas are matched only in pairs. As the realizations of holistic schema matching, we develop two alternative approaches: global evaluation and local evaluation. Global evaluation exhaustively assesses all possible "models," where a model expresses all attribute matchings. In particular, we propose the MGS framework for such global evaluation, building upon the hypothesis of the existence of a hidden schema model that probabilistically generates the schemas we observed. On the other hand, local evaluation independently assesses every single matching to incrementally construct such a model. In particular, we develop the DCM framework for local evaluation, building upon the observation that co-occurrence patterns across schemas often reveal the complex relationships of attributes. We apply our approaches to match query interfaces on the deep Web. The result shows the effectiveness of both the MGS and DCM approaches, which together demonstrate the promise of holistic schema matching.
A Holistic Paradigm for Schema Matching
, 2004
"... Schema matching is a critical problem for integrating heterogeneous information sources. Traditionally, the problem of matching multiple schemas has essentially relied on finding pairwise-attribute correspondence. In contrast, we propose a new matching paradigm, holistic schema matching, to holistic ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Schema matching is a critical problem for integrating heterogeneous information sources. Traditionally, the problem of matching multiple schemas has essentially relied on finding pairwise-attribute correspondence. In contrast, we propose a new matching paradigm, holistic schema matching, to holistically match many schemas at the same time and find all the matchings at once. By handling a set of schemas together, we can explore their context information that reflects the semantic correspondences among attributes, which is not available when schemas are matched only in pairs. As the realizations of the holistic paradigm, we developed two alternative approaches recently. This article takes an initial step to unify those two approaches and further contrasts their strength and weakness. Specifically, we develop two alternative methods for realizing holistic schema matching: global evaluation and local evaluation. Global evaluation exhaustively assesses all the possible models, where a model expresses all attribute matchings. In particular, we propose the MGS framework for such global evaluation with the hypothesis of the existence of generative models. On the other hand, local evaluation independently assesses every single matching to incrementally construct the model. In particular, we develop the DCM framework for such local evaluation with the observation that co-occurrence patterns across schemas often reveal the complex relationships of attributes. We apply our approaches on matching Web query interfaces on the deep Web. The result shows the effectiveness of both the MGS and DCM approaches, which together demonstrate the promise of the holistic paradigm for schema matching.
OpenKnowledge ⋆ Deliverable 3.1.: Dynamic Ontology Matching: a Survey
, 2006
"... Abstract. Matching has been recognized as a plausible solution for the semantic heterogeneity problem in many traditional applications, such as schema integration, ontology integration, data warehouses, data integration, and so on. Recently, there have emerged a line of new applications characterize ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Abstract. Matching has been recognized as a plausible solution for the semantic heterogeneity problem in many traditional applications, such as schema integration, ontology integration, data warehouses, data integration, and so on. Recently, there have emerged a line of new applications characterized by their dynamics, such as peer-to-peer systems, agents, web-services. In this deliverable we extend the notion of ontology matching, as it has been understood in traditional applications, to dynamic ontology matching. In particular, we examine real-world scenarios and collect the requirements they pose towards a plausible solution. We consider five general matching directions which we believe can appropriately address those requirements. These are: (i) approximate and partial ontology matching, (ii) interactive ontology matching, (iii) continuous ”design-time ” ontology matching, (iv) community-driven ontology matching and
Synthesizing products for online catalogs
- PVLDB
"... A comprehensive product catalog is essential to the success of Product Search engines and shopping sites such as Yahoo! Shopping, Google Product Search, and Bing Shopping. Given the large number of products and the speed at which they are released to the market, keeping catalogs up-to-date becomes a ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
A comprehensive product catalog is essential to the success of Product Search engines and shopping sites such as Yahoo! Shopping, Google Product Search, and Bing Shopping. Given the large number of products and the speed at which they are released to the market, keeping catalogs up-to-date becomes a challenging task, calling for the need of automated techniques. In this paper, we introduce the problem of product synthesis, a key component of catalog creation and maintenance. Given a set of offers advertised by merchants, the goal is to identify new products and add them to the catalog, together with their (structured) attributes. A fundamental challenge in product synthesis is the scale of the problem. A Product Search engine receives data from thousands of merchants about millions of products; the product taxonomy contains thousands of categories, where each category has a different schema; and merchants use representations for products that are different from the ones used in the catalog of the Product Search engine. We propose a system that provides an end-to-end solution to the product synthesis problem, and addresses issues involved in data extraction from offers, schema reconciliation, and data fusion. For the schema reconciliation component, we developed a novel and scalable technique for schema matching which leverages knowledge about previously-known instance-level associations between offers and products; and it is trained using automatically created training sets (no manually-labeled data is needed). We present an experimental evaluation using data from Bing Shopping for more than 800K offers, a thousand merchants, and 400 categories. The evaluation confirms that our approach is able to automatically generate a large number of accurate product specifications. Furthermore, the evaluation shows that our schema reconciliation component outperforms state-of-the-art schema matching techniques in terms of precision and recall. 1.
Mediation Queries Adaptation After the Removal of a Data Source
, 2008
"... Abstract: A broad variety of data is available in distinct heterogeneous sources, stored under different formats: database formats (in relational and object-oriented models), document formats (SGML/XML), browser formats (HTML), message formats, etc. The integration of such data is increasingly impor ..."
Abstract
- Add to MetaCart
Abstract: A broad variety of data is available in distinct heterogeneous sources, stored under different formats: database formats (in relational and object-oriented models), document formats (SGML/XML), browser formats (HTML), message formats, etc. The integration of such data is increasingly important for modern information systems applications such as data warehousing, data mining, and web applications. This is realized by providing a uniform view of data sources (called mediation schema or global schema) and defining a set of queries (called mediation queries or mediation mappings) which define objects of the mediation schema. One of the important problems that merit consideration is the impact of schema evolution on mediation queries. Mappings left inconsistent by a schema change have to be detected and updated. In particular, one source may be removed from the system because it provides always obsolete information or because it is unavailable. In this case it is necessary to update the inconsistent mappings. In this paper, we study the removal of a source from an integration system and show how to correctly update the mappings between the mediation schema and the distributed sources after this change, in the context of the global-as-view approach (each relation of the global schema is expressed as a view on the data source).

