Results 11 - 20
of
31
Learning Semantic String Transformations from Examples
"... We address the problem of performing semantic transformations on strings, which may represent a variety of data types (or their combination) such as a column in a relational table, time, date, currency, etc. Unlike syntactic transformations, which are based on regular expressions and which interpret ..."
Abstract
-
Cited by 3 (3 self)
- Add to MetaCart
We address the problem of performing semantic transformations on strings, which may represent a variety of data types (or their combination) such as a column in a relational table, time, date, currency, etc. Unlike syntactic transformations, which are based on regular expressions and which interpret a string as a sequence of characters, semantic transformations additionally require exploiting the semantics of the data type represented by the string, which may be encoded as a database of relational tables. Manually performing such transformations on a large collection of strings is error prone and cumbersome, while programmatic solutions are beyond the skill-set of end-users. We present a programming by example technology that allows end-users to automate such repetitive tasks. We describe an expressive transformation language for semantic manipulation that combines table lookup operations and syntactic manipulations. We then present a synthesis algorithm that can learn all transformations in the language that are consistent with the user-provided set of input-output examples. We have implemented this technology as an add-in for the Microsoft Excel Spreadsheet system and have evaluated it successfully over several benchmarks picked from various Excel help-forums. 1.
Data integration through transform reuse in the morpheus project
- In Proceedings of ACM SIGMOD
, 2006
"... We discuss Morpheus, a data transformation construction tool and associated repository. The architecture of Morpheus is motivated by the goal to reuse (pieces of) previously written transformations to solve data integration problems by finding relevant ones in the repository and then modifying them ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
We discuss Morpheus, a data transformation construction tool and associated repository. The architecture of Morpheus is motivated by the goal to reuse (pieces of) previously written transformations to solve data integration problems by finding relevant ones in the repository and then modifying them for repurposing. In addition, Morpheus is integrated with a DBMS so as to leverage existing capabilities including the runtime environment for transforms. We discuss the architecture of Morpheus and illustrate its usage with the help of a simple transform construction scenario. 1.
Managing Uncertainty in Schema Matcher Ensembles
"... Abstract. Schema matching is the task of matching between concepts describing the meaning of data in various heterogeneous, distributed data sources. With many heuristics to choose from, several tools have enabled the use of schema matcher ensembles, combining principles by which different schema ma ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Abstract. Schema matching is the task of matching between concepts describing the meaning of data in various heterogeneous, distributed data sources. With many heuristics to choose from, several tools have enabled the use of schema matcher ensembles, combining principles by which different schema matchers judge the similarity between concepts. In this work, we investigate means of estimating the uncertainty involved in schema matching and harnessing it to improve an ensemble outcome. We propose a model for schema matching, based on simple probabilistic principles. We then propose the use of machine learning in determining the best mapping and discuss its pros and cons. Finally, we provide a thorough empirical analysis, using both real-world and synthetic data, to test the proposed technique. We conclude that the proposed heuristic performs well, given an accurate modeling of uncertainty in matcher decision making. 1
Establishing Value Mappings using Statistical Models and User Feedback
- In ACM CIKM
, 2005
"... In this paper, we present a “value mapping ” algorithm that does not rely on syntactic similarity or semantic interpretation of the values. The algorithm first constructs a statistical model (e.g., co-occurrence frequency or entropy vector) that captures the unique characteristics of values and thei ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
In this paper, we present a “value mapping ” algorithm that does not rely on syntactic similarity or semantic interpretation of the values. The algorithm first constructs a statistical model (e.g., co-occurrence frequency or entropy vector) that captures the unique characteristics of values and their co-occurrence. It then finds the matching values by computing the distances between the models while refining the models using user feedback through iterations. Our experimental results suggest that our approach successfully establishes value mappings even in the presence of opaque data values and thus can be a useful addition to the existing data integration techniques.
A.: Boosting Schema Matchers
- OTM Proc. 2008
"... Abstract. Schema matching is recognized to be one of the basic operations required by the process of data and schema integration, and thus has a great impact on its outcome. We propose a new approach to combining matchers into ensembles, called Schema Matcher Boosting (SMB). This approach is based o ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Abstract. Schema matching is recognized to be one of the basic operations required by the process of data and schema integration, and thus has a great impact on its outcome. We propose a new approach to combining matchers into ensembles, called Schema Matcher Boosting (SMB). This approach is based on a well-known machine learning technique, called boosting. We present a boosting algorithm for schema matching with a unique ensembler feature, namely the ability to choose the matchers that participate in an ensemble. SMB introduces a new promise for schema matcher designers. Instead of trying to design a perfect schema matcher that is accurate for all schema pairs, a designer can focus on finding better than random schema matchers. We provide a thorough comparative empirical results where we show that SMB outperforms, on average, any individual matcher. In our experiments we have compared SMB with more than 30 other matchers over a real world data of 230 schemata and several ensembling approaches, including the Meta-Learner of LSD. Our empirical analysis shows that SMB improves, on average, over the performance of individual matchers. Moreover, SMB is shown to be consistently dominant, far beyond any other individual matcher. Finally, we observe that SMB performs better than the Meta-Learner in terms of precision, recall and F-Measure. 1
Compositional Mining of Multi-Relational Biological Datasets
, 2007
"... High-throughput biological screens are yielding ever-growing streams of information about multiple aspects of cellular activity. As more and more categories of datasets come online, there is a corresponding multitude of ways in which inferences can be chained across them, motivating the need for com ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
High-throughput biological screens are yielding ever-growing streams of information about multiple aspects of cellular activity. As more and more categories of datasets come online, there is a corresponding multitude of ways in which inferences can be chained across them, motivating the need for compositional data mining algorithms. In this paper, we argue that such compositional data mining can be effectively realized by functionally cascading redescription mining and biclustering algorithms as primitives. Both these primitives mirror shifts of vocabulary that can be composed in arbitrary ways to create rich chains of inferences. Given a relational database and its schema, we show how the schema can be automatically compiled into a compositional data mining program, and how different domains in the schema can be related through logical sequences of biclustering and redescription invocations. This feature allows us to rapidly prototype new data mining applications, yielding greater understanding of scientific datasets. We describe two applications of compositional data mining: (i) matching terms across categories of the Gene Ontology and (ii) understanding the molecular mechanisms underlying stress response in human cells.
A Holistic Paradigm for Schema Matching
, 2004
"... Schema matching is a critical problem for integrating heterogeneous information sources. Traditionally, the problem of matching multiple schemas has essentially relied on finding pairwise-attribute correspondence. In contrast, we propose a new matching paradigm, holistic schema matching, to holistic ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Schema matching is a critical problem for integrating heterogeneous information sources. Traditionally, the problem of matching multiple schemas has essentially relied on finding pairwise-attribute correspondence. In contrast, we propose a new matching paradigm, holistic schema matching, to holistically match many schemas at the same time and find all the matchings at once. By handling a set of schemas together, we can explore their context information that reflects the semantic correspondences among attributes, which is not available when schemas are matched only in pairs. As the realizations of the holistic paradigm, we developed two alternative approaches recently. This article takes an initial step to unify those two approaches and further contrasts their strength and weakness. Specifically, we develop two alternative methods for realizing holistic schema matching: global evaluation and local evaluation. Global evaluation exhaustively assesses all the possible models, where a model expresses all attribute matchings. In particular, we propose the MGS framework for such global evaluation with the hypothesis of the existence of generative models. On the other hand, local evaluation independently assesses every single matching to incrementally construct the model. In particular, we develop the DCM framework for such local evaluation with the observation that co-occurrence patterns across schemas often reveal the complex relationships of attributes. We apply our approaches on matching Web query interfaces on the deep Web. The result shows the effectiveness of both the MGS and DCM approaches, which together demonstrate the promise of the holistic paradigm for schema matching.
Identifying Value Mappings for Data Integration: An Unsupervised Approach
"... Abstract. The Web is a distributed network of information sources where the individual sources are autonomously created and maintained. Consequently, syntactic and semantic heterogeneity of data among sources abound. Most of the current data cleaning solutions assume that the data values referencing ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Abstract. The Web is a distributed network of information sources where the individual sources are autonomously created and maintained. Consequently, syntactic and semantic heterogeneity of data among sources abound. Most of the current data cleaning solutions assume that the data values referencing the same object bear some textual similarity. However, this assumption is often violated in practice. “Two-door front wheel drive ” can be represented as “2DR-FWD ” or “R2FD”, or even as “CAR TYPE 3 ” in different data sources. To address this problem, we propose a novel two-step automated technique that exploits statistical dependency structures among objects which is invariant to the tokens representing the objects. The algorithm achieved a high accuracy in our empirical study, suggesting that it can be a useful addition to the existing information integration techniques. 1
Automated Semantic Analysis of Schematic Data
"... Content in numerous Web data sources, designed primarily for human consumption, are not directly amenable to machine processing. Automated semantic analysis of such content facilitates their transformation into machine-processable and richly structured semantically annotated data. This paper describ ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Content in numerous Web data sources, designed primarily for human consumption, are not directly amenable to machine processing. Automated semantic analysis of such content facilitates their transformation into machine-processable and richly structured semantically annotated data. This paper describes a learning-based technique for semantic analysis of schematic data which are characterized by being template-generated from backend databases. Starting with a seed set of hand-labeled instances of semantic concepts in a set of Web pages, the technique learns statistical models of these concepts using light-weight content features. These models direct the annotation of diverse Web pages possessing similar content semantics. The principles behind the technique find application in information retrival and extraction problems. Focused Web browsing activities require only selective fragments of particular Web pages but are often performed using bookmarks which fetch the contents of the entire page. This results in information overload for users of constrained interaction modality devices such as small-screen handheld devices. Fine-grained information extraction from Web pages, which are typically performed using page specific and syntactic expressions known as wrappers, suffer from lack of scalability and robustness. We report on the application of our technique in developing semantic bookmarks for retrieving targeted browsing content and semantic wrappers for robust and scalable information extraction from Web pages sharing a semantic domain.

