Results 11 -
16 of
16
Identifying Value Mappings for Data Integration: An Unsupervised Approach
"... Abstract. The Web is a distributed network of information sources where the individual sources are autonomously created and maintained. Consequently, syntactic and semantic heterogeneity of data among sources abound. Most of the current data cleaning solutions assume that the data values referencing ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Abstract. The Web is a distributed network of information sources where the individual sources are autonomously created and maintained. Consequently, syntactic and semantic heterogeneity of data among sources abound. Most of the current data cleaning solutions assume that the data values referencing the same object bear some textual similarity. However, this assumption is often violated in practice. “Two-door front wheel drive ” can be represented as “2DR-FWD ” or “R2FD”, or even as “CAR TYPE 3 ” in different data sources. To address this problem, we propose a novel two-step automated technique that exploits statistical dependency structures among objects which is invariant to the tokens representing the objects. The algorithm achieved a high accuracy in our empirical study, suggesting that it can be a useful addition to the existing information integration techniques. 1
Web Data Integration Using Approximate String Join
, 2004
"... mining. It is highly likely that several records on the web whose textual representations differ may represent the same real world entity. These records are called approximate duplicates. Data integration seeks to identify such approximate duplicates and merge them into integrated records. Many exis ..."
Abstract
- Add to MetaCart
mining. It is highly likely that several records on the web whose textual representations differ may represent the same real world entity. These records are called approximate duplicates. Data integration seeks to identify such approximate duplicates and merge them into integrated records. Many existing data integration algorithms make use of approximate string join, which seeks to (approximately) find all pairs of strings whose distances are less than a certain threshold. In this paper, we propose a new mapping method to detect pairs of strings with similarity above a certain threshold. In our method, each string is first mapped to a point in a high dimensional grid space, then pairs of points whose distances are 1 are identified. We implement it using Oracle SQL and PL/SQL. Finally, we evaluate this method using real data sets. Experimental results suggest that our method is both accurate and efficient.
MEDIATE: Learning to Match Entity Mentions across Text and Databases
"... Many real-world applications increasingly involve both structured data and text. A given real-world entity is often referred to in different ways, such as “Helen Hunt”, and “Mrs. H. E. Hunt”, both within and across the structured data and the text. Due to this semantic heterogeneity, it remains extr ..."
Abstract
- Add to MetaCart
Many real-world applications increasingly involve both structured data and text. A given real-world entity is often referred to in different ways, such as “Helen Hunt”, and “Mrs. H. E. Hunt”, both within and across the structured data and the text. Due to this semantic heterogeneity, it remains extremely difficult to glue together information about real-world entities from the available data sources and effectively utilize both types of information. This paper describes the MEDIATE system which automatically matches entity mentions within and across both text and databases. The system can handle multiple types of entities (e.g., people, movies, locations), is easily extensible to new entity types, and operates with no need for annotated training data. Given a relational database and a set of text documents, MEDIATE learns from the data a generative model that provides a probabilistic view on how a data creator might have generated mentions, then applies it to matching the mentions. The model exploits the similarity of mention names, common transformations across mentions, and context information such as age, gender, and entity co-occurrence. To maximize matching accuracy, MEDIATE also propagates information across contexts. Experiments on real-world data show that MEDI-ATE significantly outperforms existing methods that address aspects of this problem, and that it can exploit text to improve record linkage, and vice versa. 1
Copyright c ○ 2007 by Oktie HassanzadehAbstract Benchmarking Declarative Approximate Selection Predicates
"... Declarative data quality has been an active research topic. The fundamental principle behind a declarative approach to data quality is the use of declarative statements to realize data quality primitives on top of any relational data source. A primary advantage of such an approach is the ease of use ..."
Abstract
- Add to MetaCart
Declarative data quality has been an active research topic. The fundamental principle behind a declarative approach to data quality is the use of declarative statements to realize data quality primitives on top of any relational data source. A primary advantage of such an approach is the ease of use and integration with existing applications. Over the last couple of years several similarity predicates have been proposed for common quality primitives (approximate selections, joins, etc.) and have been fully expressed using declarative SQL statements. In this thesis, new similarity predicates are proposed along with their declarative realization, based on notions of probabilistic information retrieval. Then, full declarative specifications of previously proposed similarity predicates in the literature are presented, grouped into classes according to their primary characteristics. Finally, a thorough performance and accuracy study comparing a large number of similarity predicates for data cleaning operations is performed. ii Dedication This thesis is dedicated to my brother, Aidin, and to my parents who have always supported me. iii Acknowledgements First, I would like to thank my supervisor, Nick Koudas. This work would not have been possible without his invaluable guidance and support. Special thanks to John Mylopoulos, the second reader of my thesis, for his valuable time and comments. During my research, I had the pleasure of working in a wonderful atmosphere in the database lab. I had an unforgettable year with my colleagues there. While enjoying the taste of fresh coffee from our fancy coffee machine that helped us stay awake all long nights before the deadlines, we had many fruitful discussions that often resulted in brilliant new ideas. I would like to thank all my friends in the database lab, particularly
Contents lists available at ScienceDirect Information Systems
"... journal homepage: www.elsevier.com/locate/infosys ..."

