Results 1 - 10
of
16
Efficient similarity-based operations for data integration
- Data Knowl. Eng
"... Dealing with discrepancies in data is still a big challenge in data integration systems. The problem occurs both during eliminating duplicates from semantic overlapping sources as well as during combining complementary data from different sources. Though using SQL operations like grouping and join s ..."
Abstract
-
Cited by 11 (1 self)
- Add to MetaCart
Dealing with discrepancies in data is still a big challenge in data integration systems. The problem occurs both during eliminating duplicates from semantic overlapping sources as well as during combining complementary data from different sources. Though using SQL operations like grouping and join seems to be a viable way, they fail if the attribute values of the potential duplicates or related tuples are not equal but only similar by certain criteria. As a solution to this problem, we present in this paper similarity-based variants of grouping and join operators. The extended grouping operator produces groups of similar tuples, the extended join combines tuples satisfying a given similarity condition. We describe the semantics of this operator, discuss efficient implementations for the edit distance similarity and present evaluation results. Finally, we give examples of application from the context of a data reconciliation project for looted art.
Indexing Mixed Types for Approximate Retrieval
- IN PROC. OF VLDB
, 2005
"... In various applications such as data cleansing, being able to retrieve categorical or numerical attributes based on notions of approximate match (e.g., edit distance, numerical distance) is of profound importance. Commonly, approximate match predicates are specified on combinations of attribut ..."
Abstract
-
Cited by 9 (5 self)
- Add to MetaCart
In various applications such as data cleansing, being able to retrieve categorical or numerical attributes based on notions of approximate match (e.g., edit distance, numerical distance) is of profound importance. Commonly, approximate match predicates are specified on combinations of attributes in conjunction. Existing database techniques for approximate retrieval, however, limit their applicability to single attribute retrieval through B-trees and their variants. In this paper, we propose a methodology that utilizes known multidimensional indexing structures for the problem of approximate multi-attribute retrieval. Our method
Matchsimile: A Flexible Approximate Matching Tool for Personal Names Searching
- JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY
, 2003
"... In this paper we present the architecture and algorithms behind Matchsimile, an approximate string matching lookup tool especially designed for human and company names searches against a large textual database. Part of a larger information retrieval environment, this specific engine accepts an in ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
In this paper we present the architecture and algorithms behind Matchsimile, an approximate string matching lookup tool especially designed for human and company names searches against a large textual database. Part of a larger information retrieval environment, this specific engine accepts an input text file with a set of personal and company names and a set of restrictions for the search. After a batch processing, the engine outputs another text le containing the occurrences that match each record of the input names le, according to its search parameters. Beyond the similarity search capabilities applied on each word that forms a name, the tool considers a set of personal names formation rules for their words such as combination, abbreviation, character mapping, duplicity detections, ordering, word omission and insertion, among others. This engine is used in a succeeded commercial application (also named Matchsimile), which uses this tool to allow lawyers names searches against many official law journals publications.
Using Similarity-based Operations for Resolving Data-Level Conflicts
- IN: BNCOD
, 2003
"... Dealing with discrepancies in data is still a big challenge in data integration systems. The problem occurs both during eliminating duplicates from semantic overlapping sources as well as during combining complementary data from different sources. Though using SQL operations like grouping and joi ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
Dealing with discrepancies in data is still a big challenge in data integration systems. The problem occurs both during eliminating duplicates from semantic overlapping sources as well as during combining complementary data from different sources. Though using SQL operations like grouping and join seems to be a viable way, they fail if the attribute values of the potential duplicates or related tuples are not equal but only similar by certain criteria. As a solution to this problem, we present in this paper similarity-based variants of grouping and join operators. The extended grouping operator produces groups of similar tuples, the extended join combines tuples satisfying a given similarity condition. We describe the semantics of these operators, discuss efficient implementations for the edit distance similarity and present evaluation results. Finally, we give examples how the operators can be used in given application scenarios.
Supporting Similarity Operations based on Approximate String Matching on the Web
- In CoopIS
, 2004
"... Abstract. Querying and integrating sources of structured data from the Web in most cases requires similarity-based concepts to deal with data level conflicts. This is due to the often erroneous and imprecise nature of the data and diverging conventions for their representation. On the other hand, We ..."
Abstract
-
Cited by 4 (3 self)
- Add to MetaCart
Abstract. Querying and integrating sources of structured data from the Web in most cases requires similarity-based concepts to deal with data level conflicts. This is due to the often erroneous and imprecise nature of the data and diverging conventions for their representation. On the other hand, Web databases offer only limited interfaces and almost no support for similarity queries. The approach presented in this paper maps string similarity predicates to standard predicates like substring and keyword search as offered by many of the mentioned systems. To minimize the local processing costs and the required network traffic, the mapping uses materialized information on the selectivity of string samples such as ¤-samples, substrings, and keywords. Based on the predicate mapping similarity selections and joins are described and the quality and required effort of the operations is evaluated experimentally. 1
Database Structures, Based on Tries, for Text, Spatial, and General Data
- In International Symposium on Cooperative Database Systems for Advanced Applications
, 1996
"... Digital trees, or tries, were introduced thirty years ago for sublinear-time retrieval of substrings from large texts. They were exploited for this, as a well-known example, by the University of Waterloo project to put the New Oxford English Dictionary onto CD-ROM. We have recently improved the p ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Digital trees, or tries, were introduced thirty years ago for sublinear-time retrieval of substrings from large texts. They were exploited for this, as a well-known example, by the University of Waterloo project to put the New Oxford English Dictionary onto CD-ROM. We have recently improved the performance of trie techniques for text and shown their use in searches for approximations to a given string. We have also shown that tries have excellent retrieval properties for spatial data. We have shown how to use tries to represent, without redundancy, spatial data which can be displayed to any resolution, retrieving from disk or from network only the amount of data that will finally be displayed. We have done this particularly for two-dimensional vector data, such as makes up very large maps, but have also established that the trie techniques apply to raster data and to data of other than two dimensions. These results are the basis for a claim that tries offer the best storage ...
Efficient approximate dictionary look-up over small alphabets
, 2005
"... Given a dictionary W consisting of n binary strings of length m each, a d-query asks if there exists a string in W within Hamming distance d of a given binary query string q. The problem was posed by Minsky and Papert in 1969 [10] as a challenge to data structure design. Efficient solutions have bee ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
Given a dictionary W consisting of n binary strings of length m each, a d-query asks if there exists a string in W within Hamming distance d of a given binary query string q. The problem was posed by Minsky and Papert in 1969 [10] as a challenge to data structure design. Efficient solutions have been developed only for the special case when d = 1 (the 1-query problem). We assume the standard RAM model of computation, and consider the case of the problem when alphabet size is arbitrary but finite, and d is small. We preprocess the dictionary, and construct an edge-labelled tree with bounded branching factor, and height. We present an algorithm to answer dictionary look-up within given distance d of a given query string q. The algorithm is efficient when the alphabet size is small, or the dictionary is sparse. In particular, for the d-query problem the algorithm takes time O(m(log 4/3 n − 1) d (log 2 n) d+1). This is an improvement over previously known algorithms for the d-query problem when d> 1. We also generalize the results for the case of the problem when edit distances are used. The algorithm can be modified such that it allows for words of different lengths as well as different lengths of query strings. 1
NTCIR-3 PAT Experiments at Osaka Kyoiku University - Long Gram-based Index and Essential Words
"... Long gram-based indices are experimented at NTCIR-3 patent task . To make gram-based indices, no analyses such as morphological ones are required. ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Long gram-based indices are experimented at NTCIR-3 patent task . To make gram-based indices, no analyses such as morphological ones are required.
Advanced Grouping and Aggregation for Data Integration
- in: Proceedings of the 10th International Conference on Information and Knowledge Management, CIKMÕ01
, 2001
"... New applications from the areas of analytical data processing and data integration require powerful features to condense and reconcile available data. Object-relational and other data management systems available today provide only limited concepts to deal with these requirements. The general con ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
New applications from the areas of analytical data processing and data integration require powerful features to condense and reconcile available data. Object-relational and other data management systems available today provide only limited concepts to deal with these requirements. The general concept of grouping and aggregation appears to be a fitting paradigm for a number of the mentioned issues, but in its common form of equality based groups and restricted aggregate functions a number of problems remain unsolved. Various extensions to this concept have been introduced over the last years regarding user-defined functions for aggregation and grouping. Especially, existing extensions to the grouping operation like simple derivations of group-by values do not meet the requirements of data integration applications. We propose generic interfaces for user-defined grouping and aggregation as part of a SQL extension, allowing for more complex functions, for instance integration of data mining algorithms. Furthermore, we discuss high-level language primitives for common applications and illustrate the approach by introducing new concepts for similarity-based duplicate detection and elimination.
Trie Methods for Structured Data on Secondary Storage
, 2000
"... We apply the trie structures to indexing, storing and querying structured data on secondary storage. We are interested in the storage compactness, the I/O efficiency, the order-preserving properties, the general orthogonal range queries and the exact match queries for very large files and databases. ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
We apply the trie structures to indexing, storing and querying structured data on secondary storage. We are interested in the storage compactness, the I/O efficiency, the order-preserving properties, the general orthogonal range queries and the exact match queries for very large files and databases. We also apply the trie structures to relational joins (set operations). We compare trie structures to various data structures on secondary storage: multipaging and grid files in the direct access method category, R-trees/R*-trees and X-trees in the logarithmic access cost category, as well as some representative join algorithms for performing join operations. Our results show that range queries by trie method are superior to these competitors in search cost when queries return more than a few records and are competitive to direct access methods for exact match queries. Furthermore, as the trie structure compresses data, it is the winner in terms of storage compared to all other methods mentioned above. We also present a new tidy function for order-preserving key-to-address transformation. Our tidy function is easy to construct and cheaper in access time and storage cost compared to its closest competitor.

