Results 1 -
9 of
9
Integration of Heterogeneous Databases Without Common Domains Using Queries Based on Textual Similarity
, 1998
"... Most databases contain "name constants" like course numbers, personal names, and place names that correspond to entities in the real world. Previous work in integration of heterogeneous databases has assumed that local name constants can be mapped into an appropriate global domain by normalization. ..."
Abstract
-
Cited by 193 (13 self)
- Add to MetaCart
Most databases contain "name constants" like course numbers, personal names, and place names that correspond to entities in the real world. Previous work in integration of heterogeneous databases has assumed that local name constants can be mapped into an appropriate global domain by normalization. However, in many cases, this assumption does not hold; determining if two name constants should be considered identical can require detailed knowledge of the world, the purpose of the user's query, or both. In this paper, we reject the assumption that global domains can be easily constructed, and assume instead that the names are given in natural language text. We then propose a logic called WHIRL which reasons explicitly about the similarity of local names, as measured using the vector-space model commonly adopted in statistical information retrieval. We describe an efficient implementation of WHIRL and evaluate it experimentally on data extracted from the World Wide Web. We show that WHIR...
Data Integration Using Similarity Joins and a Word-Based Information Representation Language
- ACM TRANSACTIONS ON INFORMATION SYSTEMS
, 2000
"... ..."
A Web-based Information System that Reasons with Structured Collections of Text
- In Agents '98
, 1998
"... The degree to which information sources are pre-processed by Web-based information systems varies greatly. In search engines like Altavista, little pre-processing is done, while in "knowledge integration" systems, complex site-specific "wrappers" are used integrate different information sources into ..."
Abstract
-
Cited by 53 (7 self)
- Add to MetaCart
The degree to which information sources are pre-processed by Web-based information systems varies greatly. In search engines like Altavista, little pre-processing is done, while in "knowledge integration" systems, complex site-specific "wrappers" are used integrate different information sources into a common database representation. In this paper we describe an intermediate between these two models. In our system, information sources are converted into a highly structured collection of small fragments of text. Databaselike queries to this structured collection of text fragments are approximated using a novel logic called WHIRL, which combines inference in the style of deductive databases with ranked retrieval methods from information retrieval. WHIRL allows queries that integrate information from multiple Web sites, without requiring the extraction and normalization of object identifiers that can be used as keys; instead, operations that in conventional databases require equality tests...
Knowledge Integration for Structured Information Sources Containing Text (Extended Abstract)
- In The SIGIR-97 Workshop on Networked Information Retrieval
, 1997
"... ) William W. Cohen AT&T Labs---Research 180 Park Avenue, Florham Park NJ 07932 wcohen@research.att.com August 1, 1997 Abstract Knowledge integration is the integration of distributed, heterogeneous databases, such as those available on the World Wide Web. In this paper we will consider a new type ..."
Abstract
-
Cited by 10 (6 self)
- Add to MetaCart
) William W. Cohen AT&T Labs---Research 180 Park Avenue, Florham Park NJ 07932 wcohen@research.att.com August 1, 1997 Abstract Knowledge integration is the integration of distributed, heterogeneous databases, such as those available on the World Wide Web. In this paper we will consider a new type of knowledge integration problem, namely, the problem of combining information from relations that lack common object identifiers. A general technique for this problem is proposed, based on well-studied similarity measures for text, and the observation that Web-based databases often present their information to the end user through a veneer of text. We describe an extension of Datalog called WHIRL which allows passages of ordinary text to be used as keys. WHIRL supports documents as a built-in type, similarity reasoning with a built-in predicate, and answers every query with a list of answer substutitions that are ranked according to an overall similarity score. Experiments with a prototype...
The WHIRL Approach to Integration: An Overview
- in Proceedings of the AAAI-98 Workshop on AI and Information Integration
, 1998
"... We describe a new integration system, in which information sources are converted into a highly structured collection of small fragments of text. Database-like queries to this structured collection of text fragments are approximated using a novel logic called WHIRL, which combines inference in the st ..."
Abstract
-
Cited by 9 (0 self)
- Add to MetaCart
We describe a new integration system, in which information sources are converted into a highly structured collection of small fragments of text. Database-like queries to this structured collection of text fragments are approximated using a novel logic called WHIRL, which combines inference in the style of deductive databases with ranked retrieval methods from information retrieval. WHIRL allows queries that integrate information from information sources, without requiring the extraction and normalization of object identifiers that can be used as keys; instead, operations that in conventional databases require equality tests on keys are approximated using IR similarity metrics for text. This leads to a reduction in the amount of human engineering required to field an integration system. Introduction Knowledge integration systems like the Information Manifold (Levy, Rajaraman, & Ordille 1996), TSIMMIS (Garcia-Molina et al. 1995), and others (Arens, Knoblock, & Hsu 1996; Atzeni, Mecca, &...
Reasoning about Textual Similarity in a Web-Based Information Access System
- Autonomous Agents and Multi-Agent Systems
, 1999
"... . The degree to which information sources are pre-processed by Webbased information systems varies greatly. In search engines like Altavista, little pre-processing is done, while in "knowledge integration" systems, complex sitespecific "wrappers" are used to integrate different information sources i ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
. The degree to which information sources are pre-processed by Webbased information systems varies greatly. In search engines like Altavista, little pre-processing is done, while in "knowledge integration" systems, complex sitespecific "wrappers" are used to integrate different information sources into a common database representation. In this paper we describe an intermediate point between these two models. In our system, information sources are converted into a highly structured collection of small fragments of text. Database-like queries to this structured collection of text fragments are approximated using a novel logic called WHIRL, which combines inference in the style of deductive databases with ranked retrieval methods from information retrieval (IR). WHIRL allows queries that integrate information from multiple Web sites, without requiring the extraction and normalization of object identifiers that can be used as keys; instead, operations that in conventional databases require...
Data Integration for Many Data Sources using Context-Sensitive Similarity Metrics
"... Good similarity functions are crucial for many important subtasks in data integration, such as “soft joins ” and data deduping, and one widely-used similarity function is TFIDF similarity. In this paper we describe a modification of TFIDF similarity that is more appropriate for certain datasets: nam ..."
Abstract
- Add to MetaCart
Good similarity functions are crucial for many important subtasks in data integration, such as “soft joins ” and data deduping, and one widely-used similarity function is TFIDF similarity. In this paper we describe a modification of TFIDF similarity that is more appropriate for certain datasets: namely, large data collections formed by merging together many smaller collections, each of which is (nearly) duplicate-free. Our similarity metric, called CX.IDF, shares TFIDF’s most important properties: it can be computed efficiently and stored compactly; it can be“learned”using few passes over a dataset (in experiments, one or three passes are used), and is wellsuited to parallelization; and finally, like TFIDF, it requires no labeled training data. In experiments, the new similarity function reduces matching errors relative to TFIDF by up to 80%, and reduces k-nearest neighbor classification error by 20 % on average. Categories and Subject Descriptors H.4 [Information Systems Applications]: Miscellaneous; D.2.8 [Software Engineering]: Metrics—complexity measures, performance measures
Context-Sensitive Similarity Metrics
, 2011
"... Good similarity functions are crucial for many important subtasks in data integration, such as “soft joins ” and data deduping, and one widely-used similarity function is TFIDF similarity. In this paper we describe a modification of TFIDF similarity that is more appropriate for certain datasets: nam ..."
Abstract
- Add to MetaCart
Good similarity functions are crucial for many important subtasks in data integration, such as “soft joins ” and data deduping, and one widely-used similarity function is TFIDF similarity. In this paper we describe a modification of TFIDF similarity that is more appropriate for certain datasets: namely, large data collections formed by merging together many smaller collections, each of which is (nearly) duplicate-free. Our similarity metric, called CX.IDF, shares TFIDF’s most important properties: it can be computed efficiently and stored compactly; it can be “learned” using few passes over a dataset (in experiments, one or three passes are used), and is well-suited to parallelization; and finally, like TFIDF, it requires no labeled training data. In experiments, the new similarity function reduces matching errors relative to TFIDF by up to 80%, and reduces k-nearest An important step in integrating heterogeneous datasets is determining a mapping between objects from one source and objects from another source—a step variously known as record linkage, matching, and deduping (among other terms) in the literature. One useful matching strategy is to use an appropriately thresholded similarity function–i.e., to consider objects as identical if they are
Query Efficiency Prediction for Dynamic Pruning
"... Dynamic pruning strategies are effective yet permit efficient retrieval by pruning- i.e. not fully scoring all postings of all documents matching a given query. However, the amount of pruning possible for a query can vary, resulting in queries with similar properties (query length, total numbers of ..."
Abstract
- Add to MetaCart
Dynamic pruning strategies are effective yet permit efficient retrieval by pruning- i.e. not fully scoring all postings of all documents matching a given query. However, the amount of pruning possible for a query can vary, resulting in queries with similar properties (query length, total numbers of postings) taking different amounts of time to retrieve search results. In this work, we investigate the causes for inefficient queries, identifying reasons such as the balance between informativeness of query terms, and the distribution of retrieval scores within the posting lists. Moreover, we note the advantages in being able to predict the efficiency of a query, and propose various query efficiency predictors. Using

