Results 1 - 10
of
26
Integration of Heterogeneous Databases Without Common Domains Using Queries Based on Textual Similarity
, 1998
"... Most databases contain "name constants" like course numbers, personal names, and place names that correspond to entities in the real world. Previous work in integration of heterogeneous databases has assumed that local name constants can be mapped into an appropriate global domain by normalization. ..."
Abstract
-
Cited by 193 (13 self)
- Add to MetaCart
Most databases contain "name constants" like course numbers, personal names, and place names that correspond to entities in the real world. Previous work in integration of heterogeneous databases has assumed that local name constants can be mapped into an appropriate global domain by normalization. However, in many cases, this assumption does not hold; determining if two name constants should be considered identical can require detailed knowledge of the world, the purpose of the user's query, or both. In this paper, we reject the assumption that global domains can be easily constructed, and assume instead that the names are given in natural language text. We then propose a logic called WHIRL which reasons explicitly about the similarity of local names, as measured using the vector-space model commonly adopted in statistical information retrieval. We describe an efficient implementation of WHIRL and evaluate it experimentally on data extracted from the World Wide Web. We show that WHIR...
Data Integration Using Similarity Joins and a Word-Based Information Representation Language
- ACM TRANSACTIONS ON INFORMATION SYSTEMS
, 2000
"... ..."
Probabilistic Modeling of Distributed Information Retrieval
- ACM SIGIR Conference
, 1997
"... This paper describes a model for optimum information retrieval over a distributed document collection. The model stems from Robertson's Probability Ranking Principle: Having computed individual document rankings correlated to different subcollections, these local rankings are stepwise merged into a ..."
Abstract
-
Cited by 43 (4 self)
- Add to MetaCart
This paper describes a model for optimum information retrieval over a distributed document collection. The model stems from Robertson's Probability Ranking Principle: Having computed individual document rankings correlated to different subcollections, these local rankings are stepwise merged into a final ranking list where the documents are ordered according to their probability of relevance. Here, a full dissemination of subcollection-wide information is not required. The documents of different subcollections are assumed to be indexed using different indexing vocabularies. Moreover, local rankings may be computed by individual probabilistic retrieval methods. The underlying data volume is arbitrarily scalable. A criterion for effectively limiting the ranking process to a subset of subcollections extends the model. Keywords: Information Retrieval, Probabilistic Model, Distributed Systems 1 Introduction Information retrieval (IR) methods have been developed to support the search for t...
WHIRL: A Word-based Information Representation Language
- Artificial Intelligence
, 1999
"... We describe WHIRL, an "information representation language" that synergistically combines properties of logic-based and text-based representation systems. WHIRL is a subset of Datalog that has been extended by introducing an atomic type for textual entities, an atomic operation for computing textual ..."
Abstract
-
Cited by 24 (1 self)
- Add to MetaCart
We describe WHIRL, an "information representation language" that synergistically combines properties of logic-based and text-based representation systems. WHIRL is a subset of Datalog that has been extended by introducing an atomic type for textual entities, an atomic operation for computing textual similarity, and a "soft" semantics; that is, inferences in WHIRL are associated with numeric scores, and presented to the user in decreasing order by score. This paper briefly describes WHIRL, and then surveys a number of applications. We show that WHIRL strictly generalizes both ranked retrieval of documents, and logical deduction; that non-trivial queries about large databases can be answered eciently; that WHIRL can be used to accurately integrate data from heterogeneous information sources, such as those found on the Web; that WHIRL can be used effectively for inductive classification of text; and nally, that WHIRL can be used to semi-automatically generate extraction programs for structured documents.
Metadata for Integrating Speech Documents in a Text Retrieval System
- SIGMOD Record
, 1994
"... CH-8092 Z"urich (Switzerland) ..."
On the Update of Term Weights in Dynamic Information Retrieval Systems
- In Proceedings of the 4th International Conference on Knowledge and Information Management
, 1995
"... Using the vector space information retrieval model, we show that the update of term weights under document insertions is computationally expensive for weighting schemes that use collection statistics and normalization by document vector lengths. In the dynamic setting, we argue that strict adherence ..."
Abstract
-
Cited by 19 (4 self)
- Add to MetaCart
Using the vector space information retrieval model, we show that the update of term weights under document insertions is computationally expensive for weighting schemes that use collection statistics and normalization by document vector lengths. In the dynamic setting, we argue that strict adherence to such schemes is impractical and unnecessary as long as retrieval effectiveness commensurate with strict adherence is attained. Experiments using standard test collections as a source of document insertions support this argument. These experiments indicate that term weights may drift from their mathematically defined values without a serious loss of retrieval effectiveness. The only problematic setting is when new terms are present in newly inserted documents. Ignoring these terms can cause an effectiveness degradation. 1 Introduction The rapid growth in online information has fueled recent interest in techniques to handle the burgeoning flood of data becoming electronically available. ...
Execution Performance Issues in Full-Text Information Retrieval
, 1995
"... The task of an information retrieval system is to identify documents that will satisfy a user's information need. Effective fulfillment of this task has long been an active area of research, leading to sophisticated retrieval models for representing information content in documents and queries and m ..."
Abstract
-
Cited by 18 (0 self)
- Add to MetaCart
The task of an information retrieval system is to identify documents that will satisfy a user's information need. Effective fulfillment of this task has long been an active area of research, leading to sophisticated retrieval models for representing information content in documents and queries and measuring similarity between the two. The maturity and proven effectiveness of these systems has resulted in demand for increased capacity, performance, scalability, and functionality, especially as information retrieval is integrated into more traditional database management environments. In this dissertation we explore a number of functionality and performance issues in information retrieval. First, we consider creation and modification of the document collection, concentrating on management of the inverted file index. An inverted file architecture based on a persistent object store is described and experimental results are presented for inverted file creation and modification. Our architecture provides performance that scales well with document collection size and the database features supported by the persistent object store provide many solutions to issues that arise during integration of information retrieval into more general database environments. We then turn to query evaluation speed and introduce a new optimization technique for statistical ranking retrieval systems that support structured queries. Experimental results from a variety of query sets show that execution time can be reduced by more than 50% wit...
A retrieval mechanism for semi-structured photographic collections
- In LNCS 1308 (Proceedings of DEXA 97
, 1997
"... Abstract. In this paper, a new approach for retrieval from semi-structured photographic collections is described. We have developed a retrieval model based on the Dempster-Shafer theory of evidence combi-nation. Basic concepts of the Dempster-Shafer theory are explained and the suitability of this t ..."
Abstract
-
Cited by 15 (7 self)
- Add to MetaCart
Abstract. In this paper, a new approach for retrieval from semi-structured photographic collections is described. We have developed a retrieval model based on the Dempster-Shafer theory of evidence combi-nation. Basic concepts of the Dempster-Shafer theory are explained and the suitability of this theory for information retrieval is explored. A re-trieval model for a semi-structured photographic collection is presented. Extensibitity of this retrieval model for multimedia information retrieval is discussed. Integration of database and information retrieval concepts is a major requirement for semi-structured multimedia information re-trieval and is accomplished in this model. A novel indexing scheme for photographic materials is described. We use spatial features, which are objects and their location, as photographic features. We. have developed a multi-modal query interface for querying a photographic collection. A prototype system, Epic, has been implemented and is described in this paper. 1
The QUIQ engine: A hybrid IR DB system
- In ICDE
, 2003
"... For applications that involve rapidly changing textual data and also require traditional DBMS capabilities, current systems are unsatisfactory. In this paper, we describe a hybrid IR-DB system that serves as the basis for the QUIQConnect product, a collaborative customer support application. We pres ..."
Abstract
-
Cited by 15 (2 self)
- Add to MetaCart
For applications that involve rapidly changing textual data and also require traditional DBMS capabilities, current systems are unsatisfactory. In this paper, we describe a hybrid IR-DB system that serves as the basis for the QUIQConnect product, a collaborative customer support application. We present the novel query paradigm and system architecture, along with performance results. 1
Cross-Language Speech Retrieval: Establishing a Baseline Performance
- In Proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval
, 1997
"... We present here the realisation of a cross-language speech retrieval system which retrieves German speech documents in response to user queries specified as French text. This has been achieved through the integration of two existing modules of the SPIDER information retrieval system, namely the quer ..."
Abstract
-
Cited by 12 (1 self)
- Add to MetaCart
We present here the realisation of a cross-language speech retrieval system which retrieves German speech documents in response to user queries specified as French text. This has been achieved through the integration of two existing modules of the SPIDER information retrieval system, namely the query pseudo-translation module and the speech retrieval module. Our approach to cross-language retrieval uses an automatically constructed corpus-based information structure called a similarity thesaurus. A similarity thesaurus can be constructed over any loosely comparable corpus - a parallel corpus is not necessary. The similarity thesaurus used here was constructed over a 330 MByte corpus of comparable German and French news stories. Our speech retrieval module is based on a speaker-independent phoneme recognizer and it indexes speech documents by N-grams of phonemic features. The speech retrieval module includes an additional probabilistic matching technique designed to aid retrieval from e...

