Results 1 - 10
of
20
Integrating Structured Data and Text: A relational approach
- Journal of the American Society of Information Science
, 1997
"... We integrate structured data and text using the unchanged, standard relational model. We started with the premise that a relational system could be used to implement an Information Retrieval (IR) system. After implementing a prototype to verify that premise, we then began to investigate the performa ..."
Abstract
-
Cited by 50 (27 self)
- Add to MetaCart
We integrate structured data and text using the unchanged, standard relational model. We started with the premise that a relational system could be used to implement an Information Retrieval (IR) system. After implementing a prototype to verify that premise, we then began to investigate the performance of a parallel relational database system for this application. We also tested the effect of query reduction on accuracy and found that queries can be reduced prior to their implementation without incurring a significant loss in precision/recall. This reduction also serves to improve run-time performance. After comparing our results to a special purpose IR system, we conclude that the relational model offers scalable performance and includes the ability to integrate structured data and text in a portable fashion. 1 Introduction Increasingly, applications integrate structured and unstructured data, responding to requests such as "Find articles containing vehicle and sales published in jou...
Content-Based Retrieval for Music Collections
, 1999
"... A content-based retrieval model for tackling the mismatch problems specific to music data is proposed and implemented. The system uses a pitch profile encoding for queries in any key and an n-note indexing method for approximate matching in sub-linear time. A distinct function that extracts key melo ..."
Abstract
-
Cited by 35 (1 self)
- Add to MetaCart
A content-based retrieval model for tackling the mismatch problems specific to music data is proposed and implemented. The system uses a pitch profile encoding for queries in any key and an n-note indexing method for approximate matching in sub-linear time. A distinct function that extracts key melodies for query suggestion is developed. The Web-based system provides flexible user interface for query formulation and result browsing. Users can search the system by a short sequence of notes, by uploading a file created by singing, or by clicking suggested key melodies without input. Experiments show that the pitch profile encoding and a 3-note indexing are able to overcome the key mismatch problem and the random errors caused by pitch error, note deletion and insertion. The use of extracted key melodies improves performance over direct search of the music database. For the type of burst mismatch, a query expansion approach is applied.
Probabilistic Retrieval of OCR Degraded Text Using N-Grams
- European Conference on Digital Libraries
, 1997
"... . The retrieval of OCR degraded text using n-gram formulations within a probabilistic retrieval system is examined in this paper. Direct retrieval of documents using n-gram databases of 2 and 3-grams or 2, 3, 4 and 5-grams resulted in improved retrieval performance over standard (word based) queries ..."
Abstract
-
Cited by 32 (0 self)
- Add to MetaCart
. The retrieval of OCR degraded text using n-gram formulations within a probabilistic retrieval system is examined in this paper. Direct retrieval of documents using n-gram databases of 2 and 3-grams or 2, 3, 4 and 5-grams resulted in improved retrieval performance over standard (word based) queries on the same data when a level of 10 percent degradation or worse was achieved. A second method of using n-grams to identify appropriate matching and near matching terms for query expansion which also performed better than using standard queries is also described. This method was less effective than direct n-gram query formulations but can likely be improved with alternative query component weighting schemes and measures of term similarity. Finally, a web based retrieval application using n-gram retrieval of OCR text and display, with query term highlighting, of the source document image is described. 1 Introduction A major problem with retrieval of OCR text from image data is the inevitabl...
TELLTALE: Experiments in a Dynamic Hypertext Environment for Degraded and Multilingual Data
- Journal of the American Society for Information Science (JASIS
, 1996
"... Methods and tools for finding documents relevant to a user’s needs in document corpora can be found in the information retrieval, library science, and hypertext communities. Typically, these systems provide retrieval capabilities for fairly static corpora, their algorithms are dependent on the langu ..."
Abstract
-
Cited by 31 (11 self)
- Add to MetaCart
Methods and tools for finding documents relevant to a user’s needs in document corpora can be found in the information retrieval, library science, and hypertext communities. Typically, these systems provide retrieval capabilities for fairly static corpora, their algorithms are dependent on the language for which they are written, e.g. English, and they do not perform well when presented with misspelled words or text that has been degraded by OCR (optical character recognition) techniques. In this article, we present experimentation results for the TELLTALE system. TELLTALE is a dynamic hypertext environment that provides full-text search from a hypertext-style user interface for text corpora that may be garbled by OCR or transmission errors, and that may contain languages other than English. TELLTALE uses several techniques based on n-grams (n character sequences of text). With these results we show that the dynamic linkage mechanisms in TELL-TALE are tolerant of garbles in up to 30 % of the characters in the body of the text. 1.
The TELLTALE Dynamic Hypertext Environment: Approaches to Scalability
- IN ADVANCES IN INTELLIGENT HYPERTEXT, LECTURE NOTES IN COMPUTER SCIENCE
, 1997
"... Methods and tools for finding documents relevant to a user's needs in document corpora can be found in the information retrieval, library science, and hypertext communities. Typically ..."
Abstract
-
Cited by 15 (1 self)
- Add to MetaCart
Methods and tools for finding documents relevant to a user's needs in document corpora can be found in the information retrieval, library science, and hypertext communities. Typically
Information Retrieval: A Survey
, 2000
"... Information Retrieval (IR) is the discipline that deals with retrieval of unstructured data, especially textual documents, in response to a query or topic statement, which may itself be unstructured, e.g., a sentence or even another document, or which may be structured, e.g., a boolean expression. T ..."
Abstract
-
Cited by 14 (0 self)
- Add to MetaCart
Information Retrieval (IR) is the discipline that deals with retrieval of unstructured data, especially textual documents, in response to a query or topic statement, which may itself be unstructured, e.g., a sentence or even another document, or which may be structured, e.g., a boolean expression. The need for effective methods of automated IR has grown in importance because of the tremendous explosion in the amount of unstructured data, both internal, corporate document collections, and the immense and growing number of document sources on the Internet. This report is a tutorial and survey of the state of the art, both research and commercial, in this dynamic field. The topics covered include: formulation of structured and unstructured queries and topic statements, indexing (including term weighting) of document collections, methods for computing the similarity of queries and documents, classification and routing of documents in an incoming stream to users on the basis of topic or nee...
Meta-data for Distributed Text Retrieval
- In Proceedings of First IEEE Metadata Conference
, 1997
"... duced by every back-end for use by the broker. The first back-end information provider to be adapted for use in CARROT was Telltale [5], which is an n-gram based information retrieval system. We have since added Witten, Moffat, and Bell's mg system [7], with a stripped-down version of Telltale to ge ..."
Abstract
-
Cited by 9 (2 self)
- Add to MetaCart
duced by every back-end for use by the broker. The first back-end information provider to be adapted for use in CARROT was Telltale [5], which is an n-gram based information retrieval system. We have since added Witten, Moffat, and Bell's mg system [7], with a stripped-down version of Telltale to generate n-gram based meta-data. The goal of n-gram reduction is to keep those n-grams which provide the best information for making routing decisions, while keeping the size of the meta-data object as small as possible. In our experiments with the documents in a single corpus, we have investigated two reduction techniques so far. We find that by discarding singleton n-grams, i.e. n-grams that occur exactly once in any one document, we are able to reduce the size of the corpus centroid by as much as 60% for some small test collections, without significant change in effectiveness as measured using st
An Approach to Large Scale Distributed Information Systems Using Statistical Properties of Text to Guide Agent Search
- In CIKM Workshop on Intelligent Information Agents
"... Introduction and Problem Statement Given a large, dynamic corpus, distributed over several (possibly hundreds or thousands of ) physical sites traditional Information Retrieval techniques don't seem to scale [BDMS94]. Web-walking robots are fine as far as they go, but they're centralized, they don' ..."
Abstract
-
Cited by 6 (2 self)
- Add to MetaCart
Introduction and Problem Statement Given a large, dynamic corpus, distributed over several (possibly hundreds or thousands of ) physical sites traditional Information Retrieval techniques don't seem to scale [BDMS94]. Web-walking robots are fine as far as they go, but they're centralized, they don't stay current, and, except for Harvest [BDHMSW95], none, as yet, have any notion of local indices that are shared. 2 A Proposed Solution The solution we propose uses a mediated agent architecture composed of local server agents managing local corpora and communicating via agent name servers and brokers [see Figure 1]. Use of this intelligent agent architecture leverages the scalability solutions from the intelligent agent community for use in distributed information retrieval. In addition to the use of a mediated architecture, our solution depends upon the use of automatically generated, effective metadata. In this case, metadata is effective when it provides accu
PAI: Automatic Indexing for Extracting Asserted Keywords from a Document
- Journal of New Generation Computing
, 2002
"... This paper proposes an automatic indexing method called PAI (Priming Activation Indexing) that extracts keywords representing the author's main point from a document based on the priming effect in cognitive process. The basic idea of PAI is that since an author writes a document emphasiz- Copyright ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
This paper proposes an automatic indexing method called PAI (Priming Activation Indexing) that extracts keywords representing the author's main point from a document based on the priming effect in cognitive process. The basic idea of PAI is that since an author writes a document emphasiz- Copyright c 2002, American Association for Artificial Intelligence (www.aaai.org). All rights reserved

