Results 1 -
3 of
3
Probabilistic Retrieval of OCR Degraded Text Using N-Grams
- European Conference on Digital Libraries
, 1997
"... . The retrieval of OCR degraded text using n-gram formulations within a probabilistic retrieval system is examined in this paper. Direct retrieval of documents using n-gram databases of 2 and 3-grams or 2, 3, 4 and 5-grams resulted in improved retrieval performance over standard (word based) queries ..."
Abstract
-
Cited by 32 (0 self)
- Add to MetaCart
. The retrieval of OCR degraded text using n-gram formulations within a probabilistic retrieval system is examined in this paper. Direct retrieval of documents using n-gram databases of 2 and 3-grams or 2, 3, 4 and 5-grams resulted in improved retrieval performance over standard (word based) queries on the same data when a level of 10 percent degradation or worse was achieved. A second method of using n-grams to identify appropriate matching and near matching terms for query expansion which also performed better than using standard queries is also described. This method was less effective than direct n-gram query formulations but can likely be improved with alternative query component weighting schemes and measures of term similarity. Finally, a web based retrieval application using n-gram retrieval of OCR text and display, with query term highlighting, of the source document image is described. 1 Introduction A major problem with retrieval of OCR text from image data is the inevitabl...
Identification of Confusable Drug Names: A New Approach and Evaluation Methodology
- In Proceedings of COLING 2004
, 2004
"... This paper addresses the mitigation of medical errors due to the confusion of sound-alike and look-alike drug names. Our approach involves application of two new methods--- one based on orthographic similarity ("lookalike ") and the other based on phonetic similarity ("sound-alike"). We presen ..."
Abstract
-
Cited by 11 (1 self)
- Add to MetaCart
This paper addresses the mitigation of medical errors due to the confusion of sound-alike and look-alike drug names. Our approach involves application of two new methods--- one based on orthographic similarity ("lookalike ") and the other based on phonetic similarity ("sound-alike"). We present a new recall-based evaluation methodology for determining the effectiveness of different similarity measures on drug names. We show that the new orthographic measure (BI-SIM) outperforms other commonly used measures of similarity on a set containing both look-alike and sound-alike pairs, and that the feature-based phonetic approach (ALINE) outperforms orthographic approaches on a test set containing solely sound-alike confusion pairs. However, an approach that combines several different measures achieves the best results on both test sets.
Pattern occurrences in multicomponent models
, 2004
"... In this paper we determine some limit distributions of pattern statistics in rational stochastic models, defined by means of nondeterministic weighted finite automata. We present a general approach to analyze these statistics in rational models having an arbitrary number of connected components. We ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
In this paper we determine some limit distributions of pattern statistics in rational stochastic models, defined by means of nondeterministic weighted finite automata. We present a general approach to analyze these statistics in rational models having an arbitrary number of connected components. We explicitly establish the limit distributions in the most significant cases; these ones are characterized by a family of unimodal density functions defined by polynomials over adjacent intervals.

