Results 1 - 10
of
16
Fast String Correction with Levenshtein-Automata
- INTERNATIONAL JOURNAL OF DOCUMENT ANALYSIS AND RECOGNITION
, 2002
"... The Levenshtein-distance between two words is the minimal number of insertions, deletions or substitutions that are needed to transform one word into the other. Levenshtein-automata of degree n for a word W are defined as finite state automata that regognize the set of all words V where the Levensht ..."
Abstract
-
Cited by 19 (3 self)
- Add to MetaCart
The Levenshtein-distance between two words is the minimal number of insertions, deletions or substitutions that are needed to transform one word into the other. Levenshtein-automata of degree n for a word W are defined as finite state automata that regognize the set of all words V where the Levenshtein-distance between V and W does not exceed n. We show how to compute, for any fixed bound n and any input word W , a deterministic Levenshtein-automaton of degree n for W in time linear in the length of W . Given an electronic dictionary that is implemented in the form of a trie or a finite state automaton, the Levenshtein-automaton for W can be used to control search in the lexicon in such a way that exactly the lexical words V are generated where the Levenshtein-distance between V and W does not exceed the given bound. This leads to a very fast method for correcting corrupted input words of unrestricted text using large electronic dictionaries. We then introduce a second method that avoids the explicit computation of Levenshtein-automata and leads to even improved eciency. We also describe how to extend both methods to variants of the Levenshtein-distance where further primitive edit operations (transpositions, merges and splits) may be used.
Bootstrapping Text Recognition from Stop Words
- Procs. ICPR-14
, 1998
"... Recognition of arbitrary noisy English text has been difficult because of problems in character segmentation and multi-font symbol classification. Both segmentation and recognition can be easier with more knowledge of the dominant font used in a given text page. This has led to some recent studies t ..."
Abstract
-
Cited by 10 (4 self)
- Add to MetaCart
Recognition of arbitrary noisy English text has been difficult because of problems in character segmentation and multi-font symbol classification. Both segmentation and recognition can be easier with more knowledge of the dominant font used in a given text page. This has led to some recent studies that show promising methods for extracting character prototypes from a text image provided that truth is given for part of the image. In this paper we investigate the feasibility of such a strategy without dependence on ground truth. We replace the needed truth by results of direct recognition of some frequently occurring words. The method makes use of the observation that over half of the words in a typical English text passage are contained in a very small lexicon.
Whole-Book Recognition using Mutual-Entropy-Driven Model Adaptation
"... We describe an approach to unsupervised high-accuracy recognition of the textual contents of an entire book using fully automatic mutual-entropy-based model adaptation. Given images of all the pages of a book together with approximate models of image formation (e.g. a character-image classifier) and ..."
Abstract
-
Cited by 9 (6 self)
- Add to MetaCart
We describe an approach to unsupervised high-accuracy recognition of the textual contents of an entire book using fully automatic mutual-entropy-based model adaptation. Given images of all the pages of a book together with approximate models of image formation (e.g. a character-image classifier) and linguistics (e.g. a word-occurrence probability model), we detect evidence for disagreements between the two models by analyzing the mutual entropy between two kinds of probability distributions: (1) the a posteriori probabilities of character classes (the recognition results from image classification alone), and (2) the a posteriori probabilities of word classes (the recognition results from image classification combined with linguistic constraints). The most serious of these disagreements are identified as candidates for automatic corrections to one or the other of the models. We describe a formal information-theoretic framework for detecting model disagreement and for proposing corrections. We illustrate this approach on a small test case selected from real book-image data. This reveals that a sequence of automatic model corrections can drive improvements in both models, and can achieve a lower recognition error rate. The importance of considering the contents of the whole book is motivated by a series of studies, over the last decade, showing that isogeny can be exploited to achieve unsupervised improvements in recognition accuracy.
Stop Word Location and Identification for Adaptive Text Recognition
- J. of Document Analysis and Recognition
, 2000
"... We propose a new adaptive strategy for text recognition that attempts to derive knowledge about the dominant font on a given page. The strategy uses a linguistic observation that over half of all words in a typical English passage are contained in a small set of less than 150 stop words. A small ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
We propose a new adaptive strategy for text recognition that attempts to derive knowledge about the dominant font on a given page. The strategy uses a linguistic observation that over half of all words in a typical English passage are contained in a small set of less than 150 stop words. A small dictionary of such words are compiled from the Brown corpus. An arbitrary text page first goes through layout analysis that produces word segmentation. A fast procedure is then applied to locate the most likely candidates for those words, using only widths of the word images. The identity of each word is determined using a word shape classifier. Using the word images together with their identities, character prototypes can be extracted using a previously proposed method. We describe experiments using simulated and real images. In an experiment using 400 real page images, we show that on average, 8 distinct characters can be learned from each page, and the method is successful on 90% of all the pages. These can serve as useful seeds to bootstrap font learning.
A Unified Approach Towards Text Recognition
- Proceedings of SPIE: The International Society for Optical Engineering
, 1996
"... In our recent research, we found that visual inter-word relations can be useful for different stages of English text recognition such as character segmentation and postprocessing. Different methods had been designed for different stages. In this paper, we propose a unified approach to use visual con ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
In our recent research, we found that visual inter-word relations can be useful for different stages of English text recognition such as character segmentation and postprocessing. Different methods had been designed for different stages. In this paper, we propose a unified approach to use visual contextual information for text recognition. Each word image has a lattice, which is a data structure to keep results of segmentation, recognition and visual inter-word relation analysis. A lattice allows ambiguity and uncertainty at different levels. A lattice-based unification algorithm is proposed to analyze information in the lattices of two or more visually related word images, and upgrade their contents. Under the approach, different stages of text recognition can be accomplished by the same set of operations-- inter-word relation analysis and lattice-based unification. The segmentation and recognition result of a word image can be propagated to those visually related word images and can ...
A Study of Touching Characters in degraded Gurmukhi Script
- in Int. Conf. on Pattern Recognition and Computer Vision, PRCV 2005
, 2005
"... Abstract—Character segmentation is an important preprocessing step for text recognition. In degraded documents, existence of touching characters decreases recognition rate drastically, for any optical character recognition (OCR) system. In this paper a study of touching Gurmukhi characters is carrie ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Abstract—Character segmentation is an important preprocessing step for text recognition. In degraded documents, existence of touching characters decreases recognition rate drastically, for any optical character recognition (OCR) system. In this paper a study of touching Gurmukhi characters is carried out and these characters have been divided into various categories after a careful analysis. Structural properties of the Gurmukhi characters are used for defining the categories. New algorithms have been proposed to segment the touching characters in middle zone. These algorithms have shown a reasonable improvement in segmenting the touching characters in degraded Gurmukhi script. The algorithms proposed in this paper are applicable only to machine printed text.
Enhancing image-based arabic document translation using a noisy channel correction model
- In Proceedings of MT Summit XI
, 2007
"... An image-based document translation system consists of several components, among which OCR (Optical Character Recognition) plays an important role. However, existing OCR software is not robust against environmental variations. Furthermore, OCR errors are often propagated into the translation compone ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
An image-based document translation system consists of several components, among which OCR (Optical Character Recognition) plays an important role. However, existing OCR software is not robust against environmental variations. Furthermore, OCR errors are often propagated into the translation component and cause, causing poor end-to-end performance. In this paper, we propose an imagebased document translation using an error correction model to correct misrecognized words from OCR output. We train our correction model from synthetic data with different fonts and sizes to simulate real world situations. We further enhance our correction model with bigrams to improve our word segmentation error correction. Experimental results show substantial improvements in both word recognition accuracy and translation quality. For instance, in an experiment using Arabic Transparent Font, the BLEU score increases from 18.70 to 33.47 with the use of our noisy channel model.
Cherry Blossom: A System for Japanese Character Recognition
- Symposium on Document Image Understanding Technologies
, 1997
"... A general purpose Japanese character recognition system, Cherry Blossom, has been developed at CEDAR in past years. It is designed to recognize Japanese document images in low resolution or with poor print quality. The system includes modules for page skew correction, document segmentation, text seg ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
A general purpose Japanese character recognition system, Cherry Blossom, has been developed at CEDAR in past years. It is designed to recognize Japanese document images in low resolution or with poor print quality. The system includes modules for page skew correction, document segmentation, text segmentation, character recognition and postprocessing. The API code for each module has been developed so that each module can be tested as a standalone program or can be called in large application systems such as a document indexing and retrieval system. In the character recognition module, two classification methods, the nearest-neighbor classifier and the subspace method, have been integrated in an efficient way. The speed of character recognition is about six characters per second from the original one character per second. New techniques, including dynamic feature selection, incremental nearest prototype search and visual similarity analysis, have been developed to speed up character cla...
Acbd>e1f7a>@fg1hjilkmf7h1enpoqh>@rg3i7=
- In Proceedings of the 21st Annual 26th International ACM SIGIR Conference on Research and Development in Information Retrieval
, 2003
"... Structured methods for query term replacement rely on separate estimates of term frequency and document frequency to compute the weight for each query term. This paper reviews prior work on structured query techniques and introduces three new variants that leverage estimates of replacement probabili ..."
Abstract
- Add to MetaCart
Structured methods for query term replacement rely on separate estimates of term frequency and document frequency to compute the weight for each query term. This paper reviews prior work on structured query techniques and introduces three new variants that leverage estimates of replacement probabilities. Statistically significant improvements in retrieval effectiveness are demonstrated for cross-language retrieval and for retrieval based on optical character recognition when replacement probabilities are used to estimate both term frequency and document frequency.

