Results 1 - 10
of
17
A Theory of Multiple Classifier Systems And Its Application to Visual Word Recognition
, 1992
"... Despite the success of many pattern recognition systems in constrained domains, problems that involve noisy input and many classes remain difficult. A promising direction is to use several classifiers simultaneously, such that they can complement each other in correctness. This thesis is concerned w ..."
Abstract
-
Cited by 31 (8 self)
- Add to MetaCart
Despite the success of many pattern recognition systems in constrained domains, problems that involve noisy input and many classes remain difficult. A promising direction is to use several classifiers simultaneously, such that they can complement each other in correctness. This thesis is concerned with decision combination in a multiple classifier system that is critical to its success. A multiple classifier system consists of a set of classifiers and a decision combination function. It is a preferred solution to a complex recognition problem because it allows simultaneous use of feature descriptors of many types, corresponding measures of similarity, and many classification procedures. It also allows dynamic selection, so that classifiers adapted to inputs of a particular type may be applied only when those inputs are encountered. Decisions by the classifiers are represented as rankings of the class set that are derivable from the results of feature matching. Rank scores contain more ...
Developing NLP Tools for Genome Informatics: An Information Extraction Perspective
- In Genome Informatics. Universal Academy
, 1998
"... Huge quantities of on-line medical texts such as Medline are available, and we would hope to extract useful information from these resources, as much as possible, hopefully in an automatic way, with the aid of computer technologies. Especially, recent advances in Natural Language Processing (NLP) ..."
Abstract
-
Cited by 19 (2 self)
- Add to MetaCart
Huge quantities of on-line medical texts such as Medline are available, and we would hope to extract useful information from these resources, as much as possible, hopefully in an automatic way, with the aid of computer technologies. Especially, recent advances in Natural Language Processing (NLP) techniques raise new challenges and opportunities for tackling genome-related on-line text; combining NLP techniques with genome informatics extends beyond the traditional realms of either technology to a variety of emerging applications. In this paper, we explain some of our current e#orts for developing various NLP-based tools for tackling genome-related on-line documents for information extraction task. 1
An effective algorithm for string correction using generalized edit distances-III. Computational complexity of Xhe algorithm and some app~cations Infor~tion Sci
"... This paper deals with the problem of estimating a transmitted string X, from the corresponding received string Y, which is a noisy version of X,. We assume that Y contains*any number of substitution, insertion, and deletion errors, and that no two consecutive symbols of X, were deleted in transmissi ..."
Abstract
-
Cited by 18 (10 self)
- Add to MetaCart
This paper deals with the problem of estimating a transmitted string X, from the corresponding received string Y, which is a noisy version of X,. We assume that Y contains*any number of substitution, insertion, and deletion errors, and that no two consecutive symbols of X, were deleted in transmission. We have shown that for channels which cause independent errors, and whose error probabilities exceed those of noisy strings studied in the literature [ 121, at least 99.5 % of the erroneous strings will not contain two consecutive deletion errors. The best estimate X * of X, is defined as that element of H which minimizes the generalized Levenshtein distance D ( X/Y) between X and Y. Using dynamic programming principles, an algorithm is presented which yields X+ without computing individually the distances between every word of H and Y. Though this algorithm requires more memory, it can be shown that it is, in general, computationally less complex than all other existing algorithms which perform the same task. I.
A Computational Theory of Visual Word Recognition
, 1988
"... A computational theory of the visual recognition of words of text is developed. The theory, based on previous studies of how people read, includes three stages: hypothesis generation, hypothesis testing, and global contextual analysis. Hypothesis generation uses gross visual features, such as those ..."
Abstract
-
Cited by 14 (6 self)
- Add to MetaCart
A computational theory of the visual recognition of words of text is developed. The theory, based on previous studies of how people read, includes three stages: hypothesis generation, hypothesis testing, and global contextual analysis. Hypothesis generation uses gross visual features, such as those that could be extracted from the peripheral presentation of a word, to provide expectations about word identity. Hypothesis testing integrates the information
determined by hypothesis generation with more detailed features that are extracted from the word image. Global contextual analysis provides syntactic and semantic information that influences hypothesis testing.
Algorithmic realization of the computational theory also consists of three stages. Hypothesis generation is implemented by extracting simple features from an input word and using those features to find a set of dictionary words with those features in common. Hypothesis testing uses this set of words to drive further selective image analysis that matches the input to one of the members of this set. This is done with a tree of feature tests that can be executed in several different ways to recognize an input word. Global contextual analysis is implemented with a process that uses knowledge of typical word-class transitions to improve the
performance of the hypothesis testing stage. This is executable in parallel with hypothesis testing.
This methodology is in sharp contrast to conventional machine reading algorithms which usually segment a word into characters and recognize the individual characters. Thus, a word decision is arrived at as a composite of character decisions. The algorithm presented here avoids the segmentation stage and does not require an exhaustive analysis of each character and thus is a character recognition algorithm.
Statistical projections show the viability of all three stages of the proposed approach. Experiments with images of text show that the methodology performs well in difficult
situations, such as touching and overlapping characters.
INTEGRATING KNOWLEDGE SOURCES IN Devanagari Text Recognition
, 1999
"... Reading process has been widely studied and there is a general agreement among researchers that knowledge in different forms and at different levels plays a vital role. The same is the underlying philosophy of Devanagari document recognition system described in this work. We have identified variou ..."
Abstract
-
Cited by 9 (0 self)
- Add to MetaCart
Reading process has been widely studied and there is a general agreement among researchers that knowledge in different forms and at different levels plays a vital role. The same is the underlying philosophy of Devanagari document recognition system described in this work. We have identified various relevant knowledge sources which have been integrated using a blackboard model. Some of the knowledge sources are acquired a priori by an automated training process. The efficacy of each of these knowledge sources depends on the coverage of the sample space, the training algorithm and nature of the knowledge source itself. Some of the knowledge sources are constituted from the knowledge extracted from the text as it is processed. These knowledge sources are transient in nature and are meaningful in the domain of the text under consideration. The initial segmentation of text zone in text lines is based on image profile. However, the initial segmentation leaves the overlapping text lines unsegmented. The height information of text lines obtained after initial segmentation is statistically analyzed. The most frequent line height becomes the threshold line height for the text zone under consideration. The threshold line height is used for detecting overlapping text lines. This knowledge also provides clue for the possible segmentation points for these lines. The structural properties of Devanagari script, namely the header line and three horizontal strip of a word due to two dimensional composition of the script are exploited by the segmentation process at word level as well as at character level.
A Natural Language Parser With Interleaved Spelling Correction Supporting Lexical Functional Grammar And Ill-Formed Input
, 1994
"... xiii I. INTRODUCTION 1 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Outline of This Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Summary of the Research . . . . . . . . . . . . . . . . . . . . . . . . . ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
xiii I. INTRODUCTION 1 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Outline of This Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Summary of the Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.4 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 II. REVIEW OF RELEVANT RESEARCH 5 2.1 Syntactic Ambiguity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 The Parser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2.1 Bottom Up and Top Down Parsers. . . . . . . . . . . . . . . . . . . . 7 2.2.2 Ellipsis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2.3 Syntactic Error Recovery. . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2.4 Chart Parser. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2....
A Fast Algorithm for Finding the Nearest Neighbor of a Word in a Dictionary
- In Proc. 2nd Int. Conference on Document Analysis and Recognition ICDAR’93
, 1993
"... In this paper a new algorithm for string edit distance computation is proposed. It is based on the classical approach [11]. However, while in [11] the two strings to be compared may be given online, our algorithm assumes that one of the two strings to be compared is a dictionary entry that is known ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
In this paper a new algorithm for string edit distance computation is proposed. It is based on the classical approach [11]. However, while in [11] the two strings to be compared may be given online, our algorithm assumes that one of the two strings to be compared is a dictionary entry that is known a priori. This dictionary word is converted, in an off-line phase to be carried out beforehand, into a special type of deterministic finite state automaton. Now, given an input string corresponding to a word with possible OCR errors and the automaton derived from the dictionary word, the computation of the edit distance between the two strings corresponds to a traversal of the states of the automaton. This procedure needs time which is only linear in the length of the OCR word. It is independent of the length of the dictionary word. Given not only one but N different dictionary words, their corresponding automata can be combined into a single deterministic finite state automaton. Thus the co...
Spelling correction for search engine queries
- In Proceedings of EsTAL-04, España for Natural Language Processing
, 2004
"... Abstract Search engines have become the primary means of accessing information on the Web. However, recent studies show misspelled words are very common in queries to these systems. When users misspell query, the results are incorrect or provide inconclusive information. In this work, we discuss the ..."
Abstract
-
Cited by 5 (3 self)
- Add to MetaCart
Abstract Search engines have become the primary means of accessing information on the Web. However, recent studies show misspelled words are very common in queries to these systems. When users misspell query, the results are incorrect or provide inconclusive information. In this work, we discuss the integration of a spelling correction component into tumba!, our community Web search engine. We present an algorithm that attempts to select the best choice among all possible corrections for a misspelled term, and discuss its implementation based on a ternary search tree data structure. 1
Stochastic Error-Correcting Parsing for OCR Post-processing.
, 2000
"... In this paper, stochastic error-correcting parsing is proposed as a powerful and flexible method to post-process the results of an optical character recognizer (OCR). Deterministic and non-deterministic approaches are possible under the proposed setting. The basic units of the model can be words or ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
In this paper, stochastic error-correcting parsing is proposed as a powerful and flexible method to post-process the results of an optical character recognizer (OCR). Deterministic and non-deterministic approaches are possible under the proposed setting. The basic units of the model can be words or complete sentences, and the lexicons or the language databases can be simple enumerations or may convey probabilistic information from the application domain. 1 Introduction The result of automatic optical recognition of printed or handwritten text is often affected by a considerable amount of error and uncertainty, and it is therefore essential the application of a correction algorithm. A significant portion of the ability of humans to read a handwritten text is due to their extraordinary error recovery power, thanks to the lexical, syntactic, semantic, pragmatic and discursive language constraints they apply. Among the different levels at which language can be modeled [2], the lowest one...
OCR error correction of an inflectional indian language using morphological parsing
- Journal of Information Science and Engineering
, 2000
"... This paper deals with an OCR (Optical Character Recognition) error detection and correction technique for a highly inflectional Indian language, Bangla, the second-most popular language in India and fifth-most popular language in the world. The technique is based on morphological parsing where using ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
This paper deals with an OCR (Optical Character Recognition) error detection and correction technique for a highly inflectional Indian language, Bangla, the second-most popular language in India and fifth-most popular language in the world. The technique is based on morphological parsing where using two separate lexicons of root words and suffixes, the candidate root-suffix pairs of each input string, are detected, their grammatical agreement is tested and the root/suffix part in which the error has occurred is noted. The correction is made to the corresponding error part of the input string by means of a fast dictionary access technique. To do so, the information about the error patterns generated by the OCR system are examined, and some alternative strings are generated for an erroneous word. Among the alternative strings, those satisfying grammatical agreement in root and suffix are finally chosen as suggested words. In the list of suggested words generated by the system, the desired word is available in 84.22% cases. Keywords: OCR (Optical Character Recognition), error detection, error correction, Indian language, morphological parsing, suffix, inflectional language 1.

