Results 1 -
4 of
4
Fast Approximate Search in Large Dictionaries
- COMPUTATIONAL LINGUISTICS
, 2004
"... The need to correct garbled strings arises in many areas of natural language processing. If a dictionary is available that covers all possible input tokens, a natural set of candidates for correcting an erroneous input P is the set of all words in the dictionary for which the Levenshtein distance to ..."
Abstract
-
Cited by 8 (2 self)
- Add to MetaCart
The need to correct garbled strings arises in many areas of natural language processing. If a dictionary is available that covers all possible input tokens, a natural set of candidates for correcting an erroneous input P is the set of all words in the dictionary for which the Levenshtein distance to P does not exceed a given (small) bound k. In this article we describe methods for efficiently selecting such candidate sets. After introducing as a starting point a basic correction method based on the concept of a "universal Levenshtein automaton," we show how two filtering methods known from the field of approximate text search can be used to improve the basic procedure in a significant way. The first method, which uses standard dictionaries plus dictionaries with reversed words, leads to very short correction times for most classes of input strings. Our evaluation results demonstrate that correction times for fixed-distance bounds depend on the expected number of correction candidates, which decreases for longer input words. Similarly the choice of an optimal filtering method depends on the length of the input words.
Text Augmentation: Inserting XML tags into natural language text with PPM Models and Viterbi-like search
, 2003
"... This thesis develops work on using Hidden Markov Models to insert tags natural language text. A taxonomy of tags is developed unifying the fields of text segmentation tagging, part-of-speech tagging, proper noun extraction and hierarchical entity extraction. The search spaces for inserting tags are ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
This thesis develops work on using Hidden Markov Models to insert tags natural language text. A taxonomy of tags is developed unifying the fields of text segmentation tagging, part-of-speech tagging, proper noun extraction and hierarchical entity extraction. The search spaces for inserting tags are examined from both a theoretical and experimental point of view across the taxonomy and on four corpora. A analysis of different correctness measures for different types of tag insertion problem is undertaken and a technique to determine whether tag-insertion errors are the result of a modelling failure or a searching failure is discovered.
The Recognition Of Handwritten Chinese Characters From Paper Records
- Records, IEEE TENCON, Digital Signal Processing Applications
, 1996
"... : This paper describes a method used for the recognition of handwritten simplified Chinese characters from paper records. The method is based on the use of discrete hidden Markov models. The recognition accuracy achieved for all 3755 common simplified Chinese characters in GB1 is 91.2% for top 1 cho ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
: This paper describes a method used for the recognition of handwritten simplified Chinese characters from paper records. The method is based on the use of discrete hidden Markov models. The recognition accuracy achieved for all 3755 common simplified Chinese characters in GB1 is 91.2% for top 1 choice and 98.5% for top 5 choice. The method recognizes isolated characters only and not words or phrases. The test set contained about 35,000 characters. All characters were written in a print style. 1. OVERVIEW Chinese characters are ideographic in nature with over 3000 characters in common use for simplified Chinese. Chinese characters can be written in a neat print style where rules based on stroke order and number are followed but are generally written in a more cursive style where strokes are joined. The main problems for handwritten simplified Chinese character recognition are the large number of characters used, the complexity of the characters and the character distortion due to non...
Lexicon and hidden Markov model-based optimisation
, 2005
"... The Brahmi descended Sinhala script is used by 75% of the 18 million population in Sri Lanka. To the best of our knowledge, none of the Brahmi descended scripts used by hundreds of millions of people in South Asia, possess commercial OCR products. In the process of implementation of an OCR system fo ..."
Abstract
- Add to MetaCart
The Brahmi descended Sinhala script is used by 75% of the 18 million population in Sri Lanka. To the best of our knowledge, none of the Brahmi descended scripts used by hundreds of millions of people in South Asia, possess commercial OCR products. In the process of implementation of an OCR system for the printed Sinhala script which is easily adoptable to similar scripts [Premaratne, L., Assabie, Y., Bigun, J., 2004. Recognition of modification-based scripts using direction tensors. In: 4th Indian Conf. on Computer Vision, Graphics and Image Processing (ICVGIP2004), pp. 587--592]; a segmentation-free recognition method using orientation features has been proposed in [Premaratne, H.L., Bigun, J., 2004. A segmentation-free approach to recognise printed Sinhala script using linear symmetry. Pattern Recognition 37, 2081--2089]. Due to the limitations in image analysis techniques the character level accuracy of the results directly produced by the proposed character recognition algorithm saturates at 94%. The false rejections from the recognition algorithm are initially identified only as `missing character positions' or `blank characters'. It is necessary to identify suitable substitutes for such `missing character positions' and optimise the accuracy of words to an acceptable level. This paper proposes a novel method that explores the lexicon in association with the hidden Markov models to improve the rate of accuracy of the recognised script. The proposed method could easily be extended with minor changes to other modification-based scripts consisting of confusing characters. The word-level accuracy which was at 81.5% is improved to 88.5% by the proposed optimisation algorithm.

