Results 11 -
19 of
19
Identification of Case, Digits and Special Symbols Using a Context Window
, 2001
"... We present strategies and results for identifying the symbol type of every character in a text document. Assuming reasonable word and character segmentation for shape clustering, we designed several type recognition methods that depend on cluster n-grams, characteristics of neighbors, and within- ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
We present strategies and results for identifying the symbol type of every character in a text document. Assuming reasonable word and character segmentation for shape clustering, we designed several type recognition methods that depend on cluster n-grams, characteristics of neighbors, and within-word context. On an ASCII test corpus of 925 articles, these methods represent a substantial improvementover default assignmentofallcharacters to lower case.
Accenting Unknown Words in a Specialized Language
, 2002
"... We propose two internal methods for accenting unknown words, which both learn on a reference set of accented words the contexts of occurrence of the various accented forms of a given letter. One method is adapted from POS tagging, the other is based on finite state transducers. ..."
Abstract
- Add to MetaCart
We propose two internal methods for accenting unknown words, which both learn on a reference set of accented words the contexts of occurrence of the various accented forms of a given letter. One method is adapted from POS tagging, the other is based on finite state transducers.
Mitsubishi Electric Research Laboratories
- in Proceedings of International Symposium on Non-Photorealistic Animation and Rendering (Annecy
, 2002
"... this paper we describe a system to show some limited effects on a static toy-car model and present techniques that can be used in similar setups. Our focus is on creating apparent motion for animation ..."
Abstract
- Add to MetaCart
this paper we describe a system to show some limited effects on a static toy-car model and present techniques that can be used in similar setups. Our focus is on creating apparent motion for animation
Diacritics Restoration in Romanian Texts
"... There are several languages that use diacritical characters outside the ASCII charset. For some of the languages, most diacritical characters can be deterministically recovered but in general, this is not the prevailing case. However, the difficulty of the task differs from language to language depe ..."
Abstract
- Add to MetaCart
There are several languages that use diacritical characters outside the ASCII charset. For some of the languages, most diacritical characters can be deterministically recovered but in general, this is not the prevailing case. However, the difficulty of the task differs from language to language depending on the functional role of the diacritical characters. For Romanian, automatic restoration of the diacritics is a real challenge, both because of their frequency and due to their significant contribution to the morpho-lexical and semantic disambiguation of the words. In Romanian, every third word might contain at least one diacritical character and for large texts that lack diacritics, to insert them manually is highly time-consuming, boring and error-prone. We present a professional implementation, embedded into MS Office environment, which builds on our tiered tagging technologies. Keywords diacritics restoration; part-of-speech tagging, tiered tagging 1.
Word Sense Disambiguation Based on Sense Similarity and Syntactic Context
"... This is to certify that I have examined this copy of a master’s thesis by ..."
Abstract
- Add to MetaCart
This is to certify that I have examined this copy of a master’s thesis by
A Case Restoration Approach to Named Entity Tagging in Degraded Documents 1
"... This paper describes a novel approach to named entity (NE) tagging on degraded documents. NE tagging is the process of identifying salient text strings in unstructured text, corresponding to names of people, places, organizations, times/dates, etc. Although NE tagging is typically part of a larger i ..."
Abstract
- Add to MetaCart
This paper describes a novel approach to named entity (NE) tagging on degraded documents. NE tagging is the process of identifying salient text strings in unstructured text, corresponding to names of people, places, organizations, times/dates, etc. Although NE tagging is typically part of a larger information extraction process, it has other applications, such as improving search in an information retrieval system, and post-processing the results of an OCR system. We focus on degraded documents, i.e. case insensitive documents that lack orthographic information. Examples include output of speech recognition systems, as well as e-mail. The traditional approach involves retraining an NE tagger on degraded text, a cumbersome operation. This paper describes an approach whereby text is first “restored ” to its implicit case sensitive form, and subsequently processed by the original NE tagger. Results show that this new approach leads to far less precision loss in NE tagging of degraded documents. 1.
IJDAR DOI 10.1007/s10032-011-0164-6 ORIGINAL PAPER Learning on the fly: a font-free approach toward multilingual OCR
"... Abstract Despite ubiquitous claims that optical character recognition (OCR) is a “solved problem, ” many categories of documents continue to break modern OCR software such as documents with moderate degradation or unusual fonts. Many approaches rely on pre-computed or stored character models, but th ..."
Abstract
- Add to MetaCart
Abstract Despite ubiquitous claims that optical character recognition (OCR) is a “solved problem, ” many categories of documents continue to break modern OCR software such as documents with moderate degradation or unusual fonts. Many approaches rely on pre-computed or stored character models, but these are vulnerable to cases when the font of a particular document was not part of the training set or when there is so much noise in a document that the font model becomes weak. To address these difficult cases, we present a form of iterative contextual modeling that learns character models directly from the document it is trying to recognize. We use these learned models both to segment the characters and to recognize them in an incremental, iterative process. We present results comparable with those of a commercial OCR system on a subset of characters from a difficult test document in both English and Greek.
Noname manuscript No. (will be inserted by the editor) Statistical Unicodification of African Languages
, 2010
"... Abstract Many languages in Africa are written using Latin-based scripts, but often with extra diacritics (e.g. dots below in Igbo: i., o., u.) or modifications to the letters themselves (e.g. open vowels “e ” and “o ” in Lingala: ε, O). While it is possible to render these characters accurately in U ..."
Abstract
- Add to MetaCart
Abstract Many languages in Africa are written using Latin-based scripts, but often with extra diacritics (e.g. dots below in Igbo: i., o., u.) or modifications to the letters themselves (e.g. open vowels “e ” and “o ” in Lingala: ε, O). While it is possible to render these characters accurately in Unicode, oftentimes keyboard input methods are not easily accessible or are cumbersome to use, and so the vast majority of electronic texts in many African languages are written in plain ASCII. We call the process of converting an ASCII text to its proper Unicode form unicodification. This paper describes an opensource package which performs automatic unicodification, implementing a variant of an algorithm described in previous work of De Pauw, Wagacha, and de Schryver. We have trained models for more than 100 languages using web data, and have evaluated each language using a range of feature sets.

