Results 1 -
5 of
5
A generative probabilistic ocr model for nlp applications
- In Proceedings of the Human Language Technology Conference (HLTNAACL
"... In this paper, we introduce a generative probabilistic optical character recognition (OCR) model that describes an end-to-end process in the noisy channel framework, progressing from generation of true text through its transformation into the noisy output of an OCR system. The model is designed for ..."
Abstract
-
Cited by 11 (1 self)
- Add to MetaCart
In this paper, we introduce a generative probabilistic optical character recognition (OCR) model that describes an end-to-end process in the noisy channel framework, progressing from generation of true text through its transformation into the noisy output of an OCR system. The model is designed for use in error correction, with a focus on post-processing the output of black-box OCR systems in order to make it more useful for NLP tasks. We present an implementation of the model based on finitestate models, demonstrate the model’s ability to significantly reduce character and word error rate, and provide evaluation results involving automatic extraction of translation lexicons from printed text. 1
Using Deep Morphology to Improve Automatic Error Detection in Arabic Handwriting Recognition
"... Arabic handwriting recognition (HR) is a challenging problem due to Arabic’s connected letter forms, consonantal diacritics and rich morphology. In this paper we isolate the task of identification of erroneous words in HR from the task of producing corrections for these words. We consider a variety ..."
Abstract
- Add to MetaCart
Arabic handwriting recognition (HR) is a challenging problem due to Arabic’s connected letter forms, consonantal diacritics and rich morphology. In this paper we isolate the task of identification of erroneous words in HR from the task of producing corrections for these words. We consider a variety of linguistic (morphological and syntactic) and non-linguistic features to automatically identify these errors. Our best approach achieves a roughly ∼15 % absolute increase in F-score over a simple but reasonable baseline. A detailed error analysis shows that linguistic features, such as lemma (i.e., citation form) models, help improve HR-error detection precisely where we expect them to: semantically incoherent error words. 1
Phrase Based Direct Model for Improving Handwriting Recognition Accuracies
"... We propose a method for increasing word recognition accuracies by correcting the output of a handwriting recognition system. We treat the handwriting recognizer as a black-box, such that there is no access to its internals. This enables us to keep our algorithm general and independent of any particu ..."
Abstract
- Add to MetaCart
We propose a method for increasing word recognition accuracies by correcting the output of a handwriting recognition system. We treat the handwriting recognizer as a black-box, such that there is no access to its internals. This enables us to keep our algorithm general and independent of any particular system. We use a novel method for correcting the output based on a direct “phrase-based ” system in contrast to traditional sourcechannel models. We report the accuracies of an in-house handwritten word recognizer before and after the correction. We achieve highly encouraging results for a large dataset. 1
iii Acknowledgments
, 2008
"... I thank the Almighty for providing me with this opportunity to serve Him and make a contribution through His infinite wisdom. I thank my parents for their perseverance and unconditional support, without which I could never have accomplished this endeavor. I would also like to thank other members of ..."
Abstract
- Add to MetaCart
I thank the Almighty for providing me with this opportunity to serve Him and make a contribution through His infinite wisdom. I thank my parents for their perseverance and unconditional support, without which I could never have accomplished this endeavor. I would also like to thank other members of my family including my cousin Muneer who has been watching my back from day one. I want to extend my deep appreciation to Dr. Venu Govindaraju, the chair of my dissertation committee. He has been an advisor and a mentor. His persistent guidance, omnipresent motivation and overall support have been the foundation of this thesis. He introduced me to the area of handwriting recognition and encouraged me to address the open challenge of retrieval from handwritten documents. I want to show my gratitude to Dr. Peter Scott, member of my dissertation committee. His course Computer Vision and Image Processing indeed laid a solid foundation for iv this research. His guidance and advise has been always helpful. In addition, I had the opportunity to be his Teaching Assistant for three semesters and his passion for teaching was a great motivation.
Performing Information Extraction to Improve OCR Error Detection in Semi-structured Historical Documents
"... Optical character recognition (OCR) produces transcriptions of document images. These transcriptions often contain incorrectly recognized characters which we must avoid or correct downstream. An ability to both identify OCR errors and extract information from OCR output would allow us to extract and ..."
Abstract
- Add to MetaCart
Optical character recognition (OCR) produces transcriptions of document images. These transcriptions often contain incorrectly recognized characters which we must avoid or correct downstream. An ability to both identify OCR errors and extract information from OCR output would allow us to extract and index only correct information and to post-process specific parts of the OCR output with targeted resources (e.g. re-OCR using specialized dictionaries). We present a general approach to OCR error detection that uses a hidden Markov model trained to simultaneously detect OCR errors and extract information. We evaluate this approach in two information extraction settings and on semi-structured text from two machine-printed family history documents. We show this joint approach to OCR error detection to be an improvement over two alternative approaches, one based on dictionary matching and the other using a hidden Markov model trained only to detect OCR errors. In particular, we report an average of 8 % increase in macro-averaged F-measure between the dictionary approach and our best HMM. Our contribution is to show how an OCR error detection approach based on a word model can be improved by joining this task with an information extraction task, and that an improvement in OCR error detection is achieved regardless of the information extraction task.

