Results 1 -
3 of
3
Using the web for language independent spellchecking and autocorrection
- In EMNLP
, 2009
"... We have designed, implemented and evaluated an end-to-end system spellchecking and autocorrection system that does not require any manually annotated training data. The World Wide Web is used as a large noisy corpus from which we infer knowledge about misspellings and word usage. This is used to bui ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
We have designed, implemented and evaluated an end-to-end system spellchecking and autocorrection system that does not require any manually annotated training data. The World Wide Web is used as a large noisy corpus from which we infer knowledge about misspellings and word usage. This is used to build an error model and an n-gram language model. A small secondary set of news texts with artificially inserted misspellings are used to tune confidence classifiers. Because no manual annotation is required, our system can easily be instantiated for new languages. When evaluated on human typed data with real misspellings in English and German, our web-based systems outperform baselines which use candidate corrections based on hand-curated dictionaries. Our system achieves 3.8 % total error rate in English. We show similar improvements in preliminary results on artificial data for Russian and Arabic. 1
Improving OCR Accuracy for Classical Critical Editions
"... Abstract. This paper describes a work-flow designed to populate a digital library of ancient Greek critical editions with highly accurate OCR scanned text. While the most recently available OCR engines are now able after suitable training to deal with the polytonic Greek fonts used in 19th and 20th ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Abstract. This paper describes a work-flow designed to populate a digital library of ancient Greek critical editions with highly accurate OCR scanned text. While the most recently available OCR engines are now able after suitable training to deal with the polytonic Greek fonts used in 19th and 20th century editions, further improvements can also be achieved with postprocessing. In particular, the progressive multiple alignment method applied to different OCR outputs based on the same images is discussed in this paper. 1
Parallel OCR Error Correction
"... process that recognizes alpha numerical characters on printed pages and converts them to a machine-readable text file. OCR makes it possible to digitize books and other printed materials. Our current research work is investigating ways to speed up implementation of digital libraries. Multicore syste ..."
Abstract
- Add to MetaCart
process that recognizes alpha numerical characters on printed pages and converts them to a machine-readable text file. OCR makes it possible to digitize books and other printed materials. Our current research work is investigating ways to speed up implementation of digital libraries. Multicore systems are now becoming common on desktops, servers and even laptops. In order to take full advantage of such multicore systems, current research is looking at ways to make parallel programming main stream. One such effort is the Intel Cilk Plus extensions to C and C++ from Intel Corporation that offer a quick, easy and reliable way to improve the performance of programs on multicore processors. In this paper, we present the results from our work using the Cilk Plus extensions to parallelize OCR error correction. Keywords — OCR, digital Library. I.

