ARABIC TEXT (2003)
BibTeX
@MISC{Jacob03arabictext,
author = {Dr. Bruce Jacob and Dr. Douglas and W. Oard and Kareem Darwish},
title = {ARABIC TEXT},
year = {2003}
}
OpenURL
Abstract
Since many Arabic documents are available only in print, automating retrieval from collections of scanned Arabic document images using Optical Character Recognition (OCR) is an important problem. This dissertation addresses retrieval of Arabic document images based on OCR, with emphasis on probabilistic methods to improve retrieval effectiveness. Arabic’s rich morphology (word construction) and complex orthography (writing system) present unique challenges for OCR and Information Retrieval (IR) systems. New probabilistic structured query methods that leverage replacement probabilities were developed in this research to improve retrieval effectiveness in OCR degraded text retrieval and their generality has been shown in cross-language information retrieval. For the OCR-degraded text retrieval, the probabilistic structured query methods were applied using the most effective index terms for OCR-degraded text, with replacementprobabilities estimated using an OCR degradation model. Overlapping character n-grams and combinations of character n-grams with terms obtained through morphological analysis were found to be the most effective indexing terms for Arabic collections of varying sizes, genres, and degradation levels. For index terms requiring morphological







