Results 1 - 10
of
20
Evaluation of model-based retrieval effectiveness with OCR text
- ACM Transactions on Information Systems
, 1996
"... We give a comprehensive report on our experiments with retrieval from OCR-generated text using systems based on standard models of retrieval. More specifically, we show that average precision and recall is not affected by OCR errors across systems for several collections. The collections used in the ..."
Abstract
-
Cited by 30 (12 self)
- Add to MetaCart
We give a comprehensive report on our experiments with retrieval from OCR-generated text using systems based on standard models of retrieval. More specifically, we show that average precision and recall is not affected by OCR errors across systems for several collections. The collections used in these experiments include both actual OCR-generated text and standard information retrieval collections corrupted through the simulation of OCR errors. Both the actual and simulation experiments include full-text and abstract-length documents. We also demonstrate that the ranking and feedback methods associated with these models are generally not robust enough to deal with OCR errors. It is further shown that the OCR errors and garbage strings generated from the mistranslation of graphic objects increase the size of the index by a wide margin. We not only point out problems that can arise from applying OCR text within an information retrieval environment, we also suggest solutions to overcome some of these problems.
Measuring the Effects of Data Corruption on Information Retrieval
- In Proceedings of the SDAIR 96 Conference
, 1996
"... A probability model is introduced which helps describing the effects that data corruption has on information retrieval. This leads us to the definition of a measure which analyses the effects of data corruption on retrieval ranking. The behaviour of this measure is analysed theoretically. We give an ..."
Abstract
-
Cited by 22 (1 self)
- Add to MetaCart
A probability model is introduced which helps describing the effects that data corruption has on information retrieval. This leads us to the definition of a measure which analyses the effects of data corruption on retrieval ranking. The behaviour of this measure is analysed theoretically. We give an explanation of some results found empirically by others. Our main results explain how retrieval ranking is affected by the length of documents, and characteristics of the query features. The longer the documents are the less the overall ranking is corrupted. If the documents are long, a large variation of recognition characteristics of query features can have a heavy influence on the ranking corruption. We show that some data corruption simulations are of questionable value when measuring effects on information retrieval. 1 Introduction For several years now there has been a growing interest in information retrieval of objects other than text documents, such as images and audio recordings....
Length Normalization in Degraded Text Collections
- Proceedings of Fifth Annual Symposium on Document Analysis and Information Retrieval
, 1995
"... Optical character recognition (OCR) is the most commonly used technique to convert printed material into electronic form. Using OCR, large repositories of machine readable text can be created in a short time. An information retrieval system can then be used to search through large information bases ..."
Abstract
-
Cited by 14 (2 self)
- Add to MetaCart
Optical character recognition (OCR) is the most commonly used technique to convert printed material into electronic form. Using OCR, large repositories of machine readable text can be created in a short time. An information retrieval system can then be used to search through large information bases thus created. Many information retrieval systems use sophisticated term weighting functions to improve the effectiveness of a search. Term weighting schemes can be highly sensitive to the errors in the input text, introduced by the OCR process. This study examines the effects of the well known cosine normalization method in the presence of OCR errors, and proposes a new, more robust, normalization method. Experiments show that the new scheme is less sensitive to OCR errors and facilitates use of more diverse basic weighting schemes. When used in a correct text collection, the new normalization scheme yields significant improvements in retrieval effectiveness over cosine normalization. 1 Back...
Evaluating text categorization in the presence of ocr errors
- In Proc. IS&T/SPIE 2001 Intl. Symp. on Electronic Imaging Science and Technology
, 2001
"... In this paper we describe experiments that investigate the effects of OCR errors on text categorization. In particular, we show that in our environment, OCR errors have no effect on categorization when we use a classifier based on the naive Bayes model. We also observe that dimensionality reduction ..."
Abstract
-
Cited by 6 (2 self)
- Add to MetaCart
In this paper we describe experiments that investigate the effects of OCR errors on text categorization. In particular, we show that in our environment, OCR errors have no effect on categorization when we use a classifier based on the naive Bayes model. We also observe that dimensionality reduction techniques eliminate a large number of OCR errors and improve categorization results.
Document Image Retrieval Techniques for Chinese
- Proceedings of the Fourth Symposium on Document Image Understanding Technology
, 2001
"... In this paper we present experiment results for retrieval from a collection of scanned article clippings from Chinese newspapers. The test collection consists of 8,438 articles from China, Taiwan and Hong Kong in a mix of traditional and simplified Chinese. A commercial OCR system was used to produ ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
In this paper we present experiment results for retrieval from a collection of scanned article clippings from Chinese newspapers. The test collection consists of 8,438 articles from China, Taiwan and Hong Kong in a mix of traditional and simplified Chinese. A commercial OCR system was used to produce errorful text. Exhaustive relevance assessment was performed over the entire collection for 30 Chinese queries by multiple judges. Indexing a combination of unigrams and overlapping bigrams was found to outperform overlapping bigram indexing alone, and byte length normalization was found to outperform cosine normalization. No improvement resulted from the addition of query expansion using blind relevance feedback on the same collection.
Summarizing noisy documents
- In Proceedings of the Symposium on Document Image Understanding Technology
, 2003
"... We investigate the problem of summarizing text documents that contain errors as a result of optical character recognition. Each stage in the process is tested, the error effects analyzed, and possible solutions suggested. Our experimental results show that current approaches, which are developed to ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
We investigate the problem of summarizing text documents that contain errors as a result of optical character recognition. Each stage in the process is tested, the error effects analyzed, and possible solutions suggested. Our experimental results show that current approaches, which are developed to deal with clean text, suffer significant degradation even with slight increases in the noise level of a document. We conclude by proposing possible ways of improving the performance of noisy document summarization. 1
Performance Evaluation for Text Processing of Noisy Inputs
- Symposium on Applied Computing, pp 759 - 763, March 13-17, 2005
, 2005
"... We investigate the problem of evaluating the performance of text processing algorithms on inputs that contain errors as a result of optical character recognition. A new hierarchical paradigm is proposed based on approximate string matching, allowing each stage in the processing pipeline to be tested ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
We investigate the problem of evaluating the performance of text processing algorithms on inputs that contain errors as a result of optical character recognition. A new hierarchical paradigm is proposed based on approximate string matching, allowing each stage in the processing pipeline to be tested, the error effects analyzed, and possible solutions suggested.
Robust document image understanding technologies
- In HDP ’04: Proceedings of the 1st ACM Workshop on Hardcopy Document Processing,pages 9–14,2004
"... No existing document image understanding technology, whether experimental or commercially available, can guarantee high accuracy across the full range of documents of interest to industrial and government agency users. Ideally, users should be able to search, access, examine, and navigate among docu ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
No existing document image understanding technology, whether experimental or commercially available, can guarantee high accuracy across the full range of documents of interest to industrial and government agency users. Ideally, users should be able to search, access, examine, and navigate among document images as effectively as they can among encoded data files, using familiar interfaces and tools as fully as possible. We are investigating novel algorithms and software tools at the frontiers of document image analysis, information retrieval, text mining, and visualization that will assist in the full integration of such documents into collections of textual document images as well as “born digital ” documents. Our approaches emphasize versatility first: that is, methods which work reliably across the broadest possible range of documents.
Autotag: A tool for creating structured document collections from printed materials
- in Electronic Publishing, Artistic Imaging, and Digital Typography, Proc. of the EP ’98 and RIDT ’98 Conferences
, 1998
"... Abstract. We report on the design and implementation of a system which automates the process of capturing structured documents from the optically recognized form of printed materials. The system is intended to be used to convert printed collections into their corresponding tagged electronic versions ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
Abstract. We report on the design and implementation of a system which automates the process of capturing structured documents from the optically recognized form of printed materials. The system is intended to be used to convert printed collections into their corresponding tagged electronic versions with little or no manual interventon. This conversion process has some unique problems associated with it, these are discussed, along with our attempts to solve them. This system also establishes a mapping between the bitmap image and its corresponding ASCII representation that can be used to design flexible image-based interfaces for IR-related applications. 1
UNLV-ISRI Document Collection for Research in OCR and Information Retrieval
- in Proc. IS&T/SPIE 2000 Intl. Symp. on Electronic Imaging Science and Technology
, 2000
"... We report on the UNLV-ISRI document collection history, composition, and characteristics. We further provide a short summary of research projects that were conducted using subsets of this collection. These projects were designed to address the retrieval eectiveness from OCR generated collections. Al ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
We report on the UNLV-ISRI document collection history, composition, and characteristics. We further provide a short summary of research projects that were conducted using subsets of this collection. These projects were designed to address the retrieval eectiveness from OCR generated collections. Along with this report, ISRI is making this collection available to researchers for further study on the topic of OCR and Information Retrieval. Keywords: information retrieval, OCR, document collection, relevancy 1. BACKGROUND The Information Science Research Institute (ISRI) has been involved with research of the interaction between optical character recognition (OCR) and information retrieval (IR) since 1989. This research has focused on issues associated with the construction of a large document database containing over 20 million pages of scientic, legal, and ocial memoranda. This collection will be used for online legal discoveries by the Department of Energy (DOE), its contractors,...

