Results 1 - 10
of
15
Information retrieval in document image databases
- IEEE Transactions on Knowledge and Data Engineering
"... Abstract—With the rising popularity and importance of document images as an information source, information retrieval in document image databases has become a growing and challenging problem. In this paper, we propose an approach with the capability of matching partial word images to address two iss ..."
Abstract
-
Cited by 15 (4 self)
- Add to MetaCart
Abstract—With the rising popularity and importance of document images as an information source, information retrieval in document image databases has become a growing and challenging problem. In this paper, we propose an approach with the capability of matching partial word images to address two issues in document image retrieval: word spotting and similarity measurement between documents. First, each word image is represented by a primitive string. Then, an inexact string matching technique is utilized to measure the similarity between the two primitive strings generated from two word images. Based on the similarity, we can estimate how a word image is relevant to the other and, thereby, decide whether one is a portion of the other. To deal with various character fonts, we use a primitive string which is tolerant to serif and font differences to represent a word image. Using this technique of inexact string matching, our method is able to successfully handle the problem of heavily touching characters. Experimental results on a variety of document image databases confirm the feasibility, validity, and efficiency of our proposed approach in document image retrieval. Index Terms—Document image retrieval, partial word image matching, primitive string, word searching, document similarity measurement. æ 1
Locating And Recognizing Text in . . .
- INFORMATION RETRIEVAL 2, 177--206 (2000)
, 2000
"... The explosive growth of the World Wide Web has resulted in a distributed database consisting of hundreds of millions of documents. While existing search engines index a page based on the text that is readily extracted from its HTML encoding, an increasing amount of the information on the Web is embe ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
The explosive growth of the World Wide Web has resulted in a distributed database consisting of hundreds of millions of documents. While existing search engines index a page based on the text that is readily extracted from its HTML encoding, an increasing amount of the information on the Web is embedded in images. This situation presents a new and exciting challenge for the fields of document analysis and information retrieval, as WWW image text is typically rendered in color and at very low spatial resolutions. In this paper, we survey the results of several years of our work in the area. For the problem of locating text in Web images, we describe a procedure based on clustering in color space followed by a connected-components analysis that seems promising. For character recognition, we discuss techniques using polynomial surface fitting and "fuzzy" n-tuple classifiers. Also presented are the results of several experiments that demonstrate where our methods perform well and where more work needs to be done. We conclude with a discussion of topics for further research.
Document Image Retrieval Techniques for Chinese
- Proceedings of the Fourth Symposium on Document Image Understanding Technology
, 2001
"... In this paper we present experiment results for retrieval from a collection of scanned article clippings from Chinese newspapers. The test collection consists of 8,438 articles from China, Taiwan and Hong Kong in a mix of traditional and simplified Chinese. A commercial OCR system was used to produ ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
In this paper we present experiment results for retrieval from a collection of scanned article clippings from Chinese newspapers. The test collection consists of 8,438 articles from China, Taiwan and Hong Kong in a mix of traditional and simplified Chinese. A commercial OCR system was used to produce errorful text. Exhaustive relevance assessment was performed over the entire collection for 30 Chinese queries by multiple judges. Indexing a combination of unigrams and overlapping bigrams was found to outperform overlapping bigram indexing alone, and byte length normalization was found to outperform cosine normalization. No improvement resulted from the addition of query expansion using blind relevance feedback on the same collection.
Robust document image understanding technologies
- In HDP ’04: Proceedings of the 1st ACM Workshop on Hardcopy Document Processing,pages 9–14,2004
"... No existing document image understanding technology, whether experimental or commercially available, can guarantee high accuracy across the full range of documents of interest to industrial and government agency users. Ideally, users should be able to search, access, examine, and navigate among docu ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
No existing document image understanding technology, whether experimental or commercially available, can guarantee high accuracy across the full range of documents of interest to industrial and government agency users. Ideally, users should be able to search, access, examine, and navigate among document images as effectively as they can among encoded data files, using familiar interfaces and tools as fully as possible. We are investigating novel algorithms and software tools at the frontiers of document image analysis, information retrieval, text mining, and visualization that will assist in the full integration of such documents into collections of textual document images as well as “born digital ” documents. Our approaches emphasize versatility first: that is, methods which work reliably across the broadest possible range of documents.
A survey of retrieval strategies for ocr text collections
- In Proceedings of the Symposium on Document Image Understanding Technologies
, 2003
"... The importance of effectively retrieving OCR text has grown significantly in recent years. We provide a brief overview of work done to improve the effectiveness of retrieval of OCR text. ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
The importance of effectively retrieving OCR text has grown significantly in recent years. We provide a brief overview of work done to improve the effectiveness of retrieval of OCR text.
Retrieving imaged documents in digital libraries based on word image coding
- In Proc. of Int’l Workshop on Document Image Analysis for Libraries, DIAL’04
, 2004
"... A great number of documents are scanned and archived in the form of digital images in digital libraries, to make them available and accessible in the Internet. Information retrieval in these imaged documents has become a growing and challenging problem. For this purpose, a word image coding techniqu ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
A great number of documents are scanned and archived in the form of digital images in digital libraries, to make them available and accessible in the Internet. Information retrieval in these imaged documents has become a growing and challenging problem. For this purpose, a word image coding technique is proposed in this paper, and a web-based system for efficiently retrieving imaged documents from digital libraries is described. Some image preprocessing is first carried out off-line to extract word objects from imaged documents stored in the digital library. Then each word object is represented by a string of feature codes. As a result, each document image is represented by a series of feature code strings of its words, which are stored in a feature code file. Upon receiving a user’s request, the server converts the query word into feature code string using the same conversion mechanism as is used in producing feature codes for the underlying imaged documents. Searching is then performed among those feature code files generated offline. An inexact string matching technique, with the ability of matching a word portion, is applied to match the query word with the words in the documents, and then the occurrence frequency of the query word in each corresponding document is calculated for relevant ranking. Preliminary experimental results with some imaged documents of students ’ theses in the digital library of our university show that the proposed approach is efficient and promising for retrieving imaged documents, with potential applications to digital libraries. 1.
Diagnostic Evaluation of Information Retrieval Models
"... Developing effective retrieval models is a long-standing central challenge in information retrieval research. In order to develop more effective models, it is necessary to understand the deficiencies of the current retrieval models and the relative strengths of each of them. In this article, we prop ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Developing effective retrieval models is a long-standing central challenge in information retrieval research. In order to develop more effective models, it is necessary to understand the deficiencies of the current retrieval models and the relative strengths of each of them. In this article, we propose a general methodology to analytically and experimentally diagnose the weaknesses of a retrieval function, which provides guidance on how to further improve its performance. Our methodology is motivated by the empirical observation that good retrieval performance is closely related to the use of various retrieval heuristics. We connect the weaknesses and strengths of a retrieval function with its implementations of these retrieval heuristics, and propose two strategies to check how well a retrieval function implements the desired retrieval heuristics. The first strategy is to formalize heuristics as constraints, and use constraint analysis to analytically check the implementation of retrieval heuristics. The second strategy is to define a set of relevance-preserving perturbations and perform diagnostic tests to empirically evaluate how well a retrieval function implements retrieval heuristics. Experiments show that both strategies are effective to identify the potential problems in implementations of the retrieval heuristics. The performance of retrieval functions can be improved after we fix these problems.
A Word Image Coding Technique and its Applications in Information Retrieval from Imaged Documents
"... With the need of current fast evolving digital libraries, an increasing amount of documents are being digitized into an electronic format for easy archival and dissemination purposes. Thus Document Image Retrieval (DIR), as part of information retrieval (IR) paradigm, is receiving attentions among t ..."
Abstract
- Add to MetaCart
With the need of current fast evolving digital libraries, an increasing amount of documents are being digitized into an electronic format for easy archival and dissemination purposes. Thus Document Image Retrieval (DIR), as part of information retrieval (IR) paradigm, is receiving attentions among the IR communities in recent years. This paper presents two DIR applications based on a word image coding technique to extract features from each word image object and represent them using feature code strings for comparison. The first application is a web-based retrieval system that retrieves document images online from digital libraries based on a set of input query words. The second one is a plug-in search tool embedded in Acrobat Reader that performs word spotting within the opened document images and marks the matching words explicitly in the document. Both applications achieve good precision and recall according to our experiments on document images such as students ’ theses provided by our university digital library. 1.
Content-free Document Genre Classification Using First Order Random Graphs
- PROCEEDINGS OF SEVENTH ANNUAL CONFERENCE OF THE ADVANCED SCHOOL FOR COMPUTING AND IMAGING, HEIJEN, THE NETHERLANDS
, 2001
"... We approach the general problem of machine-printed document genre classification using contentfree layout structure analysis. Document genre is determined from the layout structure detected from scanned binary images of the document pages, using no OCR results and minimal a priori knowledge of docum ..."
Abstract
- Add to MetaCart
We approach the general problem of machine-printed document genre classification using contentfree layout structure analysis. Document genre is determined from the layout structure detected from scanned binary images of the document pages, using no OCR results and minimal a priori knowledge of document logical structures. Our approach uses attributed relational graphs (ARGs) to represent the layout structure of document instances, and a first order random graphs (FORGs) to represent document genres. In this paper we develop our FORG-based genre classification method and present a comparative evaluation between our technique and a variety of statistical pattern classifiers. FORGs are capable of modeling common layout structure within a document genre and are shown to outperform traditional pattern classification techniques when fine-grained genre distinctions must be drawn.
The Indexing and Retrieval of . . .
- COMPUTER VISION AND IMAGE UNDERSTANDING
, 1998
"... The economic feasibility of maintaining large databases of document images has created a tremendous demand for robust ways to access and manipulate the information these images contain. In an attempt to movetoward a paper-less office, large quantities of printed documents are often scanned and archi ..."
Abstract
- Add to MetaCart
The economic feasibility of maintaining large databases of document images has created a tremendous demand for robust ways to access and manipulate the information these images contain. In an attempt to movetoward a paper-less office, large quantities of printed documents are often scanned and archived as images, without adequate index information. One way to

