Results 1 - 10
of
11
A Survey of Table Recognition: Models, Observations, Transformations, and Inferences
- International Journal of Document Analysis and Recognition
, 2003
"... Table characteristics vary widely. Consequently, a great variety of computational approaches have been applied to table recognition. In this survey, the table recognition literature is presented as an interaction of table models, observations, transformations and inferences. A table model defines ..."
Abstract
-
Cited by 32 (3 self)
- Add to MetaCart
Table characteristics vary widely. Consequently, a great variety of computational approaches have been applied to table recognition. In this survey, the table recognition literature is presented as an interaction of table models, observations, transformations and inferences. A table model defines the physical and logical structure of tables; the model is used to detect tables, and to analyze and decompose the detected tables. Observations perform feature measurements and data lookup, transformations alter or restructure data, and inferences generate and test hypotheses. This presentation clarifies the decisions that are made by a table recognizer, and the assumptions and inferencing techniques that underlie these decisions.
The Development of a General Framework for Intelligent Document Image Retrieval
- In Document Analysis Systems
, 1996
"... Work has recently begun on a joint project between the Universities of Maryland and Oulu on the development of a system for Intelligent Document Image Retrieval (IDIR). The IDIR system will provide close connections with and utilization of document analysis and image processing techniques, advanced ..."
Abstract
-
Cited by 14 (5 self)
- Add to MetaCart
Work has recently begun on a joint project between the Universities of Maryland and Oulu on the development of a system for Intelligent Document Image Retrieval (IDIR). The IDIR system will provide close connections with and utilization of document analysis and image processing techniques, advanced computing and networking, and modern approaches to database management. The system design consists of aggressively modularized components to enhance the development of individual parts which are used in the complete solution, including: Interface specifications, multipurpose feature extraction, an integrated efficient query language, physical retrieval from an object-oriented database, and delivery of retrieved objects. In this paper, we introduce the general framework, feature extraction modules, query capabilities, a graphical query interface, and the application interface. We demonstrate each component of the system and how the query mechanisms can be used to handle both content and struc...
A Language for Specifying and Comparing Table Recognition Strategies
, 2004
"... Table recognition algorithms may be described by models of table location and struc-ture, and decisions made relative to these models. These algorithms are usually defined informally as a sequence of decisions with supporting data observations and transformations. In this investigation, we formalize ..."
Abstract
-
Cited by 7 (3 self)
- Add to MetaCart
Table recognition algorithms may be described by models of table location and struc-ture, and decisions made relative to these models. These algorithms are usually defined informally as a sequence of decisions with supporting data observations and transformations. In this investigation, we formalize these algorithms as strategies in an imitation game, where the goal of the game is to match table interpretations from a chosen procedure as closely as possible. The chosen procedure may be a person or persons producing ‘ground truth, ’ or an algorithm. To describe table recognition strategies we have defined the Recognition Strat-egy Language (RSL). RSL is a simple functional language for describing strategies as sequences of abstract decision types whose results are determined by any suit-able decision method. RSL defines and maintains interpretation trees, a simple data structure for describing recognition results. For each interpretation in an interpreta-tion tree, we annotate hypothesis histories which capture the creation, revision, and rejection of individual hypotheses, such as the logical type and structure of regions. We present a proof-of-concept using two strategies from the literature. We demon-strate how RSL allows strategies to be specified at the level of decisions rather than ii algorithms, and we compare results of our strategy implementations using new tech-niques. In particular, we introduce historical recall and precision metrics. Con-ventional recall and precision characterize hypotheses accepted after a strategy has finished. Historical recall and precision provide additional information by describing all generated hypotheses, including any rejected in the final result. iii
Integrating Geometrical and Linguistic Analysis for E-Mail Signature Block Parsing
, 1999
"... ing with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works, requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept, ACM Inc., 1515 Broadway, New York, N ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
ing with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works, requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept, ACM Inc., 1515 Broadway, New York, NY 10036 USA, fax +1 (212) 869-0481, or permissions@acm.org. 2 \Delta H. Chen, J. Hu and R. W. Sproat 1. INTRODUCTION The rapidly increasing use of the Internet in recent years has made e-mail one of the most common forms of business and personal communication. How to manage the large and dynamic collections of e-mail documents for efficient storage and information retrieval, and how to provide conversions between e-mail and other forms of messages (e.g., voice mail and fax) to allow convenient access whenever and wherever the user needs, are some of the most important research areas in multimedia messaging. The content of modern-day e-mail has expanded beyond text to include encoded docum...
Part-of-Speech Tagging for Table of Contents Recognition
"... A labeling approach to automatic recognition of tables of contents (TOC)s is described. A prototype is used for consulting electronically scientific papers in a digital library system named Calliope. This method operates on an a roughly structured ASCII file, produced with OCR.. Labeling is based on ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
A labeling approach to automatic recognition of tables of contents (TOC)s is described. A prototype is used for consulting electronically scientific papers in a digital library system named Calliope. This method operates on an a roughly structured ASCII file, produced with OCR.. Labeling is based on a part of speech (POS) tagging. Tagging is initiated by a primary labeling of text component using some specific dictionaries. Significant tags are then grouped in title and author strings and reduced in canonical forms according to contextual rules. Non labeled tokens are integrated in one or another field per either applying contextual correction rules or using a structure model generated from well detected articles. The designed prototype operates with a great satisfaction on different TOC layouts and character recognition qualities. Without manual intervention, 95.41% rate of correct segmentation was obtained on 38 journals including 2703 articles and 81.74% rate of correct field extraction. 1.
A Tool for Arabic Documents Indexing and Retrieval From a Web Virtual Library
- in: Proceedings of the ACL/EACL 2001 Workshop on Arabic Language Processing: Status and Prospects
, 2001
"... This paper presents a method for automatic indexing and retrieval of Arabic documents from a virtual library. This latter can be multilingual and encapsulates several documents written in different languages. All the documents are scanned in order to be stored in the library. The indexing me ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
This paper presents a method for automatic indexing and retrieval of Arabic documents from a virtual library. This latter can be multilingual and encapsulates several documents written in different languages. All the documents are scanned in order to be stored in the library. The indexing method consists in using the document contents as indexes. They are firstly scanned and then submitted to an OCR software which provides document contents textual formats. In a second phase, the textual formats serve as input of a module which automatically translates the textual formats to html format (or XML). The different parts of the document contents become hyperlinks to the appropriate document scanned images. The end-user can then ask for downloading a postscript format of the document. This method was experimented for Latin documents, specifically for scientific reviews. This paper presents the method adaptation for Arabic reviews and other kinds of documents. 1
E-mail Signature Block Analysis
- In ICPR'98
, 1998
"... The signature block is a common structured component found in e-mail messages. Accurate identification and analysis of signature blocks are important in many multimedia messaging and information retrieval applications such as email text-to-speech rendering. It is also a very challenging task, becaus ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
The signature block is a common structured component found in e-mail messages. Accurate identification and analysis of signature blocks are important in many multimedia messaging and information retrieval applications such as email text-to-speech rendering. It is also a very challenging task, because signature blocks often appear in complex twodimensional layouts which are guided only by loose conventions. Traditional text analysis methods designed to deal with sequential text cannot handle 2-dimensional structures, while the highly unconstrained nature of signature blocks makes the application of 2-dimensional grammars very difficult. In this paper we describe an algorithm for signature block analysis which combines two-dimensional structural segmentation with one-dimensional grammatical constraints. The information obtained from both geometrical and linguistic analysis are integrated in the form of weighted finite state transducers (WFST), and the final solution is the optimal interp...
unknown title
"... The signature block is a common structured component found in e-mail messages. Accurate identification and analysis of signature blocks are important in many multimedia messaging and information retrieval applications such as email text-to-speech rendering. It is also a very challenging task, becaus ..."
Abstract
- Add to MetaCart
The signature block is a common structured component found in e-mail messages. Accurate identification and analysis of signature blocks are important in many multimedia messaging and information retrieval applications such as email text-to-speech rendering. It is also a very challenging task, because signature blocks often appear in complex twodimensional layouts which are guided only by loose conventions. Traditional text analysis methods designed to deal with sequential text cannot handle 2-dimensional structures, while the highly unconstrained nature of signature blocks makes the application of 2-dimensional grammars very difficult. In this paper we describe an algorithm for signature block analysis which combines two-dimensional structural segmentation with one-dimensional grammatical constraints. The information obtained from both geometrical and linguistic analysis are integrated in the form of weighted finite state transducers (WFST), and the final solution is the optimal interpretation under both constraints. 1.
Metadata Extraction from Bibliographic Documents for Digital Library
"... This chapter addresses the problem of automatic metadata extraction within digitized documents by retro-conversion techniques. The focus is on bibliographic documents as they are by nature a source of such metadata. They are strongly structuring for a digital library (DL), their automatic recognitio ..."
Abstract
- Add to MetaCart
This chapter addresses the problem of automatic metadata extraction within digitized documents by retro-conversion techniques. The focus is on bibliographic documents as they are by nature a source of such metadata. They are strongly structuring for a digital library (DL), their automatic recognition presents an obvious interest. However as their origin is very different (references, citations, tables of content, index cards), a generic methodology is proposed for their structure. Based on a first morphological labeling of the text, it looks for syntactic elements (syntagmas) revealing the bibliographic field nature (title, authors, date, publication source, etc.). Depending on the case, the syntax is validated either by a given grammar or by occurrence analysis in the different document elements (i.e. several references in a bibliography, or articles in a table of content). In the later, the bottom-up procedure generates a structure model from the well-recognized elements and applies it on the rest. The modeling requires taking into consideration the interand intra-fields relationships. The experiments performed on different types of documents confirm the interest of this approach. 1.
The Indexing and Retrieval of . . .
- COMPUTER VISION AND IMAGE UNDERSTANDING
, 1998
"... The economic feasibility of maintaining large databases of document images has created a tremendous demand for robust ways to access and manipulate the information these images contain. In an attempt to movetoward a paper-less office, large quantities of printed documents are often scanned and archi ..."
Abstract
- Add to MetaCart
The economic feasibility of maintaining large databases of document images has created a tremendous demand for robust ways to access and manipulate the information these images contain. In an attempt to movetoward a paper-less office, large quantities of printed documents are often scanned and archived as images, without adequate index information. One way to

