Results 1 - 10
of
10
A Survey of Table Recognition: Models, Observations, Transformations, and Inferences
- International Journal of Document Analysis and Recognition
, 2003
"... Table characteristics vary widely. Consequently, a great variety of computational approaches have been applied to table recognition. In this survey, the table recognition literature is presented as an interaction of table models, observations, transformations and inferences. A table model defines ..."
Abstract
-
Cited by 32 (3 self)
- Add to MetaCart
Table characteristics vary widely. Consequently, a great variety of computational approaches have been applied to table recognition. In this survey, the table recognition literature is presented as an interaction of table models, observations, transformations and inferences. A table model defines the physical and logical structure of tables; the model is used to detect tables, and to analyze and decompose the detected tables. Observations perform feature measurements and data lookup, transformations alter or restructure data, and inferences generate and test hypotheses. This presentation clarifies the decisions that are made by a table recognizer, and the assumptions and inferencing techniques that underlie these decisions.
Hidden Tree Markov Models for Document Image Classification
- IEEE Transactions on Pattern Analysis and Machine Intelligence
, 2003
"... Classification is an important problem in image document processing and is often a preliminary step towards recognition, understanding, and information extraction. In this paper, the problem is formulated in the framework of concept learning and each category corresponds to the set of image document ..."
Abstract
-
Cited by 21 (0 self)
- Add to MetaCart
Classification is an important problem in image document processing and is often a preliminary step towards recognition, understanding, and information extraction. In this paper, the problem is formulated in the framework of concept learning and each category corresponds to the set of image documents with similar physical structure. We propose a solution based on two algorithmic ideas. First, we obtain a structured representation of images based on labeled XY-trees (this representation informs the learner about important relationships between image sub-constituents). Second, we propose a probabilistic architecture that extends hidden Markov models for learning probability distributions defined on spaces of labeled trees. Finally, a successful application of this method to the categorization of commercial invoices is presented.
A Hierarchical Representation of Form Documents for Identification and Retrieval
"... In this paper, we present a logical representation for form documents to be used for identification and retrieval. A hierarchical structure is proposed to represent the structure of a form by using lines and the XY-tree approach. The approach is top-down and no domain knowledge such as the preprinte ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
In this paper, we present a logical representation for form documents to be used for identification and retrieval. A hierarchical structure is proposed to represent the structure of a form by using lines and the XY-tree approach. The approach is top-down and no domain knowledge such as the preprinted data or filled-in data is used. Geometrical modifications and slight variations are handled by this representation. Logically same forms are associated to the same or similar hierarchical structure. Identi cation and the retrieval of similar forms are performed by computing the edit distances between the generated trees.
A Language for Specifying and Comparing Table Recognition Strategies
, 2004
"... Table recognition algorithms may be described by models of table location and struc-ture, and decisions made relative to these models. These algorithms are usually defined informally as a sequence of decisions with supporting data observations and transformations. In this investigation, we formalize ..."
Abstract
-
Cited by 7 (3 self)
- Add to MetaCart
Table recognition algorithms may be described by models of table location and struc-ture, and decisions made relative to these models. These algorithms are usually defined informally as a sequence of decisions with supporting data observations and transformations. In this investigation, we formalize these algorithms as strategies in an imitation game, where the goal of the game is to match table interpretations from a chosen procedure as closely as possible. The chosen procedure may be a person or persons producing ‘ground truth, ’ or an algorithm. To describe table recognition strategies we have defined the Recognition Strat-egy Language (RSL). RSL is a simple functional language for describing strategies as sequences of abstract decision types whose results are determined by any suit-able decision method. RSL defines and maintains interpretation trees, a simple data structure for describing recognition results. For each interpretation in an interpreta-tion tree, we annotate hypothesis histories which capture the creation, revision, and rejection of individual hypotheses, such as the logical type and structure of regions. We present a proof-of-concept using two strategies from the literature. We demon-strate how RSL allows strategies to be specified at the level of decisions rather than ii algorithms, and we compare results of our strategy implementations using new tech-niques. In particular, we introduce historical recall and precision metrics. Con-ventional recall and precision characterize hypotheses accepted after a strategy has finished. Historical recall and precision provide additional information by describing all generated hypotheses, including any rejected in the final result. iii
Retrieval by Layout Similarity of Documents Represented with MXY Trees
- Proceedings of 5th IAPR International Workshop on Document Analysis Systems, Princeton (NJ, USA), volume 2423 of Lecture Notes in Computer Science
, 2002
"... Abstract. Document image retrievalcan be carried out either processing the converted text (obtained with OCR) or by measuring the layout similarity of images. We describe a system for document image retrieval based on layout similarity. The layout is described by means of a treebased representation: ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
Abstract. Document image retrievalcan be carried out either processing the converted text (obtained with OCR) or by measuring the layout similarity of images. We describe a system for document image retrieval based on layout similarity. The layout is described by means of a treebased representation: the Modified X-Y tree. Each page in the database is represented by a feature vector containing both global features of the page and a vectorialrepresentation of its layout that is derived from the corresponding MXY tree. Occurrences of tree patterns are handled similarly to index terms in Information Retrieval in order to compute the similarity. When retrieving relevant documents, the images in the collection are sorted on the basis of a measure that is the combination of two values describing the similarity of global features and of the occurrences of tree patterns. The system is applied to the retrieval of documents belonging to digital libraries. Tests of the system are made on a data-set of more than 600 pages belonging to a journal of the 19th Century, and to a collection of monographs printed in the same Century and containing more than 600 pages. 1
Using treegrammars for training set expansion in page classification
- in Seventh International Conference on Document Analysis and Recognition
, 2003
"... In this paper we describe a method for the expansion of training sets made by XY trees representing page layout. This approach is appropriate when dealing with page classification based on MXY tree page representations. The basic idea is the use of tree grammars to model the variations in the tree w ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
In this paper we describe a method for the expansion of training sets made by XY trees representing page layout. This approach is appropriate when dealing with page classification based on MXY tree page representations. The basic idea is the use of tree grammars to model the variations in the tree which are caused by segmentation algorithms. A set of general grammatical rules are defined and used to expand the training set. Pages are classified with a k − nn approach where the distance between pages is computed by means of tree-edit distance. 1.
Page classification for meta-data extraction from digital collections
- In Database and Expert Systems Applications
, 2001
"... Abstract. Automatic extraction of meta-data from collections of scanned documents (books and journals) is a useful task in order to increase the accessibility of these digital collections. In order to improve the extraction of meta-data, the classification of the page layout into a set of pre-define ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Abstract. Automatic extraction of meta-data from collections of scanned documents (books and journals) is a useful task in order to increase the accessibility of these digital collections. In order to improve the extraction of meta-data, the classification of the page layout into a set of pre-defined classes can be helpful. In this paper we describe a method for classifying document images on the basis of their physical layout, that is described by means of a hierarchicalrepresentation: the Modified X-Y tree. The Modified X-Y tree describes a document by means of a recursive segmentation by alternating horizontaland verticalcuts along either spaces or lines. Each internal node of the tree represents a separator (a space or a line), whereas leaves represent regions in the page or separating lines. The Modified X-Y tree is built starting from a symbolic description of the document, instead of dealing directly with the image. The tree is afterwards encoded into a fixed-size representation that takes into account occurrences of tree-patterns in the tree representing the page. Lastly, this feature vector is fed to an artificialneuralnetwork that is trained to classify document images. The system is applied to the classification of documents belonging to Digital Libraries, examples of classes taken into account for a journal are “title page”, “index”, “regular page”. Some tests of the system are made on a data-set of more than 600 pages belonging to a journal of the 19th Century. 1
Abstract Shape-Free Statistical Information in Optical Character Recognition
"... The fundamental task facing Optical Character Recognition (OCR) systems involves the conversion of input document images into corresponding sequences of symbolic character codes. Traditionally, this has been accomplished in a bottom-up fashion: the image of each symbol is isolated, then classified b ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
The fundamental task facing Optical Character Recognition (OCR) systems involves the conversion of input document images into corresponding sequences of symbolic character codes. Traditionally, this has been accomplished in a bottom-up fashion: the image of each symbol is isolated, then classified based on its pixel intensities. While such shape-based classifiers are initially trained on a wide array of fonts, they still tend to perform poorly when faced with novel glyph shapes. In this thesis, we attempt to bypass this problem by pursuing a top-down “codebreaking ” approach. We assume no a priori knowledge of character shape, instead relying on statistical information and language constraints to determine an appropriate character mapping. We introduce and contrast three new top-down approaches, and present experimental results on several real and synthetic datasets. Given sufficient amounts of data, our font and shape independent approaches are shown to perform about as well as shape-based classifiers. ii Acknowledgements First and foremost, I would like to thank my supervisor Sam Roweis for his tireless sup-
Using Tree-Grammars for Training Set Expansion in Page Classification
"... In this paper we describe a method for the expansion of training sets made by XY trees representing page layout. This approach is appropriate when dealing with page classification based on MXY tree page representations. The basic idea is the use of tree grammars to model the variations in the tree w ..."
Abstract
- Add to MetaCart
In this paper we describe a method for the expansion of training sets made by XY trees representing page layout. This approach is appropriate when dealing with page classification based on MXY tree page representations. The basic idea is the use of tree grammars to model the variations in the tree which are caused by segmentation algorithms. A set of general grammatical rules are defined and used to expand the training set. Pages are classified with a k nn approach where the distance between pages is computed by means of tree-edit distance.
A Distance-based Technique for non-Manhattan Layout Analysis
"... Layout analysis is a fundamental step in automatic document processing. Many different techniques have been proposed in literature to perform this task. These are broadly divided in two main categories according to the approach they follow: the top-down methods start by identifying the high level co ..."
Abstract
- Add to MetaCart
Layout analysis is a fundamental step in automatic document processing. Many different techniques have been proposed in literature to perform this task. These are broadly divided in two main categories according to the approach they follow: the top-down methods start by identifying the high level components of the page structure and then recursively split them until basic blocks are found. On the other hand, bottom-up approaches start with the smallest elements (e.g., the pixels in case of digitized document) and then recursively merge them into higher level components. A first limitation of such methods is that most of them are designed to deal only with digitized documents and hence are not applicable to the general case of native digital documents which are mainly diffused. Furthermore, top-down and most of bottom-up methods are able to process Manhattan layout documents only. In this work, we propose a general bottom-up strategy to tackle the layout analysis of (possibly) non-Manhattan documents, and two specializations of it to handle the different cases of bitmaps and PS/PDF sources. 1

