Results 1 - 10
of
12
A Survey of Table Recognition: Models, Observations, Transformations, and Inferences
- International Journal of Document Analysis and Recognition
, 2003
"... Table characteristics vary widely. Consequently, a great variety of computational approaches have been applied to table recognition. In this survey, the table recognition literature is presented as an interaction of table models, observations, transformations and inferences. A table model defines ..."
Abstract
-
Cited by 32 (3 self)
- Add to MetaCart
Table characteristics vary widely. Consequently, a great variety of computational approaches have been applied to table recognition. In this survey, the table recognition literature is presented as an interaction of table models, observations, transformations and inferences. A table model defines the physical and logical structure of tables; the model is used to detect tables, and to analyze and decompose the detected tables. Observations perform feature measurements and data lookup, transformations alter or restructure data, and inferences generate and test hypotheses. This presentation clarifies the decisions that are made by a table recognizer, and the assumptions and inferencing techniques that underlie these decisions.
Model-based Analysis of Printed Tables
- In Proceedings of International Conference on Document Analysis and Recognition (ICDAR
, 1995
"... this paper we describe a system that can analyze a wide variety of printed table formats. The adaptability of this system is realized by a model of the table's organization. Printed representations of relational data rely on several kinds of visual clues for imparting the table's logical structure t ..."
Abstract
-
Cited by 19 (0 self)
- Add to MetaCart
this paper we describe a system that can analyze a wide variety of printed table formats. The adaptability of this system is realized by a model of the table's organization. Printed representations of relational data rely on several kinds of visual clues for imparting the table's logical structure to the reader. For example, ruling lines of various widths might indicate a grouping of consecutive items or attributes. A system reading a table must make deductions based on these visual devices before it is able to specify the relational organization of the table. Some of the visual clues used to logically organize the physically structured information in the table are:
A Retargetable Table Reader
- In Proc. Fourth Int’l Conf. Document Analysis and Recognition
, 1997
"... We describe the architecture of a system for reading machine-printed documents in known predefined tabular-data layout styles. In these tables, textual data are presented in record lines made up of fixed-width fields. Tables often do not rely on line-art (ruled lines) to delimit fields, and in this ..."
Abstract
-
Cited by 13 (1 self)
- Add to MetaCart
We describe the architecture of a system for reading machine-printed documents in known predefined tabular-data layout styles. In these tables, textual data are presented in record lines made up of fixed-width fields. Tables often do not rely on line-art (ruled lines) to delimit fields, and in this way differ crucially from fixed forms. Our system performs these steps: copes with multiple tables per page; identifies records within tables; segments records into fields; and recognizes characters within fields, constrained by field-specific contextual knowledge. Obstacles to good performance on tables include small print, tight line-spacing, poorquality text (such as photocopies), and line-art or background patterns that touch the text. Precise skewcorrection and pitch-estimation, and high-performance OCR using neural nets proved crucial in overcoming these obstacles. The most significant technical advances in this work appear to be algorithms for identifying and segmenting records with k...
A Language for Specifying and Comparing Table Recognition Strategies
, 2004
"... Table recognition algorithms may be described by models of table location and struc-ture, and decisions made relative to these models. These algorithms are usually defined informally as a sequence of decisions with supporting data observations and transformations. In this investigation, we formalize ..."
Abstract
-
Cited by 7 (3 self)
- Add to MetaCart
Table recognition algorithms may be described by models of table location and struc-ture, and decisions made relative to these models. These algorithms are usually defined informally as a sequence of decisions with supporting data observations and transformations. In this investigation, we formalize these algorithms as strategies in an imitation game, where the goal of the game is to match table interpretations from a chosen procedure as closely as possible. The chosen procedure may be a person or persons producing ‘ground truth, ’ or an algorithm. To describe table recognition strategies we have defined the Recognition Strat-egy Language (RSL). RSL is a simple functional language for describing strategies as sequences of abstract decision types whose results are determined by any suit-able decision method. RSL defines and maintains interpretation trees, a simple data structure for describing recognition results. For each interpretation in an interpreta-tion tree, we annotate hypothesis histories which capture the creation, revision, and rejection of individual hypotheses, such as the logical type and structure of regions. We present a proof-of-concept using two strategies from the literature. We demon-strate how RSL allows strategies to be specified at the level of decisions rather than ii algorithms, and we compare results of our strategy implementations using new tech-niques. In particular, we introduce historical recall and precision metrics. Con-ventional recall and precision characterize hypotheses accepted after a strategy has finished. Historical recall and precision provide additional information by describing all generated hypotheses, including any rejected in the final result. iii
Extracting Tabular Information From Text Files
- EECS Department, Tufts University
, 1996
"... This paper presents work done in locating and extracting tables and their contents from document images. While most research in the area of table analysis and recognition has focused on analyzing the raster image, our approach builds upon the advances in optical character recognition (OCR) software ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
This paper presents work done in locating and extracting tables and their contents from document images. While most research in the area of table analysis and recognition has focused on analyzing the raster image, our approach builds upon the advances in optical character recognition (OCR) software to preserve the layout of tabular data by means of white space. By using methods to analyze the geometry, syntax, and the semantics of the character data, as well as utilizing some well-known image processing techniques, we are able to 1) isolate embedded tables from documents, and 2) identify table components such as title blocks, table entries, and footer blocks. Furthermore, the table analysis techniques presented in this paper can also be applied when analyzing blocks of text isolated by traditional methods such as connected component analysis[1] or bounding box [2]. 1. Introduction Tables are a means for presenting structured data in paper documents. They provide an efficient method fo...
A Constraint-based Approach to Table Structure Derivation
, 2003
"... This paper presents an approach to deriving an abstract geometric model of a table from a physical representation. The technique developed uses a graph of constraints between cells which must be satisfied in order to determine their relative horizontal and vertical position. The method is evaluated ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
This paper presents an approach to deriving an abstract geometric model of a table from a physical representation. The technique developed uses a graph of constraints between cells which must be satisfied in order to determine their relative horizontal and vertical position. The method is evaluated with a test set of tables drawn from US Securities and Exchange Commission (SEC) filings.
Document image analysis: Automated performance evaluation
- In Document Analysis Systems
, 1995
"... Both users and developers of OCR systems benefit from objective performance evaluation and benchmarking. The need for improved tests has given additional impetus to research on evaluation methodology. We discuss some of the statistical and combinatorial principles underlying error estimation, propos ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
Both users and developers of OCR systems benefit from objective performance evaluation and benchmarking. The need for improved tests has given additional impetus to research on evaluation methodology. We discuss some of the statistical and combinatorial principles underlying error estimation, propose a taxonomy for reference data, and review evaluation paradigms in current use. We provide pointers to recent work on the evaluation of isolated character classification, text reading, layout analysis, interpretation of hand-printed forms, and line-drawing conversion. 1.
Interactive Conversion of Large Web Tables
"... Two hundred web tables from ten sites were imported into Excel. The tables were edited as needed, then converted into layout independent Wang Notation using the recently developed Table Abstraction Tool (TAT). The output generated by TAT consists of XML files to be used for constructing narrow-domai ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Two hundred web tables from ten sites were imported into Excel. The tables were edited as needed, then converted into layout independent Wang Notation using the recently developed Table Abstraction Tool (TAT). The output generated by TAT consists of XML files to be used for constructing narrow-domain ontologies. On an average each table required 104 seconds for editing. Augmentations like aggregates, footnotes, table titles and notes were also extracted. Every user intervention was logged and audited. The logged interactions were analyzed to determine the relative influence of factors like table size, number of categories (Wang dimension), and various types of augmentations on the processing time. The analysis suggests which aspects of interactive table processing can be automated in the near term, and how much time such automation would save.
Table Detection in Noisy Off-line Handwritten Documents
"... Abstract—Table detection can be a valuable step in the analysis of unstructured documents. Although much work has been conducted in the domain of machine-print including books, scientific papers, etc., little has been done to address the case of handwritten inputs. In this paper, we study table dete ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Abstract—Table detection can be a valuable step in the analysis of unstructured documents. Although much work has been conducted in the domain of machine-print including books, scientific papers, etc., little has been done to address the case of handwritten inputs. In this paper, we study table detection in scanned handwritten documents subject to challenging artifacts and noise. First, we separate text components (machine-print, handwriting) from the rest of the page using an SVM classifier. We then employ a correlation-based approach to measure the coherence between adjacent text lines which may be part of the same table, solving the resulting page decomposition problem using dynamic programming. A report of preliminary results from ongoing experiments concludes the paper. Keywords-Off-line handwriting; table detection; noisy documents; I.
Medium-Independent Table Detection
- In SPIE Document Recognition and Retrieval VII
, 2000
"... An important step towards the goal of table understanding is a method for reliable table detection. This paper describes a general solution for detecting tables based on computing an optimal partitioning of a document into some number of tables. A dynamic programming algorithm is given to solve the ..."
Abstract
- Add to MetaCart
An important step towards the goal of table understanding is a method for reliable table detection. This paper describes a general solution for detecting tables based on computing an optimal partitioning of a document into some number of tables. A dynamic programming algorithm is given to solve the resulting optimization problem. This high-level framework is independent of any particular table quality measure and independent of the document medium. Moreover, it does not rely on the presence of ruling lines or other table delimiters. We also present table quality measures based on white space correlation and vertical connected component analysis. These measures can be applied equally well to ASCII text and scanned images. We report on some preliminary experiments using this method to detect tables in both ASCII text and scanned images, yielding promising results. We present detailed evaluation of these results using three different criteria which by themselves pose interesting research...

