Results 1 - 10
of
17
A Survey of Table Recognition: Models, Observations, Transformations, and Inferences
- International Journal of Document Analysis and Recognition
, 2003
"... Table characteristics vary widely. Consequently, a great variety of computational approaches have been applied to table recognition. In this survey, the table recognition literature is presented as an interaction of table models, observations, transformations and inferences. A table model defines ..."
Abstract
-
Cited by 32 (3 self)
- Add to MetaCart
Table characteristics vary widely. Consequently, a great variety of computational approaches have been applied to table recognition. In this survey, the table recognition literature is presented as an interaction of table models, observations, transformations and inferences. A table model defines the physical and logical structure of tables; the model is used to detect tables, and to analyze and decompose the detected tables. Observations perform feature measurements and data lookup, transformations alter or restructure data, and inferences generate and test hypotheses. This presentation clarifies the decisions that are made by a table recognizer, and the assumptions and inferencing techniques that underlie these decisions.
Tableseer: Automatic table metadata extraction and searching in digital libraries
- In Technical Report
, 2007
"... Tables are ubiquitous in digital libraries. In scientific documents, tables are widely used to present experimental results or statistical data in a condensed fashion. However, current search engines do not support table search. The difficulty of automatic extracting tables from un-tagged documents, ..."
Abstract
-
Cited by 11 (5 self)
- Add to MetaCart
Tables are ubiquitous in digital libraries. In scientific documents, tables are widely used to present experimental results or statistical data in a condensed fashion. However, current search engines do not support table search. The difficulty of automatic extracting tables from un-tagged documents, the lack of a universal table metadata specification, and the limitation of the existing ranking schemes make table search problem challenging. In this paper, we describe TableSeer, a search engine for tables. TableSeer crawls digital libraries, detects tables from documents, extracts tables metadata, indexes and ranks tables, and provides a userfriendly search interface. We propose an extensive set of medium-independent metadata for tables that scientists and other users can adopt for representing table information. In addition, we devise a novel page box-cutting method to improve the performance of the table detection. Given a query, TableSeer ranks the matched tables using an innovative ranking algorithm – TableRank. TableRank rates each <query, table> pair with a tailored vector space model and a specific term weighting scheme. Overall, T ableSeer eliminates the burden of manually extract table data from digital libraries and enables users to automatically examine tables. We demonstrate the value of TableSeer with empirical studies on scientific documents.
A Language for Specifying and Comparing Table Recognition Strategies
, 2004
"... Table recognition algorithms may be described by models of table location and struc-ture, and decisions made relative to these models. These algorithms are usually defined informally as a sequence of decisions with supporting data observations and transformations. In this investigation, we formalize ..."
Abstract
-
Cited by 7 (3 self)
- Add to MetaCart
Table recognition algorithms may be described by models of table location and struc-ture, and decisions made relative to these models. These algorithms are usually defined informally as a sequence of decisions with supporting data observations and transformations. In this investigation, we formalize these algorithms as strategies in an imitation game, where the goal of the game is to match table interpretations from a chosen procedure as closely as possible. The chosen procedure may be a person or persons producing ‘ground truth, ’ or an algorithm. To describe table recognition strategies we have defined the Recognition Strat-egy Language (RSL). RSL is a simple functional language for describing strategies as sequences of abstract decision types whose results are determined by any suit-able decision method. RSL defines and maintains interpretation trees, a simple data structure for describing recognition results. For each interpretation in an interpreta-tion tree, we annotate hypothesis histories which capture the creation, revision, and rejection of individual hypotheses, such as the logical type and structure of regions. We present a proof-of-concept using two strategies from the literature. We demon-strate how RSL allows strategies to be specified at the level of decisions rather than ii algorithms, and we compare results of our strategy implementations using new tech-niques. In particular, we introduce historical recall and precision metrics. Con-ventional recall and precision characterize hypotheses accepted after a strategy has finished. Historical recall and precision provide additional information by describing all generated hypotheses, including any rejected in the final result. iii
Automatic extraction of table metadata from digital documents
- In JCDL
, 2006
"... Tables are used to present, list, summarize, and structure important data in documents. In scholarly articles, they are often used to present the relationships among data and highlight a collection of results obtained from experiments and scientific analysis. In digital libraries, extracting this da ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
Tables are used to present, list, summarize, and structure important data in documents. In scholarly articles, they are often used to present the relationships among data and highlight a collection of results obtained from experiments and scientific analysis. In digital libraries, extracting this data automatically and understanding the structure and content of tables are very important to many applications. Automatic identification extraction, and search for the contents of tables can be made more precise with the help of metadata. In this paper, we propose a set of mediumindependent table metadata to facilitate the table indexing, searching, and exchanging. To extract the contents of tables and their metadata, an automatic table metadata extraction algorithm is designed and tested on PDF documents.
From legacy documents to xml: A conversion framework
- In Proc. European Conf. Digital Libraries
, 2005
"... Abstract. We present an integrated framework for the document conversion from legacy formats to XML format. We describe the LegDoC project, aimed at automating the conversion of layout annotations layout-oriented formats like PDF, PS and HTML to semantic-oriented annotations. A toolkit of different ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
Abstract. We present an integrated framework for the document conversion from legacy formats to XML format. We describe the LegDoC project, aimed at automating the conversion of layout annotations layout-oriented formats like PDF, PS and HTML to semantic-oriented annotations. A toolkit of different components covers complementary techniques the logical document analysis and semantic annotations with the methods of machine learning. We use a real case conversion project as a driving example to exemplify different techniques implemented in the project. 1
Abstract Using Visual Features for Fine-Grained Genre Classification of Web Pages
"... The field of automatic genre classification has primarily focused on extracting textual features from documents. The goal of this research is to investigate whether visual features of HTML web pages can improve the classification of fine-grained genres. Intuitively it seems that this should be helpf ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
The field of automatic genre classification has primarily focused on extracting textual features from documents. The goal of this research is to investigate whether visual features of HTML web pages can improve the classification of fine-grained genres. Intuitively it seems that this should be helpful and the challenge is to extract those visual features that capture the layout characteristics of the genres. A corpus of Web pages from different e-commerce sites was generated and manually classified into several genres. Three different sets of features were compared- one with just textual features, one with HTML level features added, and a third with visual features added. Our experiments confirm that using HTML features and particularly URL address features can improve classification beyond using textual features alone. We also show that adding visual features can be useful for further improving fine-grained genre classification.
Online medical journal article layout analysis
- Proc. SPIE 6500
, 2007
"... We describe a physical and logical layout analysis algorithm, which is applied to segment and label online medical journal articles (regular HTML and PDF-Converted-HTML files). For these articles, the geometric layout of the Web page is the most important cue for physical layout analysis. The key to ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
We describe a physical and logical layout analysis algorithm, which is applied to segment and label online medical journal articles (regular HTML and PDF-Converted-HTML files). For these articles, the geometric layout of the Web page is the most important cue for physical layout analysis. The key to physical layout analysis is then to render the HTML file in a Web browser, so that the visual information in zones (composed of one or a set of HTML DOM nodes), especially their relative position, can be utilized. The recursive X-Y cut algorithm is adopted to construct a hierarchical zone tree structure. In logical layout analysis, both geometric and linguistic features are used. The HTML documents are modeled by a Hidden Markov Model with 16 states, and the Viterbi algorithm is then used to find the optimal label sequence, concluding the logical layout analysis.
Page classification for meta-data extraction from digital collections
- In Database and Expert Systems Applications
, 2001
"... Abstract. Automatic extraction of meta-data from collections of scanned documents (books and journals) is a useful task in order to increase the accessibility of these digital collections. In order to improve the extraction of meta-data, the classification of the page layout into a set of pre-define ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Abstract. Automatic extraction of meta-data from collections of scanned documents (books and journals) is a useful task in order to increase the accessibility of these digital collections. In order to improve the extraction of meta-data, the classification of the page layout into a set of pre-defined classes can be helpful. In this paper we describe a method for classifying document images on the basis of their physical layout, that is described by means of a hierarchicalrepresentation: the Modified X-Y tree. The Modified X-Y tree describes a document by means of a recursive segmentation by alternating horizontaland verticalcuts along either spaces or lines. Each internal node of the tree represents a separator (a space or a line), whereas leaves represent regions in the page or separating lines. The Modified X-Y tree is built starting from a symbolic description of the document, instead of dealing directly with the image. The tree is afterwards encoded into a fixed-size representation that takes into account occurrences of tree-patterns in the tree representing the page. Lastly, this feature vector is fed to an artificialneuralnetwork that is trained to classify document images. The system is applied to the classification of documents belonging to Digital Libraries, examples of classes taken into account for a journal are “title page”, “index”, “regular page”. Some tests of the system are made on a data-set of more than 600 pages belonging to a journal of the 19th Century. 1
A fast preprocessing method for table boundary detection: Narrowing down the sparse lines using solely coordinate information
- In DAS
, 2008
"... As the rapid growth of PDF document in digital libraries, recognizing the document structure and detecting specific document components are useful for document storage, classification and retrieval. Tables, as a specific document component, are ubiquitous everywhere. Accurately detecting the table b ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
As the rapid growth of PDF document in digital libraries, recognizing the document structure and detecting specific document components are useful for document storage, classification and retrieval. Tables, as a specific document component, are ubiquitous everywhere. Accurately detecting the table boundary plays a crucial role for the later table structure decomposition and table data collection. In this paper, we propose an easy but effective table boundary detection method. Our method has two unique advantages comparing with other works in this field: 1) Because most tables are text-based, we claim that the text object of PDF provides enough information for table detection. In addition, we believe that the font information is not so reliable as other work stated. 2) Based on the nature of the table cells, we notice that almost all the table rows are sparse lines. By filtering out the non-sparse lines initially, the table boundary detection problem can be simplified into the sparse line analysis problem easily. The experimental results not only confirm the importance of the coordinate information, but also demonstrate the effectiveness of sparse lines in the table boundary detection. Combining with other keywords, our method is even applicable to detect other document components (e.g., mathematical formula or the references). 1
Correlating the Visual Representation of User Interfaces with their Internal Structures and Metadata
"... Pixel-based methods emerge to be a new and promising way for developing new interaction techniques on top of existing user interfaces. However, in order to maintain platform independence, other available low-level information about GUI widgets, such as accessibility metadata, was neglected intention ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Pixel-based methods emerge to be a new and promising way for developing new interaction techniques on top of existing user interfaces. However, in order to maintain platform independence, other available low-level information about GUI widgets, such as accessibility metadata, was neglected intentionally. In this paper, we present a hybrid framework, PAX, which correlates the visual representation of user interfaces (i.e. the pixels) and their internal hierarchical metadata (i.e. the content, role, and value). We identify challenges to build such a framework. We also develop and evaluate two new algorithms for detecting text at arbitrary places on the screen, and for segmenting a text image into individual word blobs. Finally, we validate our framework in implementations of three applications. We enhance an existing pixel-based system, Sikuli Script, and preserve the readability of its script code at the same time. Further, we create two novel applications, Screen Search and Screen Copy, to demonstrate how PAX can be applied to developing desktop-level interactive systems. ACM Classification: H5.2. Information interfaces and presentation: User Interfaces.

