Results 1 - 10
of
13
What have Innsbruck and Leipzig in common? Extracting Semantics from Wiki Content
- In ESWC
, 2007
"... Abstract Wikis are established means for the collaborative authoring, versioning and publishing of textual articles. The Wikipedia project, for example, succeeded in creating the by far largest encyclopedia just on the basis of a wiki. Recently, several approaches have been proposed on how to extend ..."
Abstract
-
Cited by 57 (7 self)
- Add to MetaCart
Abstract Wikis are established means for the collaborative authoring, versioning and publishing of textual articles. The Wikipedia project, for example, succeeded in creating the by far largest encyclopedia just on the basis of a wiki. Recently, several approaches have been proposed on how to extend wikis to allow the creation of structured and semantically enriched content. However, the means for creating semantically enriched structured content are already available and are, although unconsciously, even used by Wikipedia authors. In this article, we present a method for revealing this structured content by extracting information from template instances. We suggest ways to efficiently query the vast amount of extracted information (e.g. more than 8 million RDF statements for the English Wikipedia version alone), leading to astonishing query answering possibilities (such as for the title question). We analyze the quality of the extracted content, and propose strategies for quality improvements with just minor modifications of the wiki systems being currently used. 1
Document Structure Analysis Algorithms: A Literature Survey
, 2003
"... Document structure analysis can be regarded as a syntactic analysis problem. The order and containment relations among the physical or logical components of a document page can be described by an ordered tree structure and can be modeled by a tree grammar which describes the page at the component le ..."
Abstract
-
Cited by 35 (0 self)
- Add to MetaCart
Document structure analysis can be regarded as a syntactic analysis problem. The order and containment relations among the physical or logical components of a document page can be described by an ordered tree structure and can be modeled by a tree grammar which describes the page at the component level in terms of regions or blocks. This paper provides a detailed survey of past work on document structure analysis algorithms and summarize the limitations of past approaches. In particular, we survey past work on document physical layout representations and algorithms, document logical structure representations and algorithms, and performance evaluation of document structure analysis algorithms. In the last section, we summarize this work and point out its limitations.
A Survey of Table Recognition: Models, Observations, Transformations, and Inferences
- International Journal of Document Analysis and Recognition
, 2003
"... Table characteristics vary widely. Consequently, a great variety of computational approaches have been applied to table recognition. In this survey, the table recognition literature is presented as an interaction of table models, observations, transformations and inferences. A table model defines ..."
Abstract
-
Cited by 32 (3 self)
- Add to MetaCart
Table characteristics vary widely. Consequently, a great variety of computational approaches have been applied to table recognition. In this survey, the table recognition literature is presented as an interaction of table models, observations, transformations and inferences. A table model defines the physical and logical structure of tables; the model is used to detect tables, and to analyze and decompose the detected tables. Observations perform feature measurements and data lookup, transformations alter or restructure data, and inferences generate and test hypotheses. This presentation clarifies the decisions that are made by a table recognizer, and the assumptions and inferencing techniques that underlie these decisions.
A Language for Specifying and Comparing Table Recognition Strategies
, 2004
"... Table recognition algorithms may be described by models of table location and struc-ture, and decisions made relative to these models. These algorithms are usually defined informally as a sequence of decisions with supporting data observations and transformations. In this investigation, we formalize ..."
Abstract
-
Cited by 7 (3 self)
- Add to MetaCart
Table recognition algorithms may be described by models of table location and struc-ture, and decisions made relative to these models. These algorithms are usually defined informally as a sequence of decisions with supporting data observations and transformations. In this investigation, we formalize these algorithms as strategies in an imitation game, where the goal of the game is to match table interpretations from a chosen procedure as closely as possible. The chosen procedure may be a person or persons producing ‘ground truth, ’ or an algorithm. To describe table recognition strategies we have defined the Recognition Strat-egy Language (RSL). RSL is a simple functional language for describing strategies as sequences of abstract decision types whose results are determined by any suit-able decision method. RSL defines and maintains interpretation trees, a simple data structure for describing recognition results. For each interpretation in an interpreta-tion tree, we annotate hypothesis histories which capture the creation, revision, and rejection of individual hypotheses, such as the logical type and structure of regions. We present a proof-of-concept using two strategies from the literature. We demon-strate how RSL allows strategies to be specified at the level of decisions rather than ii algorithms, and we compare results of our strategy implementations using new tech-niques. In particular, we introduce historical recall and precision metrics. Con-ventional recall and precision characterize hypotheses accepted after a strategy has finished. Historical recall and precision provide additional information by describing all generated hypotheses, including any rejected in the final result. iii
pdf2table: A Method to Extract Table Information from PDF Files
"... Abstract. Tables are a common structuring element in many documents, such as PDF files. To reuse such tables, appropriate methods need to be develop, which capture the structure and the content information. We have developed several heuristics which together recognize and decompose tables in PDF fil ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
Abstract. Tables are a common structuring element in many documents, such as PDF files. To reuse such tables, appropriate methods need to be develop, which capture the structure and the content information. We have developed several heuristics which together recognize and decompose tables in PDF files and store the extracted data in a structured data format (XML) for easier reuse. Additionally, we implemented a prototype, which gives the user the ability of making adjustments on the extracted data. Our work shows that purely heuristic-based approaches can achieve good results, especially for lucid tables. 1
Transforming Arbitrary Tables into Logic Frames with TARTAR
, 2005
"... The tremendous success of the World Wide Web is countervailed by efforts needed to search and find relevant information. For tabular structures embedded in HTML documents typical keyword or link-analysis based search fails. The Semantic Web relies on annotating resources such as documents by means o ..."
Abstract
- Add to MetaCart
The tremendous success of the World Wide Web is countervailed by efforts needed to search and find relevant information. For tabular structures embedded in HTML documents typical keyword or link-analysis based search fails. The Semantic Web relies on annotating resources such as documents by means of ontologies and aims to overcome the bottleneck of finding relevant information. Turning the current Web into a Semantic Web requires automatic approaches for annotation since manual approaches will not scale in general. Most efforts have been devoted to automatic generation of ontologies from text, but with quite limited success. However, tabular structures require additional efforts, mainly because understanding of table contents requires a table structures comprehension task and a semantic interpretation task, which exceeds in complexity the linguistic task. The focus of this paper is on automatic transformation and generation of semantic (F-Logic) frames from table-like structures. The presented work consists of a methodology, an accompanying implementation (called TARTAR) and a thorough evaluation. It is based on a grounded cognitive table model which is stepwise instantiated by the methodology. A typical application scenario is the automatic population of ontologies to enable query answering over arbitrary tables (e.g. HTML tables).
White-Box Evaluation of Computer Vision Algorithms through Explicit Decision-Making
"... Abstract. Traditionally computer vision and pattern recognition algorithms are evaluated by measuring differences between final interpretations and ground truth. These black-box evaluations ignore intermediate results, making it difficult to use intermediate results in diagnosing errors and optimiza ..."
Abstract
- Add to MetaCart
Abstract. Traditionally computer vision and pattern recognition algorithms are evaluated by measuring differences between final interpretations and ground truth. These black-box evaluations ignore intermediate results, making it difficult to use intermediate results in diagnosing errors and optimization. We propose “opening the box, ” representing vision algorithms as sequences of decision points where recognition results are selected from a set of alternatives. For this purpose, we present a domain-specific language for pattern recognition tasks, the Recognition Strategy Language (RSL). At run-time, an RSL interpreter records a complete history of decisions made during recognition, as it applies them to a set of interpretations maintained for the algorithm. Decision histories provide a rich new source of information: recognition errors may be traced back to the specific decisions that caused them, and intermediate interpretations may be recovered and displayed. This additional information also permits new evaluation metrics that include false negatives (correct hypotheses that the algorithm generates and later rejects), such as the percentage of ground truth hypotheses generated (historical recall), and the percentage of generated hypotheses that are correct (historical precision). We illustrate the approach through an analysis of cell detection in two published table recognition algorithms.
A Multi-Level Table Evaluation Method For Plain Text Documents
"... Due to their expressive powers, tables are popularly used in documents. The presence of tables in documents creates needs for table-processing algorithms. Over the years, many systems were developed, and their performance, usually in terms of recalls and precisions, were reported. According to curre ..."
Abstract
- Add to MetaCart
Due to their expressive powers, tables are popularly used in documents. The presence of tables in documents creates needs for table-processing algorithms. Over the years, many systems were developed, and their performance, usually in terms of recalls and precisions, were reported. According to current practice, systems are tested using some propriety data. This raises some important questions: how should we interpret the results? how should we compare different systems objectively? how do we know that the high recall and precision rates do not come at the expense of making mistakes in other parts of the table-processing task. In this paper, we would like to address these problems by proposing a multi-level table evaluation method. 1.
Document-Zone Classification using Partial Least Squares and Hybrid Classifiers
"... This paper introduces a novel document-zone classification algorithm. Low level image features are first extracted from document zones and Partial Least Squares is used on pairs of classes to compute discriminating pairwise features. Rather than using the popular oneagainst-all and one-against-one v ..."
Abstract
- Add to MetaCart
This paper introduces a novel document-zone classification algorithm. Low level image features are first extracted from document zones and Partial Least Squares is used on pairs of classes to compute discriminating pairwise features. Rather than using the popular oneagainst-all and one-against-one voting schemes, we introduce a novel hybrid method which combines the benefits of the two schemes. The algorithm is applied on the University of Washington dataset and 97.3 % classification accuracy is obtained. 1.

