Results 1 - 10
of
27
What have Innsbruck and Leipzig in common? Extracting Semantics from Wiki Content
- In ESWC
, 2007
"... Abstract Wikis are established means for the collaborative authoring, versioning and publishing of textual articles. The Wikipedia project, for example, succeeded in creating the by far largest encyclopedia just on the basis of a wiki. Recently, several approaches have been proposed on how to extend ..."
Abstract
-
Cited by 57 (7 self)
- Add to MetaCart
Abstract Wikis are established means for the collaborative authoring, versioning and publishing of textual articles. The Wikipedia project, for example, succeeded in creating the by far largest encyclopedia just on the basis of a wiki. Recently, several approaches have been proposed on how to extend wikis to allow the creation of structured and semantically enriched content. However, the means for creating semantically enriched structured content are already available and are, although unconsciously, even used by Wikipedia authors. In this article, we present a method for revealing this structured content by extracting information from template instances. We suggest ways to efficiently query the vast amount of extracted information (e.g. more than 8 million RDF statements for the English Wikipedia version alone), leading to astonishing query answering possibilities (such as for the title question). We analyze the quality of the extracted content, and propose strategies for quality improvements with just minor modifications of the wiki systems being currently used. 1
WebTables: Exploring the Power of Tables on the Web
, 2008
"... The World-Wide Web consists of a huge number of unstructured documents, but it also contains structured data in the form of HTML tables. We extracted 14.1 billion HTML tables from Google’s general-purpose web crawl, and used statistical classification techniques to find the estimated 154M that conta ..."
Abstract
-
Cited by 39 (4 self)
- Add to MetaCart
The World-Wide Web consists of a huge number of unstructured documents, but it also contains structured data in the form of HTML tables. We extracted 14.1 billion HTML tables from Google’s general-purpose web crawl, and used statistical classification techniques to find the estimated 154M that contain high-quality relational data. Because each relational table has its own “schema ” of labeled and typed columns, each such table can be considered a small structured database. The resulting corpus of databases is larger than any other corpus we are aware of, by at least five orders of magnitude. We describe the WebTables system to explore two fundamental questions about this collection of databases. First, what are effective techniques for searching for structured data at search-engine scales? Second, what additional power
Information extraction
- FnT Databases
"... The automatic extraction of information from unstructured sources has opened up new avenues for querying, organizing, and analyzing data by drawing upon the clean semantics of structured databases and the abundance of unstructured data. The field of information extraction has its genesis in the natu ..."
Abstract
-
Cited by 24 (2 self)
- Add to MetaCart
The automatic extraction of information from unstructured sources has opened up new avenues for querying, organizing, and analyzing data by drawing upon the clean semantics of structured databases and the abundance of unstructured data. The field of information extraction has its genesis in the natural language processing community where the primary impetus came from competitions centered around the recognition of named entities like people names and organization from news articles. As society became more data oriented with easy online access to both structured and unstructured data, new applications of structure extraction came around. Now, there is interest in converting our personal desktops to structured databases, the knowledge in scientific publications to structured records, and harnessing the Internet for structured fact finding queries. Consequently, there are many different communities of researchers bringing in techniques from machine learning, databases, information retrieval, and computational linguistics for various aspects of the information extraction problem. This review is a survey of information extraction research of over two decades from these diverse communities. We create a taxonomy of the field along various dimensions derived from the nature of theextraction task, the techniques used for extraction, the variety of input resources exploited, and the type of output produced. We elaborate on rule-based and statistical methods for entity and relationship extraction. In each case we highlight the different kinds of models for capturing the diversity of clues driving the recognition process and the algorithms for training and efficiently deploying the models. We survey techniques for optimizing the various steps in an information extraction pipeline, adapting to dynamic data, integrating with existing entities and handling uncertainty in the extraction process. 1
Using Visual Cues for Extraction of Tabular Data from Arbitrary HTML Documents
- In In Proc. of the 14th Int’l Conf. on World Wide Web
, 2005
"... We describe a method to extract tabular data from web pages. Rather than just analyzing the DOM tree, we also exploit visual cues in the rendered version of the document to extract data from tables which are not explicitly marked with an HTML table element. To detect tables, we rely on a variant of ..."
Abstract
-
Cited by 16 (3 self)
- Add to MetaCart
We describe a method to extract tabular data from web pages. Rather than just analyzing the DOM tree, we also exploit visual cues in the rendered version of the document to extract data from tables which are not explicitly marked with an HTML table element. To detect tables, we rely on a variant of the well-known X-Y cut algorithm as used in the OCR community. We implemented the system by directly accessing Mozilla's box model that contains the positional data for all HTML elements of a given web page.
Uncovering the relational web
- In under review
, 2008
"... The World-Wide Web consists of a huge number of unstructured hypertext documents, but it also contains structured data in the form of HTML tables. Many of these tables contain both relational-style data and a small “schema ” of labeled and typed columns, making each such table a small structured dat ..."
Abstract
-
Cited by 14 (6 self)
- Add to MetaCart
The World-Wide Web consists of a huge number of unstructured hypertext documents, but it also contains structured data in the form of HTML tables. Many of these tables contain both relational-style data and a small “schema ” of labeled and typed columns, making each such table a small structured database. The WebTables project is an effort to extract and make use of the huge number of these structured tables on the Web. A clean collection of relational-style tables could be useful for improving web search, schema design, and many other applications. This paper describes the first stage of the WebTables project. First, we give an in-depth study of the Web’s HTML table corpus. For example, we extracted 14.1 billion HTML tables from a several-billion-page portion of Google’s generalpurpose web crawl, and estimate that 154 million of these tables contain high-quality relational-style data. We also describe the crawl’s distribution of table sizes and data types. Second, we describe a system for performing relation recovery. The Web mixes relational and non-relational tables indiscriminately (often on the same page), so there is no simple way to distinguish the 1.1 % of good relations from the remainder, nor to recover column label and type information. Our mix of hand-written detectors and statistical classifiers takes a raw Web crawl as input, and generates a collection of databases that is five orders of magnitude larger than any other collection we are aware of. Relation recovery achieves precision and recall that are comparable to other domain-independent information extraction systems. 1.
Tableseer: Automatic table metadata extraction and searching in digital libraries
- In Technical Report
, 2007
"... Tables are ubiquitous in digital libraries. In scientific documents, tables are widely used to present experimental results or statistical data in a condensed fashion. However, current search engines do not support table search. The difficulty of automatic extracting tables from un-tagged documents, ..."
Abstract
-
Cited by 11 (5 self)
- Add to MetaCart
Tables are ubiquitous in digital libraries. In scientific documents, tables are widely used to present experimental results or statistical data in a condensed fashion. However, current search engines do not support table search. The difficulty of automatic extracting tables from un-tagged documents, the lack of a universal table metadata specification, and the limitation of the existing ranking schemes make table search problem challenging. In this paper, we describe TableSeer, a search engine for tables. TableSeer crawls digital libraries, detects tables from documents, extracts tables metadata, indexes and ranks tables, and provides a userfriendly search interface. We propose an extensive set of medium-independent metadata for tables that scientists and other users can adopt for representing table information. In addition, we devise a novel page box-cutting method to improve the performance of the table detection. Given a query, TableSeer ranks the matched tables using an innovative ranking algorithm – TableRank. TableRank rates each <query, table> pair with a tailored vector space model and a specific term weighting scheme. Overall, T ableSeer eliminates the burden of manually extract table data from digital libraries and enables users to automatically examine tables. We demonstrate the value of TableSeer with empirical studies on scientific documents.
Semantically Conceptualizing and Annotating Tables
"... Abstract. Enabling a system to automatically conceptualize and annotate a human-readable table is one way to create interesting semanticweb content. But exactly “how? ” is not clear. With conceptualization and annotation in mind, we investigate a semantic-enrichment procedure as a way to turn syntac ..."
Abstract
-
Cited by 4 (3 self)
- Add to MetaCart
Abstract. Enabling a system to automatically conceptualize and annotate a human-readable table is one way to create interesting semanticweb content. But exactly “how? ” is not clear. With conceptualization and annotation in mind, we investigate a semantic-enrichment procedure as a way to turn syntactically observed table layout into semantically coherent ontological concepts, relationships, and constraints. Our semanticenrichment procedure shows how to make use of auxiliary world knowledge to construct rich ontological structures and to populate these ontological structures with instance data. The system uses auxiliary knowledge (1) to recognize concepts and which data values belong to which concepts, (2) to discover relationships among concepts and which datavalue combinations represent relationship instances, and (3) to discover constraints over the concepts and relationships that the data values and data-value combinations should satisfy. Experimental evaluations indicate that the automatic conceptualization and annotation processes perform well, yielding F-measures of 90 % for concept recognition, 77 % for relationship discovery, and 90 % for constraint discovery in web tables selected from the geopolitical domain. 1
Notes on Contemporary Table Recognition
- in Proc. Document Analysis Systems VII, 7th International Workshop, DAS 2006
, 2006
"... Abstract. The shift of interest to web tables in HTML and PDF files, coupled with the incorporation of table analysis and conversion routines in commercial desktop document processing software, are likely to turn table recognition into more of a systems than an algorithmic issue. We illustrate the t ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
Abstract. The shift of interest to web tables in HTML and PDF files, coupled with the incorporation of table analysis and conversion routines in commercial desktop document processing software, are likely to turn table recognition into more of a systems than an algorithmic issue. We illustrate the transition by some actual examples of web table conversion. We then suggest that the appropriate target format for table analysis, whether performed by conventional customized programs or by off-theshelf software, is a representation based on the abstract table introduced by X. Wang in 1996. We show that the Wang model is adequate for some useful tasks that prove elusive for less explicit representations, and outline our plans to develop a semi-automated table processing system to demonstrate this approach. Screen-snaphots of a prototype tool to allow table mark-up in the style of Wang are also presented. 1
A fast preprocessing method for table boundary detection: Narrowing down the sparse lines using solely coordinate information
- In DAS
, 2008
"... As the rapid growth of PDF document in digital libraries, recognizing the document structure and detecting specific document components are useful for document storage, classification and retrieval. Tables, as a specific document component, are ubiquitous everywhere. Accurately detecting the table b ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
As the rapid growth of PDF document in digital libraries, recognizing the document structure and detecting specific document components are useful for document storage, classification and retrieval. Tables, as a specific document component, are ubiquitous everywhere. Accurately detecting the table boundary plays a crucial role for the later table structure decomposition and table data collection. In this paper, we propose an easy but effective table boundary detection method. Our method has two unique advantages comparing with other works in this field: 1) Because most tables are text-based, we claim that the text object of PDF provides enough information for table detection. In addition, we believe that the font information is not so reliable as other work stated. 2) Based on the nature of the table cells, we notice that almost all the table rows are sparse lines. By filtering out the non-sparse lines initially, the table boundary detection problem can be simplified into the sparse line analysis problem easily. The experimental results not only confirm the importance of the coordinate information, but also demonstrate the effectiveness of sparse lines in the table boundary detection. Combining with other keywords, our method is even applicable to detect other document components (e.g., mathematical formula or the references). 1
Historical recall and precision: summarizing generated hypotheses
- In Document Analysis and Recognition, 2005. Proceedings. Eighth International Conference on
"... Document recognition involves many kinds of hypotheses: segmentation hypotheses, classification hypotheses, spatial relationship hypotheses, and so on. Many recognition strategies generate valid hypotheses which are eventually rejected, but current evaluation methods consider only accepted hypothese ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
Document recognition involves many kinds of hypotheses: segmentation hypotheses, classification hypotheses, spatial relationship hypotheses, and so on. Many recognition strategies generate valid hypotheses which are eventually rejected, but current evaluation methods consider only accepted hypotheses. As a result, we have no way to measure errors associated with rejecting valid hypotheses. We propose describing hypothesis generation in more detail, by collecting the complete set of generated hypotheses and computing the recall and precision of this set: we call these the ‘historical recall ’ and ‘historical precision. ’ Using table cell detection examples, we demonstrate how historical recall and precision along with the complete set of generated hypotheses assist in the evaluation, debugging, and design of recognition strategies. 1.

