Results 1 -
3 of
3
Lessons Learned in Automatically Detecting Lists in OCRed Historical Documents
"... Lists are often the most data-rich parts of a document collection, but are usually not set apart explicitly from the rest of the text, especially in a corpus of historical OCRed documents. There are many kinds of lists, differing from each other in both layout and content. Writing individualized cod ..."
Abstract
- Add to MetaCart
Lists are often the most data-rich parts of a document collection, but are usually not set apart explicitly from the rest of the text, especially in a corpus of historical OCRed documents. There are many kinds of lists, differing from each other in both layout and content. Writing individualized code to process all possible types of lists is an expensive challenge. In the present research, we focus on general list detection, the first step in a larger process of general list reading. Our system, ListDetector, automatically locates lists from noisy word labels obtained from cheaply developed dictionaries and regular expressions. In this paper, we start by describing a simple baseline system—the first system we are aware of to address general list detection in plain text or OCRed documents. From there, we present several of the challenges and corresponding solutions that we discovered as we raised the F-measure of ListDetector from 79 % to 86.3%. We compute evaluation metrics against a gold standard corpus of OCRed documents in the family history domain that we have manually annotated for the tasks of list detection and structure recogniton. We will continue adding to this corpus and make it publically available for other researchers to use. 1.
Management
"... Building a database of facts extracted from historical documents to enable database-like query and search would reduce the tedium of gleaning facts of interest from historical documents. We propose a solution in which historical documents themselves constitute the stored database. In our solution, w ..."
Abstract
- Add to MetaCart
Building a database of facts extracted from historical documents to enable database-like query and search would reduce the tedium of gleaning facts of interest from historical documents. We propose a solution in which historical documents themselves constitute the stored database. In our solution, we use information-extraction techniques to produce a conceptualized external annotation of facts found in each document, and we superimpose the conceptualization over the document collection. The annotation process populates the conceptualization producing a repository of extracted facts, and a reasoner obtains inferred facts from these extracted facts. Our query interface accepts free-form queries and converts them to formal queries over the extracted and inferred facts. Displayed results include, in addition to standard query results, images of original documents with results highlighted along with reasoning chains for inferred facts grounded in these highlighted facts. Along with giving the implementation status of our proof-of-concept prototype, we present results for extraction accuracy and efficiency and point to current and future work needed to enable a practical solution for the envisioned historical-document database.
An Analysis of Structured Data on the Web
"... In this paper, we analyze the nature and distribution of structured data on the Web. Web-scale information extraction, or the problem of creating structured tables using extraction from the entire web, is gathering lots of research interest. We perform a study to understand and quantify the value of ..."
Abstract
- Add to MetaCart
In this paper, we analyze the nature and distribution of structured data on the Web. Web-scale information extraction, or the problem of creating structured tables using extraction from the entire web, is gathering lots of research interest. We perform a study to understand and quantify the value of Web-scale extraction, and how structured information is distributed amongst top aggregator websites and tail sites for various interesting domains. We believe this is the first study of its kind, and gives us new insights for information extraction over the Web.

