Results 1 - 10
of
10
Building a dynamic lexicon from a digital library
- in JCDL ’08: Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries (ACM
, 2008
"... We describe here in detail our work toward creating a dynamic lexicon from the texts in a large digital library. By leveraging a small structured knowledge source (a 30,457 word treebank), we are able to extract selectional preferences for words from a 3.5 million word Latin corpus. This is promisin ..."
Abstract
-
Cited by 13 (3 self)
- Add to MetaCart
(Show Context)
We describe here in detail our work toward creating a dynamic lexicon from the texts in a large digital library. By leveraging a small structured knowledge source (a 30,457 word treebank), we are able to extract selectional preferences for words from a 3.5 million word Latin corpus. This is promising news for low-resource languages and digital collections seeking to leverage a small human investment into much larger gain. The library architecture in which this work is developed allows us to query customized subcorpora to report on lexical usage by author, genre or era and allows us to continually update the lexicon as new texts are added to the collection.
The Latin Dependency Treebank in a Cultural Heritage Digital Library
"... This paper describes the mutually beneficial relationship between a cultural heritage digital library and a historical treebank: an established digital library can provide the resources and structure necessary for efficiently building a treebank, while a treebank, as a language resource, is a valuab ..."
Abstract
-
Cited by 10 (5 self)
- Add to MetaCart
(Show Context)
This paper describes the mutually beneficial relationship between a cultural heritage digital library and a historical treebank: an established digital library can provide the resources and structure necessary for efficiently building a treebank, while a treebank, as a language resource, is a valuable tool for audiences traditionally served by such libraries. 1
Improving OCR Accuracy for Classical Critical Editions
"... Abstract. This paper describes a work-flow designed to populate a digital library of ancient Greek critical editions with highly accurate OCR scanned text. While the most recently available OCR engines are now able after suitable training to deal with the polytonic Greek fonts used in 19th and 20th ..."
Abstract
-
Cited by 10 (1 self)
- Add to MetaCart
(Show Context)
Abstract. This paper describes a work-flow designed to populate a digital library of ancient Greek critical editions with highly accurate OCR scanned text. While the most recently available OCR engines are now able after suitable training to deal with the polytonic Greek fonts used in 19th and 20th century editions, further improvements can also be achieved with postprocessing. In particular, the progressive multiple alignment method applied to different OCR outputs based on the same images is discussed in this paper. 1
the Ancient Greek and Latin Dependency Treebanks, large collections of Classical
"... Abstract This paper describes the development, composition, and several uses of ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
(Show Context)
Abstract This paper describes the development, composition, and several uses of
A new generation of textual corpora: mining corpora from very large collections
- In JCDL
, 2007
"... While digital libraries based on page images and automat-ically generated text have made possible massive projects such as the Million Book Library, Open Content Alliance, Google, and others, humanists still depend upon textual corpora expensively produced with labor-intensive methods such as double ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
(Show Context)
While digital libraries based on page images and automat-ically generated text have made possible massive projects such as the Million Book Library, Open Content Alliance, Google, and others, humanists still depend upon textual corpora expensively produced with labor-intensive methods such as double-keyboarding and manual correction. This pa-per reports the results from an analysis of OCR-generated text for classical Greek source texts. Classicists have de-pended upon specialized manual keyboarding that costs two or more times as much as keyboarding of English both for accuracy and because classical Greek OCR produced no us-able results. We found that we could produce texts by OCR that, in some cases, approached the 99.95 % professional data entry accuracy rate. In most cases, OCR-generated text yielded results that, by including the variant readings that digital corpora traditionally have left out, provide better recall and, we argue, can better serve many scholarly needs than the expensive corpora upon which classicists have relied for a generation. As digital collections expand, we will be able to collate multiple editions against each other, identify quotations of primary sources, and provide a new generation of services.
Transferring Structural Markup Across Translations Using Multilingual Alignment and Projection
"... We present here a method for automatically projecting structural information across translations, including canonical citation structure (such as chapters and sections), speaker information, quotations, markup for people and places, and any other element in TEI-compliant XML that delimits spans of t ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
(Show Context)
We present here a method for automatically projecting structural information across translations, including canonical citation structure (such as chapters and sections), speaker information, quotations, markup for people and places, and any other element in TEI-compliant XML that delimits spans of text that are linguistically symmetrical in two languages. We evaluate this technique on two datasets, one containing perfectly transcribed texts and one containing errorful OCR, and achieve an accuracy rate of 88.2 % projecting 13,023 XML tags from source documents to their transcribed translations, with an 83.6 % accuracy rate when projecting to texts containing uncorrected OCR. This approach has the potential to allow a highly granular multilingual digital library to be bootstrapped by applying the knowledge contained in a small, heavily curated collection to a much larger but unstructured one.
ABSTRACT A New Generation of Textual Corpora
"... While digital libraries based on page images and automatically generated text have made possible massive projects such as the Million Book Library, Open Content Alliance, Google, and others, humanists still depend upon textual corpora expensively produced with labor-intensive methods such as double- ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
(Show Context)
While digital libraries based on page images and automatically generated text have made possible massive projects such as the Million Book Library, Open Content Alliance, Google, and others, humanists still depend upon textual corpora expensively produced with labor-intensive methods such as double-keyboarding and manual correction. This paper reports the results from an analysis of OCR-generated text for classical Greek source texts. Classicists have depended upon specialized manual keyboarding that costs two or more times as much as keyboarding of English both for accuracy and because classical Greek OCR produced no usable results. We found that we could produce texts by OCR that, in some cases, approached the 99.95 % professional data entry accuracy rate. In most cases, OCR-generated text yielded results that, by including the variant readings that digital corpora traditionally have left out, provide better recall and, we argue, can better serve many scholarly needs than the expensive corpora upon which classicists have relied for a generation. As digital collections expand, we will be able to collate multiple editions against each other, identify quotations of primary sources, and provide a new generation of services.
[Forthcoming in Blackwell Companion to Digital Literary Studies, Ray Siemens and
"... Writing, Phaedrus, has this strange quality, and is very like painting; for the creatures of painting stand like living beings, but if one asks them a question, they preserve a solemn silence. And so it is with written words; you might think they spoke as if they had intelligence, but if you questio ..."
Abstract
- Add to MetaCart
Writing, Phaedrus, has this strange quality, and is very like painting; for the creatures of painting stand like living beings, but if one asks them a question, they preserve a solemn silence. And so it is with written words; you might think they spoke as if they had intelligence, but if you question them, wishing to know about their sayings, they always say only one and the same thing. Plato, Phaedrus 275d
Suggestions and Strategies for the Text Encoding Initiative Warning
"... The contents of this site is subject to the French law on intellectual property and is the exclusive property of the publisher. The works on this site can be accessed and reproduced on paper or digital media, provided that they are strictly used for personal, scientific or educational purposes exclu ..."
Abstract
- Add to MetaCart
(Show Context)
The contents of this site is subject to the French law on intellectual property and is the exclusive property of the publisher. The works on this site can be accessed and reproduced on paper or digital media, provided that they are strictly used for personal, scientific or educational purposes excluding any commercial exploitation. Reproduction must necessarily mention the editor, the journal name, the author and the document reference. Any other reproduction is strictly forbidden without permission of the publisher, except in cases provided by legislation
International Journal on Digital Libraries manuscript No. (will be inserted by the editor)
"... eScience and the Humanities Abstract Humanists face problems that are compara-ble to their colleagues in the sciences. Like scientists, humanists have electronic sources and datasets that are too large for traditional labor intensive analysis. They also need to work with materials that presuppose mo ..."
Abstract
- Add to MetaCart
(Show Context)
eScience and the Humanities Abstract Humanists face problems that are compara-ble to their colleagues in the sciences. Like scientists, humanists have electronic sources and datasets that are too large for traditional labor intensive analysis. They also need to work with materials that presuppose more background knowledge than any one researcher can mas-ter: no one can, for example, know all the languages needed for subjects that cross multiple disciplines. Un-like their colleagues in the sciences, however, humanists have relatively few resources with which to develop this new infrastructure. They must therefore systematically cultivate alliances with better funded disciplines, learn-ing how to build on emerging infrastructure from other disciplines and, where possible, contributing to the de-sign of a cyberinfrastructure that serves all of academia, including the humanities.