Results 1 - 10
of
540
The TIGER Treebank
, 2002
"... This paper reports on the TIGER Treebank, a corpus of currently 35.000 syntactically annotated German newspaper sentences. We describe what kind of information is encoded in the treebank and introduce the different representation formats that are used for the annotation and exploitation of the tr ..."
Abstract
-
Cited by 332 (3 self)
- Add to MetaCart
This paper reports on the TIGER Treebank, a corpus of currently 35.000 syntactically annotated German newspaper sentences. We describe what kind of information is encoded in the treebank and introduce the different representation formats that are used for the annotation and exploitation of the treebank. We explain the different methods used for the annotation: interactive annotation, using the tool Annotate, and LFG parsing. Furthermore, we give an account of the annotation scheme used for the TIGER treebank. This scheme is an extended and improved version of the NEGRA annotation scheme and we illustrate in detail the linguistic extensions that were made concerning the annotation in the TIGER project. The main differences are concerned with coordination, verb-subcategorization, expletives as well as proper nouns. In addition, the paper also presents the query tool TIGERSearch that was developed in the project to exploit the treebank in an adequate way. We describe the query language which was designed to facilitate a simple formulation of complex queries; furthermore, we shortly introduce TIGERin, a graphical user interface for query input. The paper concludes with a summary and some directions for future work.
Maltparser: A language-independent system for data-driven dependency parsing
- In Proc. of the Fourth Workshop on Treebanks and Linguistic Theories
, 2005
"... ..."
Empirical Methods for Compound Splitting
- In Proceedings of EACL
, 2003
"... Compounded words are a challenge for NLP applications such as machine translation (MT). We introduce methods to learn splitting rules from monolingual and parallel corpora. We evaluate them against a gold standard and measure their impact on performance of statistical MT systems. Results show ..."
Abstract
-
Cited by 121 (3 self)
- Add to MetaCart
(Show Context)
Compounded words are a challenge for NLP applications such as machine translation (MT). We introduce methods to learn splitting rules from monolingual and parallel corpora. We evaluate them against a gold standard and measure their impact on performance of statistical MT systems. Results show accuracy of 99.1% and performance gains for MT of 0.039 BLEU on a German-English noun phrase translation task.
Deriving a Large Scale Taxonomy from Wikipedia
- In Proceedings of AAAI
, 2007
"... We take the category system in Wikipedia as a conceptual network. We label the semantic relations between categories using methods based on connectivity in the network and lexicosyntactic matching. As a result we are able to derive a large scale taxonomy containing a large amount of subsumption, i.e ..."
Abstract
-
Cited by 112 (6 self)
- Add to MetaCart
(Show Context)
We take the category system in Wikipedia as a conceptual network. We label the semantic relations between categories using methods based on connectivity in the network and lexicosyntactic matching. As a result we are able to derive a large scale taxonomy containing a large amount of subsumption, i.e. isa, relations. We evaluate the quality of the created resource by comparing it with ResearchCyc, one of the largest manually annotated ontologies, as well as computing semantic similarity between words in benchmarking datasets.
The German Text-to-Speech synthesis system MARY: A tool for research, development and teaching
- International Journal of Speech Technology
, 2001
"... Abstract. This paper introduces the German text-to-speech synthesis system MARY. The system’s main features, namely a modular design and an XML-based system-internal data representation, are pointed out, and the properties of the individual modules are briefly presented. An interface allowing the us ..."
Abstract
-
Cited by 91 (16 self)
- Add to MetaCart
(Show Context)
Abstract. This paper introduces the German text-to-speech synthesis system MARY. The system’s main features, namely a modular design and an XML-based system-internal data representation, are pointed out, and the properties of the individual modules are briefly presented. An interface allowing the user to access and modify intermediate processing steps without the need for a technical understanding of the system is described, along with examples of how this interface can be put to use in research, development and teaching. The usefulness of the modular and transparent design approach is further illustrated with an early prototype of an interface for emotional speech synthesis.
SVMTool: A general POS tagger generator based on Support Vector Machines
, 2004
"... This report presents the svmtool , a simple, flexible, effective and efficient part--of--speech tagger based on Support Vector Machines. The svmtool offers a fairly good balance among these properties which make it really practical for current NLP applications. It is very easy to use and easily c ..."
Abstract
-
Cited by 81 (0 self)
- Add to MetaCart
(Show Context)
This report presents the svmtool , a simple, flexible, effective and efficient part--of--speech tagger based on Support Vector Machines. The svmtool offers a fairly good balance among these properties which make it really practical for current NLP applications. It is very easy to use and easily configurable so as to perfectly fit the needs of a number of different applications. Results are also very competitive, achieving an accuracy of 97.16% for English on the Wall Street Journal corpus. It has been also successfully applied to Spanish and Catalan exhibiting a similar performance. A first release of the svmtool Perl prototype is now freely available for public use. A more efficient C++ version is coming very soon, by summer 2004.
A Protegé Plug-In for Ontology Extraction from Text Based on Linguistic Analysis
- In European Semantic Web Symposium
, 2004
"... In this paper we describe a plug-in (OntoLT) for the widely used Protégé ontology development tool that supports the interactive extraction and/or extension of ontologies from text. The OntoLT approach provides an environment for the integration of linguistic analysis in ontology engineering through ..."
Abstract
-
Cited by 80 (4 self)
- Add to MetaCart
(Show Context)
In this paper we describe a plug-in (OntoLT) for the widely used Protégé ontology development tool that supports the interactive extraction and/or extension of ontologies from text. The OntoLT approach provides an environment for the integration of linguistic analysis in ontology engineering through the definition of mapping rules that map linguistic entities in annotated text collections to concept and attribute candidates (i.e. Protégé classes and slots). The paper explains this approach in more detail and discusses some initial experiments on deriving a shallow ontology for the neurology domain from a corresponding collection of neurological scientific abstracts. 1
A linguistically interpreted corpus of German newspaper text
- In Proceedings of the ESSLLI Workshop on Recent Advances in Corpus Annotation
, 1998
"... In this paper, we report on the development of an annotation scheme an annotation tools for unrestricted German text. Our representation format is based on argument structure, but also permits the extraction of other kinds of representations. We discuss several methodological issues and the analysis ..."
Abstract
-
Cited by 64 (1 self)
- Add to MetaCart
(Show Context)
In this paper, we report on the development of an annotation scheme an annotation tools for unrestricted German text. Our representation format is based on argument structure, but also permits the extraction of other kinds of representations. We discuss several methodological issues and the analysis of some phenomena. Additional focus is on the tools developed in our project and their applications. 1
Wide-coverage probabilistic sentence processing
- Lectures on Government and Binding, Foris, Dordrecht 36
, 2000
"... This paper describes a fully implemented, broad-coverage model of human syntactic processing. The model uses probabilistic parsing techniques, which combine phrase structure, lexical category, and limited subcategory probabilities with an incremental, left-to-right “pruning ” mechanism based on casc ..."
Abstract
-
Cited by 59 (10 self)
- Add to MetaCart
(Show Context)
This paper describes a fully implemented, broad-coverage model of human syntactic processing. The model uses probabilistic parsing techniques, which combine phrase structure, lexical category, and limited subcategory probabilities with an incremental, left-to-right “pruning ” mechanism based on cascaded Markov models. The parameters of the system are established through a uniform training algorithm, which determines maximum-likelihood estimates from a parsed corpus. The probabilistic parsing mechanism enables the system to achieve good accuracy on typical, “garden-variety ” language (i.e., when tested on corpora). Furthermore, the incremental probabilistic ranking of the preferred analyses during parsing also naturally explains observed human behavior for a range of garden-path structures. We do not make strong psychological claims about the specific probabilistic mechanism discussed here, which is limited by a number of practical considerations. Rather, we argue incremental probabilistic parsing models are, in general, extremely well suited to explaining this dual nature—generally good and occasionally pathological—of human linguistic performance. KEY WORDS: probabilistic parsing; frequency; Markov models.
Stylebook for the Tübingen Treebank of Written German (TüBa-D/Z
, 2009
"... This stylebook is an updated version of Telljohann et al. (2009). It describes the design principles and the annotation scheme for the German treebank TüBa-D/Z developed by the Division of Computational Linguistics (Lehrstuhl Prof. Hinrichs) at the Department of Linguistics (Seminar für Sprachwis- ..."
Abstract
-
Cited by 56 (10 self)
- Add to MetaCart
(Show Context)
This stylebook is an updated version of Telljohann et al. (2009). It describes the design principles and the annotation scheme for the German treebank TüBa-D/Z developed by the Division of Computational Linguistics (Lehrstuhl Prof. Hinrichs) at the Department of Linguistics (Seminar für Sprachwis-