• Documents
  • Authors
  • Tables
  • Other Seers ▼
    RefSeer AckSeer CollabSeer SeerSeer
  • Log in
  • Sign up
  • MetaCart

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

The English-Slovene ACQUIS corpus (2006)

by Tomaž Erjavec
Add To MetaCart

Tools

Sorted by:
Results 1 - 2 of 2

Towards a Slovene dependency treebank

by Sašo Džeroski, Tomaž Erjavec, Nina Ledinek, Petr Pajas, Zdenek Žabokrtsky, Andreja Žele - In Proc. Int. Conf. on Language Resources and Evaluation (LREC , 2006
"... The paper presents the initial release of the Slovene Dependency Treebank, currently containing 2000 sentences or 30.000 words. Our approach to annotation is based on the Prague Dependency Treebank, which serves as an excellent model due to the similarity of the languages, the existence of a detaile ..."
Abstract - Cited by 11 (1 self) - Add to MetaCart
The paper presents the initial release of the Slovene Dependency Treebank, currently containing 2000 sentences or 30.000 words. Our approach to annotation is based on the Prague Dependency Treebank, which serves as an excellent model due to the similarity of the languages, the existence of a detailed annotation guide and an annotation editor. The initial treebank contains a portion of the MULTEXT-East parallel word-level annotated corpus, namely the first part of the Slovene translation of Orwell’s “1984”. This corpus was first parsed automatically, to arrive at the initial analytic level dependency trees. These were then hand corrected using the tree editor TrEd; simultaneously, the Czech annotation manual was modified for Slovene. The current version is available in XML/TEI, as well as derived formats, and has been used in a comparative evaluation using the MALT parser, and as one of the languages present in the CoNLL-X shared task on dependency parsing. The paper also discusses further work, in the first instance the composition of the corpus to be annotated next. 1.

Morphosyntactic Tagging of Slovene Legal Language. Informatica 30:483–488

by Tomaž Erjavec, Bence Sárossy - Informatica , 2006
"... Part-of-speech tagging or, more accurately, morphosyntactic tagging, is a procedure that assigns to each word token appearing in a text its morphosyntactic description, e.g. “masculine singular common noun in the genitive case”. Morphosyntactic tagging is an important component of many language tech ..."
Abstract - Cited by 1 (0 self) - Add to MetaCart
Part-of-speech tagging or, more accurately, morphosyntactic tagging, is a procedure that assigns to each word token appearing in a text its morphosyntactic description, e.g. “masculine singular common noun in the genitive case”. Morphosyntactic tagging is an important component of many language technology applications, such as machine translation, speech synthesis, or information extraction. In the paper we report on an experiment on morphosyntactic tagging of Slovene, on a sample of Slovene legal language. We evaluate the accuracy of the TnT tagger, which had been trained on the MULTEXT-East language resources for Slovene. The test data come from the freely available parallel English-Slovene corpus SVEZ-IJS, which contains the Slovene translation European Union legal acts. Presented are the details of the manually corrected test corpus and an analysis of the tagging errors. The paper also discusses a simple transformation-based program that fixes some of the more common errors, and concludes with some directions for future work. Povzetek: V prispevku je opisan poskus oblikoslovnega označevanja na vzorcu slovenskih pravnih besedil.
The National Science Foundation
  • About CiteSeerX
  • Submit Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2010 The Pennsylvania State University