Results 1 - 10
of
27
Multilingual Part-of-Speech Tagging: Two Unsupervised Approaches
"... We demonstrate the effectiveness of multilingual learning for unsupervised part-of-speech tagging. The central assumption of our work is that by combining cues from multiple languages, the structure of each becomes more apparent. We consider two ways of applying this intuition to the problem of unsu ..."
Abstract
-
Cited by 10 (3 self)
- Add to MetaCart
We demonstrate the effectiveness of multilingual learning for unsupervised part-of-speech tagging. The central assumption of our work is that by combining cues from multiple languages, the structure of each becomes more apparent. We consider two ways of applying this intuition to the problem of unsupervised part-of-speech tagging: a model that directly merges tag structures for a pair of languages into a single sequence and a second model which instead incorporates multilingual context using latent variables. Both approaches are formulated as hierarchical Bayesian models, using Markov Chain Monte Carlo sampling techniques for inference. Our results demonstrate that by incorporating multilingual evidence we can achieve impressive performance gains across a range of scenarios. We also found that performance improves steadily as the number of available languages increases. 1.
Multext-east morphosyntactic specifications and xml
- Readings in multilinguality
, 2006
"... Word-level morphosyntactic descriptions, such as “Ncmsn ” designating a common masculine singular noun in the nominative, have been developed for all Slavic languages, yet there have been few attempts to arrive at a proposal that would be harmonised across the languages. Standardisation adds to the ..."
Abstract
-
Cited by 9 (4 self)
- Add to MetaCart
Word-level morphosyntactic descriptions, such as “Ncmsn ” designating a common masculine singular noun in the nominative, have been developed for all Slavic languages, yet there have been few attempts to arrive at a proposal that would be harmonised across the languages. Standardisation adds to the interchange potential of the resources, making it easier to develop multilingual applications or to evaluate language technology tools across several languages. The process of the harmonisation of morphosyntactic categories, esp. for morphologically rich Slavic languages is also interesting from a language-typological perspective. The EU MULTEXT-East project developed corpora, lexica and tools for seven languages, with the focus being on morphosyntactic data, including formal, EAGLES-based specifications for lexical morphosyntactic descriptions. The specifications were later extended, so that they currently cover nine languages, five from the Slavic family: Bulgarian, Croatian, Czech, Serbian and Slovene. The paper presents these morphosyntactic specifications, giving their background and structure, including the encoding of the tables as TEI feature structures. The five Slavic language specifications are discussed in more depth. 1
A flexible framework for integrating annotations from different tools and tag sets
- TRAITEMENT AUTOMATIQUE DES LANGUES. VOLUME 49 – N ˚ 2/2008, PAGES 217 À 246
, 2008
"... We present a general framework for integrating annotations from different tools and tag sets. When annotating corpora at multiple linguistic levels, annotators may use different expert tools for different phenomena or types of annotation. These tools employ different data models and accompanying app ..."
Abstract
-
Cited by 7 (3 self)
- Add to MetaCart
We present a general framework for integrating annotations from different tools and tag sets. When annotating corpora at multiple linguistic levels, annotators may use different expert tools for different phenomena or types of annotation. These tools employ different data models and accompanying approaches to visualization, and they produce different output formats. For the purposes of uniformly processing these outputs, we developed a pivot format called PAULA, along with converters to and from tool formats. Different annotations are not only integrated at the level of data format, but are also joined on the level of conceptual representation. For this purpose, we introduce OLiA, an ontology of linguistic annotations that mediates between alternative tag sets that cover the same class of linguistic phenomena. All components are integrated in the linguistic information system ANNIS: Annotation tool output is converted to the pivot format PAULA and read into a database where the data can be visualized, queried, and evaluated across multiple layers. For cross-tag set querying and statistical evaluation, ANNIS uses the ontology of linguistic annotations. Finally, ANNIS is also tied to a machine learning component for semiautomatic annotation.
Evaluating the word sense disambiguation accuracy with three different sense inventories
- Proceedings of the 2nd International Workshop on Natural Language Understanding and Cognitive Science (NLUCS2005
, 2005
"... with three different sense inventories ..."
The English-Slovene ACQUIS corpus
, 2006
"... The paper presents the SVEZ-IJS corpus, a large parallel annotated English-Slovene corpus containing translated legal texts of the European Union, the ACQUIS Communautaire. The corpus contains approx. 2 x 5 million words and was compiled from the translation memory obtained from the Translation Unit ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
The paper presents the SVEZ-IJS corpus, a large parallel annotated English-Slovene corpus containing translated legal texts of the European Union, the ACQUIS Communautaire. The corpus contains approx. 2 x 5 million words and was compiled from the translation memory obtained from the Translation Unit of the Slovene Government Office for European Affairs. The corpus is encoded in XML, according to the Text Encoding Initiative Guidelines TEI P4, where each translation memory unit contains useful metadata and the two aligned segments (sentences). Both the Slovene and English text is linguistically annotated at the word-level, by context disambiguated lemmas and morphosyntactic descriptions, which follow the MULTEXT guidelines. The complete corpus is freely available for research, either via an on-line concordancer, or for downloading from the corpus home page at
Multi-Source Translation Methods
"... Multi-parallel corpora provide a potentially rich resource for machine translation. This paper surveys existing methods for utilizing such resources, including hypothesis ranking and system combination techniques. We find that despite significant research into system combination, relatively little i ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
Multi-parallel corpora provide a potentially rich resource for machine translation. This paper surveys existing methods for utilizing such resources, including hypothesis ranking and system combination techniques. We find that despite significant research into system combination, relatively little is know about how best to translate when multiple parallel source languages are available. We provide results to show that the MAX multilingual multi-source hypothesis ranking method presented by Och and Ney (2001) does not reliably improve translation quality when a broad range of language pairs are considered. We also show that the PROD multilingual multi-source hypothesis ranking method of Och and Ney (2001) cannot be used with standard phrase-based translation engines, due to a high number of unreachable hypotheses. Finally, we present an oracle experiment which shows that current hypothesis ranking methods fall far short of the best results reachable via sentence-level ranking. 1
Learning Morphology with Morfette
"... Morfette is a modular, data-driven, probabilistic system which learns to perform joint morphological tagging and lemmatization from morphologically annotated corpora. The system is composed of two learning modules which are trained to predict morphological tags and lemmas using the Maximum Entropy c ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Morfette is a modular, data-driven, probabilistic system which learns to perform joint morphological tagging and lemmatization from morphologically annotated corpora. The system is composed of two learning modules which are trained to predict morphological tags and lemmas using the Maximum Entropy classifier. The third module dynamically combines the predictions of the Maximum-Entropy models and outputs a probability distribution over tag-lemma pair sequences. The lemmatization module exploits the idea of recasting lemmatization as a classification task by using class labels which encode mappings from wordforms to lemmas. Experimental evaluation results and error analysis on three morphologically rich languages show that the system achieves high accuracy with no language-specific feature engineering or additional resources. 1.
SI-PRON Pronunciation Lexicon: a New Language Resource for Slovenian. Informatica 30:447–452
, 2006
"... We present the efforts involved in designing SI-PRON, a comprehensive machine-readable pronunciation lexicon for Slovenian. It has been built from two sources and contains all the lemmas from the Dictionary of Standard Slovenian (SSKJ), the most frequent inflected word forms found in contemporary Sl ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
We present the efforts involved in designing SI-PRON, a comprehensive machine-readable pronunciation lexicon for Slovenian. It has been built from two sources and contains all the lemmas from the Dictionary of Standard Slovenian (SSKJ), the most frequent inflected word forms found in contemporary Slovenian texts, and a first pass of inflected word forms derived from SSKJ lemmas. The lexicon file contains the orthography, corresponding pronunciations, lemmas and morphosyntactic descriptors of lexical entries in a format based on requirements defined by the W3C Voice Browser Activity. The current version of the SI-PRON pronunciation lexicon contains over 1.4 million lexical entries. The word list determination procedure, the generation and validation of phonetic transcriptions, and the lexicon format are described in the paper. Along with Onomastica, SI-PRON presents a valuable language resource for linguistic studies and research of speech technologies for Slovenian. The lexicon is already being used by the Proteus Slovenian text-to-speech synthesis system and for generating audio samples of the SSKJ headwords. Povzetek: Članek opisuje nov jezikovni vir za slovenščino, slovar izgovarjav SI-PRON. 1
Avoiding Data Graveyards: Deriving an Ontology for Accessing Heterogeneous Data Collections
- Proceedings of the International Workshop “Ontologies in Text Technology
, 2006
"... Abstract. In this paper, I describe derivation and practical application of an ontology of word classes manually derived from four different sources: – the EAGLES recommendations for the morphosyntactic annotation of corpora, – several language-specific, or task-specific tag sets for part-of-speech ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Abstract. In this paper, I describe derivation and practical application of an ontology of word classes manually derived from four different sources: – the EAGLES recommendations for the morphosyntactic annotation of corpora, – several language-specific, or task-specific tag sets for part-of-speech tagging, – the typologically-oriented SFB632 guidelines for part-of-speech tagging, and – the General Ontology for Linguistic Description (GOLD). The resulting ontology is intended to provide integrated representation and access to terminologically heterogeneous resources. It will be applied as part of a sustainable archive of linguistic resources to be developed by the project ”Sustainability of Linguistic Data”, a just-started joint initiative by three German special research centers. While in the first phase, the focus of the ontology development has been put on terminology for part-of-speech (POS) tagging which requires hand-crafted methods, a possible extension towards the semi-automatic integration of syntactic annotation will be sketched as an outlook. 1
A Text Processing Tool for the Romanian Language
- Proc. of the EuroLAN 2005 Workshop on Cross-Language Knowledge Induction
, 2005
"... BALIE 1 is a multilingual text processing tool designed to support information extraction. In this paper we explain how we adapted it to work for the Romanian language. With this addition, the tool supports five languages: English, French, German, Spanish, and Romanian. The services offered by the t ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
BALIE 1 is a multilingual text processing tool designed to support information extraction. In this paper we explain how we adapted it to work for the Romanian language. With this addition, the tool supports five languages: English, French, German, Spanish, and Romanian. The services offered by the tool are: language identification, tokenization, sentence boundary detection, and part-of–speech tagging. We also present evaluation and results for the four newly added components for the Romanian language. 1

