Results 1 - 10
of
14
Alignment tools for parallel treebanks
- In Proc. of The Linguistic Annotation Workshop (LAW) at ACL
, 2007
"... This paper describes a tool for aligning and searching parallel treebanks. Such treebanks are a new type of parallel corpora that come with syntactic annotation on both languages plus sub-sentential alignment. Our tool allows the visualization of tree pairs and the comfortable annotation of word and ..."
Abstract
-
Cited by 7 (1 self)
- Add to MetaCart
This paper describes a tool for aligning and searching parallel treebanks. Such treebanks are a new type of parallel corpora that come with syntactic annotation on both languages plus sub-sentential alignment. Our tool allows the visualization of tree pairs and the comfortable annotation of word and phrase alignments. It also allows monolingual and bilingual searches including the specification of alignment constraints. We show that the TIGER-Search query language can easily be combined with such alignment constraints to obtain a powerful cross-lingual query language. 1
Tree-based Target Language Modeling
"... In this paper we describe an approach to target language modeling which is based on a large treebank. We assume a bag of bags as input for the target language generation component, leaving it up to this component to decide upon word and phrase order. An experiment with Dutch as target language shows ..."
Abstract
-
Cited by 4 (3 self)
- Add to MetaCart
In this paper we describe an approach to target language modeling which is based on a large treebank. We assume a bag of bags as input for the target language generation component, leaving it up to this component to decide upon word and phrase order. An experiment with Dutch as target language shows that this approach to candidate translation reranking outperforms standard n-gram modeling, when measuring
Using Uplug and SiteSeeker to construct a cross language search engine for Scandinavian
"... This paper presents how we adapted a website search engine for cross language information retrieval, using the Uplug word alignment tool for parallel corpora.We first studied the monolingual search queries posed by the visitors of the website of the Nordic council containing five different languages ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
This paper presents how we adapted a website search engine for cross language information retrieval, using the Uplug word alignment tool for parallel corpora.We first studied the monolingual search queries posed by the visitors of the website of the Nordic council containing five different languages. In order to compare how well different types of bilingual dictionaries covered the most common queries and terms on the website we tried a collection of ordinary bilingual dictionaries, a small manually constructed trilingual dictionary and an automatically constructed trilingual dictionary, constructed from the news corpus in the website using Uplug. The precision and recall of the automatically constructed Swedish-English dictionary using Uplug were 71 and 93 percent, respectively. We found that precision and recall increase significantly in samples with high word frequency, but we could not confirm that POS-tags improve precision. The collection of ordinary dictionaries, consisting of about 200 000 words, only cover 41 of the top 100 search queries at the website. The automatically built trilingual dictionary combined with the small manually built trilingual dictionary, consisting of about 2 300 words, and cover 36 of the top search queries.
The impact of lemmatization in word alignment
, 2005
"... The focus of this thesis is on examining whether word alignment results can be improved in precision and recall through lemmatization, and extraction of lemma dictionaries from the resulting links. Lemmas are extracted from existing lexical resources in order to replace word forms in two parallel co ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
The focus of this thesis is on examining whether word alignment results can be improved in precision and recall through lemmatization, and extraction of lemma dictionaries from the resulting links. Lemmas are extracted from existing lexical resources in order to replace word forms in two parallel corpora documents, one featuring the language pair English-Swedish and the other the language pair Swedish-English. The parallel corpora, consisting of a technical Scania manual and a Saul Bellow novel, originate from PLUG (Sågvall Hein 2002) project and were originally aligned and evaluated by Jörg Tiedemann (2003). By utilizing a Perl script, four lemmatized documents are created. These are aligned by a word aligner constructed by Tiedemann (2003), the Clue aligner, which is also used to align word form versions of the same texts. The results of the alignment of the lemmatized corpora and the word form corpora are evaluated automatically against a reference alignment and compared. The link results derived from lemmatized corpora yields improvement in recall, and in one case precision, compared to the word form link results.
Automatic Construction of Domain-specific Dictionaries on Sparse Parallel Corpora in the Nordic Languages
"... Hallå Norden is a web site with information regarding mobility between the Nordic countries in five different languages; Swedish, Danish, Norwegian, Icelandic and Finnish. We wanted to create a Nordic cross-language dictionary for the use in a cross-language search engine for Hallå Norden. The entir ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Hallå Norden is a web site with information regarding mobility between the Nordic countries in five different languages; Swedish, Danish, Norwegian, Icelandic and Finnish. We wanted to create a Nordic cross-language dictionary for the use in a cross-language search engine for Hallå Norden. The entire set of texts on the web site was treated as one multilingual parallel corpus. From this we extracted parallel corpora for each language pair. The corpora were very sparse, containing on average less than 80 000 words per language pair. We have used the Uplug word alignment system (Tiedemann 2003a), for the creation of the dictionaries. The results gave on average 213 new dictionary words (frequency> 3) per language pair. The average error rate was 16 percent. Different combinations with Finnish had a higher error rate, 33 percent, whereas the error rate for the remaining language pairs only yielded on average 9 percent errors. The high error rate for Finnish is possibly due to the fact that the Finnish language belongs to a different language family. Although the corpora were very sparse the word alignment results for the combinations of Swedish, Danish, Norwegian and Icelandic were surprisingly good compared to other experiments with larger corpora.
Harvesting Multi-Word Expressions from Parallel Corpora
"... The paper presents a set of approaches to extend the automatically created Slovene wordnet with nominal multiword expressions. In the first approach multiword expressions from Princeton WordNet are translated with a technique that is based on wordalignment and lexicosyntactic patterns. This is follo ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
The paper presents a set of approaches to extend the automatically created Slovene wordnet with nominal multiword expressions. In the first approach multiword expressions from Princeton WordNet are translated with a technique that is based on wordalignment and lexicosyntactic patterns. This is followed by extracting new terms from a monolingual corpus using keywordness ranking and contextual patterns. Finally, the multiword expressions are assigned a hypernym and added to our wordnet. Manual evaluation and comparison of the results shows that the translation approach is the most straightforward and accurate. However, it is successfully complemented by the two monolingual approaches which are able to identify more term candidates in the corpus that would otherwise go unnoticed. Some weaknesses of the proposed wordnet extension techniques are also addressed. 1.
Removing the Distinction Between a Translation Memory, a Bilingual Dictionary and a Parallel Corpus
"... This paper presents a prototype MT system which does not make the dis-tinction between a dictionary, a sub-sentential aligned parallel corpus, and post-edited information (translators output) like a translation memory. The system is based on the METIS-approach (Vandeghinste et al, 2006), and uses an ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
This paper presents a prototype MT system which does not make the dis-tinction between a dictionary, a sub-sentential aligned parallel corpus, and post-edited information (translators output) like a translation memory. The system is based on the METIS-approach (Vandeghinste et al, 2006), and uses an XML-based dictionary format in which not only simple word-to-word translations can be included, but which also contains complex dictionary en-tries, including discontinuous entries, like idioms and proverbs. The pre-sented prototype is a system that automatically adapts its dictionary and tar-get language corpus depending on the post-edited output as made by the users of the system, and will therefore have a learning curve in its performance. 1 1
Example-based Segmentation of Swedish Compounds in a Swedish–English bilingual corpus and the possibility of Evaluating Compound Links
"... based on that Segmentation ..."
A Multilingual Approach to Building Slovene Wordnet
"... The paper presents an experiment in which synsets for Slovene wordnet were induced automatically from several multilingual resources. Our research is based on the assumption that translations are a plausible source of semantically relevant information. More specifically, we argue that the translatio ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
The paper presents an experiment in which synsets for Slovene wordnet were induced automatically from several multilingual resources. Our research is based on the assumption that translations are a plausible source of semantically relevant information. More specifically, we argue that the translational relation on the one hand reduces ambiguity of a source word and on the other conveys semantic relatedness of a set of target words. We tried to identify sense distinctions of polysemous words and obtain sets of synonyms by first extracting multilingual lexicons from a word-aligned JRC-Acquis parallel corpus and then comparing them with the already existing wordnets in various languages. At this stage, lexicon entries were disambiguated and appropriate synset ids were assigned to their Slovene translation equivalents. Finally, the Slovene lexicon entries sharing the same assigned synset id were organized into a synset.
Translations of free software into Irish
, 2006
"... controls about 95 % of the desktop computer market, this was clearly a major step forward in the provision of technology to Irish speakers in a native language context. At the same time, tucked away among the recalcitrant 5 % of non-Windows users, there is a small community of volunteer translators ..."
Abstract
- Add to MetaCart
controls about 95 % of the desktop computer market, this was clearly a major step forward in the provision of technology to Irish speakers in a native language context. At the same time, tucked away among the recalcitrant 5 % of non-Windows users, there is a small community of volunteer translators and software developers that has been enjoying a completely free Irish language desktop system since 2002. This system is based on Linux, a free alternative to the Windows operating system, and includes a complete range of end-user applications such as web browsers, email handlers, office software, and games. Here “free ” has a technical definition 1 which means roughly that the software in question can be copied, modified, redistributed, or even sold by anyone, as long as the redistributed versions preserve these same freedoms for others. While there is no requirement that the software be distributed at no cost, in practice it almost always is 2. One occasionally hears reference to “open source ” software, which, for the purposes of this paper, amounts to the same thing, despite endless hair-splitting in the free software community. 1 See

