Results 1 - 10
of
20
Exploiting Parallel Texts to Produce a Multilingual Sense Tagged Corpus for Word Sense Disambiguation
- In Proceedings of RANLP-05, Borovets
, 2005
"... We describe an approach to the automatic creation of a sense tagged corpus intended to train a word sense disambiguation (WSD) system for English-Portuguese machine translation. The approach uses parallel corpora, translation dictionaries and a set of straightforward heuristics. In an evaluati ..."
Abstract
-
Cited by 9 (6 self)
- Add to MetaCart
We describe an approach to the automatic creation of a sense tagged corpus intended to train a word sense disambiguation (WSD) system for English-Portuguese machine translation. The approach uses parallel corpora, translation dictionaries and a set of straightforward heuristics. In an evaluation with nine corpora containing 10 ambiguous verbs, the approach achieved an average precision of 94%, compared with 58% when a state of the art statistical alignment tool was used. The resulting corpus consists of 113,802 instances tagged with the senses (i.e., translations) of the 10 verbs. Besides the word-sense tags, this corpus provides other useful information, such as POS-tags, and can be readily used as input to supervised machine learning algorithms in order to build WSD models for machine translation.
DISPARA, a system for distributing parallel corpora on the Web
- In Ranchhod, E. & N.J
, 2002
"... The main purpose of the present paper is to document the process of creating a parallel corpus available on the Web, thereby illuminating technical and design issues involved in such a project. By this we hope to gather more researchers to help with the building process, as well as boast conside ..."
Abstract
-
Cited by 6 (4 self)
- Add to MetaCart
The main purpose of the present paper is to document the process of creating a parallel corpus available on the Web, thereby illuminating technical and design issues involved in such a project. By this we hope to gather more researchers to help with the building process, as well as boast considerably the number of users of the parallel corpus.
A Hybrid Model for Word Sense Disambiguation in English-Portuguese Machine Translation
- IN PROCEEDINGS OF THE 8TH RESEARCH COLLOQUIUM OF THE UK SPECIAL-INTEREST GROUP IN COMPUTATIONAL LINGUISTICS
, 2005
"... We present the proposal for an approach to word sense disambiguation with application in machine translation from English to Brazilian Portuguese. This approach follows a hybrid natural language processing method, that is, a mixture of knowledge and corpus-based approaches. The main innovative ..."
Abstract
-
Cited by 4 (3 self)
- Add to MetaCart
We present the proposal for an approach to word sense disambiguation with application in machine translation from English to Brazilian Portuguese. This approach follows a hybrid natural language processing method, that is, a mixture of knowledge and corpus-based approaches. The main innovative feature is the formalism that we intend to use to represent the instances and the background knowledge. Opposed to
NatServer: A Client-Server Architecture for building Parallel Corpora applications
"... Parallel corpora are important resources for most Natural Language processing tasks. From the common applications, like machine translation, to the usually mono-lingual tasks as paraphrase detection and word sense disambiguation, most researchers are using massive parallel corpora. Thus, the availa ..."
Abstract
-
Cited by 3 (3 self)
- Add to MetaCart
Parallel corpora are important resources for most Natural Language processing tasks. From the common applications, like machine translation, to the usually mono-lingual tasks as paraphrase detection and word sense disambiguation, most researchers are using massive parallel corpora. Thus, the availability of an efficient way to manage them is very important. This paper presents a Client-Server architecture to query efficiently parallel corpora and probabilistic translation dictionaries.
Annotating COMPARA, a Grammar-aware Parallel Corpus
"... In this paper we describe the annotation of COMPARA, currently the largest post-edited parallel corpora which includes Portuguese. We describe the motivation, the results so far, and the way the corpus is being annotated. We also provide the first grounded results about syntactical ambiguity in Port ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
In this paper we describe the annotation of COMPARA, currently the largest post-edited parallel corpora which includes Portuguese. We describe the motivation, the results so far, and the way the corpus is being annotated. We also provide the first grounded results about syntactical ambiguity in Portuguese. Finally, we discuss some interesting problems in this connection. 1.
M.: Translation Context Sensitive WSD
- the 11th Annual Conference of the European Association for Machine Translation
, 2006
"... While it is generally agreed that Word Sense Disambiguation (WSD) is an application-dependent task, the great majority of systems pursue application-independent approaches. We propose a strategy to support WSD for Machine Translation which is designed specifically for this application. It relies on ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
While it is generally agreed that Word Sense Disambiguation (WSD) is an application-dependent task, the great majority of systems pursue application-independent approaches. We propose a strategy to support WSD for Machine Translation which is designed specifically for this application. It relies on the analysis of co-occurrences in the context that refer to words which have already been translated. Experiments on the English-Portuguese translation of 10 verbs using just this knowledge yielded an accuracy of 51%, which outperforms the baseline using the most frequent translation (37%). A less strict evaluation criterion considering the 10 best ranked translations proved the potential for this approach to be used as extra knowledge source for WSD: the correct translation was among the top 10 results in 92% of the cases. 1.
Parallel Corpora based Translation Resources Extraction
"... Resumen: Este artículo describe NATools, un conjunto de herramientas de procesamiento, análisis y extracción de recursos de traducción de Corpora Paralelo. Entre las distintas herramientas disponibles se destacan herramientas de alineamiento de frases e palabras, un extractor de diccionarios probabi ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Resumen: Este artículo describe NATools, un conjunto de herramientas de procesamiento, análisis y extracción de recursos de traducción de Corpora Paralelo. Entre las distintas herramientas disponibles se destacan herramientas de alineamiento de frases e palabras, un extractor de diccionarios probabilísticos de traducción, un servidor de corpus, un conjunto de herramientas de interrogación de corpora y diccionarios y así mismo un conjunto de herramientas de extracción de recursos bilingües. Palabras clave: corpora paralelos, recursos bilingües, traducción automática Abstract: This paper describes NATools, a toolkit to process, analyze and extract translation resources from Parallel Corpora. It includes tools like a sentence-aligner, a probabilistic translation dictionaries extractor, word-aligner, a corpus server, a set of tools to query corpora and dictionaries, as well as a set of tools to extract bilingual resources.
Parallel corpus-based bilingual terminology extraction
"... Abstract: This paper presents a parallel corpora-based bilingual terminology extraction method based on the occurrence of bilingual morphosyntactic patterns in probabilistic translation dictionaries. We discuss an experiment focused on two language pairs – English-Galician and English-Portuguese, an ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Abstract: This paper presents a parallel corpora-based bilingual terminology extraction method based on the occurrence of bilingual morphosyntactic patterns in probabilistic translation dictionaries. We discuss an experiment focused on two language pairs – English-Galician and English-Portuguese, and show results which experimentally confirm the high degree of accuracy of the proposed extraction technique. 1
Automatic Parallel Corpora and Bilingual Terminology extraction from Parallel WebSites
"... In our days, the notion, the importance and the significance of parallel corpora is so big that needs no special introduction. Unfortunately, public available parallel corpora is somewhat limited in range. There are big corpora about politics or legislation, about medicine and other specific areas, ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
In our days, the notion, the importance and the significance of parallel corpora is so big that needs no special introduction. Unfortunately, public available parallel corpora is somewhat limited in range. There are big corpora about politics or legislation, about medicine and other specific areas, but we miss corpora for other different areas. Currently there is a huge investment on using the Web as a corpus. This article uncovers GWB, a tool that aims automatic construction of parallel corpora from the web. We defend that it is possible to build high quality terminological corpora in an automatic fashion, just by specifying a sensible Internet domain and using an appropriate set of seed keywords. GWB is a web-spider that works in conjunction with a set of other Open-Source tools, defining a pipeline that includes the documents retrieval from the web, alignment at sentence level and its quality analysis, bilingual dictionaries and terminology extraction and construction of off-line dictionaries. 1.
Mining Rules for Word Sense Disambiguation
, 2005
"... This paper describes the automatic generation and the evaluation of sets of rules for word sense disambiguation (WSD) in machine translation. The ultimate aim is to identify high-quality rules that can be used as knowledge sources in a relational WSD model. The evaluation was carried out both aut ..."
Abstract
- Add to MetaCart
This paper describes the automatic generation and the evaluation of sets of rules for word sense disambiguation (WSD) in machine translation. The ultimate aim is to identify high-quality rules that can be used as knowledge sources in a relational WSD model. The evaluation was carried out both automatically, by means of four objective measures (error, coverage, support and novelty), and manually, by means of a subjective analysis of the level of interest of the best rules as pointed out by the objective measures. As a result, we selected 63 rules addressing seven highly ambiguous verbs. The evaluation also evidenced which kinds of knowledge were effectively used by the WSD rules, which are not always the same as those revealed by traditional evaluations of complete WSDmodels.

