Results 1 -
3 of
3
The Web as a Parallel Corpus
- Computational Linguistics
, 2003
"... Parallel corpora have become an essential resource for work in multilingual natural language processing. In this report, we describe our work using the STRAND system for mining parallel text on the World Wide Web, first reviewing the original algorithm and results and then presenting a set of signif ..."
Abstract
-
Cited by 101 (3 self)
- Add to MetaCart
Parallel corpora have become an essential resource for work in multilingual natural language processing. In this report, we describe our work using the STRAND system for mining parallel text on the World Wide Web, first reviewing the original algorithm and results and then presenting a set of significant enhancements. These enhancements include the use of supervised learning based on structural features of documents to improve classification performance, a new content-based measure of translational equivalence, and adaptation of the system to take advantage of the Internet Archive for mining parallel text from the Web on a large scale.
Automatic association of web directories to word senses
- Computational Linguistics
, 2003
"... We describe an algorithm that combines lexical information (from WordNet 1.7) with Web directories (from the Open Directory Project) to associate word senses with such directories. Such associations can be used as rich characterizations to acquire sense-tagged corpora automatically, cluster topicall ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
We describe an algorithm that combines lexical information (from WordNet 1.7) with Web directories (from the Open Directory Project) to associate word senses with such directories. Such associations can be used as rich characterizations to acquire sense-tagged corpora automatically, cluster topically-related senses and detect sense specializations. The algorithm is evaluated for the 29 nouns (147 senses) used in the Senseval 2 competition, obtaining 148 (word sense,Web directory) associations covering 88 % of the domain-specific word senses in the test data with 86 % accuracy. The richness of Web directories as sense characterizations is evaluated in a supervised Word Sense Disambiguation task using the Senseval 2 test suite. The results indicate that, when the directory/word sense association is correct, the samples automatically acquired from the Web directories are nearly as valid for training as the original Senseval 2 training instances. The results support our hypothesis that Web directories are a rich source of lexical information: cleaner, more reliable and more structured than the full Web as a corpus.
Collecting PolishGerman Parallel Corpora in the Internet
"... Abstract. Parallel corpora have recently become indispensable resources in multilingual natural language processing. Manual preparation of a bilingual corpus is a laborious task. Therefore methods for the automated creation of parallel corpora are currently a topic of concern for many researches. A ..."
Abstract
- Add to MetaCart
Abstract. Parallel corpora have recently become indispensable resources in multilingual natural language processing. Manual preparation of a bilingual corpus is a laborious task. Therefore methods for the automated creation of parallel corpora are currently a topic of concern for many researches. A number of sophisticated and effective algorithms for collecting parallel texts from the Internet have already been created. The aim of the research has been to verify the efficiency of existing algorithms for the collection of PolishGerman parallel corpora, intended as a reference source for a Machine Translation system, and possibly, to propose a new algorithm – best suitable for the task. 1

