Results 1 -
2 of
2
Identification of Parallel Text Pairs Using Fingerprints
"... When creating dictionaries for use in for example crosslanguage search engines, one often uses a word alignment system that takes parallel or comparable text pairs as input and produces a word list. Multilingual web sites may contain parallel texts but these can be difficult to detect. In this artic ..."
Abstract
- Add to MetaCart
When creating dictionaries for use in for example crosslanguage search engines, one often uses a word alignment system that takes parallel or comparable text pairs as input and produces a word list. Multilingual web sites may contain parallel texts but these can be difficult to detect. In this article we describe an experiment on automatic identification of parallel text pairs. We utilize the frequency distribution of word initial letters in order to map a text in one language to a corresponding text in another in the JRC-Acquis corpus (European Council legal texts). Using English and Swedish as language pair, and running a ten-fold random pairing, the algorithm made 87 percent correct matches (baseline-random 50 percent). Attempting to map the correct text among nine randomly chosen false matches and one true yielded a success rate of 68 percent (baseline-random 10 percent).

