Results 11 - 20
of
173
Finding Terminology Translations From Non-Parallel Corpora
, 1997
"... this paper, we present an initial algorithm for translating technical terms using a pair of non-parallel corpora. Evalution results show translation precisions at around 30% when only the top candidate is considered. While this precision is lower than that achieved with parallel corpora, we show tha ..."
Abstract
-
Cited by 34 (3 self)
- Add to MetaCart
this paper, we present an initial algorithm for translating technical terms using a pair of non-parallel corpora. Evalution results show translation precisions at around 30% when only the top candidate is considered. While this precision is lower than that achieved with parallel corpora, we show that top 20 candidate output from our algorithm allows translators to increase their accuracy by 50.9%. In the following sections, we first describe a pair of non-parallel corpora we use for experiments, and then we introduce the Word Relation Matrix (WoRM), a statistical word feature representation for technical term translation from non-parallel corpora. We evaluate the effectiveness of this feature with two sets of experiments, using English/English, and English/Japanese non-parallel corpora. 2. BACKGROUND
An Unsupervised Iterative Method for Chinese New Lexicon Extraction
- International Journal of Computational Linguistics & Chinese Language Processing
, 1997
"... An unsupervised iterative approach for extracting a new lexicon (or unknown words) from a Chinese text corpus is proposed in this paper. Instead of using a non-iterative segmentation-mergingfiltering -and-disambiguation approach, the proposed method iteratively integrates the contextual constraints ..."
Abstract
-
Cited by 32 (3 self)
- Add to MetaCart
An unsupervised iterative approach for extracting a new lexicon (or unknown words) from a Chinese text corpus is proposed in this paper. Instead of using a non-iterative segmentation-mergingfiltering -and-disambiguation approach, the proposed method iteratively integrates the contextual constraints (among word candidates) and a joint character association metric to progressively improve the segmentation results of the input corpus (and thus the new word list.) An augmented dictionary, which includes potential unknown words (in addition to known words), is used to segment the input corpus, unlike traditional approaches which use only known words for segmentation. In the segmentation process, the augmented dictionary is used to impose contextual constraints over known words and potential unknown words within input sentences; an unsupervised Viterbi Training process is then applied to ensure that the selected potential unknown words (and known words) maximize the likelihood of the input ...
A Language Model Approach to Keyphrase Extraction
- In Proceedings of ACL Workshop on Multiword Expressions
, 2003
"... We present a new approach to extracting keyphrases based on statistical language models. Our approach is to use pointwise KL-divergence between multiple language models for scoring both phraseness and informativeness, which can be unified into a single score to rank extracted phrases. ..."
Abstract
-
Cited by 30 (0 self)
- Add to MetaCart
We present a new approach to extracting keyphrases based on statistical language models. Our approach is to use pointwise KL-divergence between multiple language models for scoring both phraseness and informativeness, which can be unified into a single score to rank extracted phrases.
Large-Scale Automatic Extraction of an English-Chinese Translation Lexicon
- Machine Translation
, 1995
"... We report experimental results on automatic extraction of an English-Chinese translation lexicon, by statistical analysis of a large parallel corpus, using limited amounts of linguistic knowledge. To our knowledge, these are the first empirical results of the kind between an Indo-Europeanand non-Ind ..."
Abstract
-
Cited by 30 (1 self)
- Add to MetaCart
We report experimental results on automatic extraction of an English-Chinese translation lexicon, by statistical analysis of a large parallel corpus, using limited amounts of linguistic knowledge. To our knowledge, these are the first empirical results of the kind between an Indo-Europeanand non-Indo-Europeanlanguage for any significant vocabulary and corpus size. The learned vocabulary size is about 6,500 English words, achieving translation precision in the 86--96% range, with alignment proceeding at paragraph, sentence, and word levels. Specifically, we report (1) progress on the HKUST English-Chinese Parallel Bilingual Corpus, (2) experiments supportingthe usefulness of restricted lexical cues for statistical paragraphand sentence alignment, and (3) experiments that question the role of hand-derived monolingual lexicons for automatic word translation acquisition. Using a hand-derived monolingual lexicon, the learned translation lexicon averages 2.33 Chinese translations per English entry, with a manually-filtered precision of 95.1%, and an automatically-filtered weighted precision of 86.0%. We then introduce a fully automatic two-stage statistical methodology that is able to learn translations for collocations. A statistically-learned monolingual Chinese lexicon is first used to segment the Chinese text, before applying bilingual training to produce 6,429 English entries with 2.25 Chinese translations per entry. This method improves the manually-filtered precision to 96.0% and the automaticallyfiltered weighted precision to 91.0%, an error rate reduction of 35.7% from using a hand-derived monolingual lexicon.
Extracting the Unextractable: A Case Study on Verb-particles
- in Proc. of the 6th Conference on Natural Language Learning (CoNLL-2002
, 2002
"... This paper proposes a series of techniques for extracting English verb-particle constructions from raw text corpora. We initially propose three basic methods, based on tagger output, chunker output and a chunk grammar, respectively, with the chunk grammar method optionally combining with an attachme ..."
Abstract
-
Cited by 30 (7 self)
- Add to MetaCart
This paper proposes a series of techniques for extracting English verb-particle constructions from raw text corpora. We initially propose three basic methods, based on tagger output, chunker output and a chunk grammar, respectively, with the chunk grammar method optionally combining with an attachment resolution module to determine the syntactic structure of verb-preposition pairs in ambiguous constructs. We then combine the three methods together into a single classifier, and add in a number of extra lexical and frequentistic features, producing a final F-score of 0.865 over the WSJ.
Corpus-based learning of analogies and semantic relations
- Machine Learning
, 2005
"... Abstract. We present an algorithm for learning from unlabeled text, based on the Vector Space Model (VSM) of information retrieval, that can solve verbal analogy questions of the kind found in the SAT college entrance exam. A verbal analogy has the form A:B::C:D, meaning “A is to B as C is to D”; fo ..."
Abstract
-
Cited by 28 (8 self)
- Add to MetaCart
Abstract. We present an algorithm for learning from unlabeled text, based on the Vector Space Model (VSM) of information retrieval, that can solve verbal analogy questions of the kind found in the SAT college entrance exam. A verbal analogy has the form A:B::C:D, meaning “A is to B as C is to D”; for example, mason:stone::carpenter:wood. SAT analogy questions provide a word pair, A:B, and the problem is to select the most analogous word pair, C:D, from a set of five choices. The VSM algorithm correctly answers 47 % of a collection of 374 collegelevel analogy questions (random guessing would yield 20 % correct; the average college-bound senior high school student answers about 57 % correctly). We motivate this research by applying it to a difficult problem in natural language processing, determining semantic relations in noun-modifier pairs. The problem is to classify a noun-modifier pair, such as “laser printer”, according to the semantic relation between the noun (printer) and the modifier (laser). We use a supervised nearestneighbour algorithm that assigns a class to a given noun-modifier pair by finding the most analogous noun-modifier pair in the training data. With 30 classes of semantic relations, on a collection of 600 labeled noun-modifier pairs, the learning algorithm attains an F value of 26.5 % (random guessing: 3.3%). With 5 classes of semantic relations, the F value is 43.2 % (random: 20%). The performance is state-of-the-art for both verbal analogies and noun-modifier relations.
Empirical Observation of Term Variations and Principles for their Description
, 2000
"... Contents 1 Introduction 2 1.1 Do terms vary? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 A Symbolic Framework for the Study of Terminological Variation . . . . . . . . . . . . . . . 4 2 The Most Common Types of English Two-word Terms 7 2.1 Adjective N ..."
Abstract
-
Cited by 23 (0 self)
- Add to MetaCart
Contents 1 Introduction 2 1.1 Do terms vary? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 A Symbolic Framework for the Study of Terminological Variation . . . . . . . . . . . . . . . 4 2 The Most Common Types of English Two-word Terms 7 2.1 Adjective Noun (A N) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2 Noun Noun (N 2 N 1 ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3 Noun Preposition Noun (N 1 P N 2 ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3 Observing and Representing Term Variants 9 3.1 An Observation of Term Variants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.2 A Two-level Lexico-syntactic Description of Terms . . . . . . . . . . . . . . . . . . . . . . . 11 3.3 Two Families of Grammatical Rules . .
Corpus-Derived First, Second and Third-Order Word Affinities
- In Proceedings of Euralex
, 1994
"... A number of corpus-based extraction techniques have been successfully implemented which ..."
Abstract
-
Cited by 22 (1 self)
- Add to MetaCart
A number of corpus-based extraction techniques have been successfully implemented which
Term Extraction and Automatic Indexing
, 2003
"... This chapter presents a new domain of research and development in Natural Language Processing (NLP) that is concerned with the representation, acquisition, and recognition of terms. Terms are pervasive in scientific and technical documents; their identification is a crucial issue for any applicatio ..."
Abstract
-
Cited by 22 (0 self)
- Add to MetaCart
This chapter presents a new domain of research and development in Natural Language Processing (NLP) that is concerned with the representation, acquisition, and recognition of terms. Terms are pervasive in scientific and technical documents; their identification is a crucial issue for any application dealing with the analysis, understanding, generation, or translation of such documents. In particular, the ever-growing mass of specialized documentation available on-line, in industrial and governmental archives or in digital libraries, calls for advances in terminology processing for such purposes as information retrieval, cross-language querying, indexing of multimedia documents, translation aids, document routing and summarization, etc. This chapter introduces the basic linguistic characteristics of terms. It presents the main methods in NLP for recognizing or discovering terms and their interrelationships in large corpora. It is divided into three sections: an introduction to the bas...
Automatically constructing a lexicon of verb phrase idiomatic combinations
- In Proceedings of EACL-06
, 2006
"... We investigate the lexical and syntactic flexibility of a class of idiomatic expressions. We develop measures that draw on such linguistic properties, and demonstrate that these statistical, corpus-based measures can be successfully used for distinguishing idiomatic combinations from non-idiomatic o ..."
Abstract
-
Cited by 21 (4 self)
- Add to MetaCart
We investigate the lexical and syntactic flexibility of a class of idiomatic expressions. We develop measures that draw on such linguistic properties, and demonstrate that these statistical, corpus-based measures can be successfully used for distinguishing idiomatic combinations from non-idiomatic ones. We also propose a means for automatically determining which syntactic forms a particular idiom can appear in, and hence should be included in its lexical representation. 1

