Results 1 - 10
of
12
A Word-to-Word Model of Translational Equivalence
, 1997
"... Many multilingual NLP applications need to translate words between different languages, but cannot afford the computational expense of inducing or applying a full translation model. For these applications, we have designed a fast algorithm for estimating a partial translation model, which accounts f ..."
Abstract
-
Cited by 73 (6 self)
- Add to MetaCart
Many multilingual NLP applications need to translate words between different languages, but cannot afford the computational expense of inducing or applying a full translation model. For these applications, we have designed a fast algorithm for estimating a partial translation model, which accounts for translational equivalence only at the word level . The model's precision /recall trade-off can be directly controlled via one threshold parameter. This feature makes the model more suitable for applications that are not fully statistical. The model's hidden parameters can be easily conditioned on information extrinsic to the model, providing an easy way to integrate pre-existing knowledge such as part-of-speech, dictionaries, word order, etc.. Our model can link word tokens in parallel texts as well as other translation models in the literature. Unlike other translation models, it can automatically produce dictionarysized translation lexicons, and it can do so with over 99% accuracy.
Identifying idiomatic expressions using automatic word alignment
- Proceedings of the EACL 2006 Workshop on Multiword Expressions in
, 2006
"... For NLP applications that require some sort of semantic interpretation it would be helpful to know what expressions exhibit an idiomatic meaning and what expressions exhibit a literal meaning. We investigate whether automatic word-alignment in existing parallel corpora facilitates the classification ..."
Abstract
-
Cited by 15 (1 self)
- Add to MetaCart
For NLP applications that require some sort of semantic interpretation it would be helpful to know what expressions exhibit an idiomatic meaning and what expressions exhibit a literal meaning. We investigate whether automatic word-alignment in existing parallel corpora facilitates the classification of candidate expressions along a continuum ranging from literal and transparent expressions to idiomatic and opaque expressions. Our method relies on two criteria: (i) meaning predictability that is measured as semantic entropy and (ii), the overlap between the meaning of an expression and the meaning of its component words. We approximate the mentioned overlap as the proportion of default alignments. We obtain a significant improvement over the baseline with both measures. 1
Unsupervised type and token identification of idiomatic expressions
- Computational Linguistics
, 2009
"... Idiomatic expressions are plentiful in everyday language, yet they remain mysterious, as it is not clear exactly how people learn and understand them. They are of special interest to linguists, psycholinguists, and lexicographers, mainly because of their syntactic and semantic idiosyncrasies as well ..."
Abstract
-
Cited by 11 (1 self)
- Add to MetaCart
Idiomatic expressions are plentiful in everyday language, yet they remain mysterious, as it is not clear exactly how people learn and understand them. They are of special interest to linguists, psycholinguists, and lexicographers, mainly because of their syntactic and semantic idiosyncrasies as well as their unclear lexical status. Despite a great deal of research on the properties of idioms in the linguistics literature, there is not much agreement on which properties are characteristic of these expressions. Because of their peculiarities, idiomatic expressions have mostly been overlooked by researchers in computational linguistics. In this article, we look into the usefulness of some of the identified linguistic properties of idioms for their automatic recognition. Specifically, we develop statistical measures that each model a specific property of idiomatic expressions by looking at their actual usage patterns in text. We use these statistical measures in a type-based classification task where we automatically separate idiomatic expressions (expressions with a possible idiomatic interpretation) from similar-on-the-surface literal phrases (for which no idiomatic interpretation is possible). In addition, we use some of the measures in a token identification task where we distinguish idiomatic and literal usages of potentially-idiomatic expressions in context. 1.
A Scalable Architecture for Bilingual Lexicography
- University of Pennsylvania
, 1997
"... Introduction SABLE (Scalable Architecture for Bilingual LExicography) is a turn-key system for producing clean broad-coverage translation lexicons from raw, unaligned parallel texts (bitexts). SABLE is designed to work for any text genre, in any pair of languages. As long as the input texts are mut ..."
Abstract
-
Cited by 7 (1 self)
- Add to MetaCart
Introduction SABLE (Scalable Architecture for Bilingual LExicography) is a turn-key system for producing clean broad-coverage translation lexicons from raw, unaligned parallel texts (bitexts). SABLE is designed to work for any text genre, in any pair of languages. As long as the input texts are mutual translations, the relative word order of the input languages makes no difference. No SABLE component makes any assumptions about the kinds of text units in the input: no component makes any use of sentence boundaries. SABLE was designed with the following features in mind: ffl Black box functionality: Automatic construction of translation lexicons requires only that the user provide the input bitexts and identify the two languages involved. ffl Robustness: SABLE copes well with omissions and inversions in translations. ffl Scalability: SABLE has been used successfully on bitexts larger than 130MB. ffl
Punctuation: Making a Point in Unsupervised Dependency Parsing
"... We show how punctuation can be used to improve unsupervised dependency parsing. Our linguistic analysis confirms the strong connection between English punctuation and phrase boundaries in the Penn Treebank. However, approaches that naively include punctuation marks in the grammar (as if they were wo ..."
Abstract
-
Cited by 5 (4 self)
- Add to MetaCart
We show how punctuation can be used to improve unsupervised dependency parsing. Our linguistic analysis confirms the strong connection between English punctuation and phrase boundaries in the Penn Treebank. However, approaches that naively include punctuation marks in the grammar (as if they were words) do not perform well with Klein and Manning’s Dependency Model with Valence (DMV). Instead, we split a sentence at punctuation and impose parsing restrictions over its fragments. Our grammar inducer is trained on the Wall Street Journal (WSJ) and achieves 59.5 % accuracy out-of-domain (Brown sentences with 100 or fewer words), more than 6 % higher than the previous best results. Further evaluation, using the 2006/7 CoNLL sets, reveals that punctuation aids grammar induction in 17 of 18 languages, for an overall average net gain of 1.3%. Some of this improvement is from training, but more than half is from parsing with induced constraints, in inference. Punctuation-aware decoding works with existing (even already-trained) parsing models and always increased accuracy in our experiments. 1
Cross-lingual bootstrapping for semantic lexicons
- In Proceedings of the Spring Symposia of the American Association for Artificial Intelligence (AAAI
, 2005
"... This paper considers the problem of unsupervised semantic lexicon acquisition. We introduce a fully automatic approach which exploits parallel corpora, relies on shallow text properties, and is relatively inexpensive. Given the English FrameNet lexicon, our method exploits word alignments to generat ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
This paper considers the problem of unsupervised semantic lexicon acquisition. We introduce a fully automatic approach which exploits parallel corpora, relies on shallow text properties, and is relatively inexpensive. Given the English FrameNet lexicon, our method exploits word alignments to generate frame candidate lists for new languages, which are subsequently pruned automatically using a small set of linguistically motivated filters. Evaluation shows that our approach can produce high-precision multilingual FrameNet lexicons without recourse to bilingual dictionaries or deep syntactic and semantic analysis.
An Ontological-Semantic Framework for Text Analysis
, 1997
"... The Knowledge-Based Machine Translation paradigm requires a comprehensive analysis of input texts into an unambiguous machine-tractable representation of the propositional and meta-propositional meaning of that text, for which we use a particular framework referred to as ontological semantics. Th ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
The Knowledge-Based Machine Translation paradigm requires a comprehensive analysis of input texts into an unambiguous machine-tractable representation of the propositional and meta-propositional meaning of that text, for which we use a particular framework referred to as ontological semantics. The work presented here begins with a definition of a representation language for lexical semantic specification (and syntax/semantics interface) to support such an analysis, as well as a generalized algorithm for building the meaning representation from these lexical semantic specifications, utilizing the ontology and a syntactic parse as knowledge sources. The core of the algorithm is an algorithm for semantic constraint satisfaction and relaxation, involving finding the best path over the ontology between a candidate filler of a relation and semantic constraints on that relation. The ontology is viewed as a multi-dimensional graph, with distinct topologies in each dimension reflecting specific semantic relations between nodes (representing concepts) , where weights or arc distance reflects strength of semantic relatedness in context (where the path-so-far context is maintained in a state transition table).
Automatic Acquisition of Lexical Knowledge about Multiword Predicates
, 2007
"... A multiword predicate is the combination of a predicate (often a verb) with one or more of its arguments, that together form a single unit of predicative meaning. We focus on a broad class of multiword predicates, in which a verb combines with a noun in the direct object position (e.g., give a groan ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
A multiword predicate is the combination of a predicate (often a verb) with one or more of its arguments, that together form a single unit of predicative meaning. We focus on a broad class of multiword predicates, in which a verb combines with a noun in the direct object position (e.g., give a groan and shoot the breeze). The semantic interpretation of such multiword predicates involves a certain degree of idiosyncrasy; moreover, they are crosslinguistically frequent and appear in all text genres. Hence, they pose a great challenge to the current models of nat-ural language processing. Most existing computational models treat multiword predicates as syntactically-dependent word sequences or collocations. Such a treatment ignores other im-portant characteristics of these constructions, reflected in their distinct lexical and syntactic behaviour. Nonetheless, cues from the lexicosyntactic properties of multiword predicates have often been used in linguistic and psycholinguistic studies to explain their peculiar semantic behaviour. On the one hand, simple statistical approaches that only draw on the frequency of multiword predicates fail to account for much of the syntactic and semantic behaviour of these constructions. On the other hand, linguistic theories provide generalizations about the behaviour of multiword predicates that can be augmented with probabilistic knowledge about language in use. The main goal of the present study is to propose ways of combining the pre-dictive power of linguistic theories with the coverage and robustness of statistical techniques to acquire linguistically-plausible and reliable corpus-drawn knowledge about multiword predicates.
Automatic Validation of Terminology Translation Consistency with Statistical Method
"... This paper presents a novel method to automatically validate terminology consistency in localized materials. The goal of the paper is two-fold. First, we explore a way to extract phrase pair translations for compound nouns from a bilingual corpus using word alignment data. To validate the quality of ..."
Abstract
- Add to MetaCart
This paper presents a novel method to automatically validate terminology consistency in localized materials. The goal of the paper is two-fold. First, we explore a way to extract phrase pair translations for compound nouns from a bilingual corpus using word alignment data. To validate the quality of the extracted phrase pair translations, we use a Gaussian mixture model (GMM) classifier. Second, we quantify consistency of translation as a measurement of quality. With this approach, a quality assurance process for terminology translation can be fully automated. It can also be used for maintaining bilingual training data quality for machine translation. 1.
Identification of Idiomatic Expressions Using Parallel Corpora
, 2008
"... Hiermit erkläre ich, dass ich die vorliegende Arbeit selbständig verfasst und keine anderen als die angegebenen Quellen und Hilfsmittel verwendet habe. ..."
Abstract
- Add to MetaCart
Hiermit erkläre ich, dass ich die vorliegende Arbeit selbständig verfasst und keine anderen als die angegebenen Quellen und Hilfsmittel verwendet habe.

