Results 1 - 10
of
193
Vector-based models of semantic composition
- In Proceedings of ACL-08: HLT
, 2008
"... This paper proposes a framework for representing the meaning of phrases and sentences in vector space. Central to our approach is vector composition which we operationalize in terms of additive and multiplicative functions. Under this framework, we introduce a wide range of composition models which ..."
Abstract
-
Cited by 220 (5 self)
- Add to MetaCart
(Show Context)
This paper proposes a framework for representing the meaning of phrases and sentences in vector space. Central to our approach is vector composition which we operationalize in terms of additive and multiplicative functions. Under this framework, we introduce a wide range of composition models which we evaluate empirically on a sentence similarity task. Experimental results demonstrate that the multiplicative models are superior to the additive alternatives when compared against human judgments.
Corpus-based and knowledge-based measures of text semantic similarity
- IN IPROCEEDINGS OF THE 21ST NATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE - VOLUME 1
, 2006
"... This paper presents a method for measuring the semantic similarity of texts, using corpus-based and knowledge-based measures of similarity. Previous work on this problem has focused mainly on either large documents (e.g. text classification, information retrieval) or individual words (e.g. synonymy ..."
Abstract
-
Cited by 151 (4 self)
- Add to MetaCart
(Show Context)
This paper presents a method for measuring the semantic similarity of texts, using corpus-based and knowledge-based measures of similarity. Previous work on this problem has focused mainly on either large documents (e.g. text classification, information retrieval) or individual words (e.g. synonymy tests). Given that a large fraction of the information available today, on the Web and elsewhere, consists of short text snippets (e.g. abstracts of scientific documents, imagine captions, product descriptions), in this paper we focus on measuring the semantic similarity of short texts. Through experiments performed on a paraphrase data set, we show that the semantic similarity method outperforms methods based on simple lexical matching, resulting in up to 13 % error rate reduction with respect to the traditional vector-based similarity metric.
Posterior regularization for structured latent variable models
- Journal of Machine Learning Research
, 2010
"... We present posterior regularization, a probabilistic framework for structured, weakly supervised learning. Our framework efficiently incorporates indirect supervision via constraints on posterior distributions of probabilistic models with latent variables. Posterior regularization separates model co ..."
Abstract
-
Cited by 138 (8 self)
- Add to MetaCart
(Show Context)
We present posterior regularization, a probabilistic framework for structured, weakly supervised learning. Our framework efficiently incorporates indirect supervision via constraints on posterior distributions of probabilistic models with latent variables. Posterior regularization separates model complexity from the complexity of structural constraints it is desired to satisfy. By directly imposing decomposable regularization on the posterior moments of latent variables during learning, we retain the computational efficiency of the unconstrained model while ensuring desired constraints hold in expectation. We present an efficient algorithm for learning with posterior regularization and illustrate its versatility on a diverse set of structural constraints such as bijectivity, symmetry and group sparsity in several large scale experiments, including multi-view learning, cross-lingual dependency grammar induction, unsupervised part-of-speech induction,
Improved statistical machine translation using paraphrases
- In Proceedings of HLT/NAACL-2006
, 2006
"... Parallel corpora are crucial for training SMT systems. However, for many language pairs they are available only in very limited quantities. For these language pairs a huge portion of phrases encountered at run-time will be unknown. We show how techniques from paraphrasing can be used to deal with th ..."
Abstract
-
Cited by 117 (3 self)
- Add to MetaCart
Parallel corpora are crucial for training SMT systems. However, for many language pairs they are available only in very limited quantities. For these language pairs a huge portion of phrases encountered at run-time will be unknown. We show how techniques from paraphrasing can be used to deal with these otherwise unknown source language phrases. Our results show that augmenting a stateof-the-art SMT system with paraphrases leads to significantly improved coverage and translation quality. For a training corpus with 10,000 sentence pairs we increase the coverage of unique test set unigrams from 48 % to 90%, with more than half of the newly covered items accurately translated, as opposed to none in current approaches. 1
A survey of statistical machine translation
, 2007
"... Statistical machine translation (SMT) treats the translation of natural language as a machine learning problem. By examining many samples of human-produced translation, SMT algorithms automatically learn how to translate. SMT has made tremendous strides in less than two decades, and many popular tec ..."
Abstract
-
Cited by 93 (6 self)
- Add to MetaCart
Statistical machine translation (SMT) treats the translation of natural language as a machine learning problem. By examining many samples of human-produced translation, SMT algorithms automatically learn how to translate. SMT has made tremendous strides in less than two decades, and many popular techniques have only emerged within the last few years. This survey presents a tutorial overview of state-of-the-art SMT at the beginning of 2007. We begin with the context of the current research, and then move to a formal problem description and an overview of the four main subproblems: translational equivalence modeling, mathematical modeling, parameter estimation, and decoding. Along the way, we present a taxonomy of some different approaches within these areas. We conclude with an overview of evaluation and notes on future directions.
Statistical Machine Translation for Query Expansion in Answer Retrieval
"... We present an approach to query expansion in answer retrieval that uses Statistical Machine Translation (SMT) techniques to bridge the lexical gap between questions and answers. SMT-based query expansion is done by i) using a full-sentence paraphraser to introduce synonyms in context of the entire q ..."
Abstract
-
Cited by 74 (4 self)
- Add to MetaCart
We present an approach to query expansion in answer retrieval that uses Statistical Machine Translation (SMT) techniques to bridge the lexical gap between questions and answers. SMT-based query expansion is done by i) using a full-sentence paraphraser to introduce synonyms in context of the entire query, and ii) by translating query terms into answer terms using a full-sentence SMT model trained on question-answer pairs. We evaluate these global, context-aware query expansion techniques on tfidf retrieval from 10 million question-answer pairs extracted from FAQ pages. Experimental results show that SMTbased expansion improves retrieval performance over local expansion and over retrieval without expansion. 1
Syntactic Constraints on Paraphrases Extracted from Parallel Corpora
"... ccb cs jhu edu We improve the quality of paraphrases extracted from parallel corpora by requiring that phrases and their paraphrases be the same syntactic type. This is achieved by parsing the English side of a parallel corpus and altering the phrase extraction algorithm to extract phrase labels alo ..."
Abstract
-
Cited by 65 (10 self)
- Add to MetaCart
(Show Context)
ccb cs jhu edu We improve the quality of paraphrases extracted from parallel corpora by requiring that phrases and their paraphrases be the same syntactic type. This is achieved by parsing the English side of a parallel corpus and altering the phrase extraction algorithm to extract phrase labels alongside bilingual phrase pairs. In order to retain broad coverage of non-constituent phrases, complex syntactic labels are introduced. A manual evaluation indicates a 19% absolute improvement in paraphrase quality over the baseline method. 1
Meteor 1.3: Automatic Metric for Reliable Optimization and Evaluation of Machine Translation Systems
"... This paper describes Meteor 1.3, our submission ..."
Collecting Highly Parallel Data for Paraphrase Evaluation
"... A lack of standard datasets and evaluation metrics has prevented the field of paraphrasing from making the kind of rapid progress enjoyed by the machine translation community over the last 15 years. We address both problems by presenting a novel data collection framework that produces highly paralle ..."
Abstract
-
Cited by 59 (8 self)
- Add to MetaCart
(Show Context)
A lack of standard datasets and evaluation metrics has prevented the field of paraphrasing from making the kind of rapid progress enjoyed by the machine translation community over the last 15 years. We address both problems by presenting a novel data collection framework that produces highly parallel text data relatively inexpensively and on a large scale. The highly parallel nature of this data allows us to use simple n-gram comparisons to measure both the semantic adequacy and lexical dissimilarity of paraphrase candidates. In addition to being simple and efficient to compute, experiments show that these metrics correlate highly with human judgments. 1
A Survey of Paraphrasing and Textual Entailment Methods
, 2010
"... Paraphrasing methods recognize, generate, or extract phrases, sentences, or longer natural language expressions that convey almost the same information. Textual entailment methods, on the other hand, recognize, generate, or extract pairs of natural language expressions, such that a human who reads ( ..."
Abstract
-
Cited by 57 (3 self)
- Add to MetaCart
(Show Context)
Paraphrasing methods recognize, generate, or extract phrases, sentences, or longer natural language expressions that convey almost the same information. Textual entailment methods, on the other hand, recognize, generate, or extract pairs of natural language expressions, such that a human who reads (and trusts) the first element of a pair would most likely infer that the other element is also true. Paraphrasing can be seen as bidirectional textual entailment and methods from the two areas are often similar. Both kinds of methods are useful, at least in principle, in a wide range of natural language processing applications, including question answering, summarization, text generation, and machine translation. We summarize key ideas from the two areas by considering in turn recognition, generation, and extraction methods, also pointing to prominent articles and resources.