Results 1 - 10
of
18
The VNCTokens Dataset
- In proceedings of the MWE workshop. ACL
, 2008
"... Idiomatic expressions formed from a verb and a noun in its direct object position are a productive cross-lingual class of multiword expressions, which can be used both idiomatically and as a literal combination. This paper presents the VNC-Tokens dataset, a resource of almost 3000 English verb–noun ..."
Abstract
-
Cited by 11 (0 self)
- Add to MetaCart
(Show Context)
Idiomatic expressions formed from a verb and a noun in its direct object position are a productive cross-lingual class of multiword expressions, which can be used both idiomatically and as a literal combination. This paper presents the VNC-Tokens dataset, a resource of almost 3000 English verb–noun combination usages annotated as to whether they are literal or idiomatic. Previous research using this dataset is described, and other studies which could be evaluated more extensively using this resource are identified. 1. Verb–Noun Combinations Identifying multiword expressions (MWEs) in text is essential for accurately performing natural language processing tasks (Sag et al., 2002). A broad class of MWEs with distinct semantic and syntactic properties is that of idiomatic expressions. A productive process of idiom creation across languages is to combine a high frequency verb and one or more of its arguments. In particular, many such idioms are formed from the combination of a verb and a noun in the direct object position (Cowie et al., 1983; Nunberg et al., 1994; Fellbaum, 2002), e.g., give the sack, make a face, and see stars. Given the richness and productivity of the class of idiomatic verb–noun combinations (VNCs), we choose to focus on these expressions. It is a commonly held belief that expressions with an idiomatic interpretation are primarily used idiomatically, and that they lose their literal meanings over time. Nonetheless, it is still possible for a potentially-idiomatic combination to be used in a literal sense, as in: She made a face on the snowman using a carrot and two buttons. Contrast the above literal usage with the idiomatic use in: The little girl made a funny face at her mother. Interestingly, in our analysis of 60 VNCs, we found that approximately half of these expressions are attested fairly frequently in their literal sense in the British National Corpus (BNC). 1 Clearly, automatic methods are required for distinguishing between idiomatic and literal usages of such expressions, and indeed there have recently been several studies addressing this issue
Linguistic Cues for Distinguishing Literal and Non-Literal Usages
"... We investigate the effectiveness of different linguistic cues for distinguishing literal and non-literal usages of potentially idiomatic expressions. We focus specifically on features that generalize across different target expressions. While idioms on the whole are frequent, instances of each parti ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
We investigate the effectiveness of different linguistic cues for distinguishing literal and non-literal usages of potentially idiomatic expressions. We focus specifically on features that generalize across different target expressions. While idioms on the whole are frequent, instances of each particular expression can be relatively infrequent and it will often not be feasible to extract and annotate a sufficient number of examples for each expression one might want to disambiguate. We experimented with a number of different features and found that features encoding lexical cohesion as well as some syntactic features can generalize well across idioms. 1
2009b. Unsupervised classification of verb noun multi-word expression tokens
- In CICLing 2009
"... Abstract. We address the problem of classifying multiword expression tokens in running text. We focus our study on Verb-Noun Constructions (VNC) that vary in their idiomaticity depending on context. VNC tokens are classified as either idiomatic or literal. Our approach hinges upon the assumption tha ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
(Show Context)
Abstract. We address the problem of classifying multiword expression tokens in running text. We focus our study on Verb-Noun Constructions (VNC) that vary in their idiomaticity depending on context. VNC tokens are classified as either idiomatic or literal. Our approach hinges upon the assumption that a literal VNC will have more in common with its component words than an idiomatic one. Commonality is measured by contextual overlap. To this end, we set out to explore different contextual variations and different similarity measures. We also identify a new data set OPAQUE that comprises only non-decomposable VNC expressions. Our approach yields state of the art performance with an overall accuracy of 77.56 % on a TEST data set and 81.66 % on the newly characterized data set OPAQUE. 1
Automatic Idiom Identification in Wiktionary
"... Online resources, such as Wiktionary, provide an accurate but incomplete source of idiomatic phrases. In this paper, we study the problem of automatically identifying idiomatic dictionary entries with such resources. We train an idiom classifier on a newly gathered corpus of over 60,000 Wiktionary m ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Online resources, such as Wiktionary, provide an accurate but incomplete source of idiomatic phrases. In this paper, we study the problem of automatically identifying idiomatic dictionary entries with such resources. We train an idiom classifier on a newly gathered corpus of over 60,000 Wiktionary multi-word definitions, incorporating features that model whether phrase meanings are constructed compositionally. Experiments demonstrate that the learned classifier can provide high quality idiom labels, more than doubling the number of idiomatic entries from 7,764 to 18,155 at precision levels of over 65%. These gains also translate to idiom detection in sentences, by simply using known word sense disambiguation algorithms to match phrases to their definitions. In a set of Wiktionary definition example sentences, the more complete set of idioms boosts detection recall by over 28 percentage points. 1
A Cohesion Graph Based Approach for Unsupervised Recognition of Literal and Non-literal Use of Multiword Expressions
"... We present a graph-based model for representing the lexical cohesion of a discourse. In the graph structure, vertices correspond to the content words of a text and edges connecting pairs of words encode how closely the words are related semantically. We show that such a structure can be used to dist ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
We present a graph-based model for representing the lexical cohesion of a discourse. In the graph structure, vertices correspond to the content words of a text and edges connecting pairs of words encode how closely the words are related semantically. We show that such a structure can be used to distinguish literal and non-literal usages of multi-word expressions. 1
Identifying Verbal Collocations in Wikipedia Articles
- In Proceedings of the 14th International Conference on Text, Speech and Dialogue, TSD’11
, 2011
"... Abstract. In this paper, we focus on various methods for detecting ver-bal collocations, i.e. verb-particle constructions and light verb construc-tions in Wikipedia articles. Our results suggest that for verb-particle constructions, POS-tagging and restriction on the particle seem to yield the best ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
(Show Context)
Abstract. In this paper, we focus on various methods for detecting ver-bal collocations, i.e. verb-particle constructions and light verb construc-tions in Wikipedia articles. Our results suggest that for verb-particle constructions, POS-tagging and restriction on the particle seem to yield the best result whereas the combination of POS-tagging, syntactic in-formation and restrictions on the nominal and verbal component have the most beneficial effect on identifying light verb constructions. The identification of multiword semantic units can be successfully exploited in several applications in the fields of machine translation or information extraction.
c © 2012 by Yuancheng Tu ENGLISH COMPLEX VERB CONSTRUCTIONS: IDENTIFICATION AND INFERENCE BY
"... The fundamental problem faced by automatic text understanding in Natural Language Processing (NLP) is to identify semantically related pieces of text and integrate them together to compute the meaning of the whole text. However, the principle of compositionality runs into trouble very quickly when r ..."
Abstract
- Add to MetaCart
(Show Context)
The fundamental problem faced by automatic text understanding in Natural Language Processing (NLP) is to identify semantically related pieces of text and integrate them together to compute the meaning of the whole text. However, the principle of compositionality runs into trouble very quickly when real language is examined with its frequent appearance of Multiword Expressions (MWEs) whose meaning is not based on the meaning of their parts. MWEs occur in all text genres and are far more frequent and productive than are generally recognized, and pose serious difficulties for every kind of NLP applications. Given these diverse kinds of MWEs, this dissertation focuses on English verb related MWEs, constructs stochastic models to identify these complex verb predicates within the given context and discusses empirically the significance of this MWE recognition component in the context of Textual Entailment (TE), an intricate semantic inference task that involves various levels of linguistic knowledge and logic reasoning. This dissertation develops high quality computational models for three of the most frequent kinds of English complex verb constructions: Light Verb Construction (LVC), Phrasal Verb Con-struction (PVC) and Embedded Verb Construction (EVC), and demonstrates empirically their
Combining Dictionary-and Corpus-Based Concept Extraction
"... Abstract. Concept extraction is an increasingly popular topic in deep text analysis. Concepts are individual content elements. Their extraction offers thus an overview of the content of the material from which they were extracted. In the case of domain-specific material, concept extraction boils do ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract. Concept extraction is an increasingly popular topic in deep text analysis. Concepts are individual content elements. Their extraction offers thus an overview of the content of the material from which they were extracted. In the case of domain-specific material, concept extraction boils down to term identification. The most straightforward strategy for term identification is a look up in existing terminological resources. In recent research, this strategy has a poor reputation because it is prone to scaling limitations due to neologisms, lexical variation, synonymy, etc., which make the terminology to be submitted to a constant change. For this reason, many works developed statistical techniques to extract concepts. But the existence of a crowdsourced resource such as Wikipedia is changing the landscape. We present a hybrid approach that combines state-of-the-art statistical techniques with the use of the large scale term acquisition tool BabelFy to perform concept extraction. The combination of both allows us to boost the performance, compared to approaches that use these techniques separately.
Un environnement générique et ouvert pour le traitement des expressions polylexicales: de l’acquisition aux applications
, 2012
"... ..."
THÈSE Pour obtenir le grade de DOCTEUR DE L’UNIVERSITÉ DE GRENOBLE
, 2013
"... préparée au sein l’équipe NANO-D – INRIA Grenoble Rhône-Alpes ..."
(Show Context)