Results 1 -
7 of
7
Retrieving Collocations from Text: Xtract
- Computational Linguistics
, 1993
"... Natural languages are full of collocations, recurrent combinations of words that co-occur more often than expected by chance and that correspond to arbitrary word usages. Recent work in lexicography indicates that collocations are pervasive in English; apparently, they are common in all types of wri ..."
Abstract
-
Cited by 229 (1 self)
- Add to MetaCart
Natural languages are full of collocations, recurrent combinations of words that co-occur more often than expected by chance and that correspond to arbitrary word usages. Recent work in lexicography indicates that collocations are pervasive in English; apparently, they are common in all types of writing, including both technical and nontechnical genres. Several approaches have been proposed to retrieve various types of collocations from the analysis of large samples of textual data. These techniques automatically produce large numbers of collocations along with statistical figures intended to reflect the relevance of the associations. However, noue of these techniques provides functional information along with the collocation. Also, the results produced often contained improper word associations reflecting some spurious aspect of the training corpus that did not stand for true collocations. In this paper, we describe a set of techniques based on statistical methods for retrieving and identifying collocations from large textual corpora. These techniques produce a wide range of collocations and are based on some original filtering methods that allow the production of richer and higher-precision output. These techniques have been implemented and resulted in a lexicographic tool, Xtract. The techniques are described and some results are presented on a 10 million-word corpus of stock market news reports. A lexicographic evaluation of Xtract as a collocation retrieval tool has been made, and the estimated precision of Xtract is 80%.
Distinguishing subtypes of multiword expressions using linguistically-motivated statistical measures
- ACL2007 WORKSHOP ON MULTIWORD EXPRESSIONS: A BROADER PERSPECTIVE ON MULTIWORD EXPRESSIONS
, 2007
"... We identify several classes of multiword expressions that each require a different encoding in a (computational) lexicon, as well as a different treatment within a computational system. We examine linguistic properties pertaining to the degree of semantic idiosyncrasy of these classes of expressions ..."
Abstract
-
Cited by 11 (2 self)
- Add to MetaCart
We identify several classes of multiword expressions that each require a different encoding in a (computational) lexicon, as well as a different treatment within a computational system. We examine linguistic properties pertaining to the degree of semantic idiosyncrasy of these classes of expressions. Accordingly, we propose statistical measures to quantify each property, and use the measures to automatically distinguish the classes.
Looking for lexical gaps
- In Proceedings of the Ninth EURALEX International Congress
, 2000
"... In this paper we present the results of a quantitative evaluation of the discrepancies between the Italian and English lexica in terms of lexical gaps. This evaluation has been carried out in the context of MultiWordNet, an ongoing project that aims at building a multilingual lexical database. The q ..."
Abstract
-
Cited by 7 (6 self)
- Add to MetaCart
In this paper we present the results of a quantitative evaluation of the discrepancies between the Italian and English lexica in terms of lexical gaps. This evaluation has been carried out in the context of MultiWordNet, an ongoing project that aims at building a multilingual lexical database. The quantitative evaluation of the English-to-Italian lexical gaps shows that the English and Italian lexica are highly comparable and gives empirical support to the MultiWordNet model. 1.
Detecting Hidden Multiwords in Bilingual Dictionaries
"... Dictionaries are a valuable source of information about multiwords. Unfortunately, only few multiwords are explicitly marked as such in dictionaries: most of them are presented without being distinguished from free combinations of words. In this paper we present a methodology for detecting hidden mu ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Dictionaries are a valuable source of information about multiwords. Unfortunately, only few multiwords are explicitly marked as such in dictionaries: most of them are presented without being distinguished from free combinations of words. In this paper we present a methodology for detecting hidden multiwords in bilingual dictionaries, along with their translation in another language. The methodology is based on a number of automatic procedures which exploit regularities in the different kinds of expressions that can be found in the Collins English-Italian bilingual dictionary to select those phrases that are most likely to contain multiwords. The quantitative results of the experiment are provided. 1
The dynamics of collocation: A corpus-based study of the phraseology and pragmatics of the introductory-it construction
, 2005
"... ..."
Acquiring Multiword Verbs: The Role of Statistical Evidence
"... In addition to words and grammar, young children learn a large number of multiword sequences that are semantically idiosyncratic and have particular syntactic behaviour, e.g., expressions formed from the combination of a verb and a noun, such as take the train and give a kiss. Given the high degree ..."
Abstract
- Add to MetaCart
In addition to words and grammar, young children learn a large number of multiword sequences that are semantically idiosyncratic and have particular syntactic behaviour, e.g., expressions formed from the combination of a verb and a noun, such as take the train and give a kiss. Given the high degree of polysemy of verbs that commonly participate in such constructions, an important question is what cues children use to identify (nonliteral) multiword combinations. We provide evidence that certain statistical cues tapping into the properties of non-literal expressions are useful in separating these from literal combinations. Moreover, our experiments on naturally occurring child-directed data show that these cues are easily extractable from the input children receive.

