Results 1 
7 of
7
Measures of Distributional Similarity
 In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics
, 1999
"... We study distributional similarity measures for the purpose of improving probability estimation for unseen cooccurrences. Our contributions are threefold: an empirical comparison of a broad range of measures; a classification of similarity functions based on the information that they incorporate; a ..."
Abstract

Cited by 229 (2 self)
 Add to MetaCart
We study distributional similarity measures for the purpose of improving probability estimation for unseen cooccurrences. Our contributions are threefold: an empirical comparison of a broad range of measures; a classification of similarity functions based on the information that they incorporate; and the introduction of a novel function that is superior at evaluating potential proxy distributions.
Similaritybased models of word cooccurrence probabilities
 Machine Learning
, 1999
"... Abstract. In many applications of natural language processing (NLP) it is necessary to determine the likelihood of a given word combination. For example, a speech recognizer may need to determine which of the two word combinations “eat a peach ” and “eat a beach ” is more likely. Statistical NLP met ..."
Abstract

Cited by 90 (0 self)
 Add to MetaCart
Abstract. In many applications of natural language processing (NLP) it is necessary to determine the likelihood of a given word combination. For example, a speech recognizer may need to determine which of the two word combinations “eat a peach ” and “eat a beach ” is more likely. Statistical NLP methods determine the likelihood of a word combination from its frequency in a training corpus. However, the nature of language is such that many word combinations are infrequent and do not occur in any given corpus. In this work we propose a method for estimating the probability of such previously unseen word combinations using available information on “most similar ” words. We describe probabilistic word association models based on distributional word similarity, and apply them to two tasks, language modeling and pseudoword disambiguation. In the language modeling task, a similaritybased model is used to improve probability estimates for unseen bigrams in a backoff language model. The similaritybased method yields a 20 % perplexity improvement in the prediction of unseen bigrams and statistically significant reductions in speechrecognition error. We also compare four similaritybased estimation methods against backoff and maximumlikelihood estimation methods on a pseudoword sense disambiguation task in which we controlled for both unigram and bigram frequency to avoid giving too much weight to easytodisambiguate highfrequency configurations. The similaritybased methods perform up to 40 % better on this particular task.
SimilarityBased Estimation of Word Cooccurrence Probabilities
 In Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics
, 1994
"... In many applications of natural language processing it is necessary to determine the likelihood of a given word combination. For example, a speech recognizer may need to determine which of the two word combinations "eat a peach" and "eat a beach" is more likely. Statistical NLP methods determine the ..."
Abstract

Cited by 75 (8 self)
 Add to MetaCart
In many applications of natural language processing it is necessary to determine the likelihood of a given word combination. For example, a speech recognizer may need to determine which of the two word combinations "eat a peach" and "eat a beach" is more likely. Statistical NLP methods determine the likelihood of a word combination according to its frequency in a training corpus. However, the nature of language is such that many word combinations are infrequent and do not occur in a given corpus. In this work we propose a method for estimating the probability of such previously unseen word combinations using available information on "most sim ilar" words. We describe a probabilistic word association model based on distributional word similarity, and apply it to improving probability estimates for unseen word bigrams in a variant of Katz's backoff model. The similaritybased method yields a 20% perplexity improvement in the prediction of unseen bigrams and statistically significant reductions in speechrecognition error.
Characterising Measures of Lexical Distributional Similarity
 IN COLING04
, 2004
"... This work investigates the variation in a word's distributionally nearest neighbours with respect to the similarity measure used. We identify one type of variation as being the relative frequency of the neighbour words with respect to the frequency of the target word. We then demonstrate a threeway ..."
Abstract

Cited by 45 (1 self)
 Add to MetaCart
This work investigates the variation in a word's distributionally nearest neighbours with respect to the similarity measure used. We identify one type of variation as being the relative frequency of the neighbour words with respect to the frequency of the target word. We then demonstrate a threeway connection between relative frequency of similar words, a concept of distributional gnerality and the semantic relation of hyponymy. Finally, we consider the impact that this has on one application of distributional similarity methods (judging the compositionality of collocations).
The estimation of powerful language models from small and large corpora
, 1993
"... This paper deals with the estimation of powerful statistical language models using a technique that scales from very small to very large amounts of domaindependent data. We begin with an improved modeling of the grammar statistics, based on a combination of the backingoff technique [6] and zero ..."
Abstract

Cited by 41 (5 self)
 Add to MetaCart
This paper deals with the estimation of powerful statistical language models using a technique that scales from very small to very large amounts of domaindependent data. We begin with an improved modeling of the grammar statistics, based on a combination of the backingoff technique [6] and zerofrequency techniques [2, 91. These are extended to be more amenable to our particular system. Our resulting technique is greatly simplified, more robust, and gives improved recognition performance than either of the previous techniques. We then further attack the problem of robustness of a model based on a small training corpus by grouping words into obvious semantic classes. This significantly improves the robustness of the resulting statistical grammar. We also present a technique that allows the estimation of a highorder model on modest computation resources. This allows us to run a 4gram statistical model of a 50 million word corpus on a workstation of only modest capability and cost. Finally, we discuss results from applying a 2gram statistical language model integrated in the HMM search, obtaining a list of the NBest recognition results, and rescoring this list with a higherorder statistical model.
Similaritybased approaches to natural language processing
, 1997
"... Statistical methods for automatically extracting information about associations between words or documents from large collections of text have the potential to have considerable impact in a number of areas, such as information retrieval and naturallanguagebased user interfaces. However, even huge ..."
Abstract

Cited by 40 (3 self)
 Add to MetaCart
Statistical methods for automatically extracting information about associations between words or documents from large collections of text have the potential to have considerable impact in a number of areas, such as information retrieval and naturallanguagebased user interfaces. However, even huge bodies of text yield highly unreliable estimates of the probability of relatively common events, and, in fact, perfectly reasonable events may not occur in the training data at all. This is known as the sparse data problem. Traditional approaches to the sparse data problem use crude approximations. We propose a different solution: if we are able to organize the data into classes of similar events, then, if information about an event is lacking, we can estimate its behavior from information about similar events. This thesis presents two such similaritybased approaches, where, in general, we measure similarity by the KullbackLeibler divergence, an informationtheoretic quantity. Our first approach is to build soft, hierarchical clusters: soft, because each event belongs to each cluster with some probability; hierarchical, because cluster centroids are iteratively split to model finer distinctions. Our clustering method, which uses the technique of deterministic annealing,
Measures and Applications of Lexical Distributional Similarity
, 2003
"... This thesis is concerned with the measurement and application of lexical distributional similarity. Two words are said to be distributionally similar if they appear in similar contexts. This loose definition, however, has led to many measures being proposed or adopted from fields such as geometry, s ..."
Abstract

Cited by 19 (0 self)
 Add to MetaCart
This thesis is concerned with the measurement and application of lexical distributional similarity. Two words are said to be distributionally similar if they appear in similar contexts. This loose definition, however, has led to many measures being proposed or adopted from fields such as geometry, statistics, Information Retrieval (IR) and Information Theory. Our aim is to investigate the properties which make a good measure of lexical distributional similarity. We start by introducing the concept of lexical distributional similarity. We discuss potential applications, which can be roughly divided into distributional or language modelling applications and semantic applications, and methods of evaluation (Chapter 2). We look at existing measures of distributional similarity and carry out an empirical comparison of fifteen of these measures, paying particular attention to the effects of word frequency (Chapter 3). We propose a new general framework for distributional similarity based on the context of lexical substitutability, which me measure using the IR concepts of precision and recall. This framework allows us to investigate the key factors in similarity of asymmetry, the relative influence of different contexts and the extent to which words share a context (Chapter 4). Finally, we consider the application of distributional similarity in language modelling (Chapter 5) and as a predictor of semantic similarity using human judgements of similarity and a spelling correction task (Chapter 6).