## Similarity-based models of word cooccurrence probabilities (1999)

### Cached

### Download Links

- [springerlink.metapress.com]
- [l2r.cs.uiuc.edu]
- [www.cs.cornell.edu]
- [l2r.cs.uiuc.edu]
- [www.cs.biu.ac.il]
- [u.cs.biu.ac.il]
- [www.cs.cornell.edu]
- [www.cis.upenn.edu]
- DBLP

### Other Repositories/Bibliography

Venue: | Machine Learning |

Citations: | 98 - 0 self |

### BibTeX

@INPROCEEDINGS{Lee99similarity-basedmodels,

author = {Lillian Lee and Fernando C. N. Pereira and Claire Cardie and Raymond Mooney},

title = {Similarity-based models of word cooccurrence probabilities},

booktitle = {Machine Learning},

year = {1999},

pages = {34--1}

}

### Years of Citing Articles

### OpenURL

### Abstract

Abstract. In many applications of natural language processing (NLP) it is necessary to determine the likelihood of a given word combination. For example, a speech recognizer may need to determine which of the two word combinations “eat a peach ” and “eat a beach ” is more likely. Statistical NLP methods determine the likelihood of a word combination from its frequency in a training corpus. However, the nature of language is such that many word combinations are infrequent and do not occur in any given corpus. In this work we propose a method for estimating the probability of such previously unseen word combinations using available information on “most similar ” words. We describe probabilistic word association models based on distributional word similarity, and apply them to two tasks, language modeling and pseudo-word disambiguation. In the language modeling task, a similarity-based model is used to improve probability estimates for unseen bigrams in a back-off language model. The similaritybased method yields a 20 % perplexity improvement in the prediction of unseen bigrams and statistically significant reductions in speech-recognition error. We also compare four similarity-based estimation methods against back-off and maximum-likelihood estimation methods on a pseudo-word sense disambiguation task in which we controlled for both unigram and bigram frequency to avoid giving too much weight to easy-to-disambiguate high-frequency configurations. The similaritybased methods perform up to 40 % better on this particular task.

### Citations

9088 |
Elements of information theory
- Cover, Thomas
- 1991
(Show Context)
Citation Context ...training takes place. 2.3.1. KL divergence The Kullback-Leibler (KL) divergence is a standard informationtheoretic measure of the dissimilarity between two probability mass functions (Kullback, 1959; =-=Cover & Thomas, 1991-=-). We can apply it to the conditional distributions induced by words in V1 on words in V2: D(w1�w ′ 1)= � P(w2|w1) log w2 P (w2|w1) P (w2|w ′ 1 ) . (4) D(w1�w ′ 1) is non-negative, and is zero if and ... |

4120 |
Pattern Classification and Scene Analysis
- DUDA, HART
- 1973
(Show Context)
Citation Context ...t first sight seem to be special cases of clustering and weighted nearest-neighbor approaches used widely in machine learning and pattern recognition (Aha, Kibler, & Albert, 1991; Cover & Hart, 1967; =-=Duda & Hart, 1973-=-; Stanfill & Waltz, 1986; Devroye, Györfi, & Lugosi, 1996; Atkeson, Moore, & Schaal, 1997). There are important differences between those methods and ours. Clustering and nearest-neighbor techniques o... |

1648 | WordNet: an online lexical database
- Miller
- 1990
(Show Context)
Citation Context ... or word clusters from cooccurrence statistics in a corpus. Other researchers developed methods which quantify similarity relationships based on information in the manually crafted WordNet thesaurus (=-=Miller, Beckwith, Fellbaum, Gross, & Miller, 1990-=-). Resnik (1992, 1995) proposes a node-based approach for measuring the similarity between a pair of words in the thesaurus and applies it to various disambiguation tasks. His similarity function is a... |

1269 | Information theory and statistics
- Kullback
- 1959
(Show Context)
Citation Context ...e any parameter training takes place. 2.3.1 KL divergence The Kullback-Leibler (KL) divergence is a standard information-theoretic measure of the dissimilarity between two probability mass functions (=-=Kullback, 1959-=-; Cover & Thomas, 1991). We can apply it to the conditional distributions induced by words in V 1 on words in V 2 : D(w 1 kw 0 1 ) = X w 2 P (w 2 jw 1 ) log P (w 2 jw 1 ) P (w 2 jw 0 1 ) : (4) D(w 1 k... |

1125 | Instanced-based learning algorithms
- Aha, Albert
- 1991
(Show Context)
Citation Context ...ity-based methods for cooccurrence modeling may at first sight seem to be special cases of clustering and weighted nearest-neighbor approaches used widely in machine learning and pattern recognition (=-=Aha, Kibler, & Albert, 1991-=-; Cover & Hart, 1967; Duda & Hart, 1973; Stanfill & Waltz, 1986; Devroye, Györfi, & Lugosi, 1996; Atkeson, Moore, & Schaal, 1997). There are important differences between those methods and ours. Clust... |

1041 |
A probabilistic theory of pattern recognition
- Devroye, Györfi, et al.
- 1996
(Show Context)
Citation Context ...lustering and weighted nearest-neighbor approaches used widely in machine learning and pattern recognition (Aha, Kibler, & Albert, 1991; Cover & Hart, 1967; Duda & Hart, 1973; Stanfill & Waltz, 1986; =-=Devroye, Györfi, & Lugosi, 1996-=-; Atkeson, Moore, & Schaal, 1997). There are important differences between those methods and ours. Clustering and nearestneighbor techniques often rely on representing objects as points in a multidime... |

956 |
Nearest neighbor pattern classification
- Cover, Hart
- 1967
(Show Context)
Citation Context ...rence modeling may at first sight seem to be special cases of clustering and weighted nearest-neighbor approaches used widely in machine learning and pattern recognition (Aha, Kibler, & Albert, 1991; =-=Cover & Hart, 1967-=-; Duda & Hart, 1973; Stanfill & Waltz, 1986; Devroye, Györfi, & Lugosi, 1996; Atkeson, Moore, & Schaal, 1997). There are important differences between those methods and ours. Clustering and nearest-ne... |

924 | An empirical study of smoothing techniques for language modeling - Chen, Goodman - 1996 |

736 | Class-Based n-gram Models of Natural Language
- Brown, Pietra, et al.
- 1992
(Show Context)
Citation Context ... 12% of the bigrams in the test partition did not occur in thes44 DAGAN, LEE AND PEREIRA training portion. For trigrams, the sparse data problem is even more severe: for instance, researchers at IBM (=-=Brown, DellaPietra, deSouza, Lai, & Mercer, 1992-=-) examined a training corpus consisting of almost 366 million English words, and discovered that one can expect 14.7% of the word triples in any new English text to be absent from the training sample.... |

706 | A stochastic parts program and a noun phrase parser for unrestricted text
- Church
- 1988
(Show Context)
Citation Context ...Data ˆP (w2|w1) = � Pd(w2|w1) α(w1)Pr(w2|w1) c(w1,w2)>0 c(w1,w2)=0 Pr(w2|w1) = PSIM(w2|w1) � W (w1,w ′ 1) norm(w1) P (w2|w ′ 1) PSIM(w2|w1) = w ′ 1 ∈S(w1) We used a statistical part-of-speech tagger (=-=Church, 1988-=-) and pattern matching and concordancing tools (due to David Yarowsky) to identify transitive main verbs (V2) and head nouns (V1) of the corresponding direct objects in 44 million words of 1988 Associ... |

698 | Estimation of Probabilities from Sparse Data for the Language Model Component of a Speech Recognizer
- Katz
- 1987
(Show Context)
Citation Context ...s never occurred in training following the conditioning word w 1 is typically calculated from the probability of w 2 , as estimated by w 2 's frequency in the corpus (Jelinek, Mercer, & Roukos, 1992; =-=Katz, 1987-=-). This method makes an independence assumption on the cooccurrence of w 1 and w 2 : the more frequent w 2 is, the higher the estimate of P (w 2 jw 1 ) will be, regardless of w 1 . Class-based and sim... |

608 | Semantic similarity based on corpus statistics and lexical taxonomy - Jiang, Conrath - 1997 |

562 | Distributional Clustering of English Words
- Pereira, Tishby, et al.
- 1993
(Show Context)
Citation Context ...ive exponential function of the KL divergence for the weight function by analogy withsSIMILARITY-BASED MODELS 49 the form of the cluster membership function in related distributional clustering work (=-=Pereira et al., 1993-=-) and also because that is the form for the probability that w1’s distribution arose from a sample drawn from the distribution of w ′ 1 (Cover & Thomas, 1991; Lee, 1997). However, these reasons are he... |

505 |
Memory-Based Reasoning
- Waltz
- 1989
(Show Context)
Citation Context ...to be special cases of clustering and weighted nearest-neighbor approaches used widely in machine learning and pattern recognition (Aha, Kibler, & Albert, 1991; Cover & Hart, 1967; Duda & Hart, 1973; =-=Stanfill & Waltz, 1986-=-; Devroye, Györfi, & Lugosi, 1996; Atkeson, Moore, & Schaal, 1997). There are important differences between those methods and ours. Clustering and nearest-neighbor techniques often rely on representin... |

493 | Locally weighted learning
- Atkeson, Moore, et al.
- 1997
(Show Context)
Citation Context ...ighbor approaches used widely in machine learning and pattern recognition (Aha, Kibler, & Albert, 1991; Cover & Hart, 1967; Duda & Hart, 1973; Stanfill & Waltz, 1986; Devroye, Györfi, & Lugosi, 1996; =-=Atkeson, Moore, & Schaal, 1997-=-). There are important differences between those methods and ours. Clustering and nearest-neighbor techniques often rely on representing objects as points in a multidimensional space with coordinates ... |

444 | Divergence measures based on the Shannon entropy
- Lin
- 1991
(Show Context)
Citation Context ...imilaritybased models performed almost 40% better than back-off, which yielded about 49% accuracy in our experimental setting. Furthermore, a scheme based on the Jensen-Shannon divergence (Rao, 1982; =-=Lin, 1991-=-) 1 yielded statistically significant improvement in error rate over cooccurrence smoothing. We also investigated the effect of removing extremely low-frequency events from the training set. We found ... |

393 |
The population frequencies of species and the estimation of population parameters
- Good
- 1953
(Show Context)
Citation Context ... alleviate its unreliability. Our proposals address the zero-count problem exclusively, and we rely on existing techniques to smooth other small counts. Previous proposals for the zero-count problem (=-=Good, 1953-=-; Jelinek et al., 1992; Katz, 1987; Church & Gale, 1991) adjust the MLE so that the total probability of seen word pairs is less than one, leaving some probability mass to be redistributed among the u... |

351 | Interpolated estimation of Markov source parameters from sparse data,” Pattern Recognition in Practice - Jelinek, Mercer - 1986 |

344 | Explorations in Automatic Thesaurus Discovery - Grefenstette - 1994 |

314 | Word-sense disambiguation using statistical models of Roger’s categories trained on large corpora
- Yarowsky
- 1992
(Show Context)
Citation Context ...larity We now consider several word similarity measures that can be derived automatically from the statistics of a training corpus, as opposed to being derived from manually-constructed word classes (=-=Yarowsky, 1992-=-; Resnik, 1992, 1995; Luk, 1995; Lin, 1997). Sections 2.3.1 and 2.3.2 discuss two related information-theoretic functions, the KL divergence and the Jensen-Shannon divergence. Section 2.3.3 describes ... |

250 | Noun classification from predicate-argument structures - Hindle - 1990 |

249 | Integrating Multiple Knowledge Sources to Disambiguate Word Sense: An Exemplar-Based Approach - Ng, Lee - 1996 |

241 |
The zero-frequency problem: Estimating the probabilities of novel events in adaptive text compression
- Witten, Bell
- 1991
(Show Context)
Citation Context ...mples of text or speech. The most likely analysis will be taken to be the one that contains the most frequent configurations. The problem of data sparseness, also known as the zero-frequency problem (=-=Witten & Bell, 1991-=-), arises when analyses contain configurations that never occurred in the training corpus. Then it is not possible to estimate probabilities from observed frequencies, and some other estimation scheme... |

152 | Dimensions of meaning
- Schütze
- 1992
(Show Context)
Citation Context ...enses, whereas others are essentially monosemous; this means that test cases are not all uniformly hard. To circumvent these and other difficulties, we set up a pseudo-word disambiguation experiment (=-=Schutze, 1992-=-a; Gale, Church, & Yarowsky, 1992), the format of which is as follows. First, a list of pseudo-words is constructed, each of which is the combination of two different words in V 2 . Each word in V 2 c... |

132 | Using syntactic dependency as local context to resolve word sense ambiguity
- Lin
- 1997
(Show Context)
Citation Context ...measures that can be derived automatically from the statistics of a training corpus, as opposed to being derived from manually-constructed word classes (Yarowsky, 1992; Resnik, 1992, 1995; Luk, 1995; =-=Lin, 1997-=-). Sections 2.3.1 and 2.3.2 discuss two related information-theoretic functions, the KL divergence and the Jensen-Shannon divergence. Section 2.3.3 describes the L 1 norm, a geometric distance functio... |

128 | Disambiguating noun grouping wi th respect to wordnet senses - Resnik - 1995 |

104 | Improved clustering techniques for classbased statistical language modelling - Kneser, Ney - 1993 |

84 | Use of syntactic context to produce term association lists for text retrieval - Grefenstette - 1992 |

82 | Contextual word similarity and estimation from sparse data - Dagan, Marcus, et al. - 1994 |

81 | Aggregate and mixed-order Markov models for statistical language processing
- Saul, Pereira
- 1997
(Show Context)
Citation Context ... make γ depend on w1, so that the contribution of the similarity estimate could vary among words. Such dependences are often used in interpolated models (Jelinek & Mercer, 1980; Jelinek et al., 1992; =-=Saul & Pereira, 1997-=-) and are indeed advantageous. However, since they introduce hidden variables, they require a more complex training algorithm, and we did not pursue that direction in the present work. 2.3. Measures o... |

76 | A case-basd approach to knowledge acquisition for domain-specific sentence analysis - Cardie - 1993 |

75 | Similarity-based estimation of word co- occurrence probabilities
- Dagan, Pereira, et al.
- 1994
(Show Context)
Citation Context ...rs of this paper for their constructive criticisms, and the editors of the present issue, Claire Cardie and Ray Mooney, for their help and suggestions. Portions of this work have appeared previously (=-=Dagan, Pereira, & Lee, 1994-=-; Dagan, Lee, & Pereira, 1997); we thank the reviewers of those papers for their comments. Part of this work was done while the first author was a member of technical staff and then a visitor at AT&T ... |

69 | Context space
- Schütze
- 1992
(Show Context)
Citation Context ...enses, whereas others are essentially monosemous; this means that test cases are not all uniformly hard. To circumvent these and other difficulties, we set up a pseudo-word disambiguation experiment (=-=Schutze, 1992-=-a; Gale, Church, & Yarowsky, 1992), the format of which is as follows. First, a list of pseudo-words is constructed, each of which is the combination of two different words in V 2 . Each word in V 2 c... |

63 |
rinciples of lexical language modeling for speech recognion
- Jelinek, Mercer, et al.
- 1991
(Show Context)
Citation Context ...2|w1) of a conditioned word w2 that has never occurred in training following the conditioning word w1 is typically calculated from the probability of w2, as estimated by w2’s frequency in the corpus (=-=Jelinek, Mercer, & Roukos, 1992-=-; Katz, 1987). This method makes an independence assumption on the cooccurrence of w1 and w2: the more frequent w2 is, the higher the estimate of P (w2|w1) will be, regardless of w1. Class-based and s... |

59 | SimilarityBased Methods for Word Sense Disambiguation
- Dagan, Lee, et al.
- 1997
(Show Context)
Citation Context ...nstructive criticisms, and the editors of the present issue, Claire Cardie and Ray Mooney, for their help and suggestions. Portions of this work have appeared previously (Dagan, Pereira, & Lee, 1994; =-=Dagan, Lee, & Pereira, 1997-=-); we thank the reviewers of those papers for their comments. Part of this work was done while the first author was a member of technical staff and then a visitor at AT&T Labs, and the second author w... |

57 |
Work on statistical methods for word sense disambiguation
- Gale, Church, et al.
- 1992
(Show Context)
Citation Context ...thers are essentially monosemous; this means that test cases are not all uniformly hard. To circumvent these and other difficulties, we set up a pseudo-word disambiguation experiment (Schütze, 1992a; =-=Gale, Church, & Yarowsky, 1992-=-), the format of which is as follows. First, a list of pseudo-words is constructed, each of which is the combination of two different words in V2. Each word in V2 contributes to exactly one pseudo-wor... |

53 | Wordnet and distributional analysis: A class-based approach to lexical discovery
- Resnik
- 1992
(Show Context)
Citation Context ...nsider several word similarity measures that can be derived automatically from the statistics of a training corpus, as opposed to being derived from manually-constructed word classes (Yarowsky, 1992; =-=Resnik, 1992-=-, 1995; Luk, 1995; Lin, 1997). Sections 2.3.1 and 2.3.2 discuss two related information-theoretic functions, the KL divergence and the Jensen-Shannon divergence. Section 2.3.3 describes the L 1 norm, ... |

49 | Experiments on linguistically-based term associations - Ruge - 1992 |

48 | Exemplar–based word sense disambiguation: Some recent improvements - Ng - 1997 |

46 | Cooccurrence smoothing for stochastic language modeling - Essen, Steinbiss - 1992 |

45 | Similarity-Based Approaches to Natural Language
- Lee
- 1997
(Show Context)
Citation Context ...l clustering work (Pereira et al., 1993) and also because that is the form for the probability that w 1 's distribution arose from a sample drawn from the distribution of w 0 1 (Cover & Thomas, 1991; =-=Lee, 1997-=-). However, these reasons are heuristic rather than theoretical, since we do not have a rigorous probabilistic justification for similarity-based methods. 2.3.2 Jensen-Shannon divergence A related mea... |

39 | Statistical Sense Disambiguation with Relatively Small Corpus Using Dictionary Definitions, 33rd Annual Meeting of the Association for Computational Linguistics,26-30
- Luk
- 1995
(Show Context)
Citation Context ...similarity measures that can be derived automatically from the statistics of a training corpus, as opposed to being derived from manually-constructed word classes (Yarowsky, 1992; Resnik, 1992, 1995; =-=Luk, 1995-=-; Lin, 1997). Sections 2.3.1 and 2.3.2 discuss two related information-theoretic functions, the KL divergence and the Jensen-Shannon divergence. Section 2.3.3 describes the L 1 norm, a geometric dista... |

31 | Discovery procedures for sublanguage selectional patterns: Initial experiments
- Grishman, Hirschman, et al.
- 1986
(Show Context)
Citation Context ...y scores between a word and representatives of precomputed similarity classes. An early attempt to automatically classify words into semantic classes was carried out in the Linguistic String Project (=-=Grishman, Hirschman, & Nhan, 1986-=-). Semantic classes were derived from similar cooccurrence patterns of words within syntactic relations. Cooccurrence statistics were then considered at the class level and used to alleviate data spar... |

23 | Learning similarity-based word sense disambiguation from sparse data - Karov, Edelman - 1996 |

21 | Smoothing of automatically generated selectional constraints - Grishman, Sterling - 1993 |

12 | Hierarchical Clustering of Words and Application to NLP Tasks - Ushioda - 1996 |

11 | Isolated Word Recognition Using Hidden Markov Models - Sugawara, Nishimura, et al. - 1985 |

9 | An extended clustering algorithm for statistical language models - Ueberla - 1994 |

6 |
Distributional clustering of English words. In 31st annual meeting of the association for computational linguistics (p
- Pereira, Tishby, et al.
- 1993
(Show Context)
Citation Context ... estimate. We chose a negative exponential function of the KL divergence for the weight function by analogy with the form of the cluster membership function in related distributional clustering work (=-=Pereira et al., 1993-=-) and also because that is the form for the probability that w1’s distribution arose from a sample drawn from the distribution of w ′ 1 (Cover & Thomas, 1991; Lee, 1997). However, these reasons are he... |

3 | 31st Annual Meeting of the ACL (p - In - 1982 |