In many applications of natural language processing (NLP) it is necessary to determine the likelihood of a given word combination. For example, a speech recognizer may need to determine which of the two word combinations "eat a peach" and "eat a beach" is more likely. Statistical NLP methods determine the likelihood of a word combination from its frequency in a training corpus. However, the nature of language is such that many word combinations are infrequent and do not occur in any given corpus. In this work we propose a method for estimating the probability of such previously unseen word combinations using available information on "most similar" words. We describe probabilistic word association models based on distributional word similarity, and apply them to two tasks, language modeling and pseudo-word disambiguation. In the language modeling task, a similarity-based model is used to improve probability estimates for unseen bigrams in a back-off language model. The similarity-based ...
|
4923
|
Elements of Information Theory
– Cover, Thomas
- 1991
|
|
3011
|
Pattern Classification and Scene Analysis
– Duda, Hart
- 1973
|
|
1072
|
Introduction to WordNet: An On-line Lexical Database
– Miller, Beckwith, et al.
- 1990
|
|
787
|
Instance-based Learning Algorithms
– Aha, Kibler, et al.
- 1991
|
|
653
|
Information Theory and Statistics
– Kullback
- 1959
|
|
619
|
A Probabilistic Theory of Pattern Recognition
– Devroye, Gyorfi, et al.
- 1996
|
|
588
|
A stochastic parts program and noun phrase parser for unrestricted text
– Church
- 1988
|
|
540
|
Nearest neighbor pattern classification
– Cover, Hart
- 1967
|
|
508
|
Estimation of probabilities from sparse data for the language model component of a speech recognizer
– Katz
- 1987
|
|
407
|
Distributional clustering of english words
– Pereira, Tishby, et al.
- 1993
|
|
400
|
Towards memory-based reasoning
– Stanfill, Waltz
- 1986
|
|
396
|
Class-based n-gram models of natural language
– BROWN, J, et al.
- 1990
|
|
391
|
An empirical study of smoothing techniques for language modeling
– Chen, Goodman
- 1996
|
|
307
|
Locally weighted learning
– Atkeson, Moore, et al.
- 1997
|
|
244
|
Semantic similarity based on corpus statistics and lexical taxonomy
– Jiang, Conrath
- 1997
|
|
235
|
The population frequencies of species and the estimation of population parameters
– Good
- 1953
|
|
230
|
Interpolated estimation of Markov source parameters from sparse data
– Jelinek, Mercer
- 1980
|
|
228
|
Word sense disambiguation using statistical models of Roget's categories trained on large corpora
– Yarowsky
- 1992
|
|
224
|
Explorations in Automatic Thesaurus Discovery
– Grefenstette
- 1994
|
|
207
|
Divergence measures based on the shannon entropy
– Lin
- 1991
|
|
180
|
Integrating multiple knowledge sources to disambiguate word sense: an exemplar-based approach
– Ng, Lee
- 1996
|
|
165
|
Noun classification from predicate-argument structure
– Hindle
- 1990
|
|
151
|
The zerofrequency problem: Estimating the probabilities of novel events in adaptive text compression
– Witten, Bell
- 1991
|
|
104
|
Disambiguating Noun Groupings with Respect to WordNet Senses
– Resnik
- 1995
|
|
102
|
Dimensions of meaning
– Schütze
- 1992
|
|
77
|
Improved clustering techniques for class-based statistical language modeling
– Kneser, Ney
- 1993
|
|
75
|
Using syntactic dependency as local context to resolve word sense ambiguity
– Lin
- 1997
|
|
69
|
Contextual word similarity and estimation from sparse data
– Dagan, Marcus, et al.
- 1993
|
|
63
|
A case-based approach to knowledge acquisition for domain-specific sentence analysis
– Cardie
- 1993
|
|
61
|
Aggregate and mixed-order Markov models for statistical language processing
– Saul, Pereira
- 1997
|
|
56
|
Principles of lexical language modeling for speech recognition
– Jelinek, Mercer, et al.
- 1991
|
|
55
|
Use of syntactic context to produce term association lists for text retrieval
– Grefenstette
- 1992
|
|
53
|
Similaritybased estimation of word cooccurrence probabilities
– Dagan, Pereira, et al.
- 1994
|
|
42
|
Wordnet and distributional analysis: a class-based approach to lexical discovery
– Resnik
- 1992
|
|
41
|
Word space
– Schütze
- 1993
|
|
40
|
Work on statistical methods for word sense disambiguation
– Gale, Church, et al.
- 1999
|
|
36
|
Exemplar-based word sense disambiguation: some recent improvements
– Ng
- 1997
|
|
35
|
Cooccurrence smoothing for stochastic language modeling
– Essen, Steinbiss
- 1992
|
|
34
|
Similarity-based methods for word sense disambiguation
– Dagan, Lee, et al.
- 1997
|
|
32
|
Statistical Sense Disambiguation with Relatively Small Corpus Using Dictionary Definitions, 33rd Annual Meeting of the Association for Computational Linguistics,26-30
– Luk
- 1995
|
|
31
|
Similarity-based approaches to natural language processing
– Lee
- 1997
|
|
28
|
Experiments on linguistically-based term associations
– Ruge
- 1992
|
|
26
|
Discovery procedures for sublanguage selectional patterns: Initial experiments
– Grishman, Hirschman, et al.
- 1986
|
|
20
|
Smoothing of automatically generated selectional constraints
– Grishman, Sterling
- 1993
|
|
19
|
Learning similarity-based word sense disarnbiguation
– Karov, Edelman
- 1996
|
|
9
|
Isolated word recognition using hidden markov models
– Sugawara, Nishimura, et al.
- 1985
|
|
7
|
An extended clustering algorithm for statistical language models
– Ueberla
- 1994
|
|
7
|
Hierarchical clustering of words and application to NLP tasks
– Ushioda
- 1996
|
|
3
|
Distributional clustering of English words. In 31st annual meeting of the association for computational linguistics (p
– Pereira, Tishby, et al.
- 1993
|
|
2
|
the proceedings of 31st Annual Meeting of ACL
– In
- 1994
|