## Similarity-based approaches to natural language processing (1997)

### Cached

### Download Links

- [www.cs.cornell.edu]
- [www.cs.cornell.edu]
- [ftp.deas.harvard.edu]
- [arxiv.org]
- DBLP

### Other Repositories/Bibliography

Citations: | 45 - 3 self |

### BibTeX

@TECHREPORT{Lee97similarity-basedapproaches,

author = {Lillian Jane Lee},

title = {Similarity-based approaches to natural language processing},

institution = {},

year = {1997}

}

### OpenURL

### Abstract

Statistical methods for automatically extracting information about associations between words or documents from large collections of text have the potential to have considerable impact in a number of areas, such as information retrieval and natural-language-based user interfaces. However, even huge bodies of text yield highly unreliable estimates of the probability of relatively common events, and, in fact, perfectly reasonable events may not occur in the training data at all. This is known as the sparse data problem. Traditional approaches to the sparse data problem use crude approximations. We propose a different solution: if we are able to organize the data into classes of similar events, then, if information about an event is lacking, we can estimate its behavior from information about similar events. This thesis presents two such similarity-based approaches, where, in general, we measure similarity by the Kullback-Leibler divergence, an information-theoretic quantity. Our first approach is to build soft, hierarchical clusters: soft, because each event belongs to each cluster with some probability; hierarchical, because cluster centroids are iteratively split to model finer distinctions. Our clustering method, which uses the technique of deterministic annealing,

### Citations

9054 | Maximum likelihood from incomplete data via the EM algorithm
- Dempster, Laird, et al.
- 1977
(Show Context)
Citation Context ...at maximize entropy, though, and so we repeat these two steps until a stable configuration is reached. This two-step estimation iteration is reminiscent of the EM (Estimation-Maximization) algorithm (=-=Dempster, Laird, and Rubin, 1977-=-) commonly used to find maximum likelihood solutions. Before we continue, we review the notation that will be used in the following sections. Model probabilities are always marked with a tilde (sP ). ... |

3768 | Fuzzy sets - Zadeh - 1965 |

1563 | Wordnet: a lexical database for english
- Miller
- 1995
(Show Context)
Citation Context ...n, and attempt to learn characteristics of the language from the statistics in the sample. They may also make use of auxiliary information gained from such sources as on-line dictionaries or WordNet (=-=Miller, 1995-=-). An important advantage of statistical approaches over traditional linguistic models is that statistical methods yield probabilities. These probabilities can easily be combined with estimates from o... |

1535 |
Finding Groups in Data: An Introduction to Cluster Analysis
- Kaufman, Rousseeuw
- 1990
(Show Context)
Citation Context ...us working with representations adhering to the second condition tends to be much more convenient. Many clustering schemes represent objects in terms of a set {A 1 , A 2 , . . . , AN } of attributes (=-=Kaufman and Rousseeuw, 1990-=-). Each object is associated with an attribute vector (a 1 , a 2 , . . . , aN ) of values for the attributes. Some attributes can take on an infinite number of values; for example, the mean of a norma... |

1450 |
Pattern recognition with fuzzy objective function algorithms
- Bezdek
- 1981
(Show Context)
Citation Context ...e known. In practice, however, estimates of inter-object distances can be quite sensitive to noise; centroid methods overcome this problem by averaging together many points. The fuzzy k-means method (=-=Bezdek, 1981-=-), a generalization of the k-means approach, bears some resemblance to our procedure. It is a centroid method using the Euclidean distance (L 2 ) as distance function. The centroid distributions depen... |

1317 | Information Theory and Statistics
- Kullback
- 1959
(Show Context)
Citation Context ...ack Leibler distance (Cover and Thomas, 1991). Kullback himself refers to the function as information for 9 discrimination, reserving the term "divergence" for the symmetric function D(q||r)=-=+D(r||q) (Kullback, 1959). We will-=- use the name Kullback-Leibler (KL) divergence throughout this thesis. The KL divergence is a standard information-theoretic "measure" of the dissimilarity between two probability mass funct... |

1093 |
The use of multiple measurements in taxonomic problems
- Fisher
- 1936
(Show Context)
Citation Context ... and compare. For the moment, we will be vague about what sorts of objects we will be considering; researchers have clustered everything from documents (Salton, 1968; Cutting et al., 1992) to irises (=-=Fisher, 1936-=-; Cheeseman et al., 1988). We want the representation we choose to satisfy two requirements. First, the representation should be general enough to apply to many di#erent types of objects. Second, any ... |

940 | An empirical study of smoothing techniques for language modeling. Computer Speech and Language - SF, Goodman - 1999 |

900 | Word association norms, mutual information, and lexicography
- Church, Hanks
- 1990
(Show Context)
Citation Context ...ion, Aczel and Daroczy (1975) for an axiomatic development, and Renyi (1970) for a description of information theory that uses the KL divergence as a starting point. Some authors (Brown et al., 1992; =-=Church and Hanks, 1990-=-; Dagan, Marcus, and Markovitch, 1995; Luk, 1995) use the mutual information, which is the KL divergence between the joint distribution of two random variables and their product distributions. Let A a... |

751 | Information theory and statistical mechanics - Jaynes - 1957 |

738 | Class-Based N-Gram Models of Natural Language
- Brown, Pietra
- 1992
(Show Context)
Citation Context ...for general information, Aczel and Daroczy (1975) for an axiomatic development, and Renyi (1970) for a description of information theory that uses the KL divergence as a starting point. Some authors (=-=Brown et al., 1992-=-; Church and Hanks, 1990; Dagan, Marcus, and Markovitch, 1995; Luk, 1995) use the mutual information, which is the KL divergence between the joint distribution of two random variables and their produc... |

716 | A stochastic parts program and noun phrase parser for unrestricted text
- Church
- 1988
(Show Context)
Citation Context ...t were derived from newswire text automatically parsed by Hindle's parser Fidditch (Hindle, 1994). Later, we constructed similar frequency tables with the help of a statistical part-of-speech tagger (=-=Church, 1988-=-) and tools for regular expression pattern-matching on tagged corpora (Yarowsky, 1992a). We 22 have not compared the accuracy and coverage of the two methods or studied what biases they introduce, alt... |

701 | Estimation of Probabilities from Sparse Data for the Language Model Component of a Speech Recognizer
- Katz
- 1987
(Show Context)
Citation Context ...noted 6 in chapter 1, using the maximum likelihood estimate tends to grossly underestimate the probability of low-frequency events. Many alternatives to the MLE (Good, 1953; Jelinek and Mercer, 1980; =-=Katz, 1987-=-; Church and Gale, 1991) take the MLE as an initial estimate and adjust it so that the total estimated probability of pairs occurring in the sample is less than one, leaving some probability mass for ... |

680 | Scatter/gather: A cluster-based approach to browsing large document collections
- Cutting, Pedersen, et al.
- 1992
(Show Context)
Citation Context ...or the objects we wish to cluster and compare. For the moment, we will be vague about what sorts of objects we will be considering; researchers have clustered everything from documents (Salton, 1968; =-=Cutting et al., 1992-=-) to irises (Fisher, 1936; Cheeseman et al., 1988). We want the representation we choose to satisfy two requirements. First, the representation should be general enough to apply to many di#erent types... |

568 | Distributional Clustering of English Words
- Pereira, Tishby, et al.
- 1993
(Show Context)
Citation Context ...raphic Notes Portions of this thesis are joint work and have appeared elsewhere. Chapter 3 is based on the paper "Distributional Clustering of English Words" with Fernando Pereira and Naftal=-=i Tishby (Pereira, Tishby, and Lee, 1993-=-), which appeared in the proceedings of the 31st meeting of the ACL. We thank Don Hindle for making available the 1988 Associated Press verb-object data set, the Fidditch parser, and a verb-object str... |

515 |
Bayesian Classification (AutoClass): Theory and Results
- Cheeseman, Stutz
- 1996
(Show Context)
Citation Context ...ibutional setting we have been considering, using the KL divergence is well-motivated, whereas it is not entirely clear why the L 2 norm would be meaningful. Bayesian methods (Wallace and Dowe, 1994; =-=Cheeseman and Stutz, 1996-=-) combine well-formedness constraints and performance criteria. They seek to find the model with the maximum posterior probability given the data, where the posterior probability is based on the produ... |

514 |
Syntactic Structures
- Chomsky
- 1957
(Show Context)
Citation Context ...ed .... Hence, in any statistical model ... these sentences will be ruled out on identical grounds as equally `remote' from English. Yet (1), though nonsensical, is grammatical, while (2) is not."=-=; 2 (Chomsky, 1964, pg. 16) -=-This thought experiment helped "[disabuse] the field once and for all of the notion that there was anything of interest to statistical models of language" (Abney, 1996). However, in the year... |

422 | R . "A maximum likelihood approach t o continuous speech recognition - Bahl, Jelinek, et al. - 1983 |

411 |
The population frequencies of species and the estimation of population parameters
- Good
- 1953
(Show Context)
Citation Context ...the training sample is impossible. As noted 6 in chapter 1, using the maximum likelihood estimate tends to grossly underestimate the probability of low-frequency events. Many alternatives to the MLE (=-=Good, 1953-=-; Jelinek and Mercer, 1980; Katz, 1987; Church and Gale, 1991) take the MLE as an initial estimate and adjust it so that the total estimated probability of pairs occurring in the sample is less than o... |

358 | The Art of Computer Programming, Volume 1: Fundamental Algorithms, 2nd ed - Knuth - 1973 |

357 |
Interpolated estimation of markov source parameters from sparse data
- Jelinek, Mercer
- 2000
(Show Context)
Citation Context ... sample is impossible. As noted 6 in chapter 1, using the maximum likelihood estimate tends to grossly underestimate the probability of low-frequency events. Many alternatives to the MLE (Good, 1953; =-=Jelinek and Mercer, 1980-=-; Katz, 1987; Church and Gale, 1991) take the MLE as an initial estimate and adjust it so that the total estimated probability of pairs occurring in the sample is less than one, leaving some probabili... |

317 | Word-sense disambiguation using statistical models of roget’s categories trained on large corpora
- Yarowsky
- 1992
(Show Context)
Citation Context ...(Hindle, 1994). Later, we constructed similar frequency tables with the help of a statistical part-of-speech tagger (Church, 1988) and tools for regular expression pattern-matching on tagged corpora (=-=Yarowsky, 1992-=-a). We 22 have not compared the accuracy and coverage of the two methods or studied what biases they introduce, although we took care to filter out certain systematic errors (for instance, subjects of... |

256 | Noun Classification from Predicate-Argument Structures - Hindle - 1990 |

252 |
AutoClass: A Bayesian Classification System
- Cheeseman, Kelly, et al.
- 1988
(Show Context)
Citation Context ...For the moment, we will be vague about what sorts of objects we will be considering; researchers have clustered everything from documents (Salton, 1968; Cutting et al., 1992) to irises (Fisher, 1936; =-=Cheeseman et al., 1988-=-). We want the representation we choose to satisfy two requirements. First, the representation should be general enough to apply to many di#erent types of objects. Second, any particular object's repr... |

249 |
Elements of information theory. Wiley series in telecommunications
- Cover, Thomas
- 1991
(Show Context)
Citation Context ...re the performance of our similarity-based estimates. Section 2.3 studies various functions measuring similarity between distributions. We pay particular attention to the Kullback-Leibler divergence (=-=Cover and Thomas, 1991-=-), which plays a central role in our work. 2.1 Objects as Distributions The first issue we must address is what representation to use for the objects we wish to cluster and compare. For the moment, we... |

247 | Selection and Information: A Class-based Approach to Lexical Relationships
- Resnik
- 1993
(Show Context)
Citation Context ...mentioned in chapter 2, however, we are interested in ways to derive classes directly from distributional data. Resnik's thesis contains a discussion of the relative advantages of the two approaches (=-=Resnik, 1993-=-). In what follows, we will consider two sets of words, the set X of nouns, and the set Y of transitive verbs. We are interested in the object-verb relation: the pair (x, y) denotes the event that nou... |

240 |
Probability theory
- Renyi
- 1970
(Show Context)
Citation Context ... the logarithm). Limiting arguments lead us to set 0 log 0 r = 0, even if r = 0, and q log q 0 = # when q is not zero. Function (2.5) goes by many names in the literature, including information gain (=-=Renyi, 1970-=-), error (Kerridge, 1961), relative entropy, cross entropy, and Kullback Leibler distance (Cover and Thomas, 1991). Kullback himself refers to the function as information for 9 discrimination, reservi... |

204 | Pairwise data clustering by deterministic annealing - Hofmann, Buhmann - 1997 |

141 |
Statistical mechanics and phase transitions in clustering
- Rose, Gurewitz, et al.
- 1990
(Show Context)
Citation Context ...|x), (3.4) which is the average entropy of the membership probabilities. We can combine distortion and entropy into a single function, the free energy, which appears in work on statistical mechanics (=-=Rose, Gurewitz, and Fox, 1990-=-): F = D -H/#. (3.5) This function is not arbitrary; indeed, at maximum entropy points (see section 3.3.1), we can show that H = - #F #T and (3.6) D = ##F ## , (3.7) (3.8) where T = 1/#. The minima of... |

131 |
A Comparison of the Enhanced Good-Turing and Deleted Estimation Methods for Estimating
- Church, Gale
- 1991
(Show Context)
Citation Context ...hapter 1, using the maximum likelihood estimate tends to grossly underestimate the probability of low-frequency events. Many alternatives to the MLE (Good, 1953; Jelinek and Mercer, 1980; Katz, 1987; =-=Church and Gale, 1991-=-) take the MLE as an initial estimate and adjust it so that the total estimated probability of pairs occurring in the sample is less than one, leaving some probability mass for unseen pairs. These tec... |

123 |
The Art of Computer Programming, volume 1
- Knuth
- 1968
(Show Context)
Citation Context ...clustering schemes are enormous, as there are # n k # ways to group n observations into k non-empty sets, where # n k # = 1 k! k # i=0 (-1) k-i # k i # i n 19 is a Stirling number of the second kind (=-=Knuth, 1973-=-). There are a huge number of possible groupings even for small values of k and n: Hatzivassiloglou and McKeown (1993) observe that one can divide twenty-one points into nine sets in approximately 1.2... |

104 | Improved clustering techniques for class-based statistical language modelling - Kneser, Ney - 1993 |

96 |
Synopsis of Linguistic Theory 19301955
- Firth
- 1957
(Show Context)
Citation Context ... for di#erent model parameters . . . 61 5.2 Speech recognition disagreements between models . . . . . . . . . . . . . . . 62 x Chapter 1 Introduction "You shall know a word by the company it keep=-=s!" (Firth, 1957, pg. 11) We begin b-=-y considering the problem of predicting string probabilities. Suppose we are presented with two strings, 1. "Grill doctoral candidates", and 2. "Grill doctoral updates", and are as... |

84 | Statistical methods and linguistics
- Abney
- 1996
(Show Context)
Citation Context ...hile (2) is not." 2 (Chomsky, 1964, pg. 16) This thought experiment helped "[disabuse] the field once and for all of the notion that there was anything of interest to statistical models of l=-=anguage" (Abney, 1996). Ho-=-wever, in the years since Chomsky wrote this remark, some progress on ameliorating the sparse data problem has been made. Indeed, Chomsky's statement is based on the false assumption that "any st... |

84 | Contextual word similarity and estimation from sparse data
- Dagan, Shaul, et al.
- 1993
(Show Context)
Citation Context ...1975) for an axiomatic development, and Renyi (1970) for a description of information theory that uses the KL divergence as a starting point. Some authors (Brown et al., 1992; Church and Hanks, 1990; =-=Dagan, Marcus, and Markovitch, 1995-=-; Luk, 1995) use the mutual information, which is the KL divergence between the joint distribution of two random variables and their product distributions. Let A and B be two random variables with pro... |

77 | Similarity-based estimation of word cooccurrence probabilities
- Dagan, Pereira, et al.
- 1994
(Show Context)
Citation Context ...ork with Ido Dagan and Fernando Pereira that is described in "Similarity-Based Estimation of Word Cooccurrence Probabilities", which appeared in the proceedings of the 32nd Annual meeting of=-= the ACL (Dagan, Pereira, and Lee, 1994-=-). We thank Slava Katz for discussions on the topic of this paper, Doug McIlroy for detailed comments, Doug Paul for help with his baseline back-o# model, and Andre Ljolje and Michael Riley for provid... |

71 | Word Space
- Schütze
- 1993
(Show Context)
Citation Context ...training data for which the correct senses have been assigned, which can require considerable human e#ort. To circumvent these and other di#culties, we set up a pseudo-word disambiguation experiment (=-=Schutze, 1992-=-; Gale, Church, and Yarowsky, 1992), the general format of which is as follows. We first construct a list of pseudo-words, each of which is the combination of two di#erent words in Y. Each word in Y c... |

69 | Towards the Automatic Identification of Adjectival Scales: Clustering Adjectives According to Meaning - Hatzivassiloglou, McKeown - 1993 |

64 |
Principles of lexical language modeling for speech recognition
- Jelinek, Mercer, et al.
- 1991
(Show Context)
Citation Context ...ion error rate. Perplexity is often used as a performance metric for language modeling systems; it is generally assumed that lowering the perplexity is correlated with better performance in practice (=-=Jelinek, Mercer, and Roukos, 1992-=-). Let PLM be a probability model and S some 1 sample of text. Then the perplexity PP measures how well PLM models S: PP = PLM (S) -1/|S| . The intuition behind this expression is that a good language... |

62 | Numerical methods for fuzzy clustering - Ruspini - 1970 |

61 | Similarity-based methods for word sense disambiguation
- Dagan, Lee, et al.
- 1997
(Show Context)
Citation Context ...2 are adapted from "Similarity-Based Methods for Word Sense Disambiguation". This paper, co-written with Ido Dagan and Fernando Pereira, will appear in the proceedings of the 35th meeting of=-= the ACL (Dagan, Lee, and Pereira, 1997-=-). We thank Hiyan Alshawi, Joshua Goodman, Rebecca Hwa, Stuart Shieber, and Yoram Singer for many helpful comments and discussions. Chapter 5 is based on work with Ido Dagan and Fernando Pereira that ... |

57 |
Work on statistical methods for word sense disambiguation
- Gale, Church, et al.
- 1992
(Show Context)
Citation Context ...or which the correct senses have been assigned, which can require considerable human e#ort. To circumvent these and other di#culties, we set up a pseudo-word disambiguation experiment (Schutze, 1992; =-=Gale, Church, and Yarowsky, 1992-=-), the general format of which is as follows. We first construct a list of pseudo-words, each of which is the combination of two di#erent words in Y. Each word in Y contributes to exactly one pseudo-w... |

53 |
On the estimation of ‘small’ probabilities by leaving-one-out
- Ney, Essen, et al.
- 1995
(Show Context)
Citation Context ...of up to 25% on Wall Street Journal data with respect to Katz's back-o# method. This is a rather stunning result. However, the class-based model uses a smoothing method known as absolute discounting (=-=Ney and Essen, 1993-=-). An interesting question is how much of the performance is due to the smoothing method and how much is due to the clustering (Brown et al. did not smooth the data); no comparison was done between th... |

53 | WordNet and distributional analysis: A class-based approach to lexical discovery - Resnik - 1992 |

48 | Part-of-speech induction from scratch
- Schütze
- 1993
(Show Context)
Citation Context ...functions, including the L 1 and L 2 norms and the cosine function. All three of these functions appear quite commonly in the clustering literature (Kaufman and Rousseeuw, 1990; Cutting et al., 1992; =-=Schutze, 1993). The first two functions are-=- true metrics, as the name "norm" suggests. The L 1 norm (also called the "Manhattan" or "taxi-cab" distance) is defined as L 1 (q, r) = # y#Y |q(y) - r(y)|. (2.9) Clearl... |

47 |
On the Complexity of Clustering Problems
- Brucker
- 1978
(Show Context)
Citation Context ...ve that one can divide twenty-one points into nine sets in approximately 1.23 10 14 ways. As it turns out, the problem of finding a partition that minimizes some optimization function is NP-complete (=-=Brucker, 1978-=-), so, not surprisingly, most hard clustering algorithms resort to greedy or hill-climbing search to find a good partition. Greedy and hill-climbing approaches all first create an initial clustering a... |

46 |
Cooccurrence smoothing for stochastic language modeling
- Essen, Steinbiss
- 1992
(Show Context)
Citation Context ...ioned words (y) rather than on the similarity between conditioning words (x). For example, Essen and Steinbiss's variation 1 considers the confusion probability (4.3) of contexts rather than objects (=-=Essen and Steinbiss, 1992-=-). However, they noted that model 1-A was equivalent to model 2-B (which we discussed in section 4.4.2; it uses the confusion probability of conditioning words), and that their other model using varia... |

39 | Statistical Sense Disambiguation with Relatively Small Corpus Using Dictionary Definitions, 33rd Annual Meeting of the Association for Computational Linguistics,26-30
- Luk
- 1995
(Show Context)
Citation Context ...nd Renyi (1970) for a description of information theory that uses the KL divergence as a starting point. Some authors (Brown et al., 1992; Church and Hanks, 1990; Dagan, Marcus, and Markovitch, 1995; =-=Luk, 1995-=-) use the mutual information, which is the KL divergence between the joint distribution of two random variables and their product distributions. Let A and B be two random variables with probability ma... |

36 |
Intrinsic classification by MML - the Snob program
- Wallace, Dowe
- 1994
(Show Context)
Citation Context ...e function. In the distributional setting we have been considering, using the KL divergence is well-motivated, whereas it is not entirely clear why the L 2 norm would be meaningful. Bayesian methods (=-=Wallace and Dowe, 1994-=-; Cheeseman and Stutz, 1996) combine well-formedness constraints and performance criteria. They seek to find the model with the maximum posterior probability given the data, where the posterior probab... |

35 | Bootstrapping syntactic categories - Finch, Chater - 1992 |