## Unsupervised Learning from Dyadic Data (1998)

Citations: | 100 - 9 self |

### BibTeX

@INPROCEEDINGS{Hofmann98unsupervisedlearning,

author = {Thomas Hofmann},

title = {Unsupervised Learning from Dyadic Data},

booktitle = {},

year = {1998},

pages = {466--472},

publisher = {MIT Press}

}

### Years of Citing Articles

### OpenURL

### Abstract

Dyadic data refers to a domain with two finite sets of objects in which observations are made for dyads, i.e., pairs with one element from either set. This includes event co-occurrences, histogram data, and single stimulus preference data as special cases. Dyadic data arises naturally in many applications ranging from computational linguistics and information retrieval to preference analysis and computer vision. In this paper, we present a systematic, domain-independent framework for unsupervised learning from dyadic data by statistical mixture models. Our approach covers different models with flat and hierarchical latent class structures and unifies probabilistic modeling and structure discovery. Mixture models provide both, a parsimonious yet flexible parameterization of probability distributions with good generalization performance on sparse data, as well as structural information about data-inherent grouping structure. We propose an annealed version of the standard Expectation Maximization algorithm for model fitting which is empirically evaluated on a variety of data sets from different domains.

### Citations

8558 |
Elements of Information Theory
- Cover, Thomas
- 1991
(Show Context)
Citation Context ...senting all co--occurring y in that code. The asymptotic average codeword length of a code based on P (yjc(x)) is exactly the cross entropy which governs the latent class posterior probabilities (cf. =-=[CT91]-=-). The one-sided clustering model can also be viewed as an unsupervised version of the naive Bayes' classifier, if we give Y the interpretation of a feature space for x 2 X . Each sample set S x then ... |

8083 | likelihood from incomplete data via the EM algorithm - Dempster, Laird, et al. - 1977 |

3119 | Introduction to Modern Information Retrieval - Salton - 1983 |

2717 | Indexing by latent semantic analysis
- Deerwester, Dumais, et al.
- 1990
(Show Context)
Citation Context ...ry may not occur in a document, the (indirect) conditional probability based on the aspect model can nevertheless be high. This is similar to the dimension--reduction approach pursued in standard LSI =-=[DTGL90]-=-. We tested the PLSI method on a number of medium--sized standard document test collection with relevance assessments by computing precision--recall curves. The precision--recall curve reported in the... |

2146 |
Dubes RC, “Algorithms for Clustering Data
- Jain, C
- 1988
(Show Context)
Citation Context ...88] for an overview), but is of great importance in interactive retrieval. The most frequently used methods in this context are linkage algorithms (single linkage, complete linkage, Wards method, cf. =-=[JD88]-=-), or hybrid combinations of agglomerative and centroid--based methods 5 7 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 cluster 1 3 5 7 10 11 12 13 14 15 16 17 18 19 20 21 22... |

942 |
The EM Algorithm and Extensions
- Mclachlan, Krishnan
- 1996
(Show Context)
Citation Context ... P (S; a; `) for given posterior probabilities with respect to `. The EM algorithm is known to increase the observed likelihood in each step, and converges to a (local) maximum under mild assumptions =-=[MK97]-=-. In the aspect model, ` contains all continuous parameters, namely P (a), P (xja) and P (yja). The E-step equations for the class posterior probabilities in the aspect model can be derived from Bayes... |

878 |
Finite Mixture Models
- McLachlan, Peel
- 2000
(Show Context)
Citation Context ...erior probabilities in (15) as PfC(x)=cjS x ; `g / P (c) exp 8 ! : \Gamman(x) 0 @ \Gamma X y2Y P (yjx) log P (yjc) 1 A 9 = ; : (18) Comparing (18) with standard models like the Gaussian mixture model =-=[MB88]-=- or probabilistic vector quantization [RGF92] demonstrates that the cross entropy between the empirical conditional probability P (yjx) and the class--conditional P (yjc) serves as a distortion measur... |

831 | An introduction to variational methods for graphical models - Jordan, Ghahramani, et al. - 1999 |

764 | A view of the EM algorithm that justifies incremental sparse and other variants
- Neal, Hinton
- 1998
(Show Context)
Citation Context ... sharing their expertise and data in natural language processing as well as to J.M.H. du Buf for providing the image data depicted in Fig. 13. Appendix EM, Annealed EM and Free Energy Minimization In =-=[NH98]-=-, it has been shown that both, the E-step and M-step of the EM algorithm, are minimizinga (generalized) free energy criterion. This fact is of importance, in particular for deriving approximate E-step... |

724 | Hierarchical mixtures of experts and EM algorithm
- Jordan, Jacobs
- 1994
(Show Context)
Citation Context ...ver, especially in the context of structure discovery, it is important to find a hierarchical data organization. There are well--known learning architectures like the Hierarchical Mixtures of Experts =-=[JJ94]-=- which can fit hierarchical models to data. Yet, in the case of dyadic data there is an alternative way to define a hierarchical model by combining aspects and clusters. In the hierarchical clustering... |

692 | Using collaborative filtering to weave an information tapestry - Goldberg, Nichols, et al. - 1992 |

625 | Statistical Analysis of Finite Mixture Distributions - Titterington, Smith, et al. - 1985 |

622 | Scatter/gather:a cluster-based approach to browsing large document collections
- Cutting, Karger, et al.
- 1992
(Show Context)
Citation Context ... 26 27 28 29 30 31 32 decision Figure 10: Exemplary relative word distributions over nodes for the CLUSTER dataset for the keywords 'cluster', 'decision', 'glass', 'robust', 'segment', and 'channel'. =-=[CKP92]-=- which have no probabilistic interpretation and have a number of other disadvantages. In contrast, the hierarchical mixture model provide a sound statistical basis and also has many additional feature... |

592 | Grouplens: applying collaborative filtering to usenet news - Konstan, Miller, et al. - 1997 |

548 | Distributional clustering of English words
- Pereira, Tishby, et al.
- 1993
(Show Context)
Citation Context ... content of a document [HPJ99]. ffl Computational linguistics in the corpus-based statistical analysis of word co-occurences which has applications in probabilistic language modeling, word clustering =-=[PTL93]-=-, word sense disambiguation [Hin90, DLP97] and discrimination [Sch98], and automated thesaurus construction [SP97b]. ffl Preference analysis and consumption behavior by identifying X with individuals ... |

462 | Unsupervised texture segmentation using Gabor filters - Jain, Farrokhnia - 1991 |

441 | Graphical models in applied multivariate statistics - Whittaker - 1990 |

372 | Automatic word sense discrimination
- Schutze
- 1998
(Show Context)
Citation Context ...corpus-based statistical analysis of word co-occurences which has applications in probabilistic language modeling, word clustering [PTL93], word sense disambiguation [Hin90, DLP97] and discrimination =-=[Sch98]-=-, and automated thesaurus construction [SP97b]. ffl Preference analysis and consumption behavior by identifying X with individuals and Y with objects. Dyads then correspond to single stimulus preferen... |

353 | The population frequencies of the species and the estimation of population parameters - Good - 1953 |

336 | Interpolated estimation of markov source parameters from sparse data - Jelinek, Mercer - 1980 |

324 | Learning to order things
- Cohen, Schapire, et al.
- 1999
(Show Context)
Citation Context ...s setting. Examples that have recently received some attention are proximity data [HB97, BWD97] which replace metric distances by the weaker notion of pairwise similarities and ranked preference data =-=[CSS98]-=-. A variety of other types of non--metrical data can be found, for example, in the psychometric literature [Coo64, Kru78, CA80], in particular in the context of multidimensional scaling and correspond... |

288 | Multidimensional scaling - Kruskal, Wish - 1978 |

240 | Noun classification from predicate-argument structures - Hindle - 1990 |

227 |
Latent Variable Models and Factor Analysis
- Bartholomew
- 1987
(Show Context)
Citation Context ...tained by interchanging the role of the sets X and Y. Probabilistic Factor Analysis for Discrete Data The dimensionality reduction obtained by the aspect model is similar in spirit to factor analysis =-=[Bar87]-=-. The factor analysis of co-occurrence data is also known as Latent Semantic Analysis (LSA) [LS89, DTGL90]. In LSA, the co-occurrence matrix of counts C = (n(x i ; y j )) i;j is decomposed by Singular... |

227 | The zero-frequency problem: Estimating the probabilities of novel events in adaptive text compression - Witten, Bell - 1991 |

202 |
Recent Trends in Hierarchic document Clustering: a critical review
- Willett
- 1988
(Show Context)
Citation Context ...structuring large text repositories. Clustering of documents provides a popular way of prestructuring a database that has been applied with mixed success in the context of query--based retrieval (cf. =-=[Wil88]-=- for an overview), but is of great importance in interactive retrieval. The most frequently used methods in this context are linkage algorithms (single linkage, complete linkage, Wards method, cf. [JD... |

197 | Pairwise data clustering by deterministic annealing - HOFMANN, BUHMANN - 1997 |

191 |
Texture classification and segmentation using multiresolution simultaneous autoregressive models
- MAO, JAIN
- 1992
(Show Context)
Citation Context ...e past decades, most of which obey a two--stage scheme. In the modeling stage: characteristic features are extracted from the textured input image, e.g. spatial frequencies [JF91, HPB98], MRF--models =-=[MJ92]-=-. In the clustering stage features are grouped into homogeneous segments, where homogeneity of features is typically formalized by a clustering optimization criterion. Most widely, features are interp... |

135 |
Statistical mechanics and phase transitions in clustering
- Rose, Gurewitz, et al.
- 1990
(Show Context)
Citation Context ...uch as the log-likelihood) is smoothed for large T . For hierarchical models, the annealed EM algorithm offers a natural way to generate tree topologies. As is known from adaptive vector quantization =-=[RGF90]-=-, starting at a high value of T and successively lowering T typically leads through a sequence of phase transitions. At each phase transition, the effective number of distinguishable clusters grows un... |

129 | Boundary detection by constrained optimization - Geman, Geman, et al. - 1990 |

119 | Markov random field models for unsupervised segmentation of textured colour images - Panjwani, Healey - 1995 |

102 | A Theory of Data - COOMBS - 1964 |

92 | Statistical models for co-occurrence data - Hofmann, Puzicha - 1998 |

91 | Unsupervised texture segmentation in a deterministic annealing framework
- HOFMANN, PUZICHA, et al.
- 1998
(Show Context)
Citation Context ...he context of image segmentation, where X corresponds to image locations, Y to discretized or categorical feature values, and a dyad denotes the occurrence of a feature at a particular image location =-=[HPB98]-=-. ffl Text-based information retrieval, where X corresponds to a document collection, Y to the vocabulary, and a dyad represents the occurrence of a token in the content of a document [HPJ99]. ffl Com... |

85 | Multiscale minimization of global energy functions in some visual recovery problems - Heitz, Pérez, et al. - 1994 |

82 |
A co-occurrence-based thesaurus and two applications to information retrieval
- Schütze, Pedersen
- 1997
(Show Context)
Citation Context ...ccurences which has applications in probabilistic language modeling, word clustering [PTL93], word sense disambiguation [Hin90, DLP97] and discrimination [Sch98], and automated thesaurus construction =-=[SP97b]-=-. ffl Preference analysis and consumption behavior by identifying X with individuals and Y with objects. Dyads then correspond to single stimulus preferences. This type of data is the starting point f... |

79 | Aggregate and Mixed-Order Markov Models for Statistical Language
- Saul, Pereira
- 1997
(Show Context)
Citation Context ...m. Mixture and clustering models for dyadic data have been investigated before under the titles of class-based n-gram models [BdM + 92], distributional clustering [PTL93], and aggregate Markov models =-=[SP97a]-=- in natural language processing. All three approaches are recovered as special cases in our general learning framework. There is also a close relation to clustering methods for qualitative data like t... |

74 |
Vector quantization by deterministic annealing
- Rose, Gurewitz, et al.
- 1992
(Show Context)
Citation Context ...; `g / P (c) exp 8 ! : \Gamman(x) 0 @ \Gamma X y2Y P (yjx) log P (yjc) 1 A 9 = ; : (18) Comparing (18) with standard models like the Gaussian mixture model [MB88] or probabilistic vector quantization =-=[RGF92]-=- demonstrates that the cross entropy between the empirical conditional probability P (yjx) and the class--conditional P (yjc) serves as a distortion measure in the one-sided clustering model. Notice t... |

67 | Annealed competition of experts for a segmentation and classification of switching dynamics - Pawelzik, Kohlmorgen, et al. - 1996 |

57 | Data clustering using a model granular magnet. Neural Computation, 9:1805– 1842 - Blatt, Wiseman, et al. - 1997 |

53 | Update rules for parameter estimation in Bayesian networks
- Bauer, Koller, et al.
- 1997
(Show Context)
Citation Context ...imple way to accelerate EM algorithms is by over-relaxation in the M-step. This has been discussed in the context of mixture models [PW78] and was recently `rediscovered' under the title of EM (j) in =-=[BKS97]-=-. We found this method useful in accelerating the fitting procedure for all discussed models. Essentially the estimator for a generic parameter ` in the M-step is modified bys` (t+1) = (1 \Gamma j)s` ... |

53 | Similarity-based methods for word sense disambiguation - Dagan, Lee, et al. - 1997 |

52 | On substantive research hypotheses, conditional independence graphs and graphical chain models - N, Lauritzen - 1990 |

48 | The Development of an Experimental Discrete Dictation Recognizer - Jelinek |

46 | Co-occurrence Smoothing for Stochastic Language Modeling - Essen, Steinbiss - 1992 |

39 | Multidimensional Scaling - Arabie - 1980 |

38 | Histogram clustering for unsupervised segmentation and image retrieval
- Puzicha, Hofmann, et al.
- 1999
(Show Context)
Citation Context ...h visually and semantically satisfying. A detailed evaluation of the one-sided clustering model for unsupervised texture segmentation is out of the scope of this paper and will be published elsewhere =-=[PH98]-=-. 5 Conclusion As the main contribution of this paper a novel class of statistical models for the analysis of cooccurrence data has been proposed and evaluated. We have introduced and discussed severa... |

35 |
Hierarchical image segmentation by multi-dimensional clustering and orientation-adaptive boundary re nement
- Schroeter, Bigun
- 1995
(Show Context)
Citation Context ...t manner. In contrast to pairwise similarity clustering, they offer a sound generative model for texture class description which can be utilized in subsequent processing stages like edge localization =-=[SB95]-=-. Furthermore, there is no need to compute a large matrix of pairwise similarity scores between image sites, which greatly reduces the overall processing time and memory requirements. Compared to K-- ... |

34 | A Theory of Proximity Based Clustering: Structure Detection by Optimization - Puzicha, Hofmann, et al. - 1999 |

30 | Global text matching for information retrieval - Salton, Buckley - 1991 |