## Semisupervised learning of hierarchical latent trait models for data visualisation (2005)

Venue: | IEEE Transactions on Knowledge and Data Engineering |

Citations: | 7 - 1 self |

### BibTeX

@ARTICLE{Nabney05semisupervisedlearning,

author = {Ian T. Nabney and Yi Sun and Peter Tiňo and Ata Kabán},

title = {Semisupervised learning of hierarchical latent trait models for data visualisation},

journal = {IEEE Transactions on Knowledge and Data Engineering},

year = {2005},

volume = {17},

pages = {2005}

}

### OpenURL

### Abstract

Recently, we have developed the hierarchical Generative Topographic Mapping (HGTM), an inter-active method for visualisation of large high-dimensional real-valued data sets. In this paper, we propose a more general visualisation system by extending HGTM in 3 ways, which allow the user to visualise a wider range of datasets and better support the model development process. (i) We integrate HGTM with noise models from the exponential family of distributions. The basic building block is the Latent Trait Model (LTM). This enables us to visualise data of inherently discrete nature, e.g. collections of documents in a hierarchical manner. (ii) We give the user a choice of initialising the child plots of the current plot in either interactive, or automatic mode. In the interactive mode the user selects “regions of interest”, whereas in the automatic mode an unsupervised minimum message length (MML)-inspired construction of a mixture of LTMs is employed. The unsupervised construction is particularly useful when high-level plots are covered with dense clusters of highly overlapping data projections, making it difficult to use the interactive mode. Such a situation often arises when visualising large data sets. (iii) We derive general formulas for magnification factors in latent trait models. Magnification factors are a useful tool to improve our understanding of the visualisation plots, since they can highlight the boundaries between data clusters. We illustrate our approach on a toy example and evaluate it on three more complex real data sets.

### Citations

8187 | Maximum-likelihood from incomplete data via the EM algorithm
- Dempster, Laird, et al.
- 1977
(Show Context)
Citation Context ...ecursive way; after viewing the plots at a given level, the user may add further plots at the next level down in order to provide more insight. These child plots can be trained using the EM algorithm =-=[10]-=-, but their parameters must be initialized in some way. Existing hierarchical models do this by allowing the user to select the position of each child plot in an interactive mode; see [27]. In this pa... |

3282 |
Self-Organizing Maps
- Kohonen
- 1997
(Show Context)
Citation Context ... of non-linear visualisation hierarchies [27], the basic building block of which is the Generative Topographic Mapping (GTM) [4]. GTM is a probabilistic reformulation of the self-organizing map (SOM) =-=[17]-=- in the form of a non-linear latent variable model with a spherical Gaussian noise model. The extension of the GTM algorithm to discrete variables was described in [5] and a generalisation of this to ... |

1392 |
Generalized linear models
- McCullagh, Nelder
- 1989
(Show Context)
Citation Context ...on b(·) denotes the gradient of the cumulant function B(·) 2 , Φ is an M ×K matrix with φ k in its k-th column, T is the data matrix including N data vectors {tn} as 2 It is the inverse link function =-=[21]-=- of the noise distribution. August 16, 2004 DRAFT 5 (4)scolumns, R = (Rkn)k=1,...,K,n=1,...,N and G is a diagonal matrix with elements gkk = � N n=1 Rkn, where Rkn, computed via Bayes’ theorem in the ... |

1047 |
Bayesian Theory
- Bernardo, Smith
- 1994
(Show Context)
Citation Context ...information matrix, |I(θ)| is its determinant, and c is the number of free parameters, i.e. the dimension of θ. This approach was first proposed in [33]. By imposing a non-informative Jeffreys’ prior =-=[3]-=- on both the vector of mixing coefficients {π(M)} and the parameters Θ (M) of individual mixture components [11], the equation (15) August 16, 2004 DRAFT (15)sbecomes L(θ, ζ) = Q � � � N · π(M) log + ... |

313 |
An information measure for classification
- Wallace, Boulton
- 1968
(Show Context)
Citation Context ...ecifying the model parameters, the other specifying the data given the model: Length(θ, ζ) = Length(θ) + Length(ζ|θ). The MML principle was first applied to unsupervised learning of mixture models in =-=[29]-=- and was extended to hierarchical models in [8]. A computer program, Snob, that uses these principles for both parameter estimation and model selection was described in [30]; this provides a flat clus... |

274 | Unsupervised learning of finite mixture models
- Figueiredo, Jain
(Show Context)
Citation Context ...e a principled minimum message length (MML)-based learning of mixture models with an embedded model August 16, 2004 DRAFT 3sselection criterion this approach has been used for Gaussian mixture models =-=[11]-=- 1 . Hence, given a parent LTM, the number and position of its children is based on the modelling properties of the children themselves – without any ad-hoc criteria which would be exterior to the mod... |

188 |
Estimation and inference by compact coding
- Wallace, Freeman
- 1987
(Show Context)
Citation Context ...2 � , 11 L(θ, ζ), where I(θ) is the expected Fisher information matrix, |I(θ)| is its determinant, and c is the number of free parameters, i.e. the dimension of θ. This approach was first proposed in =-=[33]-=-. By imposing a non-informative Jeffreys’ prior [3] on both the vector of mixing coefficients {π(M)} and the parameters Θ (M) of individual mixture components [11], the equation (15) August 16, 2004 D... |

170 |
Information and Exponential Families in Statistical Theory
- Barndorff-Nielsen
- 1978
(Show Context)
Citation Context ... k=1 K� p(t|xk). (1) The conditional data distribution, p(t|xk), (conditioned on the kth latent space point xk ∈ H is modelled as a member of the exponential family in a parameterised functional form =-=[2]-=- k=1 pB(t|xk, Θ) = exp {f Θ(xk)t − B(f Θ(xk))} p0(t). (2) Here Θ is the parameter vector of the model, B(f Θ(xk)) = ln � exp(f Θ(xk)t)p0(t) dt denotes the cumulant generating function of p(t|xk), and ... |

162 |
Hyperdimensional data analysis using parallel coordinates
- Wegman
- 1990
(Show Context)
Citation Context ... the minimum and maximum values in each variable of the cluster. This can be depicted using any multivariate visualisation technique that shows all the variables: [36] uses parallel coordinates [13], =-=[34]-=-, star glyphs [24], scatterplot matrices and dimensional stacking [20]. A band is assigned the colour of the cluster it represents. The strategy for this, called proximity-based colouring maps colours... |

91 |
Exploring n-dimensional databases
- LeBlanc, Ward, et al.
- 1990
(Show Context)
Citation Context ...can be depicted using any multivariate visualisation technique that shows all the variables: [36] uses parallel coordinates [13], [34], star glyphs [24], scatterplot matrices and dimensional stacking =-=[20]-=-. A band is assigned the colour of the cluster it represents. The strategy for this, called proximity-based colouring maps colours by cluster proximity based on the structure of the hierarchical tree.... |

87 | A hierarchical latent variable model for data visualization
- Bishop, Tipping
- 1998
(Show Context)
Citation Context ...al projection of high-dimensional data may not be sufficient to capture all of the interesting aspects of the data. Therefore, hierarchical August 16, 2004 DRAFT 2sextensions of visualisation methods =-=[7]-=-, [22] have been developed. These allow the user to ‘drill down’ into the data; each plot covers a smaller region and it is therefore easier to discern the structure of data. Also plots may be at an a... |

68 |
Recursive pattern: A technique for visualizing very large amounts of data
- Keim, Ankerst, et al.
- 1995
(Show Context)
Citation Context ...sualisation of large multivariate datasets has been growing recently; as well as the generative approach taken by [7], [27] and this paper, more heuristic methods have also been developed [36], [19], =-=[16]-=-. We have selected the first of these, Interactive Hierarchical Displays (IHDs), as a benchmark for two main reasons: it is recent work that unifies several features of earlier techniques and it is th... |

62 | Script recognition with hierarchical feature maps
- Miikkulainen
- 1990
(Show Context)
Citation Context ...ojection of high-dimensional data may not be sufficient to capture all of the interesting aspects of the data. Therefore, hierarchical August 16, 2004 DRAFT 2sextensions of visualisation methods [7], =-=[22]-=- have been developed. These allow the user to ‘drill down’ into the data; each plot covers a smaller region and it is therefore easier to discern the structure of data. Also plots may be at an angle a... |

57 | Glyphmaker: Creating customized visualization of complex data
- Ribarsky, Ayers, et al.
- 1994
(Show Context)
Citation Context ...aximum values in each variable of the cluster. This can be depicted using any multivariate visualisation technique that shows all the variables: [36] uses parallel coordinates [13], [34], star glyphs =-=[24]-=-, scatterplot matrices and dimensional stacking [20]. A band is assigned the colour of the cluster it represents. The strategy for this, called proximity-based colouring maps colours by cluster proxim... |

49 | A probabilistic classification system for predicting the cellular localization sites of proteins
- Horton, Nakai
- 1996
(Show Context)
Citation Context ...ors. August 16, 2004 DRAFT 3 1 2 21sdistribution. Our findings are confirmed by the poor classification results (around 55% accuracy) obtained on this data set using various classification techniques =-=[12]-=-. B. Comparison Although the primary focus of this paper is on automating the development of hierarchical models, it is also useful to compare our results with another hierarchical visualisation techn... |

34 | A component-wise em algorithm for mixtures
- Celeux, Chrétien, et al.
- 2001
(Show Context)
Citation Context ...Amax points act as centres of regions of interest in the data space. In other words, they play the role of vectors z(ci) from section III-B. As in [11], we adopt the component-wise EM (CEM) algorithm =-=[9]-=-, i.e. rather than simultaneously updating all the LTMs, we first update the parameters Θ (1) of the first LTM (13), while parameters of the remaining LTMs are fixed, then we recompute the component r... |

34 |
Intrinsic Classification by MML – the snob program
- Wallace, Dowe
- 1994
(Show Context)
Citation Context ...rning of mixture models in [29] and was extended to hierarchical models in [8]. A computer program, Snob, that uses these principles for both parameter estimation and model selection was described in =-=[30]-=-; this provides a flat clustering model. The hierarchical model used in these papers differs from ours in three main ways: firstly, only the leaf nodes define a probability density, while our hierarch... |

32 | MML clustering of multi-state, Poisson, von Mises circular and Gaussian distributions
- Wallace, Dowe
(Show Context)
Citation Context ...y’s prior, the algorithm selects the “appropriate” number of components while the parameters of each model are estimated by ML. (A similar approach for other density models was formulated in [30] and =-=[32]-=-). The novelty of their proposed approach is that parameter estimation and model selection are integrated in a single EM algorithm, rather than using a model selection criterion on a set of pre-estima... |

28 |
Interactive hierarchical displays: a general framework for visualization and exploration of large multivariate data sets
- Yang, Ward, et al.
- 2003
(Show Context)
Citation Context ...terest in visualisation of large multivariate datasets has been growing recently; as well as the generative approach taken by [7], [27] and this paper, more heuristic methods have also been developed =-=[36]-=-, [19], [16]. We have selected the first of these, Interactive Hierarchical Displays (IHDs), as a benchmark for two main reasons: it is recent work that unifies several features of earlier techniques ... |

21 |
A combined latent class and trait model for the analysis and visualization of discrete data
- Kabán, Girolami
- 2001
(Show Context)
Citation Context ...ribed in [5] and a generalisation of this to the Latent Trait Model (LTM), a latent variable model class whose noise models are selected from the exponential family of distributions, was developed in =-=[14]-=-. In this paper we extend the hierarchical GTM (HGTM) visualisation system to incorporate LTMs. This enables us to visualise data of an inherently discrete nature, e.g. collections of documents. A hie... |

21 |
Minimum-entropy data partitioning using reversible jump Markov chain Monte Carlo
- Roberts, Holmes, et al.
(Show Context)
Citation Context ...at make visualisation plots at higher levels complex and difficult to deal with in an interactive manner. An intuitively simple but flawed approach would be to use a data partitioning technique (e.g. =-=[25]-=-) for segmenting the data set, followed by constructing visualisation plots in the individual compartments. Clearly, in this case there would be no direct connection between the criterion for choosing... |

18 | Hierarchical GTM: Constructing localized nonlinear projection manifolds in a principled way
- Tino, Nabney
- 2002
(Show Context)
Citation Context ...lusters may be split apart instead of lying on top of each other. Recently, we have developed a general and principled approach to the interactive construction of non-linear visualisation hierarchies =-=[27]-=-, the basic building block of which is the Generative Topographic Mapping (GTM) [4]. GTM is a probabilistic reformulation of the self-organizing map (SOM) [17] in the form of a non-linear latent varia... |

17 | Magnification factors for the GTM algorithm
- Bishop, Svensén, et al.
- 1997
(Show Context)
Citation Context ...icated that magnification factors may provide valuable additional information to the user’s understanding of the visualisation plots, since they can highlight the boundaries between data clusters. In =-=[6]-=-, formulas for magnification factors were only derived for the GTM. In this paper, we derive formulas for magnification factors in full generality for latent trait models. In the next section we brief... |

17 |
An information measure for hierarchic classification
- BOWLTON, WALLACE
- 1973
(Show Context)
Citation Context ...ing the data given the model: Length(θ, ζ) = Length(θ) + Length(ζ|θ). The MML principle was first applied to unsupervised learning of mixture models in [29] and was extended to hierarchical models in =-=[8]-=-. A computer program, Snob, that uses these principles for both parameter estimation and model selection was described in [30]; this provides a flat clustering model. The hierarchical model used in th... |

15 |
GTM: The Generative Topographic
- Bishop, Svensen, et al.
- 1998
(Show Context)
Citation Context ...ication factors, data visualization, document mining. 1 INTRODUCTION TOPOGRAPHIC visualization of multidimensional data has been an important method of data analysis and data mining for several years =-=[4]-=-, [18]. Visualization is an effective way for domain experts to detect clusters, outliers, and other important structural features in data. In addition, it can be used to guide the data mining process... |

14 |
Minimum Message Length and Kolmogorov
- Wallace, Dowe
- 1999
(Show Context)
Citation Context ...models Given a set ζ = {t1, t2, ..., tN} of data points, minimum message length (MML) strategies select, among the models inferred from ζ, the one which minimizes length of the message transmitting ζ =-=[28]-=-. Given that the data is modeled by a parametric probabilistic model P (ζ|θ), the message consists of two parts – one specifying the model parameters, the other specifying the data given the model: Le... |

11 | 2003b. ‘‘Visualizing Changes in the Structure of Data for Exploratory Feature Selection
- Pampalk, Goebl, et al.
(Show Context)
Citation Context ...experts to detect clusters, outliers and other important structural features in data. In addition, it can be used to guide the data mining process itself by giving feedback on the results of analysis =-=[23]-=-. In this paper we use latent variable models to visualise data, so that a single plot may contain several data clusters; our aim is to provide sufficiently informative plots that the clusters can be ... |

11 |
Iteratively reweighted least squares algorithms, convergence analysis, and numerical comparisons
- Wolke, Schwetlick
- 1988
(Show Context)
Citation Context ...After the initialisation of each child model, the full hierarchical training described in section III-A is used. 4 In this partial M-step we could alternatively use iterative reweighted least squares =-=[35]-=-. August 16, 2004 DRAFT 9sFig. 1. An example of strongly overlapping clusters: visualisation of a collection of documents with a single Latent Trait Model. The documents are classified according to th... |

9 | A two-way visualization method for clustered data
- Koren, Harel
(Show Context)
Citation Context ... in visualisation of large multivariate datasets has been growing recently; as well as the generative approach taken by [7], [27] and this paper, more heuristic methods have also been developed [36], =-=[19]-=-, [16]. We have selected the first of these, Interactive Hierarchical Displays (IHDs), as a benchmark for two main reasons: it is recent work that unifies several features of earlier techniques and it... |

6 | Intrinsic classification by MML—The SNOB program
- Wallace, Dowe
- 1994
(Show Context)
Citation Context ...rning of mixture models in [29] and was extended to hierarchical models in [8]. A computer program, Snob, that uses these principles for both parameter estimation and model selection was described in =-=[30]-=-; this provides a flat clustering model. The hierarchical model used in these papers differs from ours in three main ways. First, only the leaf nodes define a probability density, while our hierarchys... |

4 | General framework for a principled hierarchical visualization of high-dimensional data
- Kabn, Tino, et al.
- 2002
(Show Context)
Citation Context ...ion [14]. If the data contains outliers, a Student t-distribution may be more appropriate than the Gaussian used in HGTM. Preliminary results of organising LTMs into a hierarchy have been encouraging =-=[15]-=-, and motivated the work described in this paper. The hierarchical LTM arranges a set of LTMs and their corresponding plots in a tree structure T . The Root is at level 1, children of level-ℓ models a... |

3 |
Approximation of improper prior measures by proper probability measures
- Stein
- 1965
(Show Context)
Citation Context ...rs. As pointed out in [31], the use of the non-informative Jeffrey’s prior in general raises problems from the Bayesian point of view. For instance, improper priors may lead to inadmissible estimates =-=[26]-=-. However, such priors have been extensively used mainly due to mathematical convenience: we do not have to compute the Fisher information matrix (typically a computationally expensive step). Neverthe... |

1 |
Voronoi diagrams—survey of a fundamental geometric data structure
- Aurenhammer
- 1991
(Show Context)
Citation Context ...ace. The points ci selected in the latent space H correspond to the “centres” of these regions. These “centres” of the “regions of interest” are mapped back to the data space and Voronoi compartments =-=[1]-=- defined by the mapped points z(ci) ∈ D, where z is the map (3) of the corresponding LTM, are calculated in the data space. In the case of a Gaussian noise model, the child LTMs are initialized by loc... |

1 |
Parallel coordinates: A tool for visualizing mulitdimensional geometry
- Inselberg, Dimsdale
- 1990
(Show Context)
Citation Context ...giving the minimum and maximum values in each variable of the cluster. This can be depicted using any multivariate visualisation technique that shows all the variables: [36] uses parallel coordinates =-=[13]-=-, [34], star glyphs [24], scatterplot matrices and dimensional stacking [20]. A band is assigned the colour of the cluster it represents. The strategy for this, called proximity-based colouring maps c... |

1 |
Glyphmaker: Creating Customized Visualization
- Ribarsky, Ayers, et al.
- 1994
(Show Context)
Citation Context ... Hierarchical visualization of the protein data set constructed in a semi-interactive way. visualization technique that shows all the variables: [36] uses parallel coordinates [13], [34], star glyphs =-=[24]-=-, scatterplot matrices and dimensional stacking [20]. A band is assigned the color of the cluster it represents. The strategy for this, called proximity-based coloring maps colors by cluster proximity... |

1 |
MML Clustering of Multi-State
- Wallace, Dowe
- 2000
(Show Context)
Citation Context ...y’s prior, the algorithm selects the “appropriate” number of components while the parameters of each model are estimated by ML. (A similar approach for other density models was formulated in [30] and =-=[32]-=-.) The novelty of their proposed approach is that parameter estimation and model selection are integrated in a single EM algorithm, rather than using a model selection criterion on a set of preestimat... |