## Optimal cluster preserving embedding of nonmetric proximity data (2003)

### Cached

### Download Links

- [www.inf.ethz.ch]
- [www.informatik.uni-bonn.de]
- DBLP

### Other Repositories/Bibliography

Venue: | IEEE Trans. Pattern Analysis and Machine Intelligence |

Citations: | 43 - 4 self |

### BibTeX

@ARTICLE{Roth03optimalcluster,

author = {Volker Roth and Julian Laub and Motoaki Kawanabe and Joachim M. Buhmann},

title = {Optimal cluster preserving embedding of nonmetric proximity data},

journal = {IEEE Trans. Pattern Analysis and Machine Intelligence},

year = {2003},

volume = {25},

pages = {2003}

}

### Years of Citing Articles

### OpenURL

### Abstract

Abstract—For several major applications of data analysis, objects are often not represented as feature vectors in a vector space, but rather by a matrix gathering pairwise proximities. Such pairwise data often violates metricity and, therefore, cannot be naturally embedded in a vector space. Concerning the problem of unsupervised structure detection or clustering, in this paper, a new embedding method for pairwise data into Euclidean vector spaces is introduced. We show that all clustering methods, which are invariant under additive shifts of the pairwise proximities, can be reformulated as grouping problems in Euclidian spaces. The most prominent property of this constant shift embedding framework is the complete preservation of the cluster structure in the embedding space. Restating pairwise clustering problems in vector spaces has several important consequences, such as the statistical description of the clusters by way of cluster prototypes, the generic extension of the grouping procedure to a discriminative prediction rule, and the applicability of standard preprocessing methods like denoising or dimensionality reduction. Index Terms—Clustering, pairwise proximity data, cost function, embedding, MDS. 1

### Citations

3277 |
The Self-Organizing Map
- Kohonen
- 1990
(Show Context)
Citation Context ... PCA, we can also apply any other standard method for dimensionality reduction or visualization, such as projection pursuit [5], local linear embedding (LLE) [13], Isomap [18] or Self-organizing maps [7]. We now have presented various aspects of constant shift embedding. The next section contains a detailed analysis of the dierences to the classical MDS approach. 3.2 Comparison between MDS and co... |

2617 | Normalized cuts and image segmentation
- Shi, Malik
(Show Context)
Citation Context ...we will focus on those techniques that explicitely minimize a certain cost function. This subset of clustering methods includes e.g. graph-theoretic approaches like several variations of Cut criteria =-=[16]-=-, and many methods derived from an axiomatization of pairwise cost functions in [12]. From a theoretical viewpoint, cost-based clustering methods are interesting insofar, as many properties of the gro... |

2039 |
Principal Component Analysis
- Jolliffe
- 2002
(Show Context)
Citation Context ... sciences measure so-called preference data. The missing vector space representation precludes the use of well established clustering or classification techniques such as Principal Component Analysis =-=[1]-=- or Support Vector Machines [2]. Nonvectorial data sets as such are difficult to handle and, for data mining purposes, we need to relate them to some mathematical concept. A common approach is to repl... |

1936 |
Pattern classification
- Duda, Hart, et al.
- 2001
(Show Context)
Citation Context ...ering cost function and the classical k-means grouping criterion in the embedding space. 2 PROXIMITY-BASED CLUSTERING Unsupervised grouping or clustering aims at extracting hidden structure from data =-=[4]-=-. The term data refers to both a set of objects and a set of corresponding object representations resulting from some physical measurement process. Different types of object representations are possib... |

1698 |
A Global Geometric Framework for Nonlinear Dimensionality Reduction”,Science, vol 290
- Tenenbaum, Silva, et al.
- 2000
(Show Context)
Citation Context ...n 2 found by loss-free kernel PCA, we can also apply any other standard method for dimensionality reduction or visualization, such as projection pursuit [5], local linear embedding (LLE) [13], Isomap =-=[-=-18] or Self-organizing maps [7]. We now have presented various aspects of constant shift embedding. The next section contains a detailed analysis of the dierences to the classical MDS approach. 3.2 Co... |

1629 | Nonlinear dimensionality reduction by locally linear embedding
- Roweis, Saul
(Show Context)
Citation Context ...vectors in R n 2 found by loss-free kernel PCA, we can also apply any other standard method for dimensionality reduction or visualization, such as projection pursuit [5], local linear embedding (LLE) =-=[-=-13], Isomap [18] or Self-organizing maps [7]. We now have presented various aspects of constant shift embedding. The next section contains a detailed analysis of the dierences to the classical MDS app... |

1309 | Data clustering: a review
- JAIN, MURTY, et al.
- 1999
(Show Context)
Citation Context ...cond case we are given a n n pairwise proximity matrix. The problem of grouping vectorial data has been widely studied in the literature, and many clustering algorithms have been proposed (see e.g. [=-=1]-=-[3]). One of the most popular methods is k-means clustering. It derives a set of k prototype vectors which quantize the data set with minimal quantization error. Partitioning proximity data is conside... |

1058 | Nonlinear component analysis as a kernel eigenvalue problem
- Scholkopf, Smola, et al.
- 1998
(Show Context)
Citation Context ...oise reduction step into the clustering procedure. While it is not clear how to implement this data preprocessing for the original pairwise object representations, we can easily apply kernel PCA (see =-=[1-=-4]) in the constant-shift setting. This will be the main focus of the next section. 3.1 Constant shift embedding and Denoising We now dispose of a distance matrix ~ D which derives from real points in... |

408 |
Multidimensional scaling
- Cox
(Show Context)
Citation Context ...ansformed into vectorial problems by means of classical embedding strategies. Presumingly, the most popular classical embedding method for nonmetric data is Multidimensional Scaling (MDS) (see, e.g., =-=[6]-=- for a recent overview), where one seeks a low-dimensional representation of data such that the distortion of the pairwise dissimilarities Dij is minimal with respect to some cost function. One widely... |

380 | An Introduction to kernel-based learning algorithms
- Mtiller, Mika, et al.
- 2001
(Show Context)
Citation Context ...ata is not available as points in a vector space. This precludes the use of well established clustering or classication techniques such as Principal Component Analysis [6] or Support Vector Machines [=-=11-=-]. For instance, genomics typically produce data represented as strings from some alphabet, psychology yields sets of similarity judgments, yet other elds like social sciences measure so called prefer... |

255 |
Theory and Methods of Scaling
- Torgerson
- 1958
(Show Context)
Citation Context ... S 2 SD , the centralized version S c is identical and unique. We now state the following: Theorem 3.1. D derives from a squared Euclidian distance if and only if S c is positive semi-denite. Proof. [=-=19]-=- referring to [20] or the following simple argument: ()) Since D derives from a squared Euclidian distance, we can take vectors x 1 ; : : : x n 2 R d (d 6 n 1) which satisfy D ij = kx i x j k 2 . Then... |

198 | Pairwise Data Clustering by Deterministic Annealing
- Hofmann, Buhmann
- 1997
(Show Context)
Citation Context ...e pairwise clustering cost function (see eq. (34)). It is of particular interest, since it combines the properties of additivity, scale- and shift invariance, and statistical robustness, see [12]. In =-=[4-=-] this grouping problem is stated as a combinatorial optimization problem, which is optimized in a deterministic annealing framework after applying a mean-eld approximation. According to the theorems ... |

191 | Data Clustering: A
- Jain, Murty, et al.
- 1999
(Show Context)
Citation Context ...the second case, we are given a n n pairwise proximity matrix. The problem of grouping vectorial data has been widely studied in the literature, and many clustering algorithms have been proposed [4], =-=[5]-=-. One of the most popular methods is k-means clustering. It derives a set of k prototypesROTH ET AL.: OPTIMAL CLUSTER PRESERVING EMBEDDING OF NONMETRIC PROXIMITY DATA 1541 vectors which quantize the d... |

168 | A prediction-based resampling method for estimating the number of clusters in a dataset, Genome Biology (7
- Dudoit, Fridlyand
(Show Context)
Citation Context ... the embedding space, a deterministic annealing method was applied. Concerning the selection of the “correct” number of clusters, we used the concept of cluster stability which has been introduced in =-=[25]-=- and refined in [26]. The main idea is to draw resamples from the data set and then to compare the inferred data-partitions across these resamples. The variations of the partitions are transformed int... |

125 | Kernel pca and de-noising in feature spaces
- Mika, Scholkopf, et al.
- 1998
(Show Context)
Citation Context ...rather interested in some approximative reconstructions of the real vectors. In the PCA framework, one usually assumes that the directions corresponding to small eigenvalues contain the noise (c.f. [10] [15]). We can thus obtain a representation in a space of reduced dimension (with the well-dened error of PCA reconstruction) when choosing tsn 2 in step 3 of the above algorithm: X t = V t ( ... |

113 |
Input space vs. feature space in kernel-based methods
- Schölkopf, Mika, et al.
- 1999
(Show Context)
Citation Context ...r interested in some approximative reconstructions of the real vectors. In the PCA framework, one usually assumes that the directions corresponding to small eigenvalues contain the noise (c.f. [10] [15]). We can thus obtain a representation in a space of reduced dimension (with the well-dened error of PCA reconstruction) when choosing tsn 2 in step 3 of the above algorithm: X t = V t ( t ) 1... |

102 |
Discussion of a Set of Points in terms of Their Mutual Distances. Psychometrika 3:19–22
- Young, Householder
- 1938
(Show Context)
Citation Context ...ralized version S c is identical and unique. We now state the following: Theorem 3.1. D derives from a squared Euclidian distance if and only if S c is positive semi-denite. Proof. [19] referring to [=-=20-=-] or the following simple argument: ()) Since D derives from a squared Euclidian distance, we can take vectors x 1 ; : : : x n 2 R d (d 6 n 1) which satisfy D ij = kx i x j k 2 . Then, D c ij = D ij ... |

92 |
A deterministic annealing approach to clustering
- Rose, Gurewitz, et al.
- 1990
(Show Context)
Citation Context ...al the shifted k-means cost function Hkmð ~DÞ, for which the mean-field approximation becomes exact. For details on annealing and mean-field approximations, the interested reader is referred to [10], =-=[21]-=-. If one decides to insert a denoising/dimensionalityreduction step into the clustering procedure, this will usually not only speed up the computations, but it will also “robustify” optimization heuri... |

92 |
Improved tool for biological sequence analysis
- Pearson, Lipman
- 1988
(Show Context)
Citation Context ...n the embedding space. From the SWISSPROT and TrEMBL databases [22], we extracted all of the approximate 1,200 sequences annotated as “globins” or as “globin-like.” The heuristic FASTA scoring method =-=[23]-=- was used for computing pairwise alignment scores which, in turn, were length-corrected (a Bayesian approach for correcting local alignments, following [24]) and normalized to the length of the alignm... |

89 |
Nonlinear Dimensionality Reduction by Locally
- Roweis, Saul
- 2000
(Show Context)
Citation Context ... the exactly reconstructed vectors in IR p , we can also apply any other standard method for dimensionality reduction or visualization, such as projection pursuit [16], locally linear embedding (LLE) =-=[17]-=-, Isomap [18], or Selforganizing maps [19]. The latter methods can also be viewed as approximations of the optimal structure preserving vectors, employing, however, an approximation criterion differen... |

72 | Classification with nonmetric distances: Image retrieval and class representation
- Jacobs, Weinshall, et al.
(Show Context)
Citation Context ...y a complex alignment algorithm. This procedure yields a matrix gathering the pairwise relations between the original objects, which may be the starting point of intelligent data analysis, see, e.g., =-=[3]-=- for an example of such a procedure in the field of image retrieval. We like to stress here that such a matrix is by no means naturally related to the common viewpoint of objects being embedded in som... |

47 |
Principal Component Analysis
- Jollie
- 1986
(Show Context)
Citation Context ...r several major applications, data is not available as points in a vector space. This precludes the use of well established clustering or classication techniques such as Principal Component Analysis [=-=6-=-] or Support Vector Machines [11]. For instance, genomics typically produce data represented as strings from some alphabet, psychology yields sets of similarity judgments, yet other elds like social s... |

40 |
Projection pursuit. The annals of Statistics
- Huber
- 1985
(Show Context)
Citation Context ...t given the exactly reconstructed vectors in R n 2 found by loss-free kernel PCA, we can also apply any other standard method for dimensionality reduction or visualization, such as projection pursuit =-=[5]-=-, local linear embedding (LLE) [13], Isomap [18] or Self-organizing maps [7]. We now have presented various aspects of constant shift embedding. The next section contains a detailed analysis of the di... |

34 | Theory of Multidimensional Scaling
- Leeuw, Heiser
- 1982
(Show Context)
Citation Context ...ortion of the distance D in MDS is measured by several criteria such as SSTRESS(D; ~ D) = tr(D ~ D) 2 ; (19) STRAIN(D; ~ D) = tr n Q(D ~ D)Q(D ~ D) o ; (20) where D and ~ D are distance matrices, cf. =-=[8]-=-. In the following we will compare the distortions of MDS and constant shift method by the measure STRAIN. This measure can be transformed as STRAIN(D; ~ D) = tr(QDQ Q ~ DQ) 2 = 4tr(S c ~ S c ) 2 : Le... |

32 | A Theory of Proximity Based Clustering: Structure Detection by Optimization
- Puzicha, Hofmann, et al.
- 1999
(Show Context)
Citation Context ... This subset of clustering methods includes e.g. graph-theoretic approaches like several variations of Cut criteria [16], and many methods derived from an axiomatization of pairwise cost functions in =-=[12]-=-. From a theoretical viewpoint, cost-based clustering methods are interesting insofar, as many properties of the grouping solution can be derived by analyzing invariance properties of the cost functio... |

32 |
de Leeuw, Nonmetric Individual Differences Multidimensional Scaling: an Alternating Least Squares Method with Optimal Scaling Features, Psychometrika 42
- Takane, Young, et al.
- 1977
(Show Context)
Citation Context ...-dimensional representation of data such that the distortion of the pairwise dissimilarities Dij is minimal with respect to some cost function. One widely used cost function is the SSTRESS criterion, =-=[7]-=-: J Xn i;j1 !ij d 2 ij D2 ij 2 ; ð1Þ where dij kxi xjk are the transformed distances in lowdimensional space, and !ij are weights. Typically, these weights read: 1 !ij nðn 1ÞD2 ij ; !ij 1 P k;... |

25 |
Biological Sequence Analysis. Cambridge Univ
- Durbin, Eddy, et al.
- 1998
(Show Context)
Citation Context ...n-like.” The heuristic FASTA scoring method [23] was used for computing pairwise alignment scores which, in turn, were length-corrected (a Bayesian approach for correcting local alignments, following =-=[24]-=-) and normalized to the length of the alignment. From the pair-scores Sij, we derived dissimilarities by setting Dij Sii þ Sjj 2Sij. 3 The eigenvalue spectrum of the centered matrix Sc shows some hi... |

11 |
Pattern Classi£cation
- Duda, Hart, et al.
- 2001
(Show Context)
Citation Context ...scuss the relations of our embedding with graphtheoretic clustering methods (section 4). 2 Proximity-based clustering Unsupervised grouping or clustering aims at extracting hidden structure from data =-=[3-=-]. The term data refers to both a set of objects and a set of corresponding object representations resulting from some physical measurement process. Dierent types of object representations are possibl... |

11 |
On the complexity of clustering problems. Optimization and Operations
- Brucker
- 1978
(Show Context)
Citation Context ...ise Clustering cost function reads: H pc 1 X 2 k Pn Pn i1 j1 1 Mi Mj Dij Pn l1 Ml : ð3Þ The optimal assignments ^M are obtained by minimizing Hpc . The minimization itself is an NP hard problem =-=[11]-=-, and some approximation heuristics have been proposed: In [10], a mean field annealing framework has been presented (see the discussion in Section 4 of this work for some comments and new results on ... |

7 | Investigations of measures for grouping by graph partitioning
- Soundararajan, Sarkar
(Show Context)
Citation Context ...ect large dierences between the models considered. This may explain the somewhat surprising results of a large-scale comparison study of graph partitioning algorithms for image segmentation tasks in [=-=17]-=-. 5 Discussion and Conclusion We have introduced an optimal embedding procedure for pairwise clustering by means of constant shift embedding. For the class of shift-invariant clustering methods, it ou... |

4 |
Discussion of a Set of Points
- Young, Householder
- 1938
(Show Context)
Citation Context ...ng member of SD, since the following theorem holds. Theorem 1. D derives from a squared Euclidian distance, i.e., Dij kxi xjk 2 , if and only if Sc is positive semidefinite. Proof. [12] referring to =-=[13]-=-. tu For general dissimilarities, Sc will be indefinite. By shifting its diagonal elements, however, we can transform it into a positive semidefinite matrix: The following lemma states that, for any m... |

1 |
Advances in neural information processing systems 13, papers from neural information processing systems (nips
- Meila, Shi
- 2001
(Show Context)
Citation Context ..., the dierences between Averaged Association, Averaged Cut and Normalized Cut become vanishingly small. In such situations, all three methods can be reasonably well approximated by k-means (see also [9]). A graph G = (V; E) can be partitioned into disjoint sets A ; = 1; : : : ; k by removing edges: S k =1 A = V; A \ A = ; for 6= . Following [16], we dene the dissimilarity between the ... |