Results 1  10
of
17
Applications of Resampling Methods to Estimate the Number of Clusters and to Improve the Accuracy of a Clustering Method
, 2001
"... The burgeoning field of genomics, and in particular microarray experiments, have revived interest in both discriminant and cluster analysis, by raising new methodological and computational challenges. The present paper discusses applications of resampling methods to problems in cluster analysis. A r ..."
Abstract

Cited by 169 (0 self)
 Add to MetaCart
The burgeoning field of genomics, and in particular microarray experiments, have revived interest in both discriminant and cluster analysis, by raising new methodological and computational challenges. The present paper discusses applications of resampling methods to problems in cluster analysis. A resampling method, known as bagging in discriminant analysis, is applied to increase clustering accuracy and to assess the confidence of cluster assignments for individual observations. A novel predictionbased resampling method is also proposed to estimate the number of clusters, if any, in a dataset. The performance of the proposed and existing methods are compared using simulated data and gene expression data from four recently published cancer microarray studies.
Consensus clustering  A resamplingbased method for class discovery and visualization of gene expression microarray data
 MACHINE LEARNING, FUNCTIONAL GENOMICS SPECIAL ISSUE
, 2003
"... ..."
On the Performance of Clustering in Hilbert Spaces
"... Abstract—Based on � randomly drawn vectors in a separable Hilbert space, one may construct a �means clustering scheme by minimizing an empirical squared error. We investigate the risk of such a clustering scheme, defined as the expected squared distance of a random vector ˆ from the set of cluster ..."
Abstract

Cited by 14 (1 self)
 Add to MetaCart
Abstract—Based on � randomly drawn vectors in a separable Hilbert space, one may construct a �means clustering scheme by minimizing an empirical squared error. We investigate the risk of such a clustering scheme, defined as the expected squared distance of a random vector ˆ from the set of cluster centers. Our main result states that, for an almost surely bounded ˆ, the expected excess clustering risk is y @ Ia�A. Since clustering in high (or even infinite)dimensional spaces may lead to severe computational problems, we examine the properties of a dimension reduction strategy for clustering based on Johnson–Lindenstrausstype random projections. Our results reflect a tradeoff between accuracy and computational complexity when one uses �means clustering after random projection of the data to a lowdimensional space. We argue that random projections work better than other simplistic dimension reduction schemes. Index Terms—Clustering, empirical risk minimization, Hilbert space, �means, random projections, vector quantization.
Cluster stability for finite samples
 Annals of Probability, 10(4):919 – 926
, 1982
"... Over the past few years, the notion of stability in data clustering has received growing attention as a cluster validation criterion in a samplebased framework. However, recent work has shown that as the sample size increases, any clustering model will usually become asymptotically stable. This led ..."
Abstract

Cited by 9 (2 self)
 Add to MetaCart
Over the past few years, the notion of stability in data clustering has received growing attention as a cluster validation criterion in a samplebased framework. However, recent work has shown that as the sample size increases, any clustering model will usually become asymptotically stable. This led to the conclusion that stability is lacking as a theoretical and practical tool. The discrepancy between this conclusion and the success of stability in practice has remained an open question, which we attempt to address. Our theoretical approach is that stability, as used by cluster validation algorithms, is similar in certain respects to measures of generalization in a modelselection framework. In such cases, the model chosen governs the convergence rate of generalization bounds. By arguing that these rates are more important than the sample size, we are led to the prediction that stabilitybased cluster validation algorithms should not degrade with increasing sample size, despite the asymptotic universal stability. This prediction is substantiated by a theoretical analysis as well as some empirical results. We conclude that stability remains a meaningful cluster validation criterion over finite samples. 1
MEstimators Converging to a Stable Limit
"... Introduction. We discuss the convergence of Mestimators to a stable (possibly normal) limit distribution. Huber (1964) introduced Mestimators as a way to obtain more robust estimators. Let (S; S; P ) be a probability space and let fX i g 1 i=1 be a sequence of i.i.d.r.v.'s with values in S. Le ..."
Abstract

Cited by 6 (5 self)
 Add to MetaCart
Introduction. We discuss the convergence of Mestimators to a stable (possibly normal) limit distribution. Huber (1964) introduced Mestimators as a way to obtain more robust estimators. Let (S; S; P ) be a probability space and let fX i g 1 i=1 be a sequence of i.i.d.r.v.'s with values in S. Let X be a copy of X 1 . Let \Theta be a subset of IR d . Let g : S \Theta \Theta ! IR be a function such that g(\Delta; `) : S ! IR is measurable for each ` 2 \Theta. Suppose that we want to estimate a parameter ` 0 2 \Theta characterized by E[g(X; `) \Gamma g(X; `<F8.496
CLUSTER ANALYSIS AND CLASSIFICATION TREE METHODOLOGY AS AN AID TO IMPROVE UNDERSTANDING OF BENIGN PROSTATIC HYPERPLASIA
, 1994
"... Clear scientifically dermed guidelines for diagnosing benign prostatic hyperplasia have not been developed, and commonly used urologic measures characterizing the disease have shown lack of correlation. However, most reports in the literature are based on studies in referred patients or other nonre ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
Clear scientifically dermed guidelines for diagnosing benign prostatic hyperplasia have not been developed, and commonly used urologic measures characterizing the disease have shown lack of correlation. However, most reports in the literature are based on studies in referred patients or other nonrepresentative samples and additionally have not considered the multivariate relationship among these measures. Such commonly used measures were collected during the baseline phase of a communitybased study initiated in Olmsted County,. Minnesota to study the prevalence and progression of disease in a randomly selected sample of untreated men aged 4079 without history of prostate cancer or prior prostate surgery. In the absence of a clinical diagnosis, hierarchical group average cluster analysis and the kth nearest neighbor nonparametric density estimation (NPDE) approach were applied to group men after fIrst standardizing variables using a robust measure. As the number of clusters has been shown to be a monotonically decreasing function ofsmoothing parameter k, graphical tools
Using combinatorial optimization in model–based trimmed clustering with cardinality constraints
"... Abstract Statistical clustering criteria with free scale parameters and unknown cluster sizes are inclined to create small, spurious clusters. To mitigate this tendency a statistical model for cardinality–constrained clustering of data with gross outliers is established, its maximum likelihood and m ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
Abstract Statistical clustering criteria with free scale parameters and unknown cluster sizes are inclined to create small, spurious clusters. To mitigate this tendency a statistical model for cardinality–constrained clustering of data with gross outliers is established, its maximum likelihood and maximum a posteriori clustering criteria are derived, and their consistency and robustness are analyzed. The criteria lead to constrained optimization problems that can be solved by iterative, alternating trimming algorithms of k–means type. Each step in the algorithms requires the solution to a λ–assignment problem known from combinatorial optimization. The method allows to estimate the numbers of clusters and outliers. It is illustrated with a synthetic and a real data set. Key words model–based clustering; classification model; outliers; size constraints; combinatorial optimization; λ–assignment problem; model selection 1
T.: Consensus clustering
 Machine Learning 52 (2003) 91–118 Functional Genomics Special Issue
"... A resamplingbased method for class discovery and visualization of gene expression microarray data ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
A resamplingbased method for class discovery and visualization of gene expression microarray data
On Uprocesses and clustering performance
"... Many clustering techniques aim at optimizing empirical criteria that are of the form of a Ustatistic of degree two. Given a measure of dissimilarity between pairs of observations, the goal is to minimize the within cluster point scatter over a class of partitions of the feature space. It is the pur ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
Many clustering techniques aim at optimizing empirical criteria that are of the form of a Ustatistic of degree two. Given a measure of dissimilarity between pairs of observations, the goal is to minimize the within cluster point scatter over a class of partitions of the feature space. It is the purpose of this paper to define a general statistical framework, relying on the theory of Uprocesses, for studying the performance of such clustering methods. In this setup, under adequate assumptions on the complexity of the subsets forming the partition candidates, the excess of clustering risk is proved to be of the order OP(1 / √ n). Based on recent results related to the tail behavior of degenerate Uprocesses, it is also shown how to establish tighter rate bounds. Model selection issues, related to the number of clusters forming the data partition in particular, are also considered. 1