Results 1  10
of
12
Estimating the number of clusters in a dataset via the Gap statistic
, 2000
"... We propose a method (the \Gap statistic") for estimating the number of clusters (groups) in a set of data. The technique uses the output of any clustering algorithm (e.g. kmeans or hierarchical), comparing the change in within cluster dispersion to that expected under an appropriate reference null ..."
Abstract

Cited by 261 (1 self)
 Add to MetaCart
We propose a method (the \Gap statistic") for estimating the number of clusters (groups) in a set of data. The technique uses the output of any clustering algorithm (e.g. kmeans or hierarchical), comparing the change in within cluster dispersion to that expected under an appropriate reference null distribution. Some theory is developed for the proposal and a simulation study that shows that the Gap statistic usually outperforms other methods that have been proposed in the literature. We also briey explore application of the same technique to the problem for estimating the number of linear principal components. 1 Introduction Cluster analysis is an important tool for \unsupervised" learning the problem of nding groups in data without the help of a response variable. A major challenge in cluster analysis is estimation of the optimal number of \clusters". Figure 1 (top right) shows a typical plot of an error measure W k (the within cluster dispersion dened below) for a clustering pr...
MML clustering of multistate, Poisson, von Mises circular and Gaussian distributions
 Statistics Computing
, 2000
"... Minimum Message Length (MML) is an invariant Bayesian point estimation technique which is also statistically consistent and efficient. We provide a brief overview of MML inductive inference ..."
Abstract

Cited by 32 (10 self)
 Add to MetaCart
Minimum Message Length (MML) is an invariant Bayesian point estimation technique which is also statistically consistent and efficient. We provide a brief overview of MML inductive inference
Bayesian Model Selection in Finite Mixtures by Marginal Density Decompositions
 JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION
, 2001
"... ..."
Testing for a Finite Mixture Model With Two Components
 Journal of the Royal Statistical Society, Ser. B
, 2004
"... We consider a finite mixture model with k components and a kernel distribution from a general parametric family. We consider the problem of testing the hypothesis k = 2 against k ≥ 3. In this problem, the likelihood ratio test has a very complicated large sample theory and is difficult to use in pra ..."
Abstract

Cited by 12 (2 self)
 Add to MetaCart
We consider a finite mixture model with k components and a kernel distribution from a general parametric family. We consider the problem of testing the hypothesis k = 2 against k ≥ 3. In this problem, the likelihood ratio test has a very complicated large sample theory and is difficult to use in practice. We propose a test based on the likelihood ratio statistic where the estimates of the parameters, (under the null and the alternative) are obtained from a penalized likelihood which guarantees consistent estimation of the support points. The asymptotic null distribution of the corresponding modified likelihood ratio test is derived and found to be relatively simple in nature and easily applied. Simulations based on a mixture model with normal kernel are encouraging that the modified test performs well, and its use is illustrated in an example involving data from a medical study where the hypothesis arises as a consequence of a potential genetic mechanism. Key words and phrases. Asymptotic distribution, finite mixture models, likelihood ratio tests, penalty terms, nonregular estimation, strong identifiability. AMS 1980 subject classifications. Primary 62F03; secondary 62F05. 1
Testing For Monotonicity Of A Regression Mean Without Selecting A Bandwidth
, 1998
"... . A new approach to testing for monotonicity of a regression mean, not requiring computation of a curve estimator or a bandwidth, is suggested. It is based on the notion of `running gradients' over short intervals, although from some viewpoints it may be regarded as an analogue for monotonicity test ..."
Abstract

Cited by 4 (3 self)
 Add to MetaCart
. A new approach to testing for monotonicity of a regression mean, not requiring computation of a curve estimator or a bandwidth, is suggested. It is based on the notion of `running gradients' over short intervals, although from some viewpoints it may be regarded as an analogue for monotonicity testing of the dip/excess mass approach for testing modality hypotheses about densities. Like the latter methods, the new technique does not suffer difficulties caused by almostflat parts of the target function. In fact, it is calibrated so as to work well for flat response curves, and as a result it has relatively good power properties in boundary cases where the curve exhibits shoulders. In this respect, as well as in its construction, the `running gradients' approach differs from alternative techniques based on the notion of a critical bandwidth. KEYWORDS. Bootstrap, calibration, curve estimation, Monte Carlo, response curve, running gradient. SHORT TITLE. Testing for monotonicity. 1 The man...
Multivariate mode hunting: Data analytic tools with measures of significance
 J. Multivariate Analysis
, 2009
"... Multivariate mode hunting is of increasing practical importance. Only a few such methods exist, however, and there usually is a trade off between practical feasibility and theoretical justification. In this paper we attempt to do both. We propose a method for locating isolated modes (or better, moda ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
Multivariate mode hunting is of increasing practical importance. Only a few such methods exist, however, and there usually is a trade off between practical feasibility and theoretical justification. In this paper we attempt to do both. We propose a method for locating isolated modes (or better, modal regions) in a multivariate data set without prespecifying their total number. Information on significance of the findings is provided by means of formal testing for the presence of antimodes. Critical values of the tests are derived from large sample considerations. The method is designed to be computationally feasible in moderate dimensions, and it is complemented by diagnostic plots. Since the nullhypothesis under consideration is highly composite the proposed tests involve calibration in order to assure a correct (asymptotic) level. Our methods are illustrated by application to real data sets.
Bump hunting with nonGaussian kernels
 Ann. Statist
, 2004
"... It is well known that the number of modes of a kernel density estimator is monotone nonincreasing in the bandwidth if the kernel is a Gaussian density. There is numerical evidence of nonmonotonicity in the case of some nonGaussian kernels, but little additional information is available. The present ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
It is well known that the number of modes of a kernel density estimator is monotone nonincreasing in the bandwidth if the kernel is a Gaussian density. There is numerical evidence of nonmonotonicity in the case of some nonGaussian kernels, but little additional information is available. The present paper provides theoretical and numerical descriptions of the extent to which the number of modes is a nonmonotone function of bandwidth in the case of general compactly supported densities. Our results address popular kernels used in practice, for example, the Epanechnikov, biweight and triweight kernels, and show that in such cases nonmonotonicity is present with strictly positive probability for all sample sizes n ≥ 3. In the Epanechnikov and biweight cases the probability of nonmonotonicity equals 1 for all n ≥ 2. Nevertheless, in spite of the prevalence of lack of monotonicity revealed by these results, it is shown that the notion of a critical bandwidth (the smallest bandwidth above which the number of modes is guaranteed to be monotone) is still well defined. Moreover, just as in the Gaussian case, the critical bandwidth is of the same size as the bandwidth that minimises mean squared error of the density estimator. These theoretical results, and new numerical evidence, show that the main effects of nonmonotonicity occur for relatively small bandwidths, and have negligible impact on many aspects of bump hunting. 1. Introduction. Compactly supported kernels
ASSESSING EXTREMA OF EMPIRICAL PRINCIPAL COMPONENT FUNCTIONS
, 2006
"... The difficulties of estimating and representing the distributions of functional data mean that principal component methods play a substantially greater role in functional data analysis than in more conventional finitedimensional settings. Local maxima and minima in principal component functions are ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
The difficulties of estimating and representing the distributions of functional data mean that principal component methods play a substantially greater role in functional data analysis than in more conventional finitedimensional settings. Local maxima and minima in principal component functions are of direct importance; they indicate places in the domain of a random function where influence on the function value tends to be relatively strong but of opposite sign. We explore statistical properties of the relationship between extrema of empirical principal component functions, and their counterparts for the true principal component functions. It is shown that empirical principal component funcions have relatively little trouble capturing conventional extrema, but can experience difficulty distinguishing a “shoulder ” in a curve from a small bump. For example, when the true principal component function has a shoulder, the probability that the empirical principal component function has instead a bump is approximately equal to 1. We suggest and describe the 2 performance of bootstrap methods for assessing the strength of extrema. It is shown that the subsample bootstrap is more effective than the standard bootstrap in this regard. A “bootstrap likelihood” is proposed for measuring extremum strength. Exploratory numerical methods are suggested.
Tests for normal mixtures based on the empirical characteristic function
 Comput. Statist. Data Anal
, 2005
"... Abstract. A goodness–of–fit test for two–component homoscedastic and homothetic mixtures of normal distributions is proposed. The tests are based on a weighted L2–type distance between the empirical characteristic function and its population counterpart, where in the latter, parameters are replaced ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
Abstract. A goodness–of–fit test for two–component homoscedastic and homothetic mixtures of normal distributions is proposed. The tests are based on a weighted L2–type distance between the empirical characteristic function and its population counterpart, where in the latter, parameters are replaced by consistent estimators. Consequently the resulting tests are consistent against general alternatives. When moment estimation is employed and as the decay of the weight function tends to infinity the test statistics approach limit values, which are related to the first nonvanishing moment equation. The new tests are compared via simulation to other omnibus tests for mixtures of normal distributions, and are applied to several real data sets. Keywords. Characteristic function, Goodnessoffit test, Mixtures of Normal Distributions 1