Results 1 -
8 of
8
Estimating the number of clusters in a dataset via the Gap statistic
, 2000
"... We propose a method (the \Gap statistic") for estimating the number of clusters (groups) in a set of data. The technique uses the output of any clustering algorithm (e.g. k-means or hierarchical), comparing the change in within cluster dispersion to that expected under an appropriate reference null ..."
Abstract
-
Cited by 167 (1 self)
- Add to MetaCart
We propose a method (the \Gap statistic") for estimating the number of clusters (groups) in a set of data. The technique uses the output of any clustering algorithm (e.g. k-means or hierarchical), comparing the change in within cluster dispersion to that expected under an appropriate reference null distribution. Some theory is developed for the proposal and a simulation study that shows that the Gap statistic usually outperforms other methods that have been proposed in the literature. We also briey explore application of the same technique to the problem for estimating the number of linear principal components. 1 Introduction Cluster analysis is an important tool for \unsupervised" learning| the problem of nding groups in data without the help of a response variable. A major challenge in cluster analysis is estimation of the optimal number of \clusters". Figure 1 (top right) shows a typical plot of an error measure W k (the within cluster dispersion dened below) for a clustering pr...
MML clustering of multi-state, Poisson, von Mises circular and Gaussian distributions
- Statistics Computing
, 2000
"... Minimum Message Length (MML) is an invariant Bayesian point estimation technique which is also statistically consistent and efficient. We provide a brief overview of MML inductive inference ..."
Abstract
-
Cited by 29 (8 self)
- Add to MetaCart
Minimum Message Length (MML) is an invariant Bayesian point estimation technique which is also statistically consistent and efficient. We provide a brief overview of MML inductive inference
Bayesian Model Selection in Finite Mixtures by Marginal Density Decompositions
- JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION
, 2001
"... ..."
Testing For Monotonicity Of A Regression Mean Without Selecting A Bandwidth
, 1998
"... . A new approach to testing for monotonicity of a regression mean, not requiring computation of a curve estimator or a bandwidth, is suggested. It is based on the notion of `running gradients' over short intervals, although from some viewpoints it may be regarded as an analogue for monotonicity test ..."
Abstract
-
Cited by 3 (3 self)
- Add to MetaCart
. A new approach to testing for monotonicity of a regression mean, not requiring computation of a curve estimator or a bandwidth, is suggested. It is based on the notion of `running gradients' over short intervals, although from some viewpoints it may be regarded as an analogue for monotonicity testing of the dip/excess mass approach for testing modality hypotheses about densities. Like the latter methods, the new technique does not suffer difficulties caused by almostflat parts of the target function. In fact, it is calibrated so as to work well for flat response curves, and as a result it has relatively good power properties in boundary cases where the curve exhibits shoulders. In this respect, as well as in its construction, the `running gradients' approach differs from alternative techniques based on the notion of a critical bandwidth. KEYWORDS. Bootstrap, calibration, curve estimation, Monte Carlo, response curve, running gradient. SHORT TITLE. Testing for monotonicity. 1 The man...
Bump hunting with nonGaussian kernels
- Ann. Statist
, 2004
"... It is well known that the number of modes of a kernel density estimator is monotone nonincreasing in the bandwidth if the kernel is a Gaussian density. There is numerical evidence of nonmonotonicity in the case of some non-Gaussian kernels, but little additional information is available. The present ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
It is well known that the number of modes of a kernel density estimator is monotone nonincreasing in the bandwidth if the kernel is a Gaussian density. There is numerical evidence of nonmonotonicity in the case of some non-Gaussian kernels, but little additional information is available. The present paper provides theoretical and numerical descriptions of the extent to which the number of modes is a nonmonotone function of bandwidth in the case of general compactly supported densities. Our results address popular kernels used in practice, for example, the Epanechnikov, biweight and triweight kernels, and show that in such cases nonmonotonicity is present with strictly positive probability for all sample sizes n ≥ 3. In the Epanechnikov and biweight cases the probability of nonmonotonicity equals 1 for all n ≥ 2. Nevertheless, in spite of the prevalence of lack of monotonicity revealed by these results, it is shown that the notion of a critical bandwidth (the smallest bandwidth above which the number of modes is guaranteed to be monotone) is still well defined. Moreover, just as in the Gaussian case, the critical bandwidth is of the same size as the bandwidth that minimises mean squared error of the density estimator. These theoretical results, and new numerical evidence, show that the main effects of nonmonotonicity occur for relatively small bandwidths, and have negligible impact on many aspects of bump hunting. 1. Introduction. Compactly supported kernels
Tests for normal mixtures based on the empirical characteristic function
- Comput. Statist. Data Anal
, 2005
"... Abstract. A goodness–of–fit test for two–component homoscedastic and homothetic mixtures of normal distributions is proposed. The tests are based on a weighted L2–type distance between the empirical characteristic function and its population counterpart, where in the latter, parameters are replaced ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Abstract. A goodness–of–fit test for two–component homoscedastic and homothetic mixtures of normal distributions is proposed. The tests are based on a weighted L2–type distance between the empirical characteristic function and its population counterpart, where in the latter, parameters are replaced by consistent estimators. Consequently the resulting tests are consistent against general alternatives. When moment estimation is employed and as the decay of the weight function tends to infinity the test statistics approach limit values, which are related to the first nonvanishing moment equation. The new tests are compared via simulation to other omnibus tests for mixtures of normal distributions, and are applied to several real data sets. Keywords. Characteristic function, Goodness-of-fit test, Mixtures of Normal Distributions 1
ASSESSING EXTREMA OF EMPIRICAL PRINCIPAL COMPONENT FUNCTIONS
, 2006
"... The difficulties of estimating and representing the distributions of functional data mean that principal component methods play a substantially greater role in functional data analysis than in more conventional finite-dimensional settings. Local maxima and minima in principal component functions are ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
The difficulties of estimating and representing the distributions of functional data mean that principal component methods play a substantially greater role in functional data analysis than in more conventional finite-dimensional settings. Local maxima and minima in principal component functions are of direct importance; they indicate places in the domain of a random function where influence on the function value tends to be relatively strong but of opposite sign. We explore statistical properties of the relationship between extrema of empirical principal component functions, and their counterparts for the true principal component functions. It is shown that empirical principal component funcions have relatively little trouble capturing conventional extrema, but can experience difficulty distinguishing a “shoulder ” in a curve from a small bump. For example, when the true principal component function has a shoulder, the probability that the empirical principal component function has instead a bump is approximately equal to 1. We suggest and describe the 2 performance of bootstrap methods for assessing the strength of extrema. It is shown that the subsample bootstrap is more effective than the standard bootstrap in this regard. A “bootstrap likelihood” is proposed for measuring extremum strength. Exploratory numerical methods are suggested.

