Results 1 - 10
of
258
Scalable statistical bug isolation
- In Proceedings of the ACM SIGPLAN 2005 Conference on Programming Language Design and Implementation
, 2005
"... We present a statistical debugging algorithm that isolates bugs in programs containing multiple undiagnosed bugs. Earlier statistical algorithms that focus solely on identifying predictors that correlate with program failure perform poorly when there are multiple bugs. Our new technique separates th ..."
Abstract
-
Cited by 132 (11 self)
- Add to MetaCart
We present a statistical debugging algorithm that isolates bugs in programs containing multiple undiagnosed bugs. Earlier statistical algorithms that focus solely on identifying predictors that correlate with program failure perform poorly when there are multiple bugs. Our new technique separates the effects of different bugs and identifies predictors that are associated with individual bugs. These predictors reveal both the circumstances under which bugs occur as well as the frequencies of failure modes, making it easier to prioritize debugging efforts. Our algorithm is validated using several case studies, including examples in which the algorithm identified previously unknown, significant crashing bugs in widely used systems. Categories and Subject Descriptors D.2.4 [Software Engineering]: Software/Program Verification—statistical methods; D.2.5
Unsupervised feature selection using feature similarity
- IEEE Transactions on Pattern Analysis and Machine Intelligence
, 2002
"... AbstractÐIn this article, we describe an unsupervised feature selection algorithm suitable for data sets, large in both dimension and size. The method is based on measuring similarity between features whereby redundancy therein is removed. This does not need any search and, therefore, is fast. A new ..."
Abstract
-
Cited by 70 (1 self)
- Add to MetaCart
AbstractÐIn this article, we describe an unsupervised feature selection algorithm suitable for data sets, large in both dimension and size. The method is based on measuring similarity between features whereby redundancy therein is removed. This does not need any search and, therefore, is fast. A new feature similarity measure, called maximum information compression index, is introduced. The algorithm is generic in nature and has the capability of multiscale representation of data sets. The superiority of the algorithm, in terms of speed and performance, is established extensively over various real-life data sets of different sizes and dimensions. It is also demonstrated how redundancy and information loss in feature selection can be quantified with an entropy measure. Index TermsÐData mining, pattern recognition, dimensionality reduction, feature clustering, multiscale representation, entropy measures. 1
Sequential Monte Carlo methods for statistical analysis of tables
- J. Amer. Statist. Assoc
"... We describe a sequential importance sampling (SIS) procedure for analyzing two-way zero–one or contingency tables with fixed marginal sums. An essential feature of the new method is that it samples the columns of the table progressively according to certain special distributions. Our method produces ..."
Abstract
-
Cited by 43 (10 self)
- Add to MetaCart
We describe a sequential importance sampling (SIS) procedure for analyzing two-way zero–one or contingency tables with fixed marginal sums. An essential feature of the new method is that it samples the columns of the table progressively according to certain special distributions. Our method produces Monte Carlo samples that are remarkably close to the uniform distribution, enabling one to approximate closely the null distributions of various test statistics about these tables. Our method compares favorably with other existing Monte Carlobased algorithms, and sometimes is a few orders of magnitude more efficient. In particular, compared with Markov chain Monte Carlo (MCMC)-based approaches, our importance sampling method not only is more efficient in terms of absolute running time and frees one from pondering over the mixing issue, but also provides an easy and accurate estimate of the total number of tables with fixed marginal sums, which is far more difficult for an MCMC method to achieve.
Model Choice: A Minimum Posterior Predictive Loss Approach
, 1998
"... Model choice is a fundamental and much discussed activity in the analysis of data sets. Hierarchical models introducing random effects can not be handled by classical methods. Bayesian approaches using predictive distributions can, though the formal solution, which includes Bayes factors as a specia ..."
Abstract
-
Cited by 39 (10 self)
- Add to MetaCart
Model choice is a fundamental and much discussed activity in the analysis of data sets. Hierarchical models introducing random effects can not be handled by classical methods. Bayesian approaches using predictive distributions can, though the formal solution, which includes Bayes factors as a special case, can be criticized. We propose a predictive criterion where the goal is good prediction of a replicate of the observed data but tempered by fidelity to the observed values. We obtain this criterion by minimizing posterior loss for a given model and then, for models under consideration, select the one which minimizes this criterion. For a broad range of losses, the criterion emerges approximately as a form partitioned into a goodness-of-fit term and a penalty term. In the context of generalized linear mixed effects models we obtain a penalized deviance criterion comprised of a piece which is a Bayesian deviance measure and a piece which is a penalty for model complexity. We illustrate ...
The bootstrap
- In Handbook of Econometrics
, 2001
"... The bootstrap is a method for estimating the distribution of an estimator or test statistic by resampling one’s data. It amounts to treating the data as if they were the population for the purpose of evaluating the distribution of interest. Under mild regularity conditions, the bootstrap yields an a ..."
Abstract
-
Cited by 38 (1 self)
- Add to MetaCart
The bootstrap is a method for estimating the distribution of an estimator or test statistic by resampling one’s data. It amounts to treating the data as if they were the population for the purpose of evaluating the distribution of interest. Under mild regularity conditions, the bootstrap yields an approximation to the distribution of an estimator or test statistic that is at least as accurate as the
Dynamic Panel Estimation and Homogeneity Testing under CrossSection Dependence, Cowles Foundation Discussion Paper n.1362
, 2002
"... Least squares bias in autoregression and dynamic panel regression is shown to be exacerbated in case of cross section dependence. The bias is substantial and is shown to have serious effects in applications like HAC estimation and dynamic half-life response estimation. To address the bias problem, t ..."
Abstract
-
Cited by 38 (2 self)
- Add to MetaCart
Least squares bias in autoregression and dynamic panel regression is shown to be exacerbated in case of cross section dependence. The bias is substantial and is shown to have serious effects in applications like HAC estimation and dynamic half-life response estimation. To address the bias problem, this paper develops a panel approach to median unbiased estimation that takes into account cross section dependence. The new estimators given here considerably reduce the effects of bias and gain precision from estimating cross section error correlation. The paper also develops an asymptotic theory for tests of coefficient homogeneity under cross section dependence, and proposes a modiÞed Hausman test to test for the presence of homogeneous unit roots. An orthogonalization procedure is developed to remove cross section dependence and permit the use of conventional and meta unit root tests with panel data. Some simulations investigating the Þnite sample performance of the estimation and test procedures are reported.
Testing random variables for independence and identity
- Proceedings of the 41st Annual Symposium on Foundations of Computer Science
, 2000
"... Given access to independent samples of a distribution �over�℄�℄, we show how to test whether the distributions formed by projecting�to each coordinate are independent, i.e., whether�is-close in the norm to the product distribution��for some distributions�over �℄and�over�℄. The sample complexity of o ..."
Abstract
-
Cited by 36 (14 self)
- Add to MetaCart
Given access to independent samples of a distribution �over�℄�℄, we show how to test whether the distributions formed by projecting�to each coordinate are independent, i.e., whether�is-close in the norm to the product distribution��for some distributions�over �℄and�over�℄. The sample complexity of our test is �poly, assuming without loss of generality that �. We also give a matching lower bound, up to poly� � factors. Furthermore, given access to samples of a distribution �over�℄, we show how to test if�is-close in norm to an explicitly specified distribution�. Our test uses��poly samples, which nearly matches the known tight bounds for the case when�is uniform. 1.
A Statistical Test for the Time Constancy of Scaling Exponents
- IEEE Transactions on Signal Processing
, 1999
"... A wavelet based statistical test is described for distinguishing true time variation of the scaling exponent describing scaling behaviour, from statistical fluctuations of estimates across time of a constant exponent. The test is applicable to diverse scaling phenomena including long range dependenc ..."
Abstract
-
Cited by 31 (7 self)
- Add to MetaCart
A wavelet based statistical test is described for distinguishing true time variation of the scaling exponent describing scaling behaviour, from statistical fluctuations of estimates across time of a constant exponent. The test is applicable to diverse scaling phenomena including long range dependence and exactly selfsimilar processes in a uniform framework, without the need for prior knowledge of the type in question. It is based on the special properties of wavelet-based estimates of the scaling exponent over adjacent blocks of data, strongly motivating an idealised inference problem: the equality or otherwise of means of independent Gaussian variables with known variances. A uniformly most powerful invariant test exists for this problem and is described. A separate UMPI test is also described for when the scaling exponent undergoes a level change. The power functions of the two tests are given explicitly and compared. Using simulation the effect in practice of deviations from the ide...

