Results 1  10
of
375
Scalable statistical bug isolation
 In Proceedings of the ACM SIGPLAN 2005 Conference on Programming Language Design and Implementation
, 2005
"... We present a statistical debugging algorithm that isolates bugs in programs containing multiple undiagnosed bugs. Earlier statistical algorithms that focus solely on identifying predictors that correlate with program failure perform poorly when there are multiple bugs. Our new technique separates th ..."
Abstract

Cited by 186 (12 self)
 Add to MetaCart
We present a statistical debugging algorithm that isolates bugs in programs containing multiple undiagnosed bugs. Earlier statistical algorithms that focus solely on identifying predictors that correlate with program failure perform poorly when there are multiple bugs. Our new technique separates the effects of different bugs and identifies predictors that are associated with individual bugs. These predictors reveal both the circumstances under which bugs occur as well as the frequencies of failure modes, making it easier to prioritize debugging efforts. Our algorithm is validated using several case studies, including examples in which the algorithm identified previously unknown, significant crashing bugs in widely used systems. Categories and Subject Descriptors D.2.4 [Software Engineering]: Software/Program Verification—statistical methods; D.2.5
Unsupervised feature selection using feature similarity
 IEEE Transactions on Pattern Analysis and Machine Intelligence
, 2002
"... AbstractÐIn this article, we describe an unsupervised feature selection algorithm suitable for data sets, large in both dimension and size. The method is based on measuring similarity between features whereby redundancy therein is removed. This does not need any search and, therefore, is fast. A new ..."
Abstract

Cited by 98 (2 self)
 Add to MetaCart
AbstractÐIn this article, we describe an unsupervised feature selection algorithm suitable for data sets, large in both dimension and size. The method is based on measuring similarity between features whereby redundancy therein is removed. This does not need any search and, therefore, is fast. A new feature similarity measure, called maximum information compression index, is introduced. The algorithm is generic in nature and has the capability of multiscale representation of data sets. The superiority of the algorithm, in terms of speed and performance, is established extensively over various reallife data sets of different sizes and dimensions. It is also demonstrated how redundancy and information loss in feature selection can be quantified with an entropy measure. Index TermsÐData mining, pattern recognition, dimensionality reduction, feature clustering, multiscale representation, entropy measures. 1
The bootstrap
 In Handbook of Econometrics
, 2001
"... The bootstrap is a method for estimating the distribution of an estimator or test statistic by resampling one’s data. It amounts to treating the data as if they were the population for the purpose of evaluating the distribution of interest. Under mild regularity conditions, the bootstrap yields an a ..."
Abstract

Cited by 75 (1 self)
 Add to MetaCart
The bootstrap is a method for estimating the distribution of an estimator or test statistic by resampling one’s data. It amounts to treating the data as if they were the population for the purpose of evaluating the distribution of interest. Under mild regularity conditions, the bootstrap yields an approximation to the distribution of an estimator or test statistic that is at least as accurate as the
Dynamic Panel Estimation and Homogeneity Testing under CrossSection Dependence, Cowles Foundation Discussion Paper n.1362
, 2002
"... Least squares bias in autoregression and dynamic panel regression is shown to be exacerbated in case of cross section dependence. The bias is substantial and is shown to have serious effects in applications like HAC estimation and dynamic halflife response estimation. To address the bias problem, t ..."
Abstract

Cited by 75 (4 self)
 Add to MetaCart
Least squares bias in autoregression and dynamic panel regression is shown to be exacerbated in case of cross section dependence. The bias is substantial and is shown to have serious effects in applications like HAC estimation and dynamic halflife response estimation. To address the bias problem, this paper develops a panel approach to median unbiased estimation that takes into account cross section dependence. The new estimators given here considerably reduce the effects of bias and gain precision from estimating cross section error correlation. The paper also develops an asymptotic theory for tests of coefficient homogeneity under cross section dependence, and proposes a modiÞed Hausman test to test for the presence of homogeneous unit roots. An orthogonalization procedure is developed to remove cross section dependence and permit the use of conventional and meta unit root tests with panel data. Some simulations investigating the Þnite sample performance of the estimation and test procedures are reported.
Model Choice: A Minimum Posterior Predictive Loss Approach
, 1998
"... Model choice is a fundamental and much discussed activity in the analysis of data sets. Hierarchical models introducing random effects can not be handled by classical methods. Bayesian approaches using predictive distributions can, though the formal solution, which includes Bayes factors as a specia ..."
Abstract

Cited by 59 (10 self)
 Add to MetaCart
Model choice is a fundamental and much discussed activity in the analysis of data sets. Hierarchical models introducing random effects can not be handled by classical methods. Bayesian approaches using predictive distributions can, though the formal solution, which includes Bayes factors as a special case, can be criticized. We propose a predictive criterion where the goal is good prediction of a replicate of the observed data but tempered by fidelity to the observed values. We obtain this criterion by minimizing posterior loss for a given model and then, for models under consideration, select the one which minimizes this criterion. For a broad range of losses, the criterion emerges approximately as a form partitioned into a goodnessoffit term and a penalty term. In the context of generalized linear mixed effects models we obtain a penalized deviance criterion comprised of a piece which is a Bayesian deviance measure and a piece which is a penalty for model complexity. We illustrate ...
Testing random variables for independence and identity
 Proceedings of the 41st Annual Symposium on Foundations of Computer Science
, 2000
"... Given access to independent samples of a distribution �over�℄�℄, we show how to test whether the distributions formed by projecting�to each coordinate are independent, i.e., whether�isclose in the norm to the product distribution��for some distributions�over �℄and�over�℄. The sample complexity of o ..."
Abstract

Cited by 54 (18 self)
 Add to MetaCart
Given access to independent samples of a distribution �over�℄�℄, we show how to test whether the distributions formed by projecting�to each coordinate are independent, i.e., whether�isclose in the norm to the product distribution��for some distributions�over �℄and�over�℄. The sample complexity of our test is �poly, assuming without loss of generality that �. We also give a matching lower bound, up to poly� � factors. Furthermore, given access to samples of a distribution �over�℄, we show how to test if�isclose in norm to an explicitly specified distribution�. Our test uses��poly samples, which nearly matches the known tight bounds for the case when�is uniform. 1.
Sequential Monte Carlo methods for statistical analysis of tables
 J. Amer. Statist. Assoc
"... We describe a sequential importance sampling (SIS) procedure for analyzing twoway zero–one or contingency tables with fixed marginal sums. An essential feature of the new method is that it samples the columns of the table progressively according to certain special distributions. Our method produces ..."
Abstract

Cited by 51 (10 self)
 Add to MetaCart
We describe a sequential importance sampling (SIS) procedure for analyzing twoway zero–one or contingency tables with fixed marginal sums. An essential feature of the new method is that it samples the columns of the table progressively according to certain special distributions. Our method produces Monte Carlo samples that are remarkably close to the uniform distribution, enabling one to approximate closely the null distributions of various test statistics about these tables. Our method compares favorably with other existing Monte Carlobased algorithms, and sometimes is a few orders of magnitude more efficient. In particular, compared with Markov chain Monte Carlo (MCMC)based approaches, our importance sampling method not only is more efficient in terms of absolute running time and frees one from pondering over the mixing issue, but also provides an easy and accurate estimate of the total number of tables with fixed marginal sums, which is far more difficult for an MCMC method to achieve.
Multihypothesis sequential probability ratio tests  Part II: Accurate asymptotic . . .
 IEEE TRANS. INFORM. THEORY
, 2000
"... In a companion paper [13], we proved that two specific constructions of multihypothesis sequential tests, which we refer to as Multihypothesis Sequential Probability Ratio Tests (MSPRT’s), are asymptotically optimal as the decision risks (or error probabilities) go to zero. The MSPRT’s asymptotical ..."
Abstract

Cited by 42 (14 self)
 Add to MetaCart
In a companion paper [13], we proved that two specific constructions of multihypothesis sequential tests, which we refer to as Multihypothesis Sequential Probability Ratio Tests (MSPRT’s), are asymptotically optimal as the decision risks (or error probabilities) go to zero. The MSPRT’s asymptotically minimize not only the expected sample size but also any positive moment of the stopping time distribution, under very general statistical models for the observations. In this paper, based on nonlinear renewal theory we find accurate asymptotic approximations (up to a vanishing term) for the expected sample size that take into account the “overshoot ” over the boundaries of decision statistics. The approximations are derived for the scenario where the hypotheses are simple, the observations are independent and identically distributed (i.i.d.) according to one of the underlying distributions, and the decision risks go to zero. Simulation results for practical examples show that these approximations are fairly accurate not only for large but also for moderate sample sizes. The asymptotic results given here complete the analysis initiated in [4], where firstorder asymptotics were obtained for the expected sample size under a specific restriction on the Kullback–Leibler distances between the hypotheses.