Results 1  10
of
153
Development and Use of a GoldStandard Data Set for Subjectivity Classifications
, 1999
"... and improving intercoder reliability in discourse tagging using statistical techniques. Biascorrected tags axe formulated and successfully used to guide a revision of the coding manual and develop an automatic classifier. ..."
Abstract

Cited by 123 (9 self)
 Add to MetaCart
and improving intercoder reliability in discourse tagging using statistical techniques. Biascorrected tags axe formulated and successfully used to guide a revision of the coding manual and develop an automatic classifier.
Comparing Corpora using Frequency Profiling
 In proceedings of the workshop on Comparing Corpora, held in conjunction ACL 2000. October 2000, Hong Kong
, 2000
"... This paper describes a method of comparing corpora which uses frequency profiling. The method can be used to discover key words in the corpora which differentiate one corpus from another. Using annotated corpora, it can be applied to discover key grammatical or wordsense categories. This can ..."
Abstract

Cited by 102 (5 self)
 Add to MetaCart
This paper describes a method of comparing corpora which uses frequency profiling. The method can be used to discover key words in the corpora which differentiate one corpus from another. Using annotated corpora, it can be applied to discover key grammatical or wordsense categories. This can be used as a quick way in to find the differences between the corpora and is shown to have applications in the study of social differentiation in the use of English vocabulary, profiling of learner English and document analysis in the software engineering process.
A kernel statistical test of independence
, 2008
"... Although kernel measures of independence have been widely applied in machine learning (notably in kernel ICA), there is as yet no method to determine whether they have detected statistically significant dependence. We provide a novel test of the independence hypothesis for one particular kernel inde ..."
Abstract

Cited by 95 (48 self)
 Add to MetaCart
(Show Context)
Although kernel measures of independence have been widely applied in machine learning (notably in kernel ICA), there is as yet no method to determine whether they have detected statistically significant dependence. We provide a novel test of the independence hypothesis for one particular kernel independence measure, the HilbertSchmidt independence criterion (HSIC). The resulting test costs O(m 2), where m is the sample size. We demonstrate that this test outperforms established contingency table and functional correlationbased tests, and that this advantage is greater for multivariate data. Finally, we show the HSIC test also applies to text (and to structured data more generally), for which no other independence test presently exists. 1
Kernel measures of conditional dependence
 In Adv. NIPS
, 2008
"... We propose a new measure of conditional dependence of random variables, based on normalized crosscovariance operators on reproducing kernel Hilbert spaces. Unlike previous kernel dependence measures, the proposed criterion does not depend on the choice of kernel in the limit of infinite data, for a ..."
Abstract

Cited by 87 (46 self)
 Add to MetaCart
(Show Context)
We propose a new measure of conditional dependence of random variables, based on normalized crosscovariance operators on reproducing kernel Hilbert spaces. Unlike previous kernel dependence measures, the proposed criterion does not depend on the choice of kernel in the limit of infinite data, for a wide class of kernels. At the same time, it has a straightforward empirical estimate with good convergence behaviour. We discuss the theoretical properties of the measure, and demonstrate its application in experiments. 1
How to Fit a Response Time Distribution
"... Among the most valuable tools in behavioral science is statistically fitting mathematical models of cognition to data, response time distributions in particular. However, techniques for fitting distributions vary widely and little is known about the efficacy of different techniques. In this article, ..."
Abstract

Cited by 82 (1 self)
 Add to MetaCart
Among the most valuable tools in behavioral science is statistically fitting mathematical models of cognition to data, response time distributions in particular. However, techniques for fitting distributions vary widely and little is known about the efficacy of different techniques. In this article, we assessed several fitting techniques by simulating six widely cited models of response time and using the fitting procedures to recover model parameters. The techniques include the maximization of likelihood and leastsquares fits of the theoretical distributions to different empirical estimates of the simulated distributions. A running example was used to illustrate the different estimation and fitting procedures. The simulation studies revealed that empirical density estimates are biased even for very large sample sizes. Some fitting techniques yielded more accurate and less variable parameter estimates than others. Methods that involved leastsquares fits to density estimates generally yielded very poor parameter estimates. How to Fit a Response Time Distribution The importance of considering the entire response time (RT) distribution in testing formal models of cognition is now widely appreciated. Fitting a model to mean RT alone can mask important details of the data that examination of the entire distribution would reveal, such as the behavior of fast and slow responses across the conditions of an experiment (e.g., Heathcote, Popiel & Mewhort, 1991), the extent of facilitation between perceptual channels (Miller, 1982), and the effects of practice on RT quantiles (Logan, 1992). Techniques for testing hypotheses based on the RT distribution have been developed (Townsend, 1990). In addition, the RT distribution provides an important meeting ground between theory and da...
TestU01: A C library for empirical testing of random number generators
 ACM TRANSACTIONS ON MATHEMATICAL SOFTWARE
, 2007
"... We introduce TestU01, a software library implemented in the ANSI C language, and offering a collection of utilities for the empirical statistical testing of uniform random number generators (RNGs). It provides general implementations of the classical statistical tests for RNGs, as well as several ot ..."
Abstract

Cited by 80 (4 self)
 Add to MetaCart
We introduce TestU01, a software library implemented in the ANSI C language, and offering a collection of utilities for the empirical statistical testing of uniform random number generators (RNGs). It provides general implementations of the classical statistical tests for RNGs, as well as several others tests proposed in the literature, and some original ones. Predefined tests suites for sequences of uniform random numbers over the interval (0, 1) and for bit sequences are available. Tools are also offered to perform systematic studies of the interaction between a specific test and the structure of the point sets produced by a given family of RNGs. That is, for a given kind of test and a given class of RNGs, to determine how large should be the sample size of the test, as a function of the generator’s period length, before the generator starts to fail the test systematically. Finally, the library provides various types of generators implemented in generic form, as well as many specific generators proposed in the literature or found in widelyused software. The tests can be applied to instances of the generators predefined in the library, or to userdefined generators, or to streams of random numbers produced by any kind of device or stored in files. Besides introducing TestU01, the paper provides a survey and a classification of statistical tests for RNGs. It also applies batteries of tests to a long list of widely used RNGs.
Csiszár’s divergences for nonnegative matrix factorization: Family of new algorithms
 LNCS
, 2006
"... In this paper we discus a wide class of loss (cost) functions for nonnegative matrix factorization (NMF) and derive several novel algorithms with improved efficiency and robustness to noise and outliers. We review several approaches which allow us to obtain generalized forms of multiplicative NMF a ..."
Abstract

Cited by 77 (20 self)
 Add to MetaCart
(Show Context)
In this paper we discus a wide class of loss (cost) functions for nonnegative matrix factorization (NMF) and derive several novel algorithms with improved efficiency and robustness to noise and outliers. We review several approaches which allow us to obtain generalized forms of multiplicative NMF algorithms and unify some existing algorithms. We give also the flexible and relaxed form of the NMF algorithms to increase convergence speed and impose some desired constraints such as sparsity and smoothness of components. Moreover, the effects of various regularization terms and constraints are clearly shown. The scope of these results is vast since the proposed generalized divergence functions include quite large number of useful loss functions such as the squared Euclidean distance,KulbackLeibler divergence, ItakuraSaito, Hellinger, Pearson’s chisquare, and Neyman’s chisquare distances, etc. We have applied successfully the developed algorithms to blind (or semi blind) source separation (BSS) where sources can be generally statistically dependent, however they satisfy some other conditions or additional constraints such as nonnegativity, sparsity and/or smoothness.
Fishing for Exactness
 In Proceedings of the SouthCentral SAS Users Group Conference
, 1996
"... Statistical methods for automatically identifying dependent word pairs (i.e. dependent bigrams) in a corpus of natural language text have traditionally been performed using asymptotic tests of significance. This paper suggests that Fisher's exact test is a more appropriate test due to the skewe ..."
Abstract

Cited by 56 (5 self)
 Add to MetaCart
(Show Context)
Statistical methods for automatically identifying dependent word pairs (i.e. dependent bigrams) in a corpus of natural language text have traditionally been performed using asymptotic tests of significance. This paper suggests that Fisher's exact test is a more appropriate test due to the skewed and sparse data samples typical of this problem. Both theoretical and experimental comparisons between Fisher's exact test and a variety of asymptotic tests (the ttest, Pearson's chisquare test, and Likelihoodratio chisquare test) are presented. These comparisons show that Fisher's exact test is more reliable in identifying dependent word pairs. The usefulness of Fisher's exact test extends to other problems in statistical natural language processing as skewed and sparse data appears to be the rule in natural language. The experiment presented in this paper was performed using PROC FREQ of the SAS System. Introduction Due to advances in computing power and the increasing availability of l...
General TimeReversible Distances with Unequal Rates across Sites: Mixing Γ and Inverse Gaussian Distributions with Invariant Sites
, 1997
"... This paper aims to explain to biologists the assumptions of these distances and to clarify some earlier misconceptions. Importantly, nearly all of the currently used distance estimates (including those of Tamura, 1992; Tamura and Nei, 1994) are special cases (restrictions) of the general timerevers ..."
Abstract

Cited by 51 (4 self)
 Add to MetaCart
This paper aims to explain to biologists the assumptions of these distances and to clarify some earlier misconceptions. Importantly, nearly all of the currently used distance estimates (including those of Tamura, 1992; Tamura and Nei, 1994) are special cases (restrictions) of the general timereversible distance (see Zharkikh, 1994; Swofford et al., 1996)