Results 1 - 10
of
164
Statistical Comparisons of Classifiers over Multiple Data Sets
, 2006
"... While methods for comparing two learning algorithms on a single data set have been scrutinized for quite some time already, the issue of statistical tests for comparisons of more algorithms on multiple data sets, which is even more essential to typical machine learning studies, has been all but igno ..."
Abstract
-
Cited by 120 (0 self)
- Add to MetaCart
While methods for comparing two learning algorithms on a single data set have been scrutinized for quite some time already, the issue of statistical tests for comparisons of more algorithms on multiple data sets, which is even more essential to typical machine learning studies, has been all but ignored. This article reviews the current practice and then theoretically and empirically examines several suitable tests. Based on that, we recommend a set of simple, yet safe and robust non-parametric tests for statistical comparisons of classifiers: the Wilcoxon signed ranks test for comparison of two classifiers and the Friedman test with the corresponding post-hoc tests for comparison of more classifiers over multiple data sets. Results of the latter can also be neatly presented with the newly introduced CD (critical difference) diagrams.
Gibbs motif sampling: detection of bacterial outer membrane protein repeats
- Protein Science
, 1995
"... The detection and alignment of locally conserved regions (motifs) in multiple sequences can provide insight into protein structure, function, and evolution. A new Gibbs sampling algorithm is described that detects motif-encoding regions in sequences and optimally partitions them into distinct motif ..."
Abstract
-
Cited by 76 (10 self)
- Add to MetaCart
The detection and alignment of locally conserved regions (motifs) in multiple sequences can provide insight into protein structure, function, and evolution. A new Gibbs sampling algorithm is described that detects motif-encoding regions in sequences and optimally partitions them into distinct motif models; this is illustrated using a set of im-munoglobulin fold proteins. When applied to sequences sharing a single motif, the sampler can be used to clas-sify motif regions into related submodels, as is illustrated using helix-turn-helix DNA-binding proteins. Other statistically based procedures are described for searching a database for sequences matching motifs found by the sampler. When applied to a set of 32 very distantly related bacterial integral outer membrane proteins, the sam-pler revealed that they share a subtle, repetitive motif. Although BLAST (Altschul SF et al., 1990, J Mol Biol 215:403-410) fails to detect significant pairwise similarity between any of the sequences, the repeats present in these outer membrane proteins, taken as a whole, are highly significant (based on a generally applicable statisti-cal test for motifs described here). Analysis of bacterial porins with known trimeric 0-barrel structure and related proteins reveals a similar repetitive motif corresponding to alternating membrane-spanning 0-strands. These &strands occur on the membrane interface (as opposed to the trimeric interface) of the &barrel. The broad con-servation and structural location of these repeats suggests that they play important functional roles.
The humanID gait challenge problem: Data sets, performance, and analysis
- IEEE Transactions on Pattern Analysis and Machine Intelligence
, 2005
"... Abstract—Identification of people by analysis of gait patterns extracted from video has recently become a popular research problem. However, the conditions under which the problem is “solvable ” are not understood or characterized. To provide a means for measuring progress and characterizing the pro ..."
Abstract
-
Cited by 51 (1 self)
- Add to MetaCart
Abstract—Identification of people by analysis of gait patterns extracted from video has recently become a popular research problem. However, the conditions under which the problem is “solvable ” are not understood or characterized. To provide a means for measuring progress and characterizing the properties of gait recognition, we introduce the HumanID Gait Challenge Problem. The challenge problem consists of a baseline algorithm, a set of 12 experiments, and a large data set. The baseline algorithm estimates silhouettes by background subtraction and performs recognition by temporal correlation of silhouettes. The 12 experiments are of increasing difficulty, as measured by the baseline algorithm, and examine the effects of five covariates on performance. The covariates are: change in viewing angle, change in shoe type, change in walking surface, carrying or not carrying a briefcase, and elapsed time between sequences being compared. Identification rates for the 12 experiments range from 78 percent on the easiest experiment to 3 percent on the hardest. All five covariates had statistically significant effects on performance, with walking surface and time difference having the greatest impact. The data set consists of 1,870 sequences from 122 subjects spanning five covariates (1.2 Gigabytes of data). The gait data, the source code of the baseline algorithm, and scripts to run, score, and analyze the challenge experiments are available at
Newsjunkie: Providing Personalized Newsfeeds via Analysis of Information Novelty
- In WWW2004
, 2004
"... We present a principled methodology for filtering news stories by formal measures of information novelty, and show how the techniques can be used to custom-tailor newsfeeds based on information that a user has already reviewed. We review methods for analyzing novelty and then describe Newsjunkie, a ..."
Abstract
-
Cited by 44 (4 self)
- Add to MetaCart
We present a principled methodology for filtering news stories by formal measures of information novelty, and show how the techniques can be used to custom-tailor newsfeeds based on information that a user has already reviewed. We review methods for analyzing novelty and then describe Newsjunkie, a system that personalizes news for users by identifying the novelty of stories in the context of stories they have already reviewed. Newsjunkie employs novelty-analysis algorithms that represent articles as words and named entities. The algorithms analyze inter- and intra- document dynamics by considering how information evolves over time from article to article, as well as within individual articles. We review the results of a user study undertaken to gauge the value of the approach over legacy time-based review of newsfeeds, and also to compare the performance of alternate distance metrics that are used to estimate the dissimilarity between candidate new articles and sets of previously reviewed articles.
A Comparison of Statistical Significance Tests for Information Retrieval Evaluation
, 2007
"... Information retrieval (IR) researchers commonly use three tests of statistical significance: the Student’s paired t-test, the Wilcoxon signed rank test, and the sign test. Other researchers have previously proposed using both the bootstrap and Fisher’s randomization (permutation) test as nonparametr ..."
Abstract
-
Cited by 24 (1 self)
- Add to MetaCart
Information retrieval (IR) researchers commonly use three tests of statistical significance: the Student’s paired t-test, the Wilcoxon signed rank test, and the sign test. Other researchers have previously proposed using both the bootstrap and Fisher’s randomization (permutation) test as nonparametric significance tests for IR but these tests have seen little use. For each of these five tests, we took the ad-hoc retrieval runs submitted to TRECs 3 and 5-8, and for each pair of runs, we measured the statistical significance of the difference in their mean average precision. We discovered that there is little practical difference between the randomization, bootstrap, and t tests. Both the Wilcoxon and sign test have a poor ability to detect significance and have the potential to lead to false detections of significance. The Wilcoxon and sign tests are simplified variants of the randomization test and their use should be discontinued for measuring the significance of a difference between means.
A Comparison of Crosscompany and Within-company Effort Estimation Models for Web Applications
- Proceedings Metrics’04, Chicago, Illinois September 11-17 th 2004, IEEE Computer Society
, 2004
"... The objective of this paper is to investigate the use of cross-company and within-company cost estimation model for Web projects. The study analysis method we used was a forward, stepwise regression of web project development effort with independent variables that were used for on-line web applicati ..."
Abstract
-
Cited by 17 (9 self)
- Add to MetaCart
The objective of this paper is to investigate the use of cross-company and within-company cost estimation model for Web projects. The study analysis method we used was a forward, stepwise regression of web project development effort with independent variables that were used for on-line web application price quotations. Model cross-validation used two approaches: one-fold cross validation and a validation subset of 13 projects from the same company. The setting consists of Industrial Web applications from 24 different companies and 8 different countries, and the experiment units are 53 web application and hypermedia projects. The main outcome measure was an estimate of effort for each project, for each estimation model. We used the summary statistics: MMRE, Median MRE, Pred(25), mean and median of absolute residuals to evaluate the predictive accuracy of model. We also compared estimation model accuracy with an assessment of estimate accuracy provided by the participating companies. Results showed that the best fitting cross-company regression model was significantly better than the median and experts at predicting effort for 13 projects from a single company (p<0.05). The best fitting within-company regression model for the company with 13 projects was significantly better than the cross-company model (p<0.05). However the accuracy of the company 1 model was worse than the accuracy achieved by company 1 estimators. In addition, the company 1 model was an extremely poor predictor of effort for the other companies ’ projects. We conclude that cross-company effort estimation model can be useful for companies that do not have past projects from which to develop their own models. However, a cross-company data set should only be used for estimation until it is possible for a Web company to
Machine learning methods for predicting failures in hard drives: A multiple-instance application
- Journal of Machine Learning research
, 2005
"... We compare machine learning methods applied to a difficult real-world problem: predicting computer hard-drive failure using attributes monitored internally by individual drives. The problem is one of detecting rare events in a time series of noisy and nonparametrically-distributed data. We develop a ..."
Abstract
-
Cited by 17 (1 self)
- Add to MetaCart
We compare machine learning methods applied to a difficult real-world problem: predicting computer hard-drive failure using attributes monitored internally by individual drives. The problem is one of detecting rare events in a time series of noisy and nonparametrically-distributed data. We develop a new algorithm based on the multiple-instance learning framework and the naive Bayesian classifier (mi-NB) which is specifically designed for the low false-alarm case, and is shown to have promising performance. Other methods compared are support vector machines (SVMs), unsupervised clustering, and non-parametric statistical tests (rank-sum and reverse arrangements). The failure-prediction performance of the SVM, rank-sum and mi-NB algorithm is considerably better than the threshold method currently implemented in drives, while maintaining low false alarm rates. Our results suggest that nonparametric statistical tests should be considered for learning problems involving detecting rare events in time series data. An appendix details the calculation of rank-sum significance probabilities in the case of discrete, tied observations, and we give new recommendations about when the exact calculation should be used instead of the commonly-used normal approximation. These normal approximations may be particularly inaccurate for rare event problems like hard drive failures.
Improved Disk-Drive Failure Warnings
- IEEE Transactions on Reliability
, 2002
"... Improved methods are proposed for disk-drive failure prediction. The SMART (Self Monitoring and Reporting Technology) failure prediction system is currently implemented in disk-drives. Its purpose is to predict the near-term failure of an individual hard disk-drive, and issue a backup warning to pre ..."
Abstract
-
Cited by 14 (2 self)
- Add to MetaCart
Improved methods are proposed for disk-drive failure prediction. The SMART (Self Monitoring and Reporting Technology) failure prediction system is currently implemented in disk-drives. Its purpose is to predict the near-term failure of an individual hard disk-drive, and issue a backup warning to prevent data loss. Two experimental tests of SMART show only moderate accuracy at low false-alarm rates. (A rate of 0.2% of total drives per year implies that 20% of drive returns would be good drives, relative to 1% annual failure rate of drives). This requirement for very low false-alarm rates is well known in medical diagnostic tests for rare diseases, and methodology used there suggests ways to improve SMART.
Improving the performance of motor-impaired users with automaticallygenerated, ability-based interfaces
- In CHI’08
, 2008
"... We evaluate two systems for automatically generating personalized interfaces adapted to the individual motor capabilities of users with motor impairments. The first system, SUPPLE, adapts to users ’ capabilities indirectly by first using the ARNAULD preference elicitation engine to model a user’s pr ..."
Abstract
-
Cited by 14 (6 self)
- Add to MetaCart
We evaluate two systems for automatically generating personalized interfaces adapted to the individual motor capabilities of users with motor impairments. The first system, SUPPLE, adapts to users ’ capabilities indirectly by first using the ARNAULD preference elicitation engine to model a user’s preferences regarding how he or she likes the interfaces to be created. The second system, SUPPLE++, models a user’s motor abilities directly from a set of one-time motor performance tests. In a study comparing these approaches to baseline interfaces, participants with motor impairments were 26.4 % faster using ability-based user interfaces generated by SUPPLE++. They also made 73 % fewer errors, strongly preferred those interfaces to the manufacturers’ defaults, and found them more efficient, easier to use, and much less physically tiring. These findings indicate that rather than requiring some users with motor impairments to adapt themselves to software using separate assistive technologies, software can now adapt itself to the capabilities of its users.

