Results 1  10
of
110
Mining highspeed data streams
, 2000
"... Categories and Subject ���������� � �¨�������������������������¦���¦����������¡¤�� ¡ � ¡����������������¦¡¤����§�£���� ..."
Abstract

Cited by 288 (10 self)
 Add to MetaCart
Categories and Subject ���������� � �¨�������������������������¦���¦����������¡¤�� ¡ � ¡����������������¦¡¤����§�£����
Mining timechanging data streams
 IN PROC. OF THE 2001 ACM SIGKDD INTL. CONF. ON KNOWLEDGE DISCOVERY AND DATA MINING
, 2001
"... Most statistical and machinelearning algorithms assume that the data is a random sample drawn from a stationary distribution. Unfortunately, most of the large databases available for mining today violate this assumption. They were gathered over months or years, and the underlying processes genera ..."
Abstract

Cited by 247 (5 self)
 Add to MetaCart
Most statistical and machinelearning algorithms assume that the data is a random sample drawn from a stationary distribution. Unfortunately, most of the large databases available for mining today violate this assumption. They were gathered over months or years, and the underlying processes generating them changed during this time, sometimes radically. Although a number of algorithms have been proposed for learning timechanging concepts, they generally do not scale well to very large databases. In this paper we propose an efficient algorithm for mining decision trees from continuouslychanging data streams, based on the ultrafast VFDT decision tree learner. This algorithm, called CVFDT, stays current while making the most of old data by growing an alternative subtree whenever an old one becomes questionable, and replacing the old with the new when the new becomes more accurate. CVFDT learns a model which is similar in accuracy to the one that would be learned by reapplying VFDT to a moving window of examples every time a new example arrives, but with O(1) complexity per example, as opposed to O(w), where w is the size of the window. Experiments on a set of large timechanging data streams demonstrate the utility of this approach.
A Comparison of Prediction Accuracy, Complexity, and Training Time of Thirtythree Old and New Classification Algorithms
, 2000
"... . Twentytwo decision tree, nine statistical, and two neural network algorithms are compared on thirtytwo datasets in terms of classication accuracy, training time, and (in the case of trees) number of leaves. Classication accuracy is measured by mean error rate and mean rank of error rate. Both cr ..."
Abstract

Cited by 167 (7 self)
 Add to MetaCart
. Twentytwo decision tree, nine statistical, and two neural network algorithms are compared on thirtytwo datasets in terms of classication accuracy, training time, and (in the case of trees) number of leaves. Classication accuracy is measured by mean error rate and mean rank of error rate. Both criteria place a statistical, splinebased, algorithm called Polyclass at the top, although it is not statistically signicantly dierent from twenty other algorithms. Another statistical algorithm, logistic regression, is second with respect to the two accuracy criteria. The most accurate decision tree algorithm is Quest with linear splits, which ranks fourth and fth, respectively. Although splinebased statistical algorithms tend to have good accuracy, they also require relatively long training times. Polyclass, for example, is third last in terms of median training time. It often requires hours of training compared to seconds for other algorithms. The Quest and logistic regression algor...
Preliminary Guidelines for Empirical Research in Software Engineering
 IEEE Transactions on Software Engineering
, 2002
"... propose a preliminary set of research guidelines aimed at stimulating discussion among software researchers. They are based on a review of research guidelines developed for medical researchers and on our own experience in doing and reviewing software engineering research. The guidelines are intended ..."
Abstract

Cited by 129 (2 self)
 Add to MetaCart
propose a preliminary set of research guidelines aimed at stimulating discussion among software researchers. They are based on a review of research guidelines developed for medical researchers and on our own experience in doing and reviewing software engineering research. The guidelines are intended to assist researchers, reviewers, and metaanalysts in designing, conducting, and evaluating empirical studies. Editorial boards of software engineering journals may wish to use our recommendations as a basis for developing guidelines for reviewers and for framing policies for dealing with the design, data collection, and analysis and reporting of empirical studies. Index TermsÐEmpirical software research, research guidelines, statistical mistakes. 1
Gascuel O. Approximate LikelihoodRatio Test for Branches: A
 Fast, Accurate, and Powerful Alternative. Systematic Biology
"... Abstract.—We revisit statistical tests for branches of evolutionary trees reconstructed upon molecular data. A new, fast, approximate likelihoodratio test (aLRT) for branches is presented here as a competitive alternative to nonparametric bootstrap and Bayesian estimation of branch support. The aLR ..."
Abstract

Cited by 86 (4 self)
 Add to MetaCart
Abstract.—We revisit statistical tests for branches of evolutionary trees reconstructed upon molecular data. A new, fast, approximate likelihoodratio test (aLRT) for branches is presented here as a competitive alternative to nonparametric bootstrap and Bayesian estimation of branch support. The aLRT is based on the idea of the conventional LRT, with the null hypothesis corresponding to the assumption that the inferred branch has length 0. We show that the LRT statistic is asymptotically distributed as a maximum of three random variables drawn from the 1 2 1 2 χ 2 0 + χ
Practice with sleep makes perfect: sleepdependent motor skill learning
 Neuron
, 2002
"... that “practice makes perfect, ” recent findings suggest that training is not the only determinant of motor skill learning. Time is also an important factor (Karni et al., 1998). Several studies have demonstrated that, while practice produces gains in both speed and accuracy of ..."
Abstract

Cited by 28 (4 self)
 Add to MetaCart
that “practice makes perfect, ” recent findings suggest that training is not the only determinant of motor skill learning. Time is also an important factor (Karni et al., 1998). Several studies have demonstrated that, while practice produces gains in both speed and accuracy of
ProjectionBased Statistical Inference in Linear Structural Models with Possibly Weak Instruments
, 2003
"... ..."
Statistical Inference and Data Mining
, 1996
"... es of probability distributions, estimation, hypothesis testing, model scoring, Gibb's sampling, rational decision making, causal inference, prediction, and model averaging. For a rigorous survey of statistics, the mathematically inclined reader should see [7]. Due to space limitations, we ..."
Abstract

Cited by 22 (3 self)
 Add to MetaCart
es of probability distributions, estimation, hypothesis testing, model scoring, Gibb's sampling, rational decision making, causal inference, prediction, and model averaging. For a rigorous survey of statistics, the mathematically inclined reader should see [7]. Due to space limitations, we must also ignore a number of interesting topics, including time series analysis and metaanalysis. Probability Distributions The statistical literature contains mathematical characterizations of a wealth of probability distributions, as well as properties of random variablesfunctions defined on the "events" to which a probability measure assigns values. Important relations among probability distributions include marginalization (summing over a subset of values) and conditionalization (forming a conditional probability measure from a probability measure on a sample space and some event of positive probability. Essential relations among random variable
Models for network evolution
 Journal of Mathematical Sociology
, 1996
"... Abstract: This paper describes mathematical models for network evolution when ties (edges) are directed and the node set is xed. Each of these models implies a speci c type of departure from the standard null binomial model. We provide statistical tests that, in keeping with these models, are sensit ..."
Abstract

Cited by 21 (3 self)
 Add to MetaCart
Abstract: This paper describes mathematical models for network evolution when ties (edges) are directed and the node set is xed. Each of these models implies a speci c type of departure from the standard null binomial model. We provide statistical tests that, in keeping with these models, are sensitive to particular types of departures from the null. Each model (and associated test) discussed follows directly from one or more sociocognitive theories about how individuals alter the colleagues with whom they are likely to interact. The models include triad completion models, degree variance models, polarization and balkanization models, the HollandLeinhardt models, metric models, and the constructural model. We nd that many of these models, in their basic form, tend asymptotically towards an equilibrium distribution centered at the completely connected network (i.e., all individuals are equally likely to interact with all other individuals) � a fact that can inhibit the development of satisfactory tests. Keywords: triad completion, HollandLeinhardt model, polarization, degree variance, network evolution, constructuralism
Testing For Nonlinearity Using Redundancies: Quantitative and Qualitative Aspects
 Physica D
, 1995
"... A method for testing nonlinearity in time series is described based on informationtheoretic functionals  redundancies, linear and nonlinear forms of which allow either qualitative, or, after incorporating the surrogate data technique, quantitative evaluation of dynamical properties of scrutinized ..."
Abstract

Cited by 19 (7 self)
 Add to MetaCart
A method for testing nonlinearity in time series is described based on informationtheoretic functionals  redundancies, linear and nonlinear forms of which allow either qualitative, or, after incorporating the surrogate data technique, quantitative evaluation of dynamical properties of scrutinized data. An interplay of quantitative and qualitative testing on both the linear and nonlinear levels is analyzed and robustness of this combined approach against spurious nonlinearity detection is demonstrated. Evaluation of redundancies and redundancybased statistics as functions of time lag and embedding dimension can further enhance insight into dynamics of a system under study. Keywords: time series, nonlinearity, mutual information, redundancy, surrogate data 1 Introduction The problem of inferring the dynamics of a system from measured data is a perpetual challenge for time series analysts. Ideas and concepts from nonlinear dynamics and theory of deterministic chaos have led to a num...