Results 1 - 10
of
77
Mining high-speed data streams
, 2000
"... Categories and Subject ���������� � �¨�������������������������¦���¦����������¡¤�� ¡ � ¡����������������¦¡¤����§�£���� ..."
Abstract
-
Cited by 220 (10 self)
- Add to MetaCart
Categories and Subject ���������� � �¨�������������������������¦���¦����������¡¤�� ¡ � ¡����������������¦¡¤����§�£����
Mining time-changing data streams
- IN PROC. OF THE 2001 ACM SIGKDD INTL. CONF. ON KNOWLEDGE DISCOVERY AND DATA MINING
, 2001
"... Most statistical and machine-learning algorithms assume that the data is a random sample drawn from a station-ary distribution. Unfortunately, most of the large databases available for mining today violate this assumption. They were gathered over months or years, and the underlying pro-cesses genera ..."
Abstract
-
Cited by 196 (4 self)
- Add to MetaCart
Most statistical and machine-learning algorithms assume that the data is a random sample drawn from a station-ary distribution. Unfortunately, most of the large databases available for mining today violate this assumption. They were gathered over months or years, and the underlying pro-cesses generating them changed during this time, sometimes radically. Although a number of algorithms have been pro-posed for learning time-changing concepts, they generally do not scale well to very large databases. In this paper we propose an efficient algorithm for mining decision trees from continuously-changing data streams, based on the ultra-fast VFDT decision tree learner. This algorithm, called CVFDT, stays current while making the most of old data by growing an alternative subtree whenever an old one becomes ques-tionable, and replacing the old with the new when the new becomes more accurate. CVFDT learns a model which is similar in accuracy to the one that would be learned by reapplying VFDT to a moving window of examples every time a new example arrives, but with O(1) complexity per example, as opposed to O(w), where w is the size of the window. Experiments on a set of large time-changing data streams demonstrate the utility of this approach.
A Comparison of Prediction Accuracy, Complexity, and Training Time of Thirty-three Old and New Classification Algorithms
, 2000
"... . Twenty-two decision tree, nine statistical, and two neural network algorithms are compared on thirty-two datasets in terms of classication accuracy, training time, and (in the case of trees) number of leaves. Classication accuracy is measured by mean error rate and mean rank of error rate. Both cr ..."
Abstract
-
Cited by 134 (6 self)
- Add to MetaCart
. Twenty-two decision tree, nine statistical, and two neural network algorithms are compared on thirty-two datasets in terms of classication accuracy, training time, and (in the case of trees) number of leaves. Classication accuracy is measured by mean error rate and mean rank of error rate. Both criteria place a statistical, spline-based, algorithm called Polyclass at the top, although it is not statistically signicantly dierent from twenty other algorithms. Another statistical algorithm, logistic regression, is second with respect to the two accuracy criteria. The most accurate decision tree algorithm is Quest with linear splits, which ranks fourth and fth, respectively. Although spline-based statistical algorithms tend to have good accuracy, they also require relatively long training times. Polyclass, for example, is third last in terms of median training time. It often requires hours of training compared to seconds for other algorithms. The Quest and logistic regression algor...
Preliminary Guidelines for Empirical Research in Software Engineering
- IEEE Transactions on Software Engineering
, 2002
"... propose a preliminary set of research guidelines aimed at stimulating discussion among software researchers. They are based on a review of research guidelines developed for medical researchers and on our own experience in doing and reviewing software engineering research. The guidelines are intended ..."
Abstract
-
Cited by 95 (1 self)
- Add to MetaCart
propose a preliminary set of research guidelines aimed at stimulating discussion among software researchers. They are based on a review of research guidelines developed for medical researchers and on our own experience in doing and reviewing software engineering research. The guidelines are intended to assist researchers, reviewers, and meta-analysts in designing, conducting, and evaluating empirical studies. Editorial boards of software engineering journals may wish to use our recommendations as a basis for developing guidelines for reviewers and for framing policies for dealing with the design, data collection, and analysis and reporting of empirical studies. Index TermsÐEmpirical software research, research guidelines, statistical mistakes. 1
Statistical Inference and Data Mining
, 1996
"... es of probability distributions, estimation, hypothesis testing, model scoring, Gibb's sampling, rational decision making, causal inference, prediction, and model averaging. For a rigorous survey of statistics, the mathematically inclined reader should see [7]. Due to space limitations, we ..."
Abstract
-
Cited by 21 (3 self)
- Add to MetaCart
es of probability distributions, estimation, hypothesis testing, model scoring, Gibb's sampling, rational decision making, causal inference, prediction, and model averaging. For a rigorous survey of statistics, the mathematically inclined reader should see [7]. Due to space limitations, we must also ignore a number of interesting topics, including time series analysis and meta-analysis. Probability Distributions The statistical literature contains mathematical characterizations of a wealth of probability distributions, as well as properties of random variables---functions defined on the "events" to which a probability measure assigns values. Important relations among probability distributions include marginalization (summing over a subset of values) and conditionalization (forming a conditional probability measure from a probability measure on a sample space and some event of positive probability. Essential relations among random variable
Models for network evolution
- Journal of Mathematical Sociology
, 1996
"... Abstract: This paper describes mathematical models for network evolution when ties (edges) are directed and the node set is xed. Each of these models implies a speci c type of departure from the standard null binomial model. We provide statistical tests that, in keeping with these models, are sensit ..."
Abstract
-
Cited by 18 (3 self)
- Add to MetaCart
Abstract: This paper describes mathematical models for network evolution when ties (edges) are directed and the node set is xed. Each of these models implies a speci c type of departure from the standard null binomial model. We provide statistical tests that, in keeping with these models, are sensitive to particular types of departures from the null. Each model (and associated test) discussed follows directly from one or more socio-cognitive theories about how individuals alter the colleagues with whom they are likely to interact. The models include triad completion models, degree variance models, polarization and balkanization models, the Holland-Leinhardt models, metric models, and the constructural model. We nd that many of these models, in their basic form, tend asymptotically towards an equilibrium distribution centered at the completely connected network (i.e., all individuals are equally likely to interact with all other individuals) � a fact that can inhibit the development of satisfactory tests. Keywords: triad completion, Holland-Leinhardt model, polarization, degree variance, network evolution, constructuralism
Testing For Nonlinearity Using Redundancies: Quantitative and Qualitative Aspects
- Physica D
, 1995
"... A method for testing nonlinearity in time series is described based on information-theoretic functionals -- redundancies, linear and nonlinear forms of which allow either qualitative, or, after incorporating the surrogate data technique, quantitative evaluation of dynamical properties of scrutinized ..."
Abstract
-
Cited by 18 (6 self)
- Add to MetaCart
A method for testing nonlinearity in time series is described based on information-theoretic functionals -- redundancies, linear and nonlinear forms of which allow either qualitative, or, after incorporating the surrogate data technique, quantitative evaluation of dynamical properties of scrutinized data. An interplay of quantitative and qualitative testing on both the linear and nonlinear levels is analyzed and robustness of this combined approach against spurious nonlinearity detection is demonstrated. Evaluation of redundancies and redundancy-based statistics as functions of time lag and embedding dimension can further enhance insight into dynamics of a system under study. Keywords: time series, nonlinearity, mutual information, redundancy, surrogate data 1 Introduction The problem of inferring the dynamics of a system from measured data is a perpetual challenge for time series analysts. Ideas and concepts from nonlinear dynamics and theory of deterministic chaos have led to a num...
Overfitting Explained
, 1997
"... Overfitting arises when model components are evaluated against the wrong reference distribution. Most modeling algorithms iteratively find the best of several components and then test whether this component is good enough to add to the model. We show that for independently distributed random variabl ..."
Abstract
-
Cited by 16 (1 self)
- Add to MetaCart
Overfitting arises when model components are evaluated against the wrong reference distribution. Most modeling algorithms iteratively find the best of several components and then test whether this component is good enough to add to the model. We show that for independently distributed random variables, the reference distribution for any one variable underestimates the reference distribution for the the highest-valued variable # thus variate values will appear significant when they are not, and model components will be added when they should not be added. We relate this problem to the well-known statistical theory of multiple comparisons or simultaneous inference.
Empirical studies of quality models in object-oriented systems
- Advances in Computers
, 2002
"... Measuring structural design properties of a software system, such as coupling, cohesion, or complexity, is a promising approach towards early quality assessments. To use such measurement effectively, quality models are needed that quantitatively describe how these internal structural properties rela ..."
Abstract
-
Cited by 14 (1 self)
- Add to MetaCart
Measuring structural design properties of a software system, such as coupling, cohesion, or complexity, is a promising approach towards early quality assessments. To use such measurement effectively, quality models are needed that quantitatively describe how these internal structural properties relate to relevant external system qualities such as reliability or maintainability. This chapter has for objective to summarize, in a structured and detailed fashion, the empirical results that have been reported so far with modeling external system quality based on structural design properties in object-oriented systems. We perform a critical review of existing work in order to identify lessons learned regarding the way these studies are performed and reported. Constructive guidelines are also provided to facilitate the work of future studies, thus facilitating the development of an
Environmental scanning: Acquisition and use of information by managers
- In M. E. Williams (Ed.), Annual review of information science and technology (vol.28
, 1993
"... The present study investigates how chief executive officers in the Canadian telecommunications industry acquire and use information about the external business environment, an information seeking activity known as environmental scanning. Data were collected by a nationwide questionnaire survey and s ..."
Abstract
-
Cited by 13 (3 self)
- Add to MetaCart
The present study investigates how chief executive officers in the Canadian telecommunications industry acquire and use information about the external business environment, an information seeking activity known as environmental scanning. Data were collected by a nationwide questionnaire survey and several focused interviews. Of the 113 CEOs in the study population, 67 returned completed questionnaires, thus giving a response rate of 59 percent. Personal interviews were then conducted with eight of the respondents. The chief executives collectively perceive the Technological, Customer, and Competition environmental sectors to have the greatest Perceived Strategic Uncertainty – these sectors were perceived to be the most strategic, variable and complex. For each environmental sector, the Amount of Scanning of the sector is positively correlated with the Perceived Strategic Uncertainty of that sector. Generally, the chief executives use multiple, complementary sources in environmental scanning. Personal sources such as customers and subordinate staff are very important in both scanning and decision making, and they are used more frequently than impersonal sources. Nonetheless, impersonal sources such as publications and reports are also frequently used in scanning. In decision making, environmental information from internal sources is used more frequently than that from external sources. For many of the information sources, the frequency of source use is

