Results 1  10
of
33
Distributional Clustering of Words for Text Classification
, 1998
"... This paper describes the application of Distributional Clustering [20] to document classification. This approach clusters words into groups based on the distribution of class labels associated with each word. Thus, unlike some other unsupervised dimensionalityreduction techniques, such as Latent Sem ..."
Abstract

Cited by 234 (1 self)
 Add to MetaCart
This paper describes the application of Distributional Clustering [20] to document classification. This approach clusters words into groups based on the distribution of class labels associated with each word. Thus, unlike some other unsupervised dimensionalityreduction techniques, such as Latent Semantic Indexing, we are able to compress the feature space much more aggressively, while still maintaining high document classification accuracy. Experimental results obtained on three realworld data sets show that we can reduce the feature dimensionality by three orders of magnitude and lose only 2% accuracysignificantly better than Latent Semantic Indexing [6], classbased clustering [1], feature selection by mutual information [23], or Markovblanketbased feature selection [13]. We also show that less aggressive clustering sometimes results in improved classification accuracy over classification without clustering. 1 Introduction The popularity of the Internet has caused an exponent...
Model Selection and the Principle of Minimum Description Length
 Journal of the American Statistical Association
, 1998
"... This paper reviews the principle of Minimum Description Length (MDL) for problems of model selection. By viewing statistical modeling as a means of generating descriptions of observed data, the MDL framework discriminates between competing models based on the complexity of each description. This ..."
Abstract

Cited by 145 (5 self)
 Add to MetaCart
This paper reviews the principle of Minimum Description Length (MDL) for problems of model selection. By viewing statistical modeling as a means of generating descriptions of observed data, the MDL framework discriminates between competing models based on the complexity of each description. This approach began with Kolmogorov's theory of algorithmic complexity, matured in the literature on information theory, and has recently received renewed interest within the statistics community. In the pages that follow, we review both the practical as well as the theoretical aspects of MDL as a tool for model selection, emphasizing the rich connections between information theory and statistics. At the boundary between these two disciplines, we find many interesting interpretations of popular frequentist and Bayesian procedures. As we will see, MDL provides an objective umbrella under which rather disparate approaches to statistical modeling can coexist and be compared. We illustrate th...
A novel anomaly detection scheme based on principal component classifier
 in Proceedings of the IEEE Foundations and New Directions of Data Mining Workshop, in conjunction with the Third IEEE International Conference on Data Mining (ICDM’03
, 2003
"... This paper proposes a novel scheme that uses robust principal component classifier in intrusion detection problem where the training data may be unsupervised. Assuming that anomalies can be treated as outliers, an intrusion predictive model is constructed from the major and minor principal component ..."
Abstract

Cited by 41 (5 self)
 Add to MetaCart
This paper proposes a novel scheme that uses robust principal component classifier in intrusion detection problem where the training data may be unsupervised. Assuming that anomalies can be treated as outliers, an intrusion predictive model is constructed from the major and minor principal components of normal instances. A measure of the difference of an anomaly from the normal instance is the distance in the principal component space. The distance based on the major components that account for 50 % of the total variation and the minor components with eigenvalues less than 0.20 is shown to work well. The experiments with KDD Cup 1999 data demonstrate that our proposed method achieves 98.94 % in recall and 97.89 % in precision with the false alarm rate 0.92 % and outperforms the nearest neighbor method, densitybased local outliers (LOF) approach, and the outlier detection algorithms based on Canberra metric.
Studying the faultdetection effectiveness of GUI test cases for rapidly evolving software
 IEEE Transactions on Software Engineering
, 2005
"... Abstract—Software is increasingly being developed/maintained by multiple, often geographically distributed developers working concurrently. Consequently, rapidfeedbackbased quality assurance mechanisms such as daily builds and smoke regression tests, which help to detect and eliminate defects earl ..."
Abstract

Cited by 33 (11 self)
 Add to MetaCart
Abstract—Software is increasingly being developed/maintained by multiple, often geographically distributed developers working concurrently. Consequently, rapidfeedbackbased quality assurance mechanisms such as daily builds and smoke regression tests, which help to detect and eliminate defects early during software development and maintenance, have become important. This paper addresses a major weakness of current smoke regression testing techniques, i.e., their inability to automatically (re)test graphical user interfaces (GUIs). Several contributions are made to the area of GUI smoke testing. First, the requirements for GUI smoke testing are identified and a GUI smoke test is formally defined as a specialized sequence of events. Second, a GUI smoke regression testing process called Daily Automated Regression Tester (DART) that automates GUI smoke testing is presented. Third, the interplay between several characteristics of GUI smoke test suites including their size, fault detection ability, and test oracles is empirically studied. The results show that: 1) the entire smoke testing process is feasible in terms of execution time, storage space, and manual effort, 2) smoke tests cannot cover certain parts of the application code, 3) having comprehensive test oracles may make up for not having long smoke test cases, and 4) using certain oracles can make up for not having large smoke test suites. Index Terms—Smoke testing, GUI testing, test oracles, empirical studies, regression testing. 1
Static Detection Of Deadlocks In Polynomial Time
, 1993
"... Parallel and distributed programming languages often include explicit synchronization primitives, such as rendezvous and semaphores. Such programs are subject to synchronization anomalies; the program behaves incorrectly because it has a faulty synchronization structure. A deadlock is an anomaly in ..."
Abstract

Cited by 27 (1 self)
 Add to MetaCart
Parallel and distributed programming languages often include explicit synchronization primitives, such as rendezvous and semaphores. Such programs are subject to synchronization anomalies; the program behaves incorrectly because it has a faulty synchronization structure. A deadlock is an anomaly in which some subset of the active tasks of the program mutually wait on each other to advance; thus, the program cannot complete execution. In static anomaly detection, the source code of a program is automatically analyzed to determine if the program can ever exhibit a specific anomaly. Static anomaly detection has the unique advantage that it can certify programs to be free of the tested anomaly; dynamic testing cannot generally do this. Though exact static detection of deadlocks is NPhard [Tay83a], many researchers have tried to detect deadlock by ...
A Review of Dynamic Handwritten Signature Verification
, 1997
"... There is considerable interest in authentication based on handwritten signature verification (HSV) because HSV is superior to many other biometric authentication techniques e.g. finger prints or retinal patterns, which are reliable but much more intrusive and expensive. This paper presents a review ..."
Abstract

Cited by 17 (2 self)
 Add to MetaCart
There is considerable interest in authentication based on handwritten signature verification (HSV) because HSV is superior to many other biometric authentication techniques e.g. finger prints or retinal patterns, which are reliable but much more intrusive and expensive. This paper presents a review of dynamic HSV techniques that have been reported in the literature. The paper also discusses possible applications of HSV, lists some commercial products that are available and suggests some areas for future research. 1 1 Introduction Our society is increasingly dependent on electronic storage and transmission of information and this has created a need for electronically verifying a person 's identity. Handwritten signatures have been the normal and customary way for identity verification. Although there have been occasional disputes about the authorship of handwritten signatures (Osborn, 1929; Harrison, 1958; Hilton, 1956), verification of handwritten signatures has not been a major p...
Feature Minimization within Decision Trees
 Computational Optimization and Applications
, 1996
"... Decision trees for classification can be constructed using mathematical programming. Within decision tree algorithms, the feature minimization problem is to construct accurate decisions using as few features or attributes within each decision as possible. Feature minimization is an important aspe ..."
Abstract

Cited by 14 (2 self)
 Add to MetaCart
Decision trees for classification can be constructed using mathematical programming. Within decision tree algorithms, the feature minimization problem is to construct accurate decisions using as few features or attributes within each decision as possible. Feature minimization is an important aspect of data mining since it helps identify what attributes are important and helps produce accurate and interpretable decision trees. In feature minimization with bounded accuracy, we minimize the number of features using a given misclassification error tolerance. This problem can be formulated as a parametric bilinear program and is shown to be NPcomplete. A parametric FrankWolfe method is used to solve the bilinear subproblems. The resulting minimization algorithm produces more compact, accurate, and interpretable trees. This procedure can be applied to many di#erent error functions. Formulations and results for two error functions are given. One method, FM RLPP, dramatically reduced the number of features of one dataset from 147 to 2 while maintaining an 83.6% testing accuracy. Computational results compare favorably with the standard univariate decision tree method, C4.5, as well as with linear programming methods of tree construction. Key Words: Data mining, machine learning, feature minimization, decision trees, bilinear programming. # Knowledge Discovery and Data Mining Group, Department of Mathematical Sciences, Rensselaer Polytechnic Institute, Troy, NY 12180. Email bredee@rpi.edu, bennek@rpi.edu. Telephone (518) 2766899. FAX (518) 2764824. This material is based on research supported by National Science Foundation Grant 949427. 1
Geometric structure estimation of axially symmetric pots from small fragments
 In Proc. Signal Processing, Pattern Recognition, and Applications
, 2002
"... The recovery of geometric structure from noisy data poses difficult nonlinear statistical estimation problems. This paper describes a novel, robust, low computational cost approach for finding the geometric structure of an axially symmetric pot from a small fragment of it (an unorganized set of 3D ..."
Abstract

Cited by 13 (0 self)
 Add to MetaCart
The recovery of geometric structure from noisy data poses difficult nonlinear statistical estimation problems. This paper describes a novel, robust, low computational cost approach for finding the geometric structure of an axially symmetric pot from a small fragment of it (an unorganized set of 3D points). This problem is of great archaeological importance to the study of the hundreds and thousands of shards found at excavation sites. Our method is based on the following fact: for each point on the surface, the center of the sphere of principal curvature corresponding to the circles of revolution is on the symmetric axis. By finding the line which minimizes the weighted leastsquares distance to the estimated centers, we can find first the symmetric axis and then the profile curve. Because of the special properties of a surface of revolution, we can do this using only first derivatives, hence this method is robust to noisy data. We then use bootstrap methods to find confidence bounds for the axis and the profile curve. These confidence bounds are essential if the estimations are used for the assembly of the full pot from multiple sherds. KEY WORDS Shape analysis, 3D data analysis, Surface of revolution, Symmetric axis 1