Results 1 - 10
of
24
Distributional Clustering of Words for Text Classification
, 1998
"... This paper describes the application of Distributional Clustering [20] to document classification. This approach clusters words into groups based on the distribution of class labels associated with each word. Thus, unlike some other unsupervised dimensionalityreduction techniques, such as Latent Sem ..."
Abstract
-
Cited by 198 (1 self)
- Add to MetaCart
This paper describes the application of Distributional Clustering [20] to document classification. This approach clusters words into groups based on the distribution of class labels associated with each word. Thus, unlike some other unsupervised dimensionalityreduction techniques, such as Latent Semantic Indexing, we are able to compress the feature space much more aggressively, while still maintaining high document classification accuracy. Experimental results obtained on three real-world data sets show that we can reduce the feature dimensionality by three orders of magnitude and lose only 2% accuracy---significantly better than Latent Semantic Indexing [6], class-based clustering [1], feature selection by mutual information [23], or Markov-blanket-based feature selection [13]. We also show that less aggressive clustering sometimes results in improved classification accuracy over classification without clustering. 1 Introduction The popularity of the Internet has caused an exponent...
Model Selection and the Principle of Minimum Description Length
- Journal of the American Statistical Association
, 1998
"... This paper reviews the principle of Minimum Description Length (MDL) for problems of model selection. By viewing statistical modeling as a means of generating descriptions of observed data, the MDL framework discriminates between competing models based on the complexity of each description. This ..."
Abstract
-
Cited by 114 (4 self)
- Add to MetaCart
This paper reviews the principle of Minimum Description Length (MDL) for problems of model selection. By viewing statistical modeling as a means of generating descriptions of observed data, the MDL framework discriminates between competing models based on the complexity of each description. This approach began with Kolmogorov's theory of algorithmic complexity, matured in the literature on information theory, and has recently received renewed interest within the statistics community. In the pages that follow, we review both the practical as well as the theoretical aspects of MDL as a tool for model selection, emphasizing the rich connections between information theory and statistics. At the boundary between these two disciplines, we find many interesting interpretations of popular frequentist and Bayesian procedures. As we will see, MDL provides an objective umbrella under which rather disparate approaches to statistical modeling can co-exist and be compared. We illustrate th...
A novel anomaly detection scheme based on principal component classifier
- in Proceedings of the IEEE Foundations and New Directions of Data Mining Workshop, in conjunction with the Third IEEE International Conference on Data Mining (ICDM’03
, 2003
"... This paper proposes a novel scheme that uses robust principal component classifier in intrusion detection problem where the training data may be unsupervised. Assuming that anomalies can be treated as outliers, an intrusion predictive model is constructed from the major and minor principal component ..."
Abstract
-
Cited by 29 (5 self)
- Add to MetaCart
This paper proposes a novel scheme that uses robust principal component classifier in intrusion detection problem where the training data may be unsupervised. Assuming that anomalies can be treated as outliers, an intrusion predictive model is constructed from the major and minor principal components of normal instances. A measure of the difference of an anomaly from the normal instance is the distance in the principal component space. The distance based on the major components that account for 50 % of the total variation and the minor components with eigenvalues less than 0.20 is shown to work well. The experiments with KDD Cup 1999 data demonstrate that our proposed method achieves 98.94 % in recall and 97.89 % in precision with the false alarm rate 0.92 % and outperforms the nearest neighbor method, density-based local outliers (LOF) approach, and the outlier detection algorithms based on Canberra metric.
Static Detection Of Deadlocks In Polynomial Time
, 1993
"... Parallel and distributed programming languages often include explicit synchronization primitives, such as rendezvous and semaphores. Such programs are subject to synchronization anomalies; the program behaves incorrectly because it has a faulty synchronization structure. A deadlock is an anomaly in ..."
Abstract
-
Cited by 25 (1 self)
- Add to MetaCart
Parallel and distributed programming languages often include explicit synchronization primitives, such as rendezvous and semaphores. Such programs are subject to synchronization anomalies; the program behaves incorrectly because it has a faulty synchronization structure. A deadlock is an anomaly in which some subset of the active tasks of the program mutually wait on each other to advance; thus, the program cannot complete execution. In static anomaly detection, the source code of a program is automatically analyzed to determine if the program can ever exhibit a specific anomaly. Static anomaly detection has the unique advantage that it can certify programs to be free of the tested anomaly; dynamic testing cannot generally do this. Though exact static detection of deadlocks is NP-hard [Tay83a], many researchers have tried to detect deadlock by ...
Studying the fault-detection effectiveness of GUI test cases for rapidly evolving software
- IEEE Transactions on Software Engineering
, 2005
"... Abstract—Software is increasingly being developed/maintained by multiple, often geographically distributed developers working concurrently. Consequently, rapid-feedback-based quality assurance mechanisms such as daily builds and smoke regression tests, which help to detect and eliminate defects earl ..."
Abstract
-
Cited by 23 (10 self)
- Add to MetaCart
Abstract—Software is increasingly being developed/maintained by multiple, often geographically distributed developers working concurrently. Consequently, rapid-feedback-based quality assurance mechanisms such as daily builds and smoke regression tests, which help to detect and eliminate defects early during software development and maintenance, have become important. This paper addresses a major weakness of current smoke regression testing techniques, i.e., their inability to automatically (re)test graphical user interfaces (GUIs). Several contributions are made to the area of GUI smoke testing. First, the requirements for GUI smoke testing are identified and a GUI smoke test is formally defined as a specialized sequence of events. Second, a GUI smoke regression testing process called Daily Automated Regression Tester (DART) that automates GUI smoke testing is presented. Third, the interplay between several characteristics of GUI smoke test suites including their size, fault detection ability, and test oracles is empirically studied. The results show that: 1) the entire smoke testing process is feasible in terms of execution time, storage space, and manual effort, 2) smoke tests cannot cover certain parts of the application code, 3) having comprehensive test oracles may make up for not having long smoke test cases, and 4) using certain oracles can make up for not having large smoke test suites. Index Terms—Smoke testing, GUI testing, test oracles, empirical studies, regression testing. 1
A Review of Dynamic Handwritten Signature Verification
, 1997
"... There is considerable interest in authentication based on handwritten signature verification (HSV) because HSV is superior to many other biometric authentication techniques e.g. finger prints or retinal patterns, which are reliable but much more intrusive and expensive. This paper presents a review ..."
Abstract
-
Cited by 15 (2 self)
- Add to MetaCart
There is considerable interest in authentication based on handwritten signature verification (HSV) because HSV is superior to many other biometric authentication techniques e.g. finger prints or retinal patterns, which are reliable but much more intrusive and expensive. This paper presents a review of dynamic HSV techniques that have been reported in the literature. The paper also discusses possible applications of HSV, lists some commercial products that are available and suggests some areas for future research. 1 1 Introduction Our society is increasingly dependent on electronic storage and transmission of information and this has created a need for electronically verifying a person 's identity. Handwritten signatures have been the normal and customary way for identity verification. Although there have been occasional disputes about the authorship of handwritten signatures (Osborn, 1929; Harrison, 1958; Hilton, 1956), verification of handwritten signatures has not been a major p...
Feature Minimization within Decision Trees
- Computational Optimization and Applications
, 1996
"... Decision trees for classification can be constructed using mathematical programming. Within decision tree algorithms, the feature minimization problem is to construct accurate decisions using as few features or attributes within each decision as possible. Feature minimization is an important aspe ..."
Abstract
-
Cited by 13 (2 self)
- Add to MetaCart
Decision trees for classification can be constructed using mathematical programming. Within decision tree algorithms, the feature minimization problem is to construct accurate decisions using as few features or attributes within each decision as possible. Feature minimization is an important aspect of data mining since it helps identify what attributes are important and helps produce accurate and interpretable decision trees. In feature minimization with bounded accuracy, we minimize the number of features using a given misclassification error tolerance. This problem can be formulated as a parametric bilinear program and is shown to be NP-complete. A parametric FrankWolfe method is used to solve the bilinear subproblems. The resulting minimization algorithm produces more compact, accurate, and interpretable trees. This procedure can be applied to many di#erent error functions. Formulations and results for two error functions are given. One method, FM RLP-P, dramatically reduced the number of features of one dataset from 147 to 2 while maintaining an 83.6% testing accuracy. Computational results compare favorably with the standard univariate decision tree method, C4.5, as well as with linear programming methods of tree construction. Key Words: Data mining, machine learning, feature minimization, decision trees, bilinear programming. # Knowledge Discovery and Data Mining Group, Department of Mathematical Sciences, Rensselaer Polytechnic Institute, Troy, NY 12180. Email bredee@rpi.edu, bennek@rpi.edu. Telephone (518) 276-6899. FAX (518) 276-4824. This material is based on research supported by National Science Foundation Grant 949427. 1
Image Database Retrieval Utilizing Affinity Relationships
- Proceedings of the First ACM International Workshop on Multimedia Databases (ACM MMDB'03
, 2003
"... Recent research effort in Content-Based Image Retrieval (CBIR) focuses on bridging the gap between low-level features and highlevel semantic contents of images as this gap has become the bottleneck of CBIR. In this paper, an effective image database retrieval framework using a new mechanism called t ..."
Abstract
-
Cited by 6 (3 self)
- Add to MetaCart
Recent research effort in Content-Based Image Retrieval (CBIR) focuses on bridging the gap between low-level features and highlevel semantic contents of images as this gap has become the bottleneck of CBIR. In this paper, an effective image database retrieval framework using a new mechanism called the Markov Model Mediator (MMM) is presented to meet this demand by taking into consideration not only the low-level image features, but also the high-level concepts learned from the history of user’s access pattern and access frequencies on the images in the database. Also, the proposed framework is efficient in two aspects: 1) Overhead for real-time training is avoided in the image retrieval process because the high-level concepts of images are captured in the off-line training process. 2) Before the exact similarity matching process, Principal Component Analysis (PCA) is applied to reduce the image search space. A training subsystem for this framework is implemented and integrated into our system. The experimental results demonstrate that the MMM mechanism can effectively assist in retrieving more accurate results from image databases.

