Results 1 - 10
of
59
PR-Miner: automatically extracting implicit programming rules and detecting violations in large software code
- In Proc. 10th European Software Engineering Conference held jointly with 13th ACM SIGSOFT International Symposium on Foundations of Software Engineering (ESEC/FSE’05
, 2005
"... Programs usually follow many implicit programming rules, most of which are too tedious to be documented by programmers. When these rules are violated by programmers who are unaware of or forget about them, defects can be easily introduced. Therefore, it is highly desirable to have tools to automatic ..."
Abstract
-
Cited by 87 (9 self)
- Add to MetaCart
Programs usually follow many implicit programming rules, most of which are too tedious to be documented by programmers. When these rules are violated by programmers who are unaware of or forget about them, defects can be easily introduced. Therefore, it is highly desirable to have tools to automatically extract such rules and also to automatically detect violations. Previous work in this direction focuses on simple function-pair based programming rules and additionally requires programmers to provide rule templates. This paper proposes a general method called PR-Miner that uses a data mining technique called frequent itemset mining to efficiently extract implicit programming rules from large software code written in an industrial programming language such as C, requiring little effort from programmers and no prior knowledge of the software. Benefiting from frequent itemset mining, PR-Miner can extract programming
Hierarchical Document Clustering Using Frequent Itemsets
- IN PROC. SIAM INTERNATIONAL CONFERENCE ON DATA MINING 2003 (SDM 2003
, 2003
"... A major challenge in document clustering is the extremely high dimensionality. For example, the vocabulary for a document set can easily be thousands of words. On the other hand, each document often contains a small fraction of words in the vocabulary. These features require special handlings. Anoth ..."
Abstract
-
Cited by 55 (1 self)
- Add to MetaCart
A major challenge in document clustering is the extremely high dimensionality. For example, the vocabulary for a document set can easily be thousands of words. On the other hand, each document often contains a small fraction of words in the vocabulary. These features require special handlings. Another requirement is hierarchical clustering where clustered documents can be browsed according to the increasing specificity of topics. In this paper, we propose to use the notion of frequent itemsets, which comes from association rule mining, for document clustering. The intuition of our clustering criterion is that each cluster is identified by some common words, called frequent itemsets, for the documents in the cluster. Frequent itemsets are also used to produce a hierarchical topic tree for clusters. By focusing on frequent items, the dimensionality of the document set is drastically reduced. We show that this method outperforms best existing methods in terms of both clustering accuracy and scalability.
Discovering Spatial Co-location Patterns: A Summary of Results
- Lecture Notes in Computer Science
, 2001
"... Given a collection of boolean spatial features, the co-location pattern discovery process finds the subsets of features frequently located together. For example, the analysis of an ecology dataset may reveal the frequent co-location of a fire ignition source feature with a needle vegetation type fea ..."
Abstract
-
Cited by 44 (5 self)
- Add to MetaCart
Given a collection of boolean spatial features, the co-location pattern discovery process finds the subsets of features frequently located together. For example, the analysis of an ecology dataset may reveal the frequent co-location of a fire ignition source feature with a needle vegetation type feature and a drought feature. The spatial co-location rule problem is different from the association rule problem. Even though boolean spatial feature types (also called spatial events) may correspond to items in association rules over market-basket datasets, there is no natural notion of transactions. This creates difficulty in using traditional measures (e.g. support, confidence) and applying association rule mining algorithms which use support based pruning. We propose a notion of user-specified neighborhoods in place of transactions to specify groups of items. New interest measures for spatial co-location patterns are proposed which are robust in the face of potentially infinite overlapping neighborhoods. We also propose an algorithm to mine frequent spatial co-location patterns and analyze its correctness, and completeness. We plan to carry out experimental evaluations and performance tuning in the near future.
Fuzzy association rules: general model and applications
- IEEE Transactions on Fuzzy Systems
, 2003
"... Abstract—The theory of fuzzy sets has been recognized as a suitable tool to model several kinds of patterns that can hold in data. In this paper, we are concerned with the development of a general model to discover association rules among items in a (crisp) set of fuzzy transactions. This general mo ..."
Abstract
-
Cited by 27 (14 self)
- Add to MetaCart
Abstract—The theory of fuzzy sets has been recognized as a suitable tool to model several kinds of patterns that can hold in data. In this paper, we are concerned with the development of a general model to discover association rules among items in a (crisp) set of fuzzy transactions. This general model can be particularized in several ways; each particular instance corresponds to a certain kind of pattern and/or repository of data. We describe some applications of this scheme, paying special attention to the discovery of fuzzy association rules in relational databases. Index Terms—Association rules, data mining, fuzzy transactions, quantified sentences. I.
A Partial Join Approach for Mining Co-location Patterns
- In the Proceeding of ACM International Symposium on Advances in Geographic Information Systems(ACM-GIS
, 2004
"... Spatial co-location patterns represent the subsets of events whose instances are frequently located together in geographic space. We identified the computational bottleneck in the execution time of a current co-location mining algorithm. A large fraction of the join-based co-location miner algorithm ..."
Abstract
-
Cited by 19 (7 self)
- Add to MetaCart
Spatial co-location patterns represent the subsets of events whose instances are frequently located together in geographic space. We identified the computational bottleneck in the execution time of a current co-location mining algorithm. A large fraction of the join-based co-location miner algorithm is devoted to computing joins to identify instances of candidate co-location patterns. We propose a novel partialjoin approach for mining co-location patterns efficiently. It transactionizes continuous spatial data while keeping track of the spatial information not modeled by transactions. It uses a transaction-based Apriori algorithm as a building block and adopts the instance join method for residual instances not identified in transactions. We show that the algorithm is correct and complete in finding all co-location rules which have prevalence and conditional probability above the given thresholds. An experimental evaluation using synthetic datasets and a real dataset shows that our algorithm is computationally more efficient than the join-based algorithm.
Generalizing the Notion of Support
- In KDD’04
, 2004
"... The goal of this paper is to show that generalizing the notion of support can be useful in extending association analysis to non-traditional types of patterns and non-binary data. To that end, we describe a framework for generalizing support that is based on the simple, but useful observation that s ..."
Abstract
-
Cited by 17 (5 self)
- Add to MetaCart
The goal of this paper is to show that generalizing the notion of support can be useful in extending association analysis to non-traditional types of patterns and non-binary data. To that end, we describe a framework for generalizing support that is based on the simple, but useful observation that support can be viewed as the composition of two functions: a function that evaluates the strength or presence of a pattern in each object (transaction) and a function that summarizes these evaluations with a single number. A key goal of any framework is to allow people to more easily express, explore, and communicate ideas, and hence, we illustrate how our support framework can be used to describe support for a variety of commonly used association patterns, such as frequent itemsets, general Boolean patterns, and error-tolerant itemsets. We also present two examples of the practical usefulness of generalized support. One example shows the usefulness of support functions for continuous data. Another example shows how the hyperclique pattern---an association pattern originally defined for binary data---can be extended to continuous data by generalizing a support function.
On Mining General Temporal Association Rules in a Publication Database
- In Proceedings of ICDM’2001
, 2001
"... In this paper, we explore a new problem of mining general temporal association rules in publication databases. In essence, a publication database is a set of transactions where each transaction T is a set of items of which each item contains an individual exhibition period. The current model of asso ..."
Abstract
-
Cited by 15 (2 self)
- Add to MetaCart
In this paper, we explore a new problem of mining general temporal association rules in publication databases. In essence, a publication database is a set of transactions where each transaction T is a set of items of which each item contains an individual exhibition period. The current model of association rule mining is not able to handle the publication database due to the following fundamental problems, i.e., (1) lack of consideration of the exhibition period of each individual item; (2) lack of an equitable support counting basis for each item. To remedy this, we propose an innovative algorithm Progressive-Partition-Miner (abbreviatedly as PPM) to discover general temporal association rules in a publication database. The basic idea of PPM is to first partition the publication database in light of exhibition periods of items and then progressively accumulate the occurrence count of each candidate 2-itemset based on the intrinsic partitioning characteristics. Algorithm PPM is also designed to employ a filtering threshold in each partition to early prune out those cumulatively infrequent 2-itemsets. Explicitly, the execution time of PPM is, in orders of magnitude, smaller than those required by the schemes which are directly extended from existing methods. 1
Unsupervised Anomaly Detection in Network Intrusion Detection Using Clusters
- In Proc. 28th Australasian CS Conf., volume 38 of CRPITV
, 2005
"... Most current network intrusion detection systems employ signature-based methods or data mining-based methods which rely on labelled training data. This training data is typically expensive to produce. Moreover, these methods have difficulty in detecting new types of attack. Using unsupervised anomal ..."
Abstract
-
Cited by 15 (0 self)
- Add to MetaCart
Most current network intrusion detection systems employ signature-based methods or data mining-based methods which rely on labelled training data. This training data is typically expensive to produce. Moreover, these methods have difficulty in detecting new types of attack. Using unsupervised anomaly detection techniques, however, the system can be trained with unlabelled data and is capable of detecting previously "unseen" attacks. In this paper, we present a new density-based and grid-based clustering algorithm that is suitable for unsupervised anomaly detection. We evaluated our methods using the 1999 KDD Cup data set. Our evaluation shows that the accuracy of our approach is close to that of existing techniques reported in the literature, and has several advantages in terms of computational complexity.
Generating Frequent Itemsets Incrementally: Two Novel Approaches Based on Galois Lattice Theory
, 2002
"... Galois (concept) lattice theory has been successfully applied to the resolution of the association rule problem in data mining. In particular, structural results about lattices have been used in the design of e#cient procedures for mining the frequent patterns (itemsets) in a transaction database. ..."
Abstract
-
Cited by 13 (3 self)
- Add to MetaCart
Galois (concept) lattice theory has been successfully applied to the resolution of the association rule problem in data mining. In particular, structural results about lattices have been used in the design of e#cient procedures for mining the frequent patterns (itemsets) in a transaction database.
Is Pushing Constraints Deeply into the Mining Algorithms really what we want? - An Alternative Approach for Association Rule Mining
, 2002
"... The common approach to exploit mining constraints is to push them deeply into the mining algorithms. In our paper we argue that this approach is based on an understanding of KDD that is no longer up-to-date. In fact, today KDD is seen as a human centered, highly interactive and iterative process. Bl ..."
Abstract
-
Cited by 12 (0 self)
- Add to MetaCart
The common approach to exploit mining constraints is to push them deeply into the mining algorithms. In our paper we argue that this approach is based on an understanding of KDD that is no longer up-to-date. In fact, today KDD is seen as a human centered, highly interactive and iterative process. Blindly enforcing constraints already during the mining runs neglects the process character of KDD and therefore is no longer state of the art. Constraints can make a single algorithm run faster but in fact we are still far from response times that would allow true interactivity in KDD. In addition we pay the price of repeated mining runs and moreover risk reducing data mining to some kind of hypothesis testing. Taking all the above into consideration we propose to do exactly the contrary of constrained mining: We accept an initial (nearly) unconstrained and costly mining run. But instead of a sequence of subsequent and still expensive constrained mining runs we answer all further mining queries from this initial result set. Whereas this is straight forward for constraints that can be implemented as filters on the result set, things get more complicated when we restrict the underlying mining data. Actually in practice such constraints are very important, e.g. the generation of rules for certain days of the week, for families, singles, male or female customers etc. We show how to postpone such rowrestriction constraints on the transactions from rule generation to rule retrieval from the initial result set.

