Results 1 - 10
of
77
Exploratory Mining and Pruning Optimizations of Constrained Associations Rules
, 1998
"... From the standpoint of supporting human-centered discovery of knowledge, the present-day model of mining association rules suffers from the following serious shortcom- ings: (i) lack of user exploration and control, (ii) lack of focus, and (iii) rigid notion of relationships. In effect, this model f ..."
Abstract
-
Cited by 245 (41 self)
- Add to MetaCart
From the standpoint of supporting human-centered discovery of knowledge, the present-day model of mining association rules suffers from the following serious shortcom- ings: (i) lack of user exploration and control, (ii) lack of focus, and (iii) rigid notion of relationships. In effect, this model functions as a black-box, admitting little user interaction in between. We propose, in this paper, an architecture that opens up the black-box, and supports constraintbased, human-centered exploratory mining of associations. The foundation of this architecture is a rich set of con- straint constructs, including domain, class, and $QL-style aggregate constraints, which enable users to clearly specify what associations are to be mined. We propose constrained association queries as a means of specifying the constraints to be satisfied by the antecedent and consequent of a mined association.
Mining the Most Interesting Rules
, 1999
"... Several algorithms have been proposed for finding the “best, ” “optimal,” or “most interesting ” rule(s) in a database according to a variety of metrics including confidence, support, gain, chi-squared value, gini, entropy gain, laplace, lift, and conviction. In this paper, we show that the best rul ..."
Abstract
-
Cited by 110 (1 self)
- Add to MetaCart
Several algorithms have been proposed for finding the “best, ” “optimal,” or “most interesting ” rule(s) in a database according to a variety of metrics including confidence, support, gain, chi-squared value, gini, entropy gain, laplace, lift, and conviction. In this paper, we show that the best rule according to any of these metrics must reside along a support/confidence border. Further, in the case of conjunctive rule mining within categorical data, the number of rules along this border is conveniently small, and can be mined efficiently from a variety of real-world data-sets. We also show how this concept can be generalized to mine all rules that are best according to any of these criteria with respect to an arbitrary subset of the population of interest. We argue that by returning a broader set of rules than previous algorithms, our techniques allow for improved insight into the data and support more user-interaction in the optimized rule-mining process. 1.
Clustering Association Rules
, 1997
"... We consider the problem of clustering two-dimensional association rules in large databases. We present a geometric-based algorithm, BitOp, for performing the clustering, embedded within an association rule clustering system, ARCS. Association rule clustering is useful when the user desires to segmen ..."
Abstract
-
Cited by 99 (0 self)
- Add to MetaCart
We consider the problem of clustering two-dimensional association rules in large databases. We present a geometric-based algorithm, BitOp, for performing the clustering, embedded within an association rule clustering system, ARCS. Association rule clustering is useful when the user desires to segment the data. We measure the quality of the segmentation generated by ARCS using the Minimum Description Length (MDL) principle of encoding the clusters on several databases including noise and errors. Scale-up experiments show that ARCS, using the BitOp algorithm, scales linearly with the amount of data. 1 Introduction Data mining, or the efficient discovery of interesting patterns from large collections of data, has been recognized as an important area of database research. The most commonly sought patterns are association rules as introduced in [AIS93b]. Intuitively, an association rule identifies a frequently occuring pattern of information in a database. Consider a supermarket database w...
BOAT -- Optimistic Decision Tree Construction
, 1999
"... Classification is an important data mining problem. Given a training database of records, each tagged with a class label, the goal of classification is to build a concise model that can be used to predict the class label of future, unlabeled records. A very popular class of classifiers are decision ..."
Abstract
-
Cited by 97 (1 self)
- Add to MetaCart
Classification is an important data mining problem. Given a training database of records, each tagged with a class label, the goal of classification is to build a concise model that can be used to predict the class label of future, unlabeled records. A very popular class of classifiers are decision trees. All current algorithms to construct decision trees, including all main-memory algorithms, make one scan over the training database per level of the tree. We introduce a new algorithm (BOAT) for decision tree construction that improves upon earlier algorithms in both performance and functionality. BOAT constructs several levels of the tree in only two scans over the training database, resulting in an average performance gain of 300% over previous work. The key to this performance improvement is a novel optimistic approach to tree construction in which we construct an initial tree using a small subset of the data and refine it to arrive at the final tree. We guarantee that any differen...
Entropy-based Subspace Clustering for Mining Numerical Data
, 1999
"... Mining numerical data is a relatively difficult problem in data mining. Clustering is one of the techniques. We consider a database with numerical attributes, in which each transaction is viewed as a multi-dimensional vector. By studying the clusters formed by these vectors, we can discover certain ..."
Abstract
-
Cited by 89 (1 self)
- Add to MetaCart
Mining numerical data is a relatively difficult problem in data mining. Clustering is one of the techniques. We consider a database with numerical attributes, in which each transaction is viewed as a multi-dimensional vector. By studying the clusters formed by these vectors, we can discover certain behaviors hidden in the data. Traditional clustering algorithms find clusters in the full space of the data sets. This results in high dimensional clusters, which are poorly comprehensible to human. One important task in this setting is the ability to discover clusters embedded in the subspaces of a high-dimensional data set. This problem is known as subspace clustering. We follow the basic assumptions of previous work CLIQUE. It is found that the number of subspaces with clustering is very large, and a criterion called the coverage is proposed in CLIQUE for the pruning. In addition to coverage, we identify new useful criteria for this problem and propose an entropybased algorithm called ENC...
Efficient substructure discovery from large semi-structured data
, 2002
"... By rapid progress of network and storage technologies, a huge amount of electronic data such as Web pages and XML data [23] has been available on intra and internet. These electronic data are heterogeneous collection of ill-structured data that have no rigid structures, and often called semi-structu ..."
Abstract
-
Cited by 87 (9 self)
- Add to MetaCart
By rapid progress of network and storage technologies, a huge amount of electronic data such as Web pages and XML data [23] has been available on intra and internet. These electronic data are heterogeneous collection of ill-structured data that have no rigid structures, and often called semi-structured data [1]. Hence, there have been
A Statistical Theory for Quantitative Association Rules
- Journal of Intelligent Information Systems
, 1999
"... Association rules are a key data-mining tool and as such have been well researched. So far, this research has focused predominantly on databases containing categorical data only. However, many real-world databases contain quantitative attributes and current solutions for this case are so far inad ..."
Abstract
-
Cited by 73 (0 self)
- Add to MetaCart
Association rules are a key data-mining tool and as such have been well researched. So far, this research has focused predominantly on databases containing categorical data only. However, many real-world databases contain quantitative attributes and current solutions for this case are so far inadequate. We introduce a new definition of quantitative association rules based on statistical inference theory. Our definition reflects the intuition that the goal of association rules is to find extraordinary and therefore interesting phenomena in databases. We also introduce the concept of sub-rules which can be applied to any type of association rule. Rigorous experimental evaluation on real-world datasets is presented, demonstrating the usefulness and characteristics of rules mined according to our definition. 1 Introduction Association Rules. The goal of data mining is to extract higher level information from an abundance of raw data. Association rules are a key tool used for this...
Cyclic Association Rules
- In Proc. 1998 Int. Conf. Data Engineering (ICDE'98
, 1998
"... We study the problem of discovering association rules that display regular cyclic variation over time. For example, if we compute association rules over monthly sales data, we may observe seasonal variation where certain rules are true at approximately the same month each year. Similarly, associatio ..."
Abstract
-
Cited by 72 (1 self)
- Add to MetaCart
We study the problem of discovering association rules that display regular cyclic variation over time. For example, if we compute association rules over monthly sales data, we may observe seasonal variation where certain rules are true at approximately the same month each year. Similarly, association rules can also display regular hourly, daily, weekly, etc., variation that is cyclical in nature. We demonstrate that existing methods cannot be naively extended to solve this problem of cyclic association rules. We then present two new algorithms for discovering such rules. The first one, which we call the sequential algorithm, treats association rules and cycles more or less independently. By studying the interaction between association rules and time, we devise a new technique called cycle pruning, which reduces the amount of time needed to find cyclic association rules. The second algorithm, which we call the interleaved algorithm, uses cycle pruning and other optimization techniques f...
Optimization of Constrained Frequent Set Queries with 2-variable Constraints
, 1999
"... Currently, there is tremendous interest in providing ad-hoc mining capabilities in database management systems. As a first step towards this goal, in [15] we proposed an architecture for supporting constraint-based, human-centered, exploratory mining of various kinds of rules including associations, ..."
Abstract
-
Cited by 70 (13 self)
- Add to MetaCart
Currently, there is tremendous interest in providing ad-hoc mining capabilities in database management systems. As a first step towards this goal, in [15] we proposed an architecture for supporting constraint-based, human-centered, exploratory mining of various kinds of rules including associations, introduced the notion of constrained frequent set queries (CFQs), and developed effective pruning optimizations for CFQs with 1-variable (1-var) constraints. While 1-var constraints are useful for constraining the antecedent and consequent separately, many natural examples of CFQs illustrate the need for constraining the antecedent and consequent jointly, for which 2-variable (2-var) constraints are indispensable. Developing pruning optimizations for CFQs with 2-var constraints is the subject of this paper. But this is a difficult problem because: (i) in 2var constraints, both variables keep changing and, unlike 1-var constraints, there is no fixed target for pruning; (ii) as we show, "conventional" monotonicity-based optimization techniques do not apply effectively to 2-var constraints. The contributions are as follows. (1) We introduce a notion of quasi-succinctness, which allows a quasi-succinct 2-var constraint to be reduced to two succinct 1-var constraints for pruning. (2) We characterize the class of 2-var constraints that are quasi-succinct. (3) We develop heuristic techniques for non-quasi-succinct constraints. Experimental results show the e ectiveness of all our techniques. (4) We propose a query optimizer for CFQs and show that for a large class of constraints, the computation strategy generated by the optimizer is ccc-optimal, i.e., minimizing the effort incurred w.r.t. constraint checking and support counting.
Depth First Generation of Long Patterns
, 2000
"... In this paper we present an algorithm for mining long patterns in databases. The algorithm finds large itemsets by using depth first search on a lexicographic tree of itemsets. The focus of this paper is to develop CPU-efficient algorithms for finding frequent itemsets in the cases when the database ..."
Abstract
-
Cited by 69 (1 self)
- Add to MetaCart
In this paper we present an algorithm for mining long patterns in databases. The algorithm finds large itemsets by using depth first search on a lexicographic tree of itemsets. The focus of this paper is to develop CPU-efficient algorithms for finding frequent itemsets in the cases when the database contains patterns which are very wide. We refer to this algorithm as DepthProject, and it achieves more than one order of magnitude speedup over the recently proposed MaxMiner algorithm for finding long patterns. These techniques may be quite useful for applications in areas such as computational biology in which the number of records is relatively small, but the itemsets are very long. This necessitates the discovery of patterns using algorithms which are especially tailored to the nature of such domains.

