Results 1 - 10
of
31
Summarizing itemset patterns: a profile-based approach
- In Proceeding of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
, 2005
"... Frequent-pattern mining has been studied extensively on scalable methods for mining various kinds of patterns including itemsets, sequences, and graphs. However, the bottleneck of frequent-pattern mining is not at the efficiency but at the interpretability, due to the huge number of patterns generat ..."
Abstract
-
Cited by 29 (5 self)
- Add to MetaCart
Frequent-pattern mining has been studied extensively on scalable methods for mining various kinds of patterns including itemsets, sequences, and graphs. However, the bottleneck of frequent-pattern mining is not at the efficiency but at the interpretability, due to the huge number of patterns generated by the mining process. In this paper, we examine how to summarize a collection of itemset patterns using only K representatives, a small number of patterns that a user can handle easily. The K representatives should not only cover most of the frequent patterns but also approximate their supports. A generative model is built to extract and profile these representatives, under which the supports of the patterns can be easily recovered without consulting the original dataset. Based on the restoration error, we propose a quality measure function to determine the optimal value of parameter K. Polynomial time algorithms are developed together with several optimization heuristics for efficiency improvement. Empirical studies indicate that we can obtain compact summarization in real datasets.
Generalizing the Notion of Support
- In KDD’04
, 2004
"... The goal of this paper is to show that generalizing the notion of support can be useful in extending association analysis to non-traditional types of patterns and non-binary data. To that end, we describe a framework for generalizing support that is based on the simple, but useful observation that s ..."
Abstract
-
Cited by 17 (5 self)
- Add to MetaCart
The goal of this paper is to show that generalizing the notion of support can be useful in extending association analysis to non-traditional types of patterns and non-binary data. To that end, we describe a framework for generalizing support that is based on the simple, but useful observation that support can be viewed as the composition of two functions: a function that evaluates the strength or presence of a pattern in each object (transaction) and a function that summarizes these evaluations with a single number. A key goal of any framework is to allow people to more easily express, explore, and communicate ideas, and hence, we illustrate how our support framework can be used to describe support for a variety of commonly used association patterns, such as frequent itemsets, general Boolean patterns, and error-tolerant itemsets. We also present two examples of the practical usefulness of generalized support. One example shows the usefulness of support functions for continuous data. Another example shows how the hyperclique pattern---an association pattern originally defined for binary data---can be extended to continuous data by generalizing a support function.
On Mining General Temporal Association Rules in a Publication Database
- In Proceedings of ICDM’2001
, 2001
"... In this paper, we explore a new problem of mining general temporal association rules in publication databases. In essence, a publication database is a set of transactions where each transaction T is a set of items of which each item contains an individual exhibition period. The current model of asso ..."
Abstract
-
Cited by 15 (2 self)
- Add to MetaCart
In this paper, we explore a new problem of mining general temporal association rules in publication databases. In essence, a publication database is a set of transactions where each transaction T is a set of items of which each item contains an individual exhibition period. The current model of association rule mining is not able to handle the publication database due to the following fundamental problems, i.e., (1) lack of consideration of the exhibition period of each individual item; (2) lack of an equitable support counting basis for each item. To remedy this, we propose an innovative algorithm Progressive-Partition-Miner (abbreviatedly as PPM) to discover general temporal association rules in a publication database. The basic idea of PPM is to first partition the publication database in light of exhibition periods of items and then progressively accumulate the occurrence count of each candidate 2-itemset based on the intrinsic partitioning characteristics. Algorithm PPM is also designed to employ a filtering threshold in each partition to early prune out those cumulatively infrequent 2-itemsets. Explicitly, the execution time of PPM is, in orders of magnitude, smaller than those required by the schemes which are directly extended from existing methods. 1
Dense Itemsets
- In SIGKDD 2004
, 2004
"... Frequent itemset mining has been the subject of a lot of work in data mining research ever since association rules were introduced. In this paper we address a problem with frequent itemsets: that they only count rows where all their attributes are present, and do not allow for any noise. We show tha ..."
Abstract
-
Cited by 14 (0 self)
- Add to MetaCart
Frequent itemset mining has been the subject of a lot of work in data mining research ever since association rules were introduced. In this paper we address a problem with frequent itemsets: that they only count rows where all their attributes are present, and do not allow for any noise. We show that generalizing the concept of frequency while preserving the performance of mining algorithms is nontrivial, and introduce a generalization of frequent itemsets, dense itemsets. Dense itemsets do not require all attributes to be present at the same time; instead, the itemset needs to define a sufficiently large submatrix that exceeds a given density threshold of attributes present.
Support envelopes: a technique for exploring the structure of association patterns
- In SIGKDD 2004
, 2004
"... This paper introduces support envelopes—a new tool for analyzing association patterns—and illustrates some of their properties, applications, and possible extensions. Specifically, the support envelope for a transaction data set and a specified pair of positive integers (m, n) consists of the items ..."
Abstract
-
Cited by 10 (1 self)
- Add to MetaCart
This paper introduces support envelopes—a new tool for analyzing association patterns—and illustrates some of their properties, applications, and possible extensions. Specifically, the support envelope for a transaction data set and a specified pair of positive integers (m, n) consists of the items and transactions that need to be searched to find any association pattern involving m or more transactions and n or more items. For any transaction data set with M transactions and N items, there is a unique lattice of at most M ∗ N support envelopes that captures the structure of the association patterns in that data set. Because support envelopes are not encumbered by a support threshold, this support lattice provides a complete view of the association structure of the data set, including association patterns that have low support. Furthermore, the boundary of the support lattice—the support boundary—has at most min(M, N) envelopes and is especially interesting since it bounds the maximum sizes of potential association patterns—not only for frequent, closed, and maximal itemsets, but also for patterns, such as error-tolerant itemsets, that are more general. The association structure can be represented graphically as a two-dimensional scatter plot of the (m, n) values associated with the support envelopes of the data set, a feature that is useful in the exploratory analysis of association patterns. Finally, the algorithm to compute support envelopes is simple and computationally efficient, and it is straightforward to parallelize the process of finding all the support envelopes.
Comparing subspace clusterings
- IEEE Transactions on Knowledge and Data Engineering
, 2004
"... Abstract—We present the first framework for comparing subspace clusterings. We propose several distance measures for subspace clusterings, including generalizations of well-known distance measures for ordinary clusterings. We describe a set of important properties for any measure for comparing subsp ..."
Abstract
-
Cited by 10 (1 self)
- Add to MetaCart
Abstract—We present the first framework for comparing subspace clusterings. We propose several distance measures for subspace clusterings, including generalizations of well-known distance measures for ordinary clusterings. We describe a set of important properties for any measure for comparing subspace clusterings and give a systematic comparison of our proposed measures in terms of these properties. We validate the usefulness of our subspace clustering distance measures by comparing clusterings produced by the algorithms FastDOC, HARP, PROCLUS, ORCLUS, and SSPC. We show that our distance measures can be also used to compare partial clusterings, overlapping clusterings, and patterns in binary data matrices. Index Terms—Subspace clustering, projected clustering, distance, feature selection, cluster validation.
Mining approximate frequent itemsets in the presence of noise: Algorithm and analysis
- In Proceedings of the 6th SIAM Conference on Data Mining (SDM
, 2006
"... Frequent itemset mining is a popular and important first step in the analysis of data arising in a broad range of applications. The traditional “exact ” model for frequent itemsets requires that every item occur in each supporting transaction. However, real data is typically subject to noise and mea ..."
Abstract
-
Cited by 9 (3 self)
- Add to MetaCart
Frequent itemset mining is a popular and important first step in the analysis of data arising in a broad range of applications. The traditional “exact ” model for frequent itemsets requires that every item occur in each supporting transaction. However, real data is typically subject to noise and measurement error. To date, the effect of noise on exact frequent pattern mining algorithms have been addressed primarily through simulation studies, and there has been limited attention to the development of noise tolerant algorithms. In this paper we propose a noise tolerant itemset model, which we call approximate frequent itemsets (AFI). Like frequent itemsets, the AFI model requires that an itemset has a minimum number of supporting transactions. However,
RBA: An Integrated Framework for Regression based on Association Rules
- In Proc of the 4 th SIAM Int'l Conf. on Data Mining
, 2004
"... ..."
Significance and recovery of block structures in binary matrices with noise
- Department of Statistics and Operation Research, UNC Chapel
, 2005
"... Abstract. Frequent itemset mining (FIM) is one of the core problems in the field of Data Mining and occupies a central place in its literature. One equivalent form of FIM can be stated as follows: given a rectangular data matrix with binary entries, find every submatrix of 1s having a minimum number ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
Abstract. Frequent itemset mining (FIM) is one of the core problems in the field of Data Mining and occupies a central place in its literature. One equivalent form of FIM can be stated as follows: given a rectangular data matrix with binary entries, find every submatrix of 1s having a minimum number of columns. This paper presents a theoretical analysis of several statistical questions related to this problem when noise is present. We begin by establishing several results concerning the extremal behavior of submatrices of ones in a binary matrix with random entries. These results provide simple significance bounds for the output of FIM algorithms. We then consider the noise sensitivity of FIM algorithms under a simple binary additive noise model, and show that, even at small noise levels, large blocks of 1s leave behind fragments of only logarithmic size. Thus such blocks cannot be directly recovered by FIM algorithms, which search for submatrices of all 1s. On the positive side, we show how, in the presence of noise, an error-tolerant criterion can recover a square submatrix of 1s against a background of 0s, even when the size of the target submatrix is very small. 1
AC-Close: efficiently mining approximate closed itemsets by core pattern recovery
- In Proc. ICDM’06
, 2006
"... Recent studies have proposed methods to discover approximate frequent itemsets in the presence of random noise. By relaxing the rigid requirement of exact frequent pattern mining, some interesting patterns, which would previously be fragmented by exact pattern mining methods due to the random noise ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
Recent studies have proposed methods to discover approximate frequent itemsets in the presence of random noise. By relaxing the rigid requirement of exact frequent pattern mining, some interesting patterns, which would previously be fragmented by exact pattern mining methods due to the random noise or measurement error, are successfully recovered. Unfortunately, a large number of “uninteresting ” candidates are explored as well during the mining process, as a result of the relaxed pattern mining methodology. This severely slows down the mining process. Even worse, it is hard for an end user to distinguish the recovered interesting patterns from these uninteresting ones. In this paper, we propose an efficient algorithm AC-Close to recover the approximate closed itemsets from “core patterns”. By focusing on the so-called core patterns, integrated with a top-down mining and several effective pruning strategies, the algorithm narrows down the search space to those potentially interesting ones. Experimental results show that AC-Close substantially outperforms the previously proposed method in terms of efficiency, while delivers a similar set of interesting recovered patterns. 1.

