## Efficient discovery of error-tolerant frequent itemsets in high dimensions (2001)

### Cached

### Download Links

- [infolab.stanford.edu]
- [www-db.stanford.edu]
- DBLP

### Other Repositories/Bibliography

Venue: | In SIGKDD 2001 |

Citations: | 58 - 0 self |

### BibTeX

@INPROCEEDINGS{Yang01efficientdiscovery,

author = {Cheng Yang},

title = {Efficient discovery of error-tolerant frequent itemsets in high dimensions},

booktitle = {In SIGKDD 2001},

year = {2001},

pages = {194--203},

publisher = {ACM Press}

}

### Years of Citing Articles

### OpenURL

### Abstract

We present a generalization of frequent itemsets allowing for the notion of errors in the itemset definition. We motivate the problem and present an efficient algorithm that identifies errortolerant frequent clusters of items in transactional data (customerpurchase data, web browsing data, text, etc.). The algorithm exploits sparseness of the underlying data to find large groups of items that are correlated over database records (rows). The notion of transaction coverage allows us to extend the algorithm and view it as a fast clustering algorithm for discovering segments of similar transactions in binary sparse data. We evaluate the new algorithm on three real-world applications: clustering highdimensional data, query selectivity estimation and collaborative filtering. Results show that the algorithm consistently uncovers structure in large sparse databases that other traditional clustering algorithms fail to find.

### Citations

9054 | Maximum likelihood from incomplete data via the EM algorithm - Dempster, Laird, et al. - 1977 |

4179 | Pattern Classification and Scene Analysis - Duda, Hart - 1973 |

2900 | Fast algorithms for mining association rules - Agrawal, Srikant - 1994 |

2641 | Mining association rules between sets of items in large databases - Agrawal, Imielinski, et al. - 1993 |

1248 | Grouplens: An open architecture for collaborative filtering of netnews - Resnick, Iacovou, et al. - 1994 |

1177 | Empirical analysis of Predictive Algorithms for Collaborative Filtering
- Breese, Heckerman, et al.
- 1998
(Show Context)
Citation Context ... likely appear with the given set based on the values of Pr(x i | C). Intuitively, an item i with a large corresponding value of Pr(x i | C) would be likely to occur along with the given itemset. See =-=[BHK98]-=- for a detailed description. Results are given in Section 5.2.4. 5. RESULTS 5.1 Synthetic Data Experiments We generated 70 different synthetic datasets while controlling the following parameters: tota... |

1160 | Recommender Systems - Resnick, Varian - 1997 |

1086 | Human Behavior and the Principle of Least Effort - Zipf - 1949 |

703 |
Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning
- Fayyad, Irani
- 1993
(Show Context)
Citation Context ... to find error-tolerant itemsets over transactional databases with categorical-valued attributes (more than 2 values). Continuous-valued attributes may be preprocessed with a discretization algorithm =-=[FI93]-=-. We discuss generalizations in Section 6, but this paper focuses on the binary case. Note that the definition does admit degenerate cases that need to be handled. Degenerate case: Table 2 illustrate ... |

628 | Efficient and effective clustering methods for spatial data mining - Ng, Han - 1994 |

600 | Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications
- Agrawal, Johannes, et al.
- 1998
(Show Context)
Citation Context ...points to define a cluster. They also used a sampling scheme to reduce I/O costs. Recognizing that most clusters are defined on subspaces rather than the entire high-dimensional space, Agrawal et al. =-=[AGGR98]-=- presented a method to build subspace clusters in a bottom-up way, using the property that if a collection of points form a cluster in a k-dimensional space, they must also form a cluster in all of it... |

562 | Concrete Mathematics
- Graham, Knuth, et al.
- 1994
(Show Context)
Citation Context ...umptions over market-basket type data quickly shows that this probability is very small. For example, for p = 0.15, ε = 0.2, κ = 0.01, N = 1,000,000, D = 500 and r = 5, using Stirling’s approximation =-=[GKP89]-=- for the combinatorial terms one obtains that the probability of finding an ETI with 5 items by chance is approximately 10 -9300 (for r=10 items this probability drops down to 10 -43,000 ). Hence the ... |

515 | Bayesian Classification (AutoClass): Theory and Results - Cheeseman, Stutz - 1996 |

476 | BIRCH: an efficient data clustering method for very large databases - Zhang, Ramakrishnan, et al. - 1996 |

368 | Mining quantitative association rules in large relational tables
- Srikant, Agrawal
- 1996
(Show Context)
Citation Context ...chanism for counting frequent combinations of attribute values, the algorithm carries through and can be applied to finding clusters in categorical data. This is similar to the algorithm presented in =-=[SA96]-=-. 7. RELATED WORK Frequent itemsets were first developed by Agrawal et al. in the apriori algorithm for association rule mining [AIS93, AS94, AMSTV96]. The key optimization in finding frequent itemset... |

356 | A Robust Clustering Algorithm for Categorical Attributes
- Guha, Rastogi, et al.
- 1999
(Show Context)
Citation Context ...itialization method for existing cluster refinement algorithms such as EM. Discrete clustering algorithms, as opposed to generalized frequent itemsets, include CACTUS [GGR99], STIRR [GKR98], and ROCK =-=[GRS99]-=-. The first two require the computation of a similarity matrix between all attributes (items), which takes O(d 2 ) time (d = number of attributes). CACTUS uses a more efficient refinement method on th... |

341 | New Algorithms for Fast Discovery of Association Rules
- Zaki, Parthasarathy, et al.
- 1997
(Show Context)
Citation Context ...eveloped in this paper. One problem that arises in a-priori is that the algorithm scales exponentially with longest pattern length. Many variants have been proposed to address this issue. Zaki et al. =-=[ZPOL97]-=- developed the algorithms MaxEclat and MaxClique which “look ahead” during initialization so that long frequent itemsets are identified early. Bayardo [B98] presented an optimized search method called... |

310 |
Inkeri Verkamo. Fast discovery of association rules
- Agrawal, Mannila, et al.
- 1996
(Show Context)
Citation Context ...cluster of customers who, purchase “most” of the products {P1,…,P5}. 1.1 Problem Statement The intuition is made specific with the following definition and problem statement. Adopting the notation of =-=[AMSTV96]-=-, let I = {i 1, i 2, …, i d} be the full set of items over a database D consisting of transactions T, where each transaction is a subset of the full itemset I. Each transaction T may be viewed as reco... |

148 | Clustering Categorical Data: An Approach Based on Dynamical Systems
- Gibson, Kleinberg, et al.
- 1998
(Show Context)
Citation Context ...as an effective initialization method for existing cluster refinement algorithms such as EM. Discrete clustering algorithms, as opposed to generalized frequent itemsets, include CACTUS [GGR99], STIRR =-=[GKR98]-=-, and ROCK [GRS99]. The first two require the computation of a similarity matrix between all attributes (items), which takes O(d 2 ) time (d = number of attributes). CACTUS uses a more efficient refin... |

95 | CACTUS: Clustering Categorical Data Using Summaries
- Ganti, Gehrke, et al.
- 1999
(Show Context)
Citation Context ...n also be used as an effective initialization method for existing cluster refinement algorithms such as EM. Discrete clustering algorithms, as opposed to generalized frequent itemsets, include CACTUS =-=[GGR99]-=-, STIRR [GKR98], and ROCK [GRS99]. The first two require the computation of a similarity matrix between all attributes (items), which takes O(d 2 ) time (d = number of attributes). CACTUS uses a more ... |

84 | An experimental comparison of several clustering and initialization methods
- Meila, Heckerman
(Show Context)
Citation Context ...a statistically meaningful model, the quality of the solution it produces is determined completely by the initial model. The state of the art of initializing EM over {0,1} data is via random restarts =-=[MH98]-=-. When applied to high-dimensional data, EM suffers from two problems in particular: clusters often go empty (no data records assigned to them); and different clusters converge on the same distributio... |

64 |
Efficiently Mining Long Patterns from Databases
- Jr
- 1998
(Show Context)
Citation Context ...posed to address this issue. Zaki et al. [ZPOL97] developed the algorithms MaxEclat and MaxClique which “look ahead” during initialization so that long frequent itemsets are identified early. Bayardo =-=[B98]-=- presented an optimized search method called Max-Miner that prunes out subsets of long patterns of frequent itemsets that are discovered early. Gunopulos et al. [GMS97] suggested iteratively extending... |

59 | Compressed data cubes for OLAP aggregate query approximation on continuous dimensions
- Bradley, Fayyad, et al.
- 1999
(Show Context)
Citation Context ...y of the mixture model statistical summary of a given database has been demonstrated in data mining tasks such as speeding up nearest neighbor queries [BFG99] and approximating OLAP aggregate queries =-=[SFB99]-=-. The mixture models used in query selectivity estimation in [SFB99] were generated over continuous-valued data only. The query selectivity task addressed here is focused on sparse {0,1}-valued data. ... |

56 | Discovering All Most Specific Sentences by Randomized Algorithms
- Gunopulos, Mannila, et al.
- 1997
(Show Context)
Citation Context ...ets are identified early. Bayardo [B98] presented an optimized search method called Max-Miner that prunes out subsets of long patterns of frequent itemsets that are discovered early. Gunopulos et al. =-=[GMS97]-=- suggested iteratively extending a working pattern until failure, using a randomized algorithm, which is similar to the idea we used in our algorithm to grow itemsets in a greedy fashion. Much work ha... |

45 | Scaling EM (ExpectationMaximization) Clustering to Large Databases
- Bradley, Reina
- 1998
(Show Context)
Citation Context ...retailer. 5.2.2 Number of Clusters For a given minimum support value κ, SGGA was applied to the database. Upon termination, the ETIs constructed were used to initialize the binomial cluster model. EM =-=[BFR98]-=- was then applied to the database to refine the given initial model. EM was Final # of clusters Final # of Clusters 1 6 0 1 4 0 1 2 0 1 0 0 8 0 6 0 4 0 2 0 0 0 0 .0 5 0 .1 M in im u m S u p p o r t 1 ... |

38 | Experiences with GroupLens: Making Usenet useful again - Miller, Riedl, et al. - 1997 |

33 | Density-Based Indexing for Approximate Nearest-Neighbor Queries
- Bennett, Fayyad, et al.
- 1999
(Show Context)
Citation Context ... | C)) if item i does not appear in x. The utility of the mixture model statistical summary of a given database has been demonstrated in data mining tasks such as speeding up nearest neighbor queries =-=[BFG99]-=- and approximating OLAP aggregate queries [SFB99]. The mixture models used in query selectivity estimation in [SFB99] were generated over continuous-valued data only. The query selectivity task addres... |

9 |
CURE: An efficient algorithm for clustering large databases
- Guha, Rastogi, et al.
- 1998
(Show Context)
Citation Context ...mensional space [CS96, DLR77, NH94, ZRL96], and cluster membership is determined by some distance function to the centroids. This leads to cluster shapes similar to spheres. Later work by Guha et al. =-=[GRS98]-=- was able to handle arbitrarilyshaped clusters by using several representative points to define a cluster. They also used a sampling scheme to reduce I/O costs. Recognizing that most clusters are defi... |