## Discovering significant patterns (2007)

### Cached

### Download Links

- [www.csse.monash.edu]
- [www.csse.monash.edu.au]
- [www.csse.monash.edu]
- [www.csse.monash.edu]
- DBLP

### Other Repositories/Bibliography

Citations: | 43 - 3 self |

### BibTeX

@MISC{Webb07discoveringsignificant,

author = {Geoffrey I. Webb},

title = {Discovering significant patterns},

year = {2007}

}

### OpenURL

### Abstract

Pattern discovery techniques, such as association rule discovery, explore large search spaces of potential patterns to find those that satisfy some user-specified constraints. Due to the large number of patterns considered, they suffer from an extreme risk of type-1 error, that is, of finding patterns that appear due to chance alone to satisfy the constraints on the sample data. This paper proposes techniques to overcome this problem by applying well-established statistical practices. These allow the user to enforce a strict upper limit on the risk of experimentwise error. Empirical studies demonstrate that standard pattern discovery techniques can discover numerous spurious patterns when applied to random data and when applied to real-world data result in large numbers of patterns that are rejected when subjected to sound statistical evaluation. They also reveal that a number of pragmatic choices about how such tests are performed can greatly affect their power.

### Citations

5184 |
C4.5: Programs for Machine Learning
- Quinlan
- 1993
(Show Context)
Citation Context ... from the available data. The model learned is usually that expected to maximize accuracy or some other metric on unseen future data. Many systems that learn explicit models, such as 3sdecision tree (=-=Quinlan, 1993-=-) or decision rule (Michalski, 1983) learners, do so by searching a space of alternative models to select the model that appears to perform best with respect to the available data. Frequently during s... |

1213 | Mining Sequential Patterns
- Agrawal, Srikant
- 1994
(Show Context)
Citation Context ...odels take the form of rules. However, the same underlying techniques may be applied to other forms of model of regularities in the data, such as itemsets (Agrawal et al., 1993), sequential patterns (=-=Agrawal & Srikant, 1995-=-) 4sand sub-graphs (Kuramochi & Karypis, 2001). The best known example of pattern discovery is association rule discovery (Agrawal et al., 1993). 3 Problem Statement As outlined in the introduction, p... |

683 |
A Theory and Methodology of Inductive Learning
- Michalski
- 1983
(Show Context)
Citation Context ...el learned is usually that expected to maximize accuracy or some other metric on unseen future data. Many systems that learn explicit models, such as 3sdecision tree (Quinlan, 1993) or decision rule (=-=Michalski, 1983-=-) learners, do so by searching a space of alternative models to select the model that appears to perform best with respect to the available data. Frequently during such a search through the space of m... |

680 |
UCI repository of machine learning databases
- Hettich, Merz
- 1998
(Show Context)
Citation Context ...ques perform on real-world data. The same seven treatments were used as for Experiment 1. Experiments were conducted using eight of the largest attribute-value datasets from the UCI machine learning (=-=Newman, Hettich, Blake, & Merz, 2006-=-) and KDD (Hettich & Bay, 2006) repositories together with the BMS-WebView-1 (Zheng, Kohavi, & Mason, 2001) and Retail (Brijs, Swinnen, Vanhoof, & Wets, 1999) datasets. These datasets are described in... |

645 |
A simple sequentially rejective multiple test procedure
- Holm
- 1979
(Show Context)
Citation Context ...s the risk of any type-1 error at α when performing n hypothesis tests by using a critical value κ = α/n for each hypothesis test. This turns out to be needlessly strict, however. The Holm procedure (=-=Holm, 1979-=-) is more powerful than the Bonferroni adjustment while still guaranteeing that the risk of any type-1 error is no more than α. The procedure takes the p-values from the n hypothesis tests and orders ... |

506 | Beyond market baskets: generalizing association rules to correlations - Brin, Matwani, et al. - 1997 |

495 | The control of the false discovery rate in multiple testing under dependency
- BENJAMINI, YEKUTIELI
- 2001
(Show Context)
Citation Context ...iscovery rate (Benjamini & Hochberg, 1995) instead of the experimentwise error rate. If this is to be done, a technique that accommodates correlations between the hypothesis tests should be employed (=-=Benjamini & Yekutieli, 2001-=-). An advantage of the direct-adjustment approach is that it supports k-optimal pattern discovery (Webb, 1995; Scheffer & Wrobel, 2002; Webb & Zhang, 2005). Rather than seeking all patterns that satis... |

321 | Frequent Subgraph Discovery
- Kuramochi, Karypis
- 2001
(Show Context)
Citation Context ...same underlying techniques may be applied to other forms of model of regularities in the data, such as itemsets (Agrawal et al., 1993), sequential patterns (Agrawal & Srikant, 1995) 4sand sub-graphs (=-=Kuramochi & Karypis, 2001-=-). The best known example of pattern discovery is association rule discovery (Agrawal et al., 1993). 3 Problem Statement As outlined in the introduction, pattern discovery seeks to identify patterns ρ... |

254 | Efficient Mining of Emerging Patterns: Discovering Trends and Differences
- Dong, Li
- 1999
(Show Context)
Citation Context ...iscovery (Agrawal, Imielinski, & Swami, 1993), k-optimal or ktop rule discovery (Webb, 1995; Scheffer & Wrobel, 2002; Webb & Zhang, 2005), contrast or emerging pattern discovery (Bay & Pazzani, 2001; =-=Dong & Li, 1999-=-), subgroup discovery (Klösgen, 1996), interesting itemset discovery (Jaroszewicz & Simovici, 2004) and impact or quantitative rule discovery (Aumann & Lindell, 1999; Webb, 2001; Zhang, Padmanabhan, &... |

196 | Generating non-redundant association rules
- Zaki
- 2000
(Show Context)
Citation Context ...ge (Piatetsky-Shapiro, 1991) and improvement (Bayardo, Agrawal, & Gunopulos, 2000). They also include the identification and rejection of redundant (Bastide, Pasquier, Taouil, Stumme, & Lakhal, 2000; =-=Zaki, 2000-=-) and derivable (Calders & Goethals, 2002) rules. Redundant rules are those such as {pregnant,female} → oedema that include items in the antecedent that are entailed by the other elements of the antec... |

174 |
Discovery, Analysis and Presentation of Strong Rules
- Piatetsky-Shapiro
- 1991
(Show Context)
Citation Context ...as a minimum support (Agrawal et al., 1993), together with other constraints, if desired, such as minimum confidence (Agrawal et al., 1993), lift (International Business Machines, 1996), or leverage (=-=Piatetsky-Shapiro, 1991-=-). These terms are defined with respect to a rule X → y and dataset D as follows: • coverage(X → y) is |{R ∈ D : X ⊆ R}|; • support(X → y) is |{R ∈ D : X ∪ {y} ⊆ R}|; • confidence(X → y) = support(X →... |

151 | D.: Constraint-based rule mining in large, dense databases
- Bayardo, Agrawal, et al.
- 2000
(Show Context)
Citation Context ...nd discard association rules that are unlikely to be of interest. These include constraints on minimum lift (International Business Machines, 1996) leverage (Piatetsky-Shapiro, 1991) and improvement (=-=Bayardo, Agrawal, & Gunopulos, 2000-=-). They also include the identification and rejection of redundant (Bastide, Pasquier, Taouil, Stumme, & Lakhal, 2000; Zaki, 2000) and derivable (Calders & Goethals, 2002) rules. Redundant rules are t... |

125 |
Explora: a multipattern and multistrategy discovery assistant
- Klösgen
- 1996
(Show Context)
Citation Context ..., 1993), k-optimal or ktop rule discovery (Webb, 1995; Scheffer & Wrobel, 2002; Webb & Zhang, 2005), contrast or emerging pattern discovery (Bay & Pazzani, 2001; Dong & Li, 1999), subgroup discovery (=-=Klösgen, 1996-=-), interesting itemset discovery (Jaroszewicz & Simovici, 2004) and impact or quantitative rule discovery (Aumann & Lindell, 1999; Webb, 2001; Zhang, Padmanabhan, & Tuzhilin, 2004). All these techniqu... |

124 |
Mining association between sets of items in massive database
- Agrawal, Imielinski, et al.
- 1993
(Show Context)
Citation Context ...o screen individual patterns in the context of discovering large numbers of patterns from a single set of data. This problem arises in pattern discovery, as exemplified by association rule discovery (=-=Agrawal, Imielinski, & Swami, 1993-=-), k-optimal or ktop rule discovery (Webb, 1995; Scheffer & Wrobel, 2002; Webb & Zhang, 2005), contrast or emerging pattern discovery (Bay & Pazzani, 2001; Dong & Li, 1999), subgroup discovery (Klösge... |

124 | World Performance of Association Rule Algorithms,” Proc. Knowledge Discovery and Data Mining (KDD
- Zheng, Kohavi, et al.
- 2001
(Show Context)
Citation Context ...ed using eight of the largest attribute-value datasets from the UCI machine learning (Newman, Hettich, Blake, & Merz, 2006) and KDD (Hettich & Bay, 2006) repositories together with the BMS-WebView-1 (=-=Zheng, Kohavi, & Mason, 2001-=-) and Retail (Brijs, Swinnen, Vanhoof, & Wets, 1999) datasets. These datasets are described in Table 2. We first found for each dataset the minimum even value for minimum-support that produced fewer t... |

122 | Pruning and summarizing the discovered associations - Liu, Hsu, et al. - 1999 |

110 | Mining all non-derivable frequent itemsets
- Calders, Goethals
- 2002
(Show Context)
Citation Context ...) and improvement (Bayardo, Agrawal, & Gunopulos, 2000). They also include the identification and rejection of redundant (Bastide, Pasquier, Taouil, Stumme, & Lakhal, 2000; Zaki, 2000) and derivable (=-=Calders & Goethals, 2002-=-) rules. Redundant rules are those such as {pregnant,female} → oedema that include items in the antecedent that are entailed by the other elements of the antecedent, as is the case with pregnant entai... |

102 | Types of costs in inductive concept learning - Turney - 2000 |

100 | A Survey of Exact Inference for Contingency Tables - Agresti - 1992 |

87 | Oversearching and layered search in empirical learning
- Quinlan, Cameron-Jones
- 1995
(Show Context)
Citation Context ..., which, for even relatively small values of κ and n, rapidly approaches 1.0. This problem is closely related to the multiple comparisons problem (Jensen & Cohen, 2000) and the problem of oversearch (=-=Quinlan & Cameron-Jones, 1995-=-). Most pattern discovery systems do little to control this risk. While a number of approaches to controlling this risk have been developed (for example, Megiddo & Srikant, 1998; Bay & Pazzani, 2001; ... |

86 | A statistical theory for quantitative association rules
- Aumann, Lindell
- 1999
(Show Context)
Citation Context ...g pattern discovery (Bay & Pazzani, 2001; Dong & Li, 1999), subgroup discovery (Klösgen, 1996), interesting itemset discovery (Jaroszewicz & Simovici, 2004) and impact or quantitative rule discovery (=-=Aumann & Lindell, 1999-=-; Webb, 2001; Zhang, Padmanabhan, & Tuzhilin, 2004). All these techniques search large spaces of possible patterns P and return all patterns that satisfy user-defined constraints. Such a process entai... |

86 | Mining minimal non-redundant association rules using frequent closed itemsets
- Bastide, Pasquier, et al.
- 2000
(Show Context)
Citation Context ...lift (International Business Machines, 1996) leverage (Piatetsky-Shapiro, 1991) and improvement (Bayardo, Agrawal, & Gunopulos, 2000). They also include the identification and rejection of redundant (=-=Bastide, Pasquier, Taouil, Stumme, & Lakhal, 2000-=-; Zaki, 2000) and derivable (Calders & Goethals, 2002) rules. Redundant rules are those such as {pregnant,female} → oedema that include items in the antecedent that are entailed by the other elements ... |

79 | M.: Detecting Group Differences: Mining Contrast Sets
- Bay, Pazzani
- 2001
(Show Context)
Citation Context ...by association rule discovery (Agrawal, Imielinski, & Swami, 1993), k-optimal or ktop rule discovery (Webb, 1995; Scheffer & Wrobel, 2002; Webb & Zhang, 2005), contrast or emerging pattern discovery (=-=Bay & Pazzani, 2001-=-; Dong & Li, 1999), subgroup discovery (Klösgen, 1996), interesting itemset discovery (Jaroszewicz & Simovici, 2004) and impact or quantitative rule discovery (Aumann & Lindell, 1999; Webb, 2001; Zhan... |

77 | OPUS: An Efficient Admissible Algorithm for Unordered Search
- Webb
- 1995
(Show Context)
Citation Context ...s of patterns from a single set of data. This problem arises in pattern discovery, as exemplified by association rule discovery (Agrawal, Imielinski, & Swami, 1993), k-optimal or ktop rule discovery (=-=Webb, 1995-=-; Scheffer & Wrobel, 2002; Webb & Zhang, 2005), contrast or emerging pattern discovery (Bay & Pazzani, 2001; Dong & Li, 1999), subgroup discovery (Klösgen, 1996), interesting itemset discovery (Jarosz... |

75 | Multiple comparisons in induction algorithms. Machine Learning 38(3):309–338
- Jensen, Cohen
- 2000
(Show Context)
Citation Context ...ast one erroneous pattern by chance is 1 − (1 − κ) n , which, for even relatively small values of κ and n, rapidly approaches 1.0. This problem is closely related to the multiple comparisons problem (=-=Jensen & Cohen, 2000-=-) and the problem of oversearch (Quinlan & Cameron-Jones, 1995). Most pattern discovery systems do little to control this risk. While a number of approaches to controlling this risk have been develope... |

71 | Using association rules for product assortment decisions: a case study
- Brijs, Swinnen, et al.
- 1999
(Show Context)
Citation Context ...lue datasets from the UCI machine learning (Newman, Hettich, Blake, & Merz, 2006) and KDD (Hettich & Bay, 2006) repositories together with the BMS-WebView-1 (Zheng, Kohavi, & Mason, 2001) and Retail (=-=Brijs, Swinnen, Vanhoof, & Wets, 1999-=-) datasets. These datasets are described in Table 2. We first found for each dataset the minimum even value for minimum-support that produced fewer than 10,000 productive rules when applied with respe... |

61 | Empirical Bayes screening for multi-item associations - DuMouchel, Pregibon - 2001 |

58 |
Multiple hypothesis testing
- Shaffer
- 1995
(Show Context)
Citation Context ...s limitations as detailed in Section 4. The current paper investigates two approaches to applying statistical tests in pattern discovery. The first applies a Bonferroni correction for multiple tests (=-=Shaffer, 1995-=-), dividing the global significance level α by the number of patterns in the search space in order to obtain the critical value κ. The second divides the available data into exploratory and holdout se... |

39 | Discovering predictive association rules
- Megiddo, Srikant
- 1998
(Show Context)
Citation Context ...m of oversearch (Quinlan & Cameron-Jones, 1995). Most pattern discovery systems do little to control this risk. While a number of approaches to controlling this risk have been developed (for example, =-=Megiddo & Srikant, 1998-=-; Bay & Pazzani, 2001; Webb, 2002), each has limitations as detailed in Section 4. The current paper investigates two approaches to applying statistical tests in pattern discovery. The first applies a... |

33 | Elementary statistics - Johnson - 1973 |

32 | Finding Association Rules that Trade Support Optimally against Confidence - Scheffer - 2001 |

32 | Discovering Association with Numeric Variables
- Webb
- 2001
(Show Context)
Citation Context ... & Pazzani, 2001; Dong & Li, 1999), subgroup discovery (Klösgen, 1996), interesting itemset discovery (Jaroszewicz & Simovici, 2004) and impact or quantitative rule discovery (Aumann & Lindell, 1999; =-=Webb, 2001-=-; Zhang, Padmanabhan, & Tuzhilin, 2004). All these techniques search large spaces of possible patterns P and return all patterns that satisfy user-defined constraints. Such a process entails evaluatin... |

30 | Interestingness of frequent itemsets using bayesian networks as background knowledge
- Jaroszewicz, Simovici
- 2004
(Show Context)
Citation Context ..., 1995; Scheffer & Wrobel, 2002; Webb & Zhang, 2005), contrast or emerging pattern discovery (Bay & Pazzani, 2001; Dong & Li, 1999), subgroup discovery (Klösgen, 1996), interesting itemset discovery (=-=Jaroszewicz & Simovici, 2004-=-) and impact or quantitative rule discovery (Aumann & Lindell, 1999; Webb, 2001; Zhang, Padmanabhan, & Tuzhilin, 2004). All these techniques search large spaces of possible patterns P and return all p... |

25 | Finding the Most Interesting Patterns in a Database Quickly by Using Sequential Sampling
- Scheffer, Wrobel
(Show Context)
Citation Context ...s from a single set of data. This problem arises in pattern discovery, as exemplified by association rule discovery (Agrawal, Imielinski, & Swami, 1993), k-optimal or ktop rule discovery (Webb, 1995; =-=Scheffer & Wrobel, 2002-=-; Webb & Zhang, 2005), contrast or emerging pattern discovery (Bay & Pazzani, 2001; Dong & Li, 1999), subgroup discovery (Klösgen, 1996), interesting itemset discovery (Jaroszewicz & Simovici, 2004) a... |

20 | Discovering Significant Rules - Webb - 2006 |

17 | K-optimal rule discovery
- Webb, Zhang
- 2005
(Show Context)
Citation Context ...ta. This problem arises in pattern discovery, as exemplified by association rule discovery (Agrawal, Imielinski, & Swami, 1993), k-optimal or ktop rule discovery (Webb, 1995; Scheffer & Wrobel, 2002; =-=Webb & Zhang, 2005-=-), contrast or emerging pattern discovery (Bay & Pazzani, 2001; Dong & Li, 1999), subgroup discovery (Klösgen, 1996), interesting itemset discovery (Jaroszewicz & Simovici, 2004) and impact or quantit... |

17 |
On the discovery of significant statistical quantitative rules
- ZHANG, PADMANABHAN, et al.
(Show Context)
Citation Context ...2001; Dong & Li, 1999), subgroup discovery (Klösgen, 1996), interesting itemset discovery (Jaroszewicz & Simovici, 2004) and impact or quantitative rule discovery (Aumann & Lindell, 1999; Webb, 2001; =-=Zhang, Padmanabhan, & Tuzhilin, 2004-=-). All these techniques search large spaces of possible patterns P and return all patterns that satisfy user-defined constraints. Such a process entails evaluating numerous patterns ρ ∈ P against a se... |

15 |
Controlling the false discovery rate: A new and powerful approach to multiple testing
- Benjamini, Hochberg
- 1995
(Show Context)
Citation Context ...is a promising issue for further investigation. It is worth noting that an advantage of the holdout approach relative to direct-adjustment is that it can support controls on the false discovery rate (=-=Benjamini & Hochberg, 1995-=-) instead of the experimentwise error rate. If this is to be done, a technique that accommodates correlations between the hypothesis tests should be employed (Benjamini & Yekutieli, 2001). An advantag... |

4 |
The UCI KDD archive. [http://kdd.ics. uci.edu
- Hettich, Bay
- 2004
(Show Context)
Citation Context ... treatments were used as for Experiment 1. Experiments were conducted using eight of the largest attribute-value datasets from the UCI machine learning (Newman, Hettich, Blake, & Merz, 2006) and KDD (=-=Hettich & Bay, 2006-=-) repositories together with the BMS-WebView-1 (Zheng, Kohavi, & Mason, 2001) and Retail (Brijs, Swinnen, Vanhoof, & Wets, 1999) datasets. These datasets are described in Table 2. We first found for e... |

4 |
Magnum Opus Version 1.3
- Webb
- 2002
(Show Context)
Citation Context ...). Most pattern discovery systems do little to control this risk. While a number of approaches to controlling this risk have been developed (for example, Megiddo & Srikant, 1998; Bay & Pazzani, 2001; =-=Webb, 2002-=-), each has limitations as detailed in Section 4. The current paper investigates two approaches to applying statistical tests in pattern discovery. The first applies a Bonferroni correction for multip... |

2 | Preliminary investigations into statistically valid exploratory rule discovery - Webb - 2003 |

1 |
Magnum Opus Version 3.0.1
- Webb
- 2005
(Show Context)
Citation Context ... conservative estimates of the true probability of a conjunction of conditions. Examples of this approach include DuMouchel and Pregibon’s (2001) Empirical Bayes Screening, Magnum Opus’s m-estimates (=-=Webb, 2005-=-) and Scheffer’s (1995) Bayesian Frequency Correction. These approaches can be very effective at reducing the overestimates of measures such as support or confidence that can occur for rules with low ... |