Results 11 - 20
of
24
Finding Subgroups having Several Descriptions: Algorithms for Redescription Mining
"... Given a 0-1 dataset, we consider the redescription mining task introduced by Ramakrishnan, Parida, and Zaki. The problem is to find subsets of the rows that can be (approximately) defined by at least two different Boolean formulae on the attributes. That is, we search for pairs (α, β) ofBoolean form ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Given a 0-1 dataset, we consider the redescription mining task introduced by Ramakrishnan, Parida, and Zaki. The problem is to find subsets of the rows that can be (approximately) defined by at least two different Boolean formulae on the attributes. That is, we search for pairs (α, β) ofBoolean formulae such that the implications α → β and β → α both hold with high accuracy. We require that the two descriptions α and β are syntactically sufficiently different. Such pairs of descriptions indicate that the subset has different definitions, a fact that gives useful information about the data. We give simple algorithms for this task, and evaluate their performance. The methods are based on pruning the search space of all possible pairs of formulae by different accuracy criteria. The significance of the findings is tested by using randomization methods. Experimental results on simulated and real data show that the methods work well: on simulated data they find the planted subsets, and on real data they produce small and understandable results. 1
Filtered-top-k Association Discovery
"... Association mining has been one of the most intensively researched areas of data mining. However, direct uptake of the resulting technologies has been relatively low. This paper examines some of the reasons why the dominant paradigms in association mining have not lived up to their promise, and argu ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Association mining has been one of the most intensively researched areas of data mining. However, direct uptake of the resulting technologies has been relatively low. This paper examines some of the reasons why the dominant paradigms in association mining have not lived up to their promise, and argues that a powerful alternative is provided by top-k techniques coupled with appropriate statistical and other filtering.
Rules for Contrast Sets
"... In this paper we present a technique to derive rules describing contrast sets. Contrast sets are a formalism to represent groups differences. We propose a novel approach to describe directional contrasts using rules where the contrasting effect is partitioned into pairs of groups. Our approach makes ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
In this paper we present a technique to derive rules describing contrast sets. Contrast sets are a formalism to represent groups differences. We propose a novel approach to describe directional contrasts using rules where the contrasting effect is partitioned into pairs of groups. Our approach makes use of a directional Fisher Exact Test to find significant differences across groups. We used a Bonferroni withinsearch adjustment to control type I errors and a pruning technique to prevent derivation of non significant contrast set specializations.
Technical Note: Layered Critical Values: A Powerful Direct-Adjustment Approach to Discovering Significant Patterns
, 2008
"... Standard pattern discovery techniques, such as association rules, suffer an extreme risk of finding very large numbers of spurious patterns for many knowledge discovery tasks. The direct-adjustment approach to controlling this risk applies a statistical test during the discovery process, using a cri ..."
Abstract
- Add to MetaCart
Standard pattern discovery techniques, such as association rules, suffer an extreme risk of finding very large numbers of spurious patterns for many knowledge discovery tasks. The direct-adjustment approach to controlling this risk applies a statistical test during the discovery process, using a critical value adjusted to take account of the size of the search space. However, a problem with the direct-adjustment strategy is that it may discard numerous true patterns. This paper investigates the assignment of different critical values to different areas of the search space as an approach to alleviating this problem, using a variant of a technique originally developed for other purposes. This approach is shown to be effective at increasing the number of discoveries while still maintaining strict control over the risk of false discoveries. 1
Dean of SchoolWhat is Unequal among the Equals? Ranking Equivalent Rules from Gene Expression Data
"... tutorial article, which has been submitted for publication in a journal or for consideration by the commissioning organization. The report represents the ideas of its author, and should not be taken as the official views of the School or the University. Any discussion of the content of the report sh ..."
Abstract
- Add to MetaCart
tutorial article, which has been submitted for publication in a journal or for consideration by the commissioning organization. The report represents the ideas of its author, and should not be taken as the official views of the School or the University. Any discussion of the content of the report should be sent to the author, at the address shown on the cover. OOI Beng Chin
Pattern-Based Classification: A Unifying Perspective
"... Abstract. The use of patterns in predictive models is a topic that has received a lot of attention in recent years. Pattern mining can help to obtain models for structured domains, such as graphs and sequences, and has been proposed as a means to obtain more accurate and more interpretable models. D ..."
Abstract
- Add to MetaCart
Abstract. The use of patterns in predictive models is a topic that has received a lot of attention in recent years. Pattern mining can help to obtain models for structured domains, such as graphs and sequences, and has been proposed as a means to obtain more accurate and more interpretable models. Despite the large amount of publications devoted to this topic, we believe however that an overview of what has been accomplished in this area is missing. This paper presents our perspective on this evolving area. We identify the principles of pattern mining that are important when mining patterns for models and provide an overview of pattern-based classification methods. We categorize these methods along the following dimensions: (1) whether they post-process a pre-computed set of patterns or iteratively execute pattern mining algorithms; (2) whether they select patterns model-independently or whether the pattern selection is guided by a model. We summarize the results that have been obtained for each of these methods. 1
THE SMALLEST SET OF CONSTRAINTS THAT EXPLAINS THE DATA: A RANDOMIZATION APPROACH
"... Aalto-yliopiston teknillinen korkeakoulu Informaatio- ja luonnontieteiden tiedekunta Tietojenkäsittelytieteen laitosDistribution: ..."
Abstract
- Add to MetaCart
Aalto-yliopiston teknillinen korkeakoulu Informaatio- ja luonnontieteiden tiedekunta Tietojenkäsittelytieteen laitosDistribution:
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL., NO., 1 Cascading
, 2011
"... This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. ..."
Abstract
- Add to MetaCart
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
Time Series Motifs Statistical Significance
"... Time series motif discovery is the task of extracting previously unknown recurrent patterns from time series data. It is an important problem within applications that range from finance to health. Many algorithms have been proposed for the task of efficiently finding motifs. Surprisingly, most of th ..."
Abstract
- Add to MetaCart
Time series motif discovery is the task of extracting previously unknown recurrent patterns from time series data. It is an important problem within applications that range from finance to health. Many algorithms have been proposed for the task of efficiently finding motifs. Surprisingly, most of these proposals do not focus on how to evaluate the discovered motifs. They are typically evaluated by human experts. This is unfeasible even for moderately sized datasets, since the number of discovered motifs tends to be prohibitively large. Statistical significance tests are widely used in bioinformatics and association rules mining communities to evaluate the extracted patterns. In this work we present an approach to calculate time series motifs statistical significance. Our proposal leverages work from the bioinformatics community by using a symbolic definition of time series motifs to derive each motif’s p-value. We estimate the expected frequency of a motif by using Markov Chain models. The p-value is then assessed by comparing the actual frequency to the estimated one using statistical hypothesis tests. Our contribution gives means to the application of a powerful technique- statistical tests- to a time series setting. This provides researchers and practitioners with an important tool to evaluate automatically the degree of relevance of each extracted motif.
Controlling False Positives in Association Rule Mining
"... Association rule mining is an important problem in the data mining area. It enumerates and tests a large number of rules on a dataset and outputs rules that satisfy user-specified constraints. Due to the large number of rules being tested, rules that do not represent real systematic effect in the da ..."
Abstract
- Add to MetaCart
Association rule mining is an important problem in the data mining area. It enumerates and tests a large number of rules on a dataset and outputs rules that satisfy user-specified constraints. Due to the large number of rules being tested, rules that do not represent real systematic effect in the data can satisfy the given constraints purely by random chance. Hence association rule mining often suffers from a high risk of false positive errors. There is a lack of comprehensive study on controlling false positives in association rule mining. In this paper, we adopt three multiple testing correction approaches—the direct adjustment approach, the permutation-based approach and the holdout approach—to control false positives in association rule mining, and conduct extensive experiments to study their performance. Our results show that (1) Numerous spurious rules are generated if no correction is made. (2) The three approaches can control false positives effectively. Among the three approaches, the permutation-based approach has the highest power of detecting real association rules, but it is very computationally expensive. We employ several techniques to reduce its cost effectively.

