Results 1  10
of
11
Tell me what I need to know: Succinctly summarizing data with itemsets
 In Proc. KDD
, 2011
"... Data analysis is an inherently iterative process. That is, what we know about the data greatly determines our expectations, and hence, what result we would find the most interesting. With this in mind, we introduce a wellfounded approach for succinctly summarizing data with a collection of itemsets ..."
Abstract

Cited by 14 (3 self)
 Add to MetaCart
(Show Context)
Data analysis is an inherently iterative process. That is, what we know about the data greatly determines our expectations, and hence, what result we would find the most interesting. With this in mind, we introduce a wellfounded approach for succinctly summarizing data with a collection of itemsets; using a probabilistic maximum entropy model, we iteratively find the most interesting itemset, and in turn update our model of the data accordingly. As we only include itemsets that are surprising with regard to the current model, the summary is guaranteed to be both descriptive and nonredundant. The algorithm that we present can either mine the topk most interesting itemsets, or use the Bayesian Information Criterion to automatically identify the model containing only the itemsets most important for describing the data. Or, in other words, it will ‘tell you what you need to know’. Experiments on synthetic and benchmark data show that the discovered summaries are succinct, and correctly identify the key patterns in the data. The models they form attain high likelihoods, and inspection shows that they summarize the data well with increasingly specific, yet nonredundant itemsets.
Discovery of errortolerant biclusters from noisy gene expression data
 Bioinformatics
, 2011
"... An important analysis performed on microarray geneexpression data is to discover biclusters, which denote groups of genes that are coherently expressed for a subset of conditions. Various biclustering algorithms have been proposed to find different types of biclusters from these realvalued geneex ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
(Show Context)
An important analysis performed on microarray geneexpression data is to discover biclusters, which denote groups of genes that are coherently expressed for a subset of conditions. Various biclustering algorithms have been proposed to find different types of biclusters from these realvalued geneexpression data sets. However, these algorithms suffer from several limitations such as inability to explicitly handle errors/noise in the data; difficulty in discovering small bicliusters due to their topdown approach; inability of some of the approaches to find overlapping biclusters, which is crucial as many genes participate in multiple biological processes. Association pattern mining also produce biclusters as their result and can naturally address some of these limitations. However, traditional association mining only finds exact biclusters, which
A novel errortolerant frequent itemset model for binary and realvalued data
, 2009
"... Frequent pattern mining has been successfully applied to a broad range of applications, however, it has two major drawbacks, which limits its applicability to several domains. First, as the traditional ‘exact ’ model of frequent pattern mining uses a strict definition of support, it limits the recov ..."
Abstract

Cited by 3 (3 self)
 Add to MetaCart
(Show Context)
Frequent pattern mining has been successfully applied to a broad range of applications, however, it has two major drawbacks, which limits its applicability to several domains. First, as the traditional ‘exact ’ model of frequent pattern mining uses a strict definition of support, it limits the recovery of frequent itemset patterns in reallife data sets where the patterns may be fragmented due to random noise/errors. Second, as traditional frequent pattern mining algorithms works with only binary or boolean attributes, it requires transformation of realvalued attributes to binary attributes, which often results in loss of information. As many of the reallife data sets are both noisy and realvalued in nature, past approaches have tried to independently address these issues and there is no systematic approach that addresses both of these issues together. In this paper, we propose a novel ErrorTolerant Frequent Itemset (ETFI) model for binary as well as realvalued data. We also propose a bottomup pattern mining algorithm to sequentially discover all ETFIs from both types of data sets. To illustrate the efficacy of our proposed ETFI approach, we use two realvalued S.Cerevisiae microarray geneexpression data sets and evaluate the patterns obtained in terms of their functional coherence as evaluated using the GObased functional enrichment analysis. Our results clearly demonstrate the importance of directly accounting for errors/noise in the data. Finally, the statistical significance of the discovered ETFIs as estimated by using two randomization tests, reveal that discovered ETFIs are indeed biologically meaningful and are neither obtained by random chance nor capture random structure in the data. The source codes as well as data sets used in this study are made available at the following website:
The blind men and the elephant: On meeting the problem of multiple truths in data from clustering and pattern mining perspectives
, 2013
"... ..."
Explicit probabilistic models for databases and networks
"... Recent work in data mining and related areas has highlighted the importance of the statistical assessment of data mining results. Crucial to this endeavour is the choice of a nontrivial null model for the data, to which the found patterns can be contrasted. The most influential null models proposed ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
(Show Context)
Recent work in data mining and related areas has highlighted the importance of the statistical assessment of data mining results. Crucial to this endeavour is the choice of a nontrivial null model for the data, to which the found patterns can be contrasted. The most influential null models proposed so far are defined in terms of invariants of the null distribution. Such null models can be used by computation intensive randomization approaches in estimating the statistical significance of data mining results. Here, we introduce a methodology to construct nontrivial probabilistic models based on the maximum entropy (MaxEnt) principle. We show how MaxEnt models allow for the natural incorporation of prior information. Furthermore, they satisfy a number of desirable properties of previously introduced randomization approaches. Lastly, they also have the benefit that they can be represented explicitly. We
Characterizing Discriminative Patterns
"... Discriminative patterns are association patterns that occur with disproportionate frequency in some classes versus others, and have been studied under names such as emerging patterns and contrastsets. Suchpatternshavedemonstratedconsiderablevalue for classification and subgroup discovery, but a deta ..."
Abstract
 Add to MetaCart
(Show Context)
Discriminative patterns are association patterns that occur with disproportionate frequency in some classes versus others, and have been studied under names such as emerging patterns and contrastsets. Suchpatternshavedemonstratedconsiderablevalue for classification and subgroup discovery, but a detailed understanding of the types of interactions among items in a discriminative pattern is lacking. To address this issue, we propose to categorize discriminative patterns according to four types of item interaction: (i) driverpassenger, (ii) coherent, (iii) independent additive and (iv) synergistic beyond independent additive. The coherent, additive, and synergistic patterns are of practical importance, with the latter two representing a gain in the discriminative power of a pattern over its subsets. Synergistic patterns are most restrictive, but perhaps the most interesting since they
biomarker discovery ∗
"... Recentadvancementinhighthroughputdatacollectiontechnologies has resulted in the availability of diverse biomedical datasets that capture complementary information pertaining to the biological processes in an organism. Biomarkers that are discovered by integrating these datasets obtainedfromacasecon ..."
Abstract
 Add to MetaCart
(Show Context)
Recentadvancementinhighthroughputdatacollectiontechnologies has resulted in the availability of diverse biomedical datasets that capture complementary information pertaining to the biological processes in an organism. Biomarkers that are discovered by integrating these datasets obtainedfromacasecontrolstudieshavethepotentialtoelucidate the biological mechanisms behind complex human diseases. In this paper we define an interactiontype integrative biomarker as one whose features together can explain the disease, but not individually. In this paper, we propose a pattern mining based integrative framework (PAMIN) to discover an interactiontype integrative biomarkers from diverse case control datasets. PAMIN first finds patterns form individual datasets to capture the available information separately and then combines these patterns to find integrated
When Pattern Met Subspace Cluster a Relationship Story
"... Abstract. While subspace clustering emerged as an application of pattern mining and some of its early advances have probably been inspired by developments in pattern mining, over the years both fields progressed rather independently. In this paper, we identify a number of recent developments in patt ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract. While subspace clustering emerged as an application of pattern mining and some of its early advances have probably been inspired by developments in pattern mining, over the years both fields progressed rather independently. In this paper, we identify a number of recent developments in pattern mining that are likely to be applicable to alleviate or solve current problems in subspace clustering and vice versa. 1
Interactive Data Mining Considered Harmful∗ (If Done Wrong)
"... Interactive data mining can be a powerful tool for data analysis. But in this short opinion piece I argue that this power comes with new pitfalls that can undermine the value of interactive mining, if not properly addressed. Most notably, there is a serious risk that the user of powerful interactive ..."
Abstract
 Add to MetaCart
(Show Context)
Interactive data mining can be a powerful tool for data analysis. But in this short opinion piece I argue that this power comes with new pitfalls that can undermine the value of interactive mining, if not properly addressed. Most notably, there is a serious risk that the user of powerful interactive data mining tools will only find the results she was expecting. The purpose of this piece is to raise awareness of this potential issue, stimulate discussion on it, and hopefully give rise to new research directions in addressing it.