Results 1 - 10
of
10
Discovering frequent patterns in sensitive data
"... Discovering frequent patterns from data is a popular exploratory technique in data mining. However, if the data are sensitive (e.g. patient health records, user behavior records) releasing information about significant patterns or trends carries significant risk to privacy. This paper shows how one ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Discovering frequent patterns from data is a popular exploratory technique in data mining. However, if the data are sensitive (e.g. patient health records, user behavior records) releasing information about significant patterns or trends carries significant risk to privacy. This paper shows how one can accurately discover and release the most significant patterns along with their frequencies in a data set containing sensitive information, while providing rigorous guarantees of privacy for the individuals whose information is stored there. We present two efficient algorithms for discovering the K most frequent patterns in a data set of sensitive records. Our algorithms satisfy differential privacy, a recently introduced definition that provides meaningful privacy guarantees in the presence of arbitrary external information. Differentially private algorithms require a degree of uncertainty in their output to preserve privacy. Our algorithms handle this by returning ‘noisy ’ lists of patterns that are close to the actual list of K most frequent patterns in the data. We define a new notion of utility that quantifies the output accuracy of private top-K pattern mining algorithms. In typical data sets, our utility criterion implies low false positive and false negative rates in the reported lists. We prove that our methods meet the new utility criterion; we also demonstrate the performance of our algorithms through extensive experiments on the transaction data sets from the FIMI repository. While the paper focuses on frequent pattern mining, the techniques developed here are relevant whenever the data mining output is a list of elements ordered according to an appropriately ‘robust ’ measure of interest. 1.
Rules for Contrast Sets
"... In this paper we present a technique to derive rules describing contrast sets. Contrast sets are a formalism to represent groups differences. We propose a novel approach to describe directional contrasts using rules where the contrasting effect is partitioned into pairs of groups. Our approach makes ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
In this paper we present a technique to derive rules describing contrast sets. Contrast sets are a formalism to represent groups differences. We propose a novel approach to describe directional contrasts using rules where the contrasting effect is partitioned into pairs of groups. Our approach makes use of a directional Fisher Exact Test to find significant differences across groups. We used a Bonferroni withinsearch adjustment to control type I errors and a pruning technique to prevent derivation of non significant contrast set specializations.
Towards Understanding Spammers – Discovering Local Patterns for Concept Description
"... Abstract. Concept description is an important task of descriptive data mining: Basically, its aim is to identify and to summarize properties of a selected target population in the form of a set of patterns – in a concise and comprehensible way. In this paper we present an approach for concept descri ..."
Abstract
- Add to MetaCart
Abstract. Concept description is an important task of descriptive data mining: Basically, its aim is to identify and to summarize properties of a selected target population in the form of a set of patterns – in a concise and comprehensible way. In this paper we present an approach for concept description in the social bookmarking domain: We show how subgroup discovery can be utilized for identifying discriminative and characteristic local patterns in order to understand the behavior of (non-)spammers. A case study applying data from a real-world system for social bookmarking provides exemplary results and demonstrates the applicability and effectiveness of the presented approach. 1
Pattern-Based Classification: A Unifying Perspective
"... Abstract. The use of patterns in predictive models is a topic that has received a lot of attention in recent years. Pattern mining can help to obtain models for structured domains, such as graphs and sequences, and has been proposed as a means to obtain more accurate and more interpretable models. D ..."
Abstract
- Add to MetaCart
Abstract. The use of patterns in predictive models is a topic that has received a lot of attention in recent years. Pattern mining can help to obtain models for structured domains, such as graphs and sequences, and has been proposed as a means to obtain more accurate and more interpretable models. Despite the large amount of publications devoted to this topic, we believe however that an overview of what has been accomplished in this area is missing. This paper presents our perspective on this evolving area. We identify the principles of pattern mining that are important when mining patterns for models and provide an overview of pattern-based classification methods. We categorize these methods along the following dimensions: (1) whether they post-process a pre-computed set of patterns or iteratively execute pattern mining algorithms; (2) whether they select patterns model-independently or whether the pattern selection is guided by a model. We summarize the results that have been obtained for each of these methods. 1
Dimension Reduction of Chemical Process Simulation Data Corresponding author:
"... Abstract. In the analysis of combustion processes, simulation is a costefficient tool that complements experimental testing. The simulation models must be precise if subtle differences are to be detected. On the other hand, computational evaluation of precise models typically requires substantial ef ..."
Abstract
- Add to MetaCart
Abstract. In the analysis of combustion processes, simulation is a costefficient tool that complements experimental testing. The simulation models must be precise if subtle differences are to be detected. On the other hand, computational evaluation of precise models typically requires substantial effort. To escape the computational bottleneck, reduced chemical schemes, for example, ILDM-based methods or the flamelet approach, have been developed that result in substantially reduced computational effort and memory requirements. This paper proposes an additional analysis tool based on the Machine Learning concepts of Subgroup Discovery and Lazy Learning. Its goal is compact representation of chemical processes using few variables. Efficacy is demonstrated for simulation data of a laminar methane/air combustion process described by 29 chemical species, 3 thermodynamic properties (pressure, temperature, enthalpy), and 2 velocity components. From these data, the reduction method derives a reduced set of 3 variables from which the other 31 variables are estimated with good accuracy. Key words: dimension reduction, subgroup discovery, lazy learner, modeling combustion 1
Subgroup Discovery using Bump Hunting on Multi-Relational Histograms
"... Abstract. We propose an approach to subgroup discovery in relational databases containing numerical attributes. The approach is based on detecting bumps in histograms constructed from substitution sets resulting from matching a first-order query against the input relational database. The approach is ..."
Abstract
- Add to MetaCart
Abstract. We propose an approach to subgroup discovery in relational databases containing numerical attributes. The approach is based on detecting bumps in histograms constructed from substitution sets resulting from matching a first-order query against the input relational database. The approach is evaluated on seven data sets, discovering interpretable subgroups. The subgroups ’ rate of survival from the training split to the testing split varies among the experimental data sets, but at least on three of them it is very high. 1
The Inductive Software Engineering Manifesto: Principles for Industrial Data Mining ∗
"... The practices of industrial and academic data mining are very different. These differences have significant implications for (a) how we manage industrial data mining projects; (b) the direction of academic studies in data mining; and (c) training programs for engineers who seek to use data miners in ..."
Abstract
- Add to MetaCart
The practices of industrial and academic data mining are very different. These differences have significant implications for (a) how we manage industrial data mining projects; (b) the direction of academic studies in data mining; and (c) training programs for engineers who seek to use data miners in an industrial setting.
From Black and White to Full Colour: Extending Redescription Mining Outside the Boolean World
"... Redescription mining is a powerful data analysis tool that is used to find multiple descriptions of the same entities. Consider geographical regions as an example. They can be characterized by the fauna that inhabits them on one hand and by their meteorological conditions on the other hand. Finding ..."
Abstract
- Add to MetaCart
Redescription mining is a powerful data analysis tool that is used to find multiple descriptions of the same entities. Consider geographical regions as an example. They can be characterized by the fauna that inhabits them on one hand and by their meteorological conditions on the other hand. Finding such redescriptors, a task known as niche-finding, is of much importance in biology. But current redescription mining methods cannot handle other than Boolean data. This restricts the range of possible applications or makes discretization a prerequisite, entailing a possibly harmful loss of information. In nichefinding, while the fauna can be naturally represented using a Boolean presence/absence data, the weather cannot. In this paper, we extend redescription mining to realvalued data using a surprisingly simple and efficient approach. We provide extensive experimental evaluation to study the behaviour of the proposed algorithm. Furthermore, we show the statistical significance of our results using recent innovations on randomization methods. 1
Mining Low-Support Discriminative Patterns from Dense and High-dimensional Data
, 2010
"... Discriminative patterns can provide valuable insights into datasets with class labels, that may not be available from the individual features or the predictive models built using them. Most existing approaches work efficiently for sparse or low-dimensional datasets. However, for dense and highdimens ..."
Abstract
- Add to MetaCart
Discriminative patterns can provide valuable insights into datasets with class labels, that may not be available from the individual features or the predictive models built using them. Most existing approaches work efficiently for sparse or low-dimensional datasets. However, for dense and highdimensional datasets, they have to use high thresholds to produce the complete results within limited time, and thus, may miss interesting low-support patterns. In this paper, we address the necessity of trading off the completeness of discriminative pattern discovery with the efficient discovery of lowsupport discriminative patterns from such datasets. We propose a family of anti-monotonic measures named SupMaxK that organize the set of discriminative patterns into nested layers of subsets, which are progressively more complete in their coverage, but require increasingly more computation. In particular, the member of SupMaxK with K = 2, named SupMaxPair, is suitable for dense and high-dimensional datasets. Experiments on both synthetic datasets and a cancer gene expression dataset demonstrate that there are low-support patterns that can be discovered using SupMaxPair but not by existing approaches. Furthermore, we show that the low-support discriminative patterns that are only discovered
Contrasting Subgroup Discovery
"... Subgroup discovery methods find interesting subsets of objects of a given class. Motivated by an application in bioinformatics, we first define a generalized subgroup discovery problem. In this setting, a subgroup is interesting if its members are characteristic for their class, even if the classes ..."
Abstract
- Add to MetaCart
Subgroup discovery methods find interesting subsets of objects of a given class. Motivated by an application in bioinformatics, we first define a generalized subgroup discovery problem. In this setting, a subgroup is interesting if its members are characteristic for their class, even if the classes are not identical. Then we further refine this setting for the case where subsets of objects, for example, subsets of objects that represent different time points or different phenotypes, are contrasted. We show that this allows finding subgroups of objects that could not be found with classical subgroup discovery. To find such subgroups, we propose an approach that consists of two subgroup discovery steps and an intermediate, contrast set definition step. This approach is applicable in various application areas. An example is biology, where interesting subgroups of genes are searched by using gene expression data. We address the problem of finding enriched gene sets that are specific for virus infected samples for a specific time point or a specific phenotype. We report on experimental results on a time series data set for virus infected Solanum tuberosum (potato) plants. The results on S. tuberosum’s response to virus infection revealed new research hypotheses for plant biologists.

