Results 1  10
of
22
Mining Statistically Important Equivalence Classes and DeltaDiscriminative Emerging Patterns
, 2007
"... The supportconfidence framework is the most common measure used in itemset mining algorithms, for its antimonotonicity that effectively simplifies the search lattice. This computational convenience brings both quality and statistical flaws to the results as observed by many previous studies. In thi ..."
Abstract

Cited by 23 (2 self)
 Add to MetaCart
The supportconfidence framework is the most common measure used in itemset mining algorithms, for its antimonotonicity that effectively simplifies the search lattice. This computational convenience brings both quality and statistical flaws to the results as observed by many previous studies. In this paper, we introduce a novel algorithm that produces itemsets with ranked statistical merits under sophisticated test statistics such as chisquare, risk ratio, odds ratio, etc. Our algorithm is based on the concept of equivalence classes. An equivalence class is a set of frequent itemsets that always occur together in the same set of transactions. Therefore, itemsets within an equivalence class all share the same level of statistical significance regardless of the variety of test statistics. As an equivalence class can be uniquely determined and concisely represented by a closed pattern and a set of generators, we just mine closed patterns and generators, taking a simultaneous depthfirst search scheme. This parallel approach has not been exploited by any prior work. We evaluate our algorithm on two aspects. In general, we compare to LCM and FPclose which are the best algorithms tailored for mining only closed patterns. In particular, we compare to epMiner which is the most recent algorithm for mining a type of relative risk patterns, known as minimal emerging patterns. Experimental results show that our algorithm is faster than all of them, sometimes even multiple orders of magnitude faster. These statistically ranked patterns and the efficiency have a high potential for reallife applications, especially in biomedical and financial fields where classical test statistics are of dominant interest.
Closed Patterns Meet nary Relations
, 2009
"... Set pattern discovery from binary relations has been extensively studied during the last decade. In particular, many complete and efficient algorithms for frequent closed set mining are now available. Generalizing such a task to nary relations (n ≥ 2) appears as a timely challenge. It may be import ..."
Abstract

Cited by 17 (11 self)
 Add to MetaCart
Set pattern discovery from binary relations has been extensively studied during the last decade. In particular, many complete and efficient algorithms for frequent closed set mining are now available. Generalizing such a task to nary relations (n ≥ 2) appears as a timely challenge. It may be important for many applications, for example, when adding the time dimension to the popular objects × features binary case. The generality of the task (no assumption being made on the relation arity or on the size of its attribute domains) makes it computationally challenging. We introduce an algorithm called DATAPEELER. From an nary relation, it extracts all closed nsets satisfying given piecewise (anti) monotonic constraints. This new class of constraints generalizes both monotonic and antimonotonic constraints. Considering the special case of ternary relations, DATAPEELER outperforms the stateoftheart algorithms CUBEMINER and TRIAS by orders of magnitude. These good performances must be granted to a new clever enumeration strategy allowing to efficiently enforce the closeness property. The relevance of the extracted closed nsets is assessed on reallife 3and 4ary relations. Beyond natural 3or 4ary relations, expanding a relation with an additional attribute can help in enforcing rather abstract constraints such as the robustness with respect to binarization. Furthermore, a collection of closed nsets is shown to be an excellent starting point
Distance Based Subspace Clustering with Flexible Dimension Partitioning
"... Traditional similarity or distance measurements usually become meaningless when the dimensions of the datasets increase, which has detrimental effects on clustering performance. In this paper, we propose a distancebased subspace clustering model, called nCluster, to find groups of objects that have ..."
Abstract

Cited by 9 (2 self)
 Add to MetaCart
Traditional similarity or distance measurements usually become meaningless when the dimensions of the datasets increase, which has detrimental effects on clustering performance. In this paper, we propose a distancebased subspace clustering model, called nCluster, to find groups of objects that have similar values on subsets of dimensions. Instead of using a grid based approach to partition the data space into nonoverlapping rectangle cells as in the density based subspace clustering algorithms, the nCluster model uses a more flexible method to partition the dimensions to preserve meaningful and significant clusters. We develop an efficient algorithm to mine only maximal nClusters. A set of experiments are conducted to show the efficiency of the proposed algorithm and the effectiveness of the new model in preserving significant clusters. 1
Supporting Framework Use via Automatically Extracted ConceptImplementation Templates
"... Abstract. Application frameworks provide reusable concepts that are instantiated in application code through potentially complex implementation steps such as subclassing, implementing callbacks, and making calls. Existing applications contain valuable examples of such steps, except that locating the ..."
Abstract

Cited by 7 (2 self)
 Add to MetaCart
Abstract. Application frameworks provide reusable concepts that are instantiated in application code through potentially complex implementation steps such as subclassing, implementing callbacks, and making calls. Existing applications contain valuable examples of such steps, except that locating them in the application code is often challenging. We propose the notion of concept implementation templates, which summarize the necessary implementation steps, and an approach to automatic extraction of such templates from traces of sample applications. We demonstrate the feasibility of the template extraction with high precision and recall through an empirical study with twelve realistic concepts from four widelyused frameworks. Finally, we report on a user experiment with twelve subjects in which the choice of templates vs. documentation had much less impact on development time than the concept complexity. 1
Maximal QuasiBicliques with Balanced Noise Tolerance: Concepts and Coclustering Applications
, 2008
"... The rigid allversusall adjacency required by a maximal biclique for its two vertex sets is extremely vulnerable to missing data. In the past, several types of quasibicliques have been proposed to tackle this problem, however their noise tolerance is usually unbalanced and can be very skewed. In t ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
The rigid allversusall adjacency required by a maximal biclique for its two vertex sets is extremely vulnerable to missing data. In the past, several types of quasibicliques have been proposed to tackle this problem, however their noise tolerance is usually unbalanced and can be very skewed. In this paper, we improve the noise tolerance of maximal quasibicliques by allowing every vertex to tolerate up to the same number, or the same percentage, of missing edges. This idea leads to a more natural interaction between the two vertex sets— a balanced mostversusmost adjacency. This generalization is also nontrivial, as many largesize maximal quasibiclique subgraphs do not contain any maximal bicliques. This observation implies that direct expansion from maximal bicliques may not guarantee a complete enumeration of all maximal quasibicliques. We present important properties of maximal quasibicliques such as a bounded closure property and a fixed point property to design efficient algorithms. Maximal quasibicliques are closely related to coclustering problems such as documents and words coclustering, images and features coclustering, stocks and financial ratios coclustering, etc. Here, we demonstrate the usefulness of our concepts using a new application—a bioinformatics example— where prediction of true protein interactions is investigated.
Mining Constrained CrossGraph Cliques in Dynamic Networks
"... have been recently proposed to mine closed patterns in ternary relations, i.e., a generalization of the socalled formal concept extraction from binary relations. In this paper, we consider the specific context where a ternary relation denotes the value of a graph adjacency matrix (i. e., a Vertices ..."
Abstract

Cited by 3 (2 self)
 Add to MetaCart
have been recently proposed to mine closed patterns in ternary relations, i.e., a generalization of the socalled formal concept extraction from binary relations. In this paper, we consider the specific context where a ternary relation denotes the value of a graph adjacency matrix (i. e., a Vertices × Vertices matrix) at different timestamps. We discuss the constraintbased extraction of patterns in such dynamic graphs. We formalize the concept of δcontiguous closed 3clique and we discuss the availability of a complete algorithm for mining them. It is based on a specialization of the enumeration strategy implemented in DataPeeler. Indeed, the relevant cliques are specified by means of a conjunction of constraints which can be efficiently exploited. The addedvalue of our strategy for computing constrained clique patterns is assessed on a real dataset about a public bicycle renting system. The raw data encode the relationships between the renting stations during one year. The extracted δcontiguous closed 3cliques are shown to be consistent with our knowledge on the considered city.
Are Zerosuppressed Binary Decision Diagrams Good for Mining Frequent Patterns in High Dimensional Datasets? Abstract
"... Mining frequent patterns such as frequent itemsets is a core operation in many important data mining tasks, such as in association rule mining. Mining frequent itemsets in highdimensional datasets is challenging, since the search space is exponential in the number of dimensions and the volume of pa ..."
Abstract

Cited by 3 (2 self)
 Add to MetaCart
Mining frequent patterns such as frequent itemsets is a core operation in many important data mining tasks, such as in association rule mining. Mining frequent itemsets in highdimensional datasets is challenging, since the search space is exponential in the number of dimensions and the volume of patterns can be huge. Many of the stateoftheart techniques rely upon the use of prefix trees (e.g. FPtrees) which allow nodes to be shared among common prefix paths. However, the scalability of such techniques may be limited when handling high dimensional datasets. The purpose of this paper is to analyse the behaviour of mining frequent itemsets when instead of a tree data structure, a canonical directed acyclic graph namely Zero Suppressed Binary Decision Diagram (ZBDD) is used. Due to its compactness and ability to promote node reuse, ZBDD has proven very effective in other areas of computer science, such as boolean SAT solvers. In this paper, we show how ZBDDs can be used to mine frequent itemsets (and their common varieties). We also introduce a weighted variant of ZBDD which allows a more efficient mining algorithm to be developed. We provide an experimental study concentrating on high dimensional biological datasets, and identify indicative situations where a ZBDD technology can be superior over the prefix tree based technique.
K.: Comprehending objectoriented software frameworks API through dynamic analysis
, 2007
"... A common practice followed by many application developers is to use existing framework applications as a guide to understand how to implement a frameworkprovided concept of interest. Unfortunately, finding the code that implements the concept of interest might be very difficult since the code might ..."
Abstract

Cited by 3 (3 self)
 Add to MetaCart
A common practice followed by many application developers is to use existing framework applications as a guide to understand how to implement a frameworkprovided concept of interest. Unfortunately, finding the code that implements the concept of interest might be very difficult since the code might be scattered across and tangled with code implementing other concepts. To address this issue, this report presents an approach called FUDA (Framework API Understanding through Dynamic Analysis). The main idea of this approach is to extract implementation recipes of a given concept from runtime information collected when that concept is invoked in a number of different sample applications. For this purpose, we introduce a novel dynamic slicing approach named concept trace slicing and combine it with clustering and data mining techniques. The experimental evaluation of FUDA suggests that this approach is effective in producing useful implementation recipes for a given concept
Discovering substantial distinctions among incremental biclusters
 in SDM
, 2009
"... A fundamental task of data analysis is comprehending what distinguishes clusters found within the data. We present the problem of mining distinguishing sets which seeks to find sets of objects or attributes that induce that most change among the incremental biclusters of a binary dataset. Unlike em ..."
Abstract

Cited by 3 (3 self)
 Add to MetaCart
A fundamental task of data analysis is comprehending what distinguishes clusters found within the data. We present the problem of mining distinguishing sets which seeks to find sets of objects or attributes that induce that most change among the incremental biclusters of a binary dataset. Unlike emerging patterns and contrast sets which only focus on statistical differences between support of itemsets, our approach considers distinctions in both the attribute space and the object space. Viewing the lattice of biclusters formed within a data set as a weighted directed graph, we mine the most significant distinguishing sets by growing a maximal cost spanning tree of the lattice. In this paper we present a weighting function for measuring distinction among biclusters in the lattice and the novel MIDS algorithm. MIDS simultaneously enumerates biclusters, constructs the bicluster lattice, and computes the distinguishing sets. The efficient computational performance of MIDS is exhibited in a performance test on real world and benchmark data sets. The utility of distinguishing sets is also demonstrated with experiments on synthetic and real data. 1
Estimating the Number of Frequent Itemsets in a Large Database
"... Estimating the number of frequent itemsets for minimal support α in a large dataset is of great interest from both theoretical and practical perspectives. However, finding not only the number of frequent itemsets, but even the number of maximal frequent itemsets, is #Pcomplete. In this study, we pr ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
Estimating the number of frequent itemsets for minimal support α in a large dataset is of great interest from both theoretical and practical perspectives. However, finding not only the number of frequent itemsets, but even the number of maximal frequent itemsets, is #Pcomplete. In this study, we provide a theoretical investigation on the sampling estimator. We discover and prove several fundamental but also rather surprising properties of the sampling estimator. We also propose a novel algorithm to estimate the number of frequent itemsets without using sampling. Our detailed experimental results have shown the accuracy and efficiency of our proposed approach. 1.