Results 1  10
of
18
Interestingness of Frequent Itemsets Using Bayesian Networks as Background Knowledge
 In Proceedings of the SIGKDD Conference on Knowledge Discovery and Data Mining
, 2004
"... ..."
The Pattern Ordering Problem
 Proceedings of the 7th European Conference on Principles of Data Mining and Knowledge Discovery, Lecture Notes in Artificial Intelligence
, 2003
"... Many pattern discovery methods provide fast tools for finding the frequently occurring patterns in large data sets. Such pattern collections can also be used to approximate the underlying joint distribution, and they summarize the data set well. However, a large set of patterns is unintuitive and no ..."
Abstract

Cited by 25 (13 self)
 Add to MetaCart
Many pattern discovery methods provide fast tools for finding the frequently occurring patterns in large data sets. Such pattern collections can also be used to approximate the underlying joint distribution, and they summarize the data set well. However, a large set of patterns is unintuitive and not necessarily easy to use. In this paper we consider the problem of ordering a collection of patterns so that each prefix of the ordering gives as good a summary of the data as possible. We formulate this problem for general loss functions, show that the problem has an efficient solution, and prove that its natural variant is NPcomplete but the greedy approximation algorithm gives an $e/(e1) \approx 1.58$ approximation quality. We apply the general technique to approximation of frequencies of frequent sets, and show that the method gives good empirical results.
On Inverse Frequent Set Mining
, 2003
"... Frequent set mining is a wellknown technique to summarize binary data. However, it is an open problem how difficult it is to invert the frequent set mining, i.e., how difficult it is to find a binary data set that is compatible with frequent set mining results, the frequent sets. This inverse data ..."
Abstract

Cited by 23 (2 self)
 Add to MetaCart
Frequent set mining is a wellknown technique to summarize binary data. However, it is an open problem how difficult it is to invert the frequent set mining, i.e., how difficult it is to find a binary data set that is compatible with frequent set mining results, the frequent sets. This inverse data mining problem is related to the questions of how well privacy is preserved in the frequent sets and how well the frequent sets characterize the original data set. In this paper we analyze the computational complexity of the problem of finding a binary data set compatible with a given collection of frequent sets and show that in many cases the problem is computationally very difficult.
Bases of motifs for generating repeated patterns with wild cards
 IEEE/ACM Transactions on Computational Biology and Bioinformatics
, 2003
"... Motif inference represents one of the most important areas of research in computational biology, and one of its oldest ones. Despite this, the problem remains very much open in the sense that no existing definition is fully satisfying, either in formal terms, or in relation to the biological questio ..."
Abstract

Cited by 17 (7 self)
 Add to MetaCart
Motif inference represents one of the most important areas of research in computational biology, and one of its oldest ones. Despite this, the problem remains very much open in the sense that no existing definition is fully satisfying, either in formal terms, or in relation to the biological questions that involve finding such motifs. Two main types of motifs have been considered in the literature: matrices (of letter frequency per position in the motif) and patterns. There is no conclusive evidence in favour of either, and recent work has attempted to integrate the two types into a single model. In this paper, we address the formal issue in relation to motifs as patterns. This is essential to get at a better understanding of motifs in general. In particular, we consider a promising idea that was recently proposed, which attempted to avoid the combinatorial explosion in the number of motifs by means of a generator set for the motifs. Instead of exhibiting a complete list of motifs satisfying some input constraints, what is produced is a basis of such motifs from which all the other ones can be generated. We study the computational cost of determining such a basis of repeated motifs with wild cards in a sequence. We give new upper and lower bounds on such a cost, introducing a notion of basis that is provably contained in (and thus smaller) than previously defined ones. Our basis can be computed in less time and space, and is still able to generate the same set of motifs. We also prove that the number of motifs in all bases defined so far grows exponentially with the quorum, that is, with the minimal number of times a motif must appear in a sequence, something unnoticed in previous work. We show that there is no hope to efficiently compute such bases unless the quorum is fixed. 1
A basis of tiling motifs for generating repeated patterns and its complexity for higher quorum
 In B.Rovan and P.Vojtás, editors, Mathematical Foundations of Computer Science, volume 2747 of LNCS
, 2003
"... ..."
Data Mining Methods for Network Intrusion Detection
, 2004
"... Network intrusion detection systems have become a standard component in security infrastructures. Unfortunately, current systems are poor at detecting novel attacks without an unacceptable level of false alarms. We propose that the solution to this problem is the application of an ensemble of data m ..."
Abstract

Cited by 12 (1 self)
 Add to MetaCart
Network intrusion detection systems have become a standard component in security infrastructures. Unfortunately, current systems are poor at detecting novel attacks without an unacceptable level of false alarms. We propose that the solution to this problem is the application of an ensemble of data mining techniques which can be applied to network connection data in an offline environment, augmenting existing realtime sensors. In this paper, we expand on our motivation, particularly with regard to running in an offline environment, and our interest in multisensor and multimethod correlation. We then review existing systems, from commercial systems, to research based intrusion detection systems. Next we survey the state of the art in the area. Standard datasets and feature extraction turned out to be more important than we had initially anticipated, so each can be found under its own heading. Next, we review the actual data mining methods that have been proposed or implemented. We conclude by summarizing the open problems in this area, along with some questions of a broader scope. We hope that by providing the motivation and summarizing the work in this area that we can stimulate further research.
Separating Structure from Interestingness
 Advances in Knowledge Discovery and Data Mining, 8th PacificAsia Conference, PAKDD 2004
, 2004
"... Condensed representations of pattern collections have been recognized to be important building blocks of inductive databases, a promising theoretical framework for data mining, and recently they have been studied actively. However, there has not been much research on how condensed representations sh ..."
Abstract

Cited by 6 (5 self)
 Add to MetaCart
Condensed representations of pattern collections have been recognized to be important building blocks of inductive databases, a promising theoretical framework for data mining, and recently they have been studied actively. However, there has not been much research on how condensed representations should actually be represented. In this paper we propose a general approach to build condensed representations of pattern collections. The approach is based on separating the structure of the pattern collection from the interestingness values of the patterns. We study also the concrete case of representing the frequent sets and their (approximate) frequencies following this approach: we discuss the tradeoffs in representing the frequent sets by the maximal frequent sets, the minimal infrequent sets and their combinations, and investigate the problem approximating the frequencies from samples by giving new upper bounds on sample complexity based on frequent closed sets and describing how convex optimization can be used to improve and score the obtained samples.
Chaining Patterns
 Discovery Science, 6th International Conference, DS 2003
, 2003
"... Finding condensed representations for pattern collections has been an active research topic in data mining recently and several representations have been proposed. In this paper we introduce chain partitions of partially ordered pattern collections as highlevel condensed representations that can be ..."
Abstract

Cited by 4 (4 self)
 Add to MetaCart
Finding condensed representations for pattern collections has been an active research topic in data mining recently and several representations have been proposed. In this paper we introduce chain partitions of partially ordered pattern collections as highlevel condensed representations that can be applied to a wide variety of pattern collections including most known condensed representations and databases. We analyze the goodness of the approach, study the computational challenges and algorithms for finding the optimal chain partitions, and show empirically that this approach can simplify the pattern collections significantly.
Finding All Occurring Sets of Interest
 2nd International Workshop on Knowledge Discovery in Inductive Databases
, 2003
"... In this paper we examine the problem of mining all occurring sets of interest. We define what they are, sketch some applications, describe streaming algorithms for the problem and analyze their computational complexity. We also study alternative representations for the occurring sets of interest and ..."
Abstract

Cited by 4 (4 self)
 Add to MetaCart
In this paper we examine the problem of mining all occurring sets of interest. We define what they are, sketch some applications, describe streaming algorithms for the problem and analyze their computational complexity. We also study alternative representations for the occurring sets of interest and evaluate some of them experimentally.
TSET: Algorithm for mining frequent temporal patterns
 In Proc. of ECML/PKDD’04 Workshop on Knowledge Discovery in Data Streams  A Collaborative Effort in Knowledge Discovery
, 2004
"... Abstract. The incorporation of temporal semantics into the traditional data mining techniques has caused the creation of a new area called Temporal Data Mining. This incorporation is especially necessary if we want to extract useful knowledge from dynamic domains, which are timevarying in nature. Ho ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
Abstract. The incorporation of temporal semantics into the traditional data mining techniques has caused the creation of a new area called Temporal Data Mining. This incorporation is especially necessary if we want to extract useful knowledge from dynamic domains, which are timevarying in nature. However, in a lot of cases is practically a computationally intractable problem and therefore it poses more challenges on efficient processing than nontemporal techniques. In this paper, we present a new algorithm named TSET (as acronym of Temporal SetEnumeration Tree) for frequent temporal pattern (sequences) mining from datasets. The algorithm is based in the intertransactional association mining problem, but it uses a unique treebased structure for storing all frequent patterns discovered in the mining process. 1