Results 11 - 20
of
136
TANE: An Efficient Algorithm for Discovering Functional and Approximate Dependencies
, 1999
"... this paper, we also consider the approximate dependency inference task: given a relation r and a threshold #, find all minimal non-trivial approximate dependencies ..."
Abstract
-
Cited by 50 (0 self)
- Add to MetaCart
this paper, we also consider the approximate dependency inference task: given a relation r and a threshold #, find all minimal non-trivial approximate dependencies
Efficient Discovery of Functional and Approximate Dependencies Using Partitions (Extended version)
- In ICDE
, 1997
"... Discovery of functional dependencies from relations has been identified as an important database analysis technique. In this paper, we present a new approach for finding functional dependencies from large databases, based on partitioning the set of rows with respect to their attribute values. The us ..."
Abstract
-
Cited by 46 (1 self)
- Add to MetaCart
Discovery of functional dependencies from relations has been identified as an important database analysis technique. In this paper, we present a new approach for finding functional dependencies from large databases, based on partitioning the set of rows with respect to their attribute values. The use of partitions makes the discovery of approximate functional dependencies easy and efficient, and the erroneous or exceptional rows can be identified easily. Experiments show that the new algorithm is efficient in practice. For benchmark databases the running times are improved by several orders of magnitude over previously published results. The algorithm is also applicable to much larger datasets than the previous methods. Computing Reviews Categories and Subject Descriptors: H.3.1 Content Analysis and Indexing F.2.2 Nonnumerical Algorithms and Problems I.2.6 Learning General Terms: Algorithms, Experimentation Additional Key Words and Phrases: Knowledge Discovery, Data Mining, Func...
Mining Long Sequential Patterns in a Noisy Environment
, 2002
"... many applications including computational biology study, consumer behavior analysis, system performance analysis, etc. In a noisy environment, an observed sequence may not accurately reflect the underlying behavior. For example, in a protein sequence, the amino acid N is likely to mutate to D with l ..."
Abstract
-
Cited by 41 (9 self)
- Add to MetaCart
many applications including computational biology study, consumer behavior analysis, system performance analysis, etc. In a noisy environment, an observed sequence may not accurately reflect the underlying behavior. For example, in a protein sequence, the amino acid N is likely to mutate to D with little impact to the biological function of the protein. It would be desirable if the occurrence of D in the observation can be related to a possible mutation from N in an appropriate manner. Unfortunately, the support measure (i.e., the number of occurrences) of a pattern does not serve this purpose. In this paper, we introduce the concept of compatibility matrix as the means to provide a probabilistic connection from the observation to the underlying true value. A new metric match is also proposed to capture the "real support" of a pattern which would be expected if a noise-free environment is assumed. In addition, in the context we address, a pattern could be very long. The standard pruning technique developed for the market basket problem may not work efficiently. As a result, a novel algorithm that combines statistical sampling and a new technique (namely border collapsing) is devised to discover long patterns in a minimal number of scans of the sequence database with sufficiently high confidence. Empirical results demonstrate the robustness of the match model (with respect to the noise) and the efficiency of the probabilistic algorithm.
Discovering All Most Specific Sentences
- ACM Transactions on Database Systems
, 2003
"... this article, we show how the problems of finding frequent sets in relations and of finding minimal keys in databases can be reduced to this formulation. Using this theory extraction formulation [Mannila 1995, 1996; Mannila and Toivonen 1997], one can formulate general results about the complexity o ..."
Abstract
-
Cited by 39 (2 self)
- Add to MetaCart
this article, we show how the problems of finding frequent sets in relations and of finding minimal keys in databases can be reduced to this formulation. Using this theory extraction formulation [Mannila 1995, 1996; Mannila and Toivonen 1997], one can formulate general results about the complexity of algorithms for these data mining tasks
A Condensed Representation to Find Frequent Patterns
, 2001
"... Given a large set of data, a common data mining problem is to extract the frequent patterns occurring in this set. The idea presented in this paper is to extract a condensed representation of the frequent patterns called disjunction-free sets, instead of extracting the whole frequent pattern col ..."
Abstract
-
Cited by 35 (2 self)
- Add to MetaCart
Given a large set of data, a common data mining problem is to extract the frequent patterns occurring in this set. The idea presented in this paper is to extract a condensed representation of the frequent patterns called disjunction-free sets, instead of extracting the whole frequent pattern collection. We show that this condensed representation can be used to regenerate all frequent patterns and their exact frequencies.
Survey on frequent pattern mining
, 2002
"... Frequent itemsets play an essential role in many data mining tasks that try to find interesting patterns from databases, such as association rules, correlations, sequences, episodes, classifiers, clusters and many more of which the mining of association rules is one of the most popular problems. The ..."
Abstract
-
Cited by 35 (1 self)
- Add to MetaCart
Frequent itemsets play an essential role in many data mining tasks that try to find interesting patterns from databases, such as association rules, correlations, sequences, episodes, classifiers, clusters and many more of which the mining of association rules is one of the most popular problems. The
On the Complexity of Generating Maximal Frequent and Minimal Infrequent Sets
, 2002
"... Let A be an mn binary matrix, t . . . , m} be a threshold, and # > 0 be a positive parameter. We show that given a family of O(n ) maximal t-frequent column sets for A, it is NP-complete to decide whether A has any further maximal t-frequent sets, or not, even when the number of such addit ..."
Abstract
-
Cited by 35 (9 self)
- Add to MetaCart
Let A be an mn binary matrix, t . . . , m} be a threshold, and # > 0 be a positive parameter. We show that given a family of O(n ) maximal t-frequent column sets for A, it is NP-complete to decide whether A has any further maximal t-frequent sets, or not, even when the number of such additional maximal t-frequent column sets may be exponentially large. In contrast, all minimal t-infrequent sets of columns of A can be enumerated in incremental quasi-polynomial time. The proof of the latter result follows from the inequality # t + 1)#, where # and # are respectively the numbers of all maximal t-frequent and all minimal t-infrequent sets of columns of the matrix A. We also discuss the complexity of generating all closed t-frequent column sets for a given binary matrix.
Inductive Databases and Condensed Representations for Data Mining
, 1997
"... Knowledge discovery in databases and data mining aim at semiautomatic tools for the analysis of large data sets. It can be argued that several data mining tasks consist of locating interesting sentences from a given logic that are true in the database. Then the task of the user/analyst is to is to q ..."
Abstract
-
Cited by 34 (2 self)
- Add to MetaCart
Knowledge discovery in databases and data mining aim at semiautomatic tools for the analysis of large data sets. It can be argued that several data mining tasks consist of locating interesting sentences from a given logic that are true in the database. Then the task of the user/analyst is to is to query this set, the theory of the database. This view gives rise to the concept of of inductive databases, i.e., databases that in addition to the data contain also inductive generalizations about the data. We describe a rough framework for inductive databases, and consider also condensed representations, data structures that make it possible to answer queries about the inductive database approximately correctly and reasonably efficiently. 1 Introduction Knowledge discovery in databases (KDD), often called data mining, aims at the discovery of useful information from large collections of data. The discovered knowledge can be rules describing properties of the data, frequently occurring patte...
A Theory of Inductive Query Answering
, 2002
"... We introduce the boolean inductive query evaluation problem, which is concerned with answering inductive queries that are arbitrary boolean expressions over monotonic and anti-monotonic predicates. Secondly, we develop a decomposition theory for inductive query evaluation in which a boolean query Q ..."
Abstract
-
Cited by 30 (0 self)
- Add to MetaCart
We introduce the boolean inductive query evaluation problem, which is concerned with answering inductive queries that are arbitrary boolean expressions over monotonic and anti-monotonic predicates. Secondly, we develop a decomposition theory for inductive query evaluation in which a boolean query Q is reformulated into k sub-queries Q i = QA ^ QM that are the conjunction of a monotonic and an anti-monotonic predicate. The solution to each subquery can be represented using a version space. We investigate how the number of version spaces k needed to answer the query can be minimized. Thirdly, for the pattern domain of strings, we show how the version spaces can be represented using a novel data structure, called the version space tree, and can be computed using a variant of the famous Apriori algorithm. Finally, we present some experiments that validate the approach.
Mining Frequent Itemsets in Evolving Databases
- In Proc. of the 2 nd SIAM Int'l Conf. on Data Mining
, 2002
"... Most current work in data mining assumes that the database is static, and a database update requires rediscovering all the patterns by scanning the entire old and new database. Such approaches can waste a lot of computational and I/O resources, and result in relatively slow response times, to essent ..."
Abstract
-
Cited by 30 (9 self)
- Add to MetaCart
Most current work in data mining assumes that the database is static, and a database update requires rediscovering all the patterns by scanning the entire old and new database. Such approaches can waste a lot of computational and I/O resources, and result in relatively slow response times, to essentially an interactive process. In this paper, we consider this problem within the context of association rule mining, a key data mining task. We propose a new approach to maintaining associations in evolving databases. Unlike prior approaches, where all the frequent associations and potentially frequent associations are maintained across database updates, we choose instead to maintain the set of maximally frequent associations, an information lossless approach. This approach results in significant I/O and computational savings. Additional highlights of the proposed approach include interactive support for windowed operations (computing the association rules over a specific time-interval) and tracking stable associations. Extensive experimental benchmarking on real and synthetic datasets demonstrate the potential advantages of the proposed approach.

