Results 11  20
of
382
Constraintbased rule mining in large, dense databases
, 1999
"... Constraintbased rule miners find all rules in a given dataset meeting userspecified constraints such as minimum support and confidence. We describe a new algorithm that directly exploits all userspecified constraints including minimum support, minimum confidence, and a new constraint that ensures ..."
Abstract

Cited by 151 (3 self)
 Add to MetaCart
Constraintbased rule miners find all rules in a given dataset meeting userspecified constraints such as minimum support and confidence. We describe a new algorithm that directly exploits all userspecified constraints including minimum support, minimum confidence, and a new constraint that ensures every mined rule offers a predictive advantage over any of its simplifications. Our algorithm maintains efficiency even at low supports on data that is dense (e.g. relational data). Previous approaches such as Apriori and its variants exploit only the minimum support constraint, and as a result are ineffective on dense data due to a combinatorial explosion of “frequent itemsets”.
Rule discovery from time series
 In Proceedings of the 1997 ACM SIGKDD International Conference, ACM SIGKDD
, 1997
"... We consider the problem of finding rules relating patterns in a time series to other patterns in that series, or patterns in one series to patterns in another series. A simple example is a rule such as "a period of low telephone call activity is usually followed by a sharp rise ill call vohune". Exa ..."
Abstract

Cited by 142 (0 self)
 Add to MetaCart
We consider the problem of finding rules relating patterns in a time series to other patterns in that series, or patterns in one series to patterns in another series. A simple example is a rule such as "a period of low telephone call activity is usually followed by a sharp rise ill call vohune". Examples of rules relating two or more time series are "if the Microsoft stock price goes up and lntel falls, then IBM goes up the next. day, " and "if Microsoft goes up strongly fro " one day, then declines strongly on the next day, and on the same days Intel stays about, level, then IBM stays about level. " Our emphasis is in the discovery of local patterns in multivariate time series, in contrast to traditional time series analysis which largely focuses on global models. Thus, we search for rules whose conditions refer to patterns in time series. However, we do not want to define beforehand which patterns are to be used; rather, we want the patterns to be formed fl’om the data in the context of rule discovery. We describe adaptive methods for finding rules of the above type fi’om timeseries data. The methods are based on discretizing the sequence hy methods resembling vector quantization. \,Ve first form subsequences by sliding window through the time series, and then cluster these subsequences by using a suitable measure of timeseries similarity. The discretized version of the time series is obtained by taldng the cluster identifiers corresponding to the subsequence. Once tl,e timeseries is discretized, we use simple rule finding methods to obtain rifles from the sequence. "vVe present empMcal resuh.s on the behavior of the method.
Unexpectedness as a Measure of Interestingness in Knowledge Discovery
 In Proceedings of the First International Conference on Knowledge Discovery and Data Mining
, 1999
"... Organizations are taking advantage of "datamining" techniques to leverage the vast amounts of data captured as they process routine transactions. Datamining is the process of discovering hidden structure or patterns in data. However several of the pattern discovery methods in datamining systems ha ..."
Abstract

Cited by 140 (9 self)
 Add to MetaCart
Organizations are taking advantage of "datamining" techniques to leverage the vast amounts of data captured as they process routine transactions. Datamining is the process of discovering hidden structure or patterns in data. However several of the pattern discovery methods in datamining systems have the drawbacks that they discover too many obvious or irrelevant patterns and that they do not leverage to a full extent valuable prior domain knowledge that managers have. This research addresses these drawbacks by developing ways to generate interesting patterns by incorporating managers' prior knowledge in the process of searching for patterns in data. Specifically we focus on providing methods that generate unexpected patterns with respect to managerial intuition by eliciting managers' beliefs about the domain and using these beliefs to seed the search for unexpected patterns in data. Our approach should lead to the development of decision support systems that provide managers with mor...
An Algorithm for MultiRelational Discovery of Subgroups
, 1997
"... We consider the problem of finding statistically unusual subgroups in a multirelation database, and extend previous work on singlerelation subgroup discovery. We give a precise definition of the multirelation subgroup discovery task, propose a specific form of declarative bias based on foreign ..."
Abstract

Cited by 134 (8 self)
 Add to MetaCart
We consider the problem of finding statistically unusual subgroups in a multirelation database, and extend previous work on singlerelation subgroup discovery. We give a precise definition of the multirelation subgroup discovery task, propose a specific form of declarative bias based on foreign links as a means of specifying the hypothesis space, and show how propositional evaluation functions can be adapted to the multirelation setting. We then describe an algorithm for this problem setting that uses optimistic estimate and minimal support pruning, an optimal refinement operator and sampling to ensure efficiency and can easily be parallelized.
An enhanced representation of time series which allows fast and accurate classification, clustering and relevance feedback
 In proceedings of the 4th Int'l Conference on Knowledge Discovery and Data Mining
"... We introduce an extended representation of time series that allows fast, accurate classification and clustering in addition to the ability to explore time series data in a relevance feedback framework. The representation consists of piecewise linear segments to represent shape and a weight vector th ..."
Abstract

Cited by 130 (24 self)
 Add to MetaCart
We introduce an extended representation of time series that allows fast, accurate classification and clustering in addition to the ability to explore time series data in a relevance feedback framework. The representation consists of piecewise linear segments to represent shape and a weight vector that contains the relative importance of each individual linear segment. In the classification context, the weights are learned automatically as part of the training cycle. In the relevance feedback context, the weights are determined by an interactive and iterative process in which users rate various choices presented to them. Our representation allows a user to define a variety of similarity measures that can be tailored to specific domains. We demonstrate our approach on space telemetry, medical and synthetic data.
Discovery of frequent Datalog patterns
, 1999
"... Discovery of frequent patterns has been studied in a variety of data mining settings. In its simplest form, known from association rule mining, the task is to discover all frequent itemsets, i.e., all combinations of items that are found in a sufficient number of examples. The fundamental task of as ..."
Abstract

Cited by 128 (9 self)
 Add to MetaCart
Discovery of frequent patterns has been studied in a variety of data mining settings. In its simplest form, known from association rule mining, the task is to discover all frequent itemsets, i.e., all combinations of items that are found in a sufficient number of examples. The fundamental task of association rule and frequent set discovery has been extended in various directions, allowing more useful patterns to be discovered with special purpose algorithms. We present Warmr, a general purpose inductive logic programming algorithm that addresses frequent query discovery: a very general Datalog formulation of the frequent pattern discovery problem.
Mining the Most Interesting Rules
, 1999
"... Several algorithms have been proposed for finding the “best, ” “optimal,” or “most interesting ” rule(s) in a database according to a variety of metrics including confidence, support, gain, chisquared value, gini, entropy gain, laplace, lift, and conviction. In this paper, we show that the best rul ..."
Abstract

Cited by 124 (1 self)
 Add to MetaCart
Several algorithms have been proposed for finding the “best, ” “optimal,” or “most interesting ” rule(s) in a database according to a variety of metrics including confidence, support, gain, chisquared value, gini, entropy gain, laplace, lift, and conviction. In this paper, we show that the best rule according to any of these metrics must reside along a support/confidence border. Further, in the case of conjunctive rule mining within categorical data, the number of rules along this border is conveniently small, and can be mined efficiently from a variety of realworld datasets. We also show how this concept can be generalized to mine all rules that are best according to any of these criteria with respect to an arbitrary subset of the population of interest. We argue that by returning a broader set of rules than previous algorithms, our techniques allow for improved insight into the data and support more userinteraction in the optimized rulemining process. 1.
Cached Sufficient Statistics for Efficient Machine Learning with Large Datasets
 Journal of Artificial Intelligence Research
, 1997
"... This paper introduces new algorithms and data structures for quick counting for machine learning datasets. We focus on the counting task of constructing contingency tables, but our approach is also applicable to counting the number of records in a dataset that match conjunctive queries. Subject to c ..."
Abstract

Cited by 122 (19 self)
 Add to MetaCart
This paper introduces new algorithms and data structures for quick counting for machine learning datasets. We focus on the counting task of constructing contingency tables, but our approach is also applicable to counting the number of records in a dataset that match conjunctive queries. Subject to certain assumptions, the costs of these operations can be shown to be independent of the number of records in the dataset and loglinear in the number of nonzero entries in the contingency table. We provide a very sparse data structure, the ADtree, to minimize memory use. We provide analytical worstcase bounds for this structure for several models of data distribution. We empirically demonstrate that tractablysized data structures can be produced for large realworld datasets by (a) using a sparse tree structure that never allocates memory for counts of zero, (b) never allocating memory for counts that can be deduced from other counts, and (c) not bothering to expand the tree fully near its...
Finding Frequent Substructures in Chemical Compounds
, 1998
"... The discovery of the relationships between chemical structure and biological function is central to biological science and medicine. In this paper we apply data mining to the problem of predicting chemical carcinogenicity. This toxicology application was launched at IJCAI'97 as a research challenge ..."
Abstract

Cited by 116 (9 self)
 Add to MetaCart
The discovery of the relationships between chemical structure and biological function is central to biological science and medicine. In this paper we apply data mining to the problem of predicting chemical carcinogenicity. This toxicology application was launched at IJCAI'97 as a research challenge for artificial intelligence. Our approach to the problem is descriptive rather than based on classification; the goal being to find common substructures and properties in chemical compounds, and in this way to contribute to scientific insight. This approach contrasts with previous machine learning research on this problem, which has mainly concentrated on predicting the toxicity of unknown chemicals. Our contribution to the field of data mining is the ability to discover useful frequent patterns that are beyond the complexity of association rules or their known variants. This is vital to the problem, which requires the discovery of patterns that are out of the reach of simple transformations...
Integrating association rule mining with relational database systems: Alternatives and implications
 Data Mining and Knowledge Discovery
"... Abstract. Data mining on large data warehouses is becoming increasingly important. In support of this trend, we consider a spectrum of architectural alternatives for coupling mining with database systems. These alternatives include: loosecoupling through a SQL cursor interface; encapsulation of a m ..."
Abstract

Cited by 110 (5 self)
 Add to MetaCart
Abstract. Data mining on large data warehouses is becoming increasingly important. In support of this trend, we consider a spectrum of architectural alternatives for coupling mining with database systems. These alternatives include: loosecoupling through a SQL cursor interface; encapsulation of a mining algorithm in a stored procedure; caching the data to a file system onthefly and mining; tightcoupling using primarily userdefined functions; and SQL implementations for processing in the DBMS. We comprehensively study the option of expressing the mining algorithm in the form of SQL queries using Association rule mining as a case in point. We consider four options in SQL92 and six options in SQL enhanced with objectrelational extensions (SQLOR). Our evaluation of the different architectural alternatives shows that from a performance perspective, the Cache option is superior, although the performance of the SQLOR option is within a factor of two. Both the Cache and the SQLOR approaches incur a higher storage penalty than the loosecoupling approach which performancewise is a factor of 3 to 4 worse than Cache. The SQL92 implementations were too slow to qualify as a competitive option. We also compare these alternatives on the basis of qualitative factors like automatic parallelization, development ease, portability and interoperability. As a byproduct of this study, we identify some primitives for native support in database systems for decisionsupport applications. Keywords: mining system architecture, association rule mining, database mining, mining algorithms in SQL