Results 11 - 20
of
319
Unexpectedness as a Measure of Interestingness in Knowledge Discovery
- In Proceedings of the First International Conference on Knowledge Discovery and Data Mining
, 1999
"... Organizations are taking advantage of "data-mining" techniques to leverage the vast amounts of data captured as they process routine transactions. Data-mining is the process of discovering hidden structure or patterns in data. However several of the pattern discovery methods in datamining systems ha ..."
Abstract
-
Cited by 121 (8 self)
- Add to MetaCart
Organizations are taking advantage of "data-mining" techniques to leverage the vast amounts of data captured as they process routine transactions. Data-mining is the process of discovering hidden structure or patterns in data. However several of the pattern discovery methods in datamining systems have the drawbacks that they discover too many obvious or irrelevant patterns and that they do not leverage to a full extent valuable prior domain knowledge that managers have. This research addresses these drawbacks by developing ways to generate interesting patterns by incorporating managers' prior knowledge in the process of searching for patterns in data. Specifically we focus on providing methods that generate unexpected patterns with respect to managerial intuition by eliciting managers' beliefs about the domain and using these beliefs to seed the search for unexpected patterns in data. Our approach should lead to the development of decision support systems that provide managers with mor...
Rule Discovery From Time Series
, 1998
"... We consider the problem of finding rules relating patterns in a time series to other patterns in that series, or patterns in one series to patterns in another series. A simple example is a rule such as "a period of low telephone call activity is usually followed by a sharp rise in call volume". ..."
Abstract
-
Cited by 120 (0 self)
- Add to MetaCart
We consider the problem of finding rules relating patterns in a time series to other patterns in that series, or patterns in one series to patterns in another series. A simple example is a rule such as "a period of low telephone call activity is usually followed by a sharp rise in call volume". Examples of rules relating two or more time series are "if the Microsoft stock price goes up and Intel falls, then IBM goes up the next day," and "if Microsoft goes up strongly for one day, then declines strongly on the next day, and on the same days Intel stays about level, then IBM stays about level." Our emphasis is in the discovery of local patterns in multivariate time series, in contrast to traditional time series analysis which largely focuses on global models. Thus, we search for rules whose conditions refer to patterns in time series. However, we do not want to define beforehand which patterns are to be used; rather, we want the patterns to be formed from the data in t...
Discovery of frequent Datalog patterns
, 1999
"... Discovery of frequent patterns has been studied in a variety of data mining settings. In its simplest form, known from association rule mining, the task is to discover all frequent itemsets, i.e., all combinations of items that are found in a sufficient number of examples. The fundamental task of as ..."
Abstract
-
Cited by 115 (8 self)
- Add to MetaCart
Discovery of frequent patterns has been studied in a variety of data mining settings. In its simplest form, known from association rule mining, the task is to discover all frequent itemsets, i.e., all combinations of items that are found in a sufficient number of examples. The fundamental task of association rule and frequent set discovery has been extended in various directions, allowing more useful patterns to be discovered with special purpose algorithms. We present Warmr, a general purpose inductive logic programming algorithm that addresses frequent query discovery: a very general Datalog formulation of the frequent pattern discovery problem.
Mining the Most Interesting Rules
, 1999
"... Several algorithms have been proposed for finding the “best, ” “optimal,” or “most interesting ” rule(s) in a database according to a variety of metrics including confidence, support, gain, chi-squared value, gini, entropy gain, laplace, lift, and conviction. In this paper, we show that the best rul ..."
Abstract
-
Cited by 110 (1 self)
- Add to MetaCart
Several algorithms have been proposed for finding the “best, ” “optimal,” or “most interesting ” rule(s) in a database according to a variety of metrics including confidence, support, gain, chi-squared value, gini, entropy gain, laplace, lift, and conviction. In this paper, we show that the best rule according to any of these metrics must reside along a support/confidence border. Further, in the case of conjunctive rule mining within categorical data, the number of rules along this border is conveniently small, and can be mined efficiently from a variety of real-world data-sets. We also show how this concept can be generalized to mine all rules that are best according to any of these criteria with respect to an arbitrary subset of the population of interest. We argue that by returning a broader set of rules than previous algorithms, our techniques allow for improved insight into the data and support more user-interaction in the optimized rule-mining process. 1.
Cached Sufficient Statistics for Efficient Machine Learning with Large Datasets
- Journal of Artificial Intelligence Research
, 1997
"... This paper introduces new algorithms and data structures for quick counting for machine learning datasets. We focus on the counting task of constructing contingency tables, but our approach is also applicable to counting the number of records in a dataset that match conjunctive queries. Subject to c ..."
Abstract
-
Cited by 108 (17 self)
- Add to MetaCart
This paper introduces new algorithms and data structures for quick counting for machine learning datasets. We focus on the counting task of constructing contingency tables, but our approach is also applicable to counting the number of records in a dataset that match conjunctive queries. Subject to certain assumptions, the costs of these operations can be shown to be independent of the number of records in the dataset and loglinear in the number of non-zero entries in the contingency table. We provide a very sparse data structure, the ADtree, to minimize memory use. We provide analytical worst-case bounds for this structure for several models of data distribution. We empirically demonstrate that tractably-sized data structures can be produced for large real-world datasets by (a) using a sparse tree structure that never allocates memory for counts of zero, (b) never allocating memory for counts that can be deduced from other counts, and (c) not bothering to expand the tree fully near its...
An Algorithm for Multi-Relational Discovery of Subgroups
, 1997
"... We consider the problem of finding statistically unusual subgroups in a multi-relation database, and extend previous work on singlerelation subgroup discovery. We give a precise definition of the multirelation subgroup discovery task, propose a specific form of declarative bias based on foreign ..."
Abstract
-
Cited by 105 (8 self)
- Add to MetaCart
We consider the problem of finding statistically unusual subgroups in a multi-relation database, and extend previous work on singlerelation subgroup discovery. We give a precise definition of the multirelation subgroup discovery task, propose a specific form of declarative bias based on foreign links as a means of specifying the hypothesis space, and show how propositional evaluation functions can be adapted to the multi-relation setting. We then describe an algorithm for this problem setting that uses optimistic estimate and minimal support pruning, an optimal refinement operator and sampling to ensure efficiency and can easily be parallelized.
Finding Frequent Substructures in Chemical Compounds
, 1998
"... The discovery of the relationships between chemical structure and biological function is central to biological science and medicine. In this paper we apply data mining to the problem of predicting chemical carcinogenicity. This toxicology application was launched at IJCAI'97 as a research challenge ..."
Abstract
-
Cited by 96 (8 self)
- Add to MetaCart
The discovery of the relationships between chemical structure and biological function is central to biological science and medicine. In this paper we apply data mining to the problem of predicting chemical carcinogenicity. This toxicology application was launched at IJCAI'97 as a research challenge for artificial intelligence. Our approach to the problem is descriptive rather than based on classification; the goal being to find common substructures and properties in chemical compounds, and in this way to contribute to scientific insight. This approach contrasts with previous machine learning research on this problem, which has mainly concentrated on predicting the toxicity of unknown chemicals. Our contribution to the field of data mining is the ability to discover useful frequent patterns that are beyond the complexity of association rules or their known variants. This is vital to the problem, which requires the discovery of patterns that are out of the reach of simple transformations...
A Belief-Driven Method for Discovering Unexpected Patterns
, 1998
"... Several pattern discovery methods proposed in the data mining literature have the drawbacks that they discover too many obvious or irrelevant patterns and that they do not leverage to a full extent valuable prior domain knowledge that decision makers have. In this paper we propose a new method ..."
Abstract
-
Cited by 87 (9 self)
- Add to MetaCart
Several pattern discovery methods proposed in the data mining literature have the drawbacks that they discover too many obvious or irrelevant patterns and that they do not leverage to a full extent valuable prior domain knowledge that decision makers have. In this paper we propose a new method of discovery that addresses these drawbacks. In particular we propose a new method of discovering unexpected patterns that takes into consideration prior background knowledge of decision makers. This prior knowledge constitutes a set of expectations or beliefs about the problem domain. Our proposed method of discovering unexpected patterns uses these beliefs to seed the search for patterns in data that contradict the beliefs. To evaluate the practicality of our approach, we applied our algorithm to consumer purchase data from a major market research company and to web logfile data tracked at an academic Web site and present our findings in the paper.
Multiple uses of frequent sets and condensed representations (Extended Abstract)
- In Proc. KDD Int. Conf. Knowledge Discovery in Databases
, 1996
"... In interactive data mining it is advantageous to have condensed representations of data that can be used to efficiently answer different queries. In this paper we show how frequent sets can be used as a condensed representation for answering various types of queries. Given a table r with 0/1 values ..."
Abstract
-
Cited by 84 (7 self)
- Add to MetaCart
In interactive data mining it is advantageous to have condensed representations of data that can be used to efficiently answer different queries. In this paper we show how frequent sets can be used as a condensed representation for answering various types of queries. Given a table r with 0/1 values and a threshold oe, a frequent set of r is a set X of columns of r such that at least a fraction oe of the rows of r have a 1 in all the columns of X. Finding frequent sets is a first step in finding association rules, and there exists several efficient algorithms for finding the frequent sets. We show that frequent sets have wider applications than just finding association rules. We show that using the inclusion-exclusion principle one can obtain approximate confidences of arbitrary boolean rules. We derive bounds for the errors in the confidences, and show that information collected during the computation of frequent sets can also be used to provide individual error bounds for each clause...
Scalable Techniques for Mining Causal Structures
- Data Mining and Knowledge Discovery
, 1998
"... Mining for association rules in market basket data has proved a fruitful area of research. Measures such as conditional probability (confidence) and correlation have been used to infer rules of the form "the existence of item A implies the existence of item B." However, such rules indicate only a st ..."
Abstract
-
Cited by 82 (1 self)
- Add to MetaCart
Mining for association rules in market basket data has proved a fruitful area of research. Measures such as conditional probability (confidence) and correlation have been used to infer rules of the form "the existence of item A implies the existence of item B." However, such rules indicate only a statistical relationship between A and B. They do not specify the nature of the relationship: whether the presence of A causes the presence of B, or the converse, or some other attribute or phenomenon causes both to appear together. In applications, knowing such causal relationships is extremely useful for enhancing understanding and effecting change. While distinguishing causality from correlation is a truly difficult problem, recent work in statistics and Bayesian learning provide some avenues of attack. In these fields, the goal has generally been to learn complete causal models, which are essentially impossible to learn in large-scale data mining applications with a large number of variab...

