Results 1  10
of
22
Understanding the crucial role of attribute interaction in data mining
 Artif. Intel. Rev
, 2001
"... This is a review paper, whose goal is to significantly improve our understanding of the crucial role of attribute interaction in data mining. The main contributions of this paper are as follows. Firstly, we show that the concept of attribute interaction has a crucial role across different kinds of p ..."
Abstract

Cited by 48 (14 self)
 Add to MetaCart
This is a review paper, whose goal is to significantly improve our understanding of the crucial role of attribute interaction in data mining. The main contributions of this paper are as follows. Firstly, we show that the concept of attribute interaction has a crucial role across different kinds of problem in data mining, such as attribute construction, coping with small disjuncts, induction of firstorder logic rules, detection of Simpson’s paradox, and finding several types of interesting rules. Hence, a better understanding of attribute interaction can lead to a better understanding of the relationship between these kinds of problems, which are usually studied separately from each other. Secondly, we draw attention to the fact that most rule induction algorithms are based on a greedy search which does not cope well with the problem of attribute interaction, and point out some alternative kinds of rule discovery methods which tend to cope better with this problem. Thirdly, we discussed several algorithms and methods for discovering interesting knowledge that, implicitly or explicitly, are based on the concept of attribute interaction.
Knowledge discovery and interestingness measures: A survey
, 1999
"... Knowledge discovery in databases, also known as data mining, is the efficient discovery of previously unknown, valid, novel, potentially useful, and understandable patterns in large databases. It encompasses many different techniques and algorithms which differ in the kinds of data that can be analy ..."
Abstract

Cited by 48 (1 self)
 Add to MetaCart
Knowledge discovery in databases, also known as data mining, is the efficient discovery of previously unknown, valid, novel, potentially useful, and understandable patterns in large databases. It encompasses many different techniques and algorithms which differ in the kinds of data that can be analyzed and the form of knowledge representation used to convey the discovered knowledge. An important problem in the area of data mining is the development of effective measures of interestingness for ranking the discovered knowledge. In this report, we provide a general overview of the more successful and widely known data mining techniques and algorithms, and survey seventeen interestingness measures from the literature that have been successfully employed in data mining applications. 1 1
Evaluation of Interestingness Measures for Ranking Discovered Knowledge
 Lecture Notes in Computer Science
, 2001
"... When mining a large database, the number of patterns discovered can easily exceed the capabilities of a human user to identify interesting results. To address this problem, various techniques have been suggested to reduce and/or order the patterns prior to presenting them to the user. In this pa ..."
Abstract

Cited by 28 (0 self)
 Add to MetaCart
When mining a large database, the number of patterns discovered can easily exceed the capabilities of a human user to identify interesting results. To address this problem, various techniques have been suggested to reduce and/or order the patterns prior to presenting them to the user. In this paper, our focus is on ranking summaries generated from a single dataset, where attributes can be generalized in many different ways and to many levels of granularity according to taxonomic hierarchies. We theoretically and empirically evaluate thirteen diversity measures used as heuristic measures of interestingness for ranking summaries generated from databases. The thirteen diversity measures have previously been utilized in various disciplines, such as information theory, statistics, ecology, and economics. We describe five principles that any measure must satisfy to be considered useful for ranking summaries. Theoretical results show that only four of the thirteen diversity measures satisfy all of the principles. We then analyze the distribution of the index values generated by each of the thirteen diversity measures. Empirical results, obtained using synthetic data, show that the distribution of index values generated tend to be highly skewed about the mean, median, and middle index values. The objective of this work is to gain some insight into the behaviour that can be expected from each of the measures in practice. 1
Managing Interesting Rules in Sequence Mining
 In PKDD
, 1999
"... . The goal of sequence mining is the discovery of interesting sequences of events. Conventional sequence miners discover only frequent sequences, though. This limits the applicability scope of sequence mining for domains like error detection and web usage analysis. We propose a framework for discove ..."
Abstract

Cited by 24 (0 self)
 Add to MetaCart
. The goal of sequence mining is the discovery of interesting sequences of events. Conventional sequence miners discover only frequent sequences, though. This limits the applicability scope of sequence mining for domains like error detection and web usage analysis. We propose a framework for discovering and maintaining interesting rules and beliefs in the context of sequence mining. We transform frequent sequences discovered by a conventional miner into sequence rules, remove redundant rules and organize the remaining ones into interestingness categories, from which unexpected rules and new beliefs are derived. 1 Introduction Data miners pursue the discovery of new knowledge. But knowledge based solely on statistical dominance is rarely new. The expert needs means for either instructing the miner to discover only interesting rules or for ranking the mining results by "interestingness" [7]. Tuzhilin et al propose interestingness measures based on the notion of belief [1, 6]. A belief r...
A Critical Review of MultiObjective Optimization in Data Mining: a position paper
 ACM SIGKDD Explorations
, 2004
"... This paper addresses the problem of how to evaluate the quality of a model built from the data in a multiobjective optimization scenario, where two or more quality criteria must be simultaneously optimized. A typical example is a scenario where one wants to maximize both the accuracy and the simpli ..."
Abstract

Cited by 19 (4 self)
 Add to MetaCart
This paper addresses the problem of how to evaluate the quality of a model built from the data in a multiobjective optimization scenario, where two or more quality criteria must be simultaneously optimized. A typical example is a scenario where one wants to maximize both the accuracy and the simplicity of a classification model or a candidate attribute subset in attribute selection. One reviews three very different approaches to cope with this problem, namely: (a) transforming the original multiobjective problem into a singleobjective problem by using a weighted formula; (b) the lexicographical approach, where the objectives are ranked in order of priority; and (c) the Pareto approach, which consists of finding as many nondominated solutions as possible and returning the set of nondominated solutions to the user. One also presents a critical review of the case for and against each of these approaches. The general conclusions are that the weighted formula approach – which is by far the most used in the data mining literature – is to a large extent an adhoc approach for multiobjective optimization, whereas the lexicographic and the Pareto approach are more principled approaches, and therefore deserve more attention from the data mining community.
Reducing Redundancy in Characteristic Rule Discovery by Using IPTechniques
 In Intelligent Data Analysis Journal
, 2000
"... The discovery of characteristic rules is a wellknown data mining technique and has lead to several successful applications. Unfortunately, typically a (very) large number of rules is discovered during the mining stage. This makes monitoring and control of these rules extremely costly and difficult. ..."
Abstract

Cited by 12 (1 self)
 Add to MetaCart
The discovery of characteristic rules is a wellknown data mining technique and has lead to several successful applications. Unfortunately, typically a (very) large number of rules is discovered during the mining stage. This makes monitoring and control of these rules extremely costly and difficult. Therefore, a selection of the most promising rules is desirable. In this paper, we propose an integer programming model to solve the problem of selecting the most promising subset of characteristic rules. The proposed technique allows to control a userdefined level of overall quality of the model in combination with a maximum reduction of the redundancy extant in the original ruleset. We use realworld data to evaluate the performance of the proposed technique against the wellknown RuleCover heuristic. 1 Introduction Data mining is the automated search for hidden, previously unknown and potentially useful information from large databases. Moreover, data mining is a crucial pha...
Heuristic Measures of Interestingness
 Proceedings of the Third European Conference on the Principles of Data Mining and Knowledge Discovery (PKDD'99
, 1999
"... The tuples in a generalized relation (i.e., a summary generated from a database) are unique, and therefore, can be considered to be a population with a structure that can be described by some probability distribution. In this paper, we present and empirically compare sixteen heuristic measures t ..."
Abstract

Cited by 11 (3 self)
 Add to MetaCart
The tuples in a generalized relation (i.e., a summary generated from a database) are unique, and therefore, can be considered to be a population with a structure that can be described by some probability distribution. In this paper, we present and empirically compare sixteen heuristic measures that evaluate the structure of a summary to assign a single realvalued index that represents its interestingness relative to other summaries generated from the same database. The heuristics are based upon wellknown measures of diversity, dispersion, dominance, and inequality used in several areas of the physical, social, ecological, management, information, and computer sciences. Their use for ranking summaries generated from databases is a new application area. All sixteen heuristics rank less complex summaries (i.e., those with few tuples and/or few nonANY attributes) as most interesting. We demonstrate that for sample data sets, the order in which some of the measures rank summaries is highly correlated.
Monitoring the Evolution of Web Usage Patterns
 Lecture Notes in Computer Science
, 2004
"... Abstract With the ongoing shift from offline to online business processes, the Web has become an important business platform, and for most companies it is crucial to have an online presence which can be used to gather information about their products and/or services. However, in many cases there ..."
Abstract

Cited by 9 (1 self)
 Add to MetaCart
Abstract With the ongoing shift from offline to online business processes, the Web has become an important business platform, and for most companies it is crucial to have an online presence which can be used to gather information about their products and/or services. However, in many cases there is a difference between the intended and the effective usage of a web site and, presently, many web site operators analyze the usage of their sites to improve their usability. But especially in the context of the Internet, content and structure change rather quickly, and the way a web site is used may change often, either due to changing information needs of its visitors, or due to an evolving user group. Therefore, the discovered usage patterns need to be updated continuously to always reflect the current state. In this article, we introduce PAM, an automated Pattern Monitor, which can be used to observe changes to the behavior of a web sites visitors. It is based on a temporal representation of rules in which both the content
Ranking the Interestingness of Summaries from Data Mining Systems
 In Proceedings of the 12th Annual Florida Artificial Intelligence Research Symposium (FLAIRS'99
, 1999
"... We study data mining where the task is description by summarization, the representation language is generalized relations, the evaluation criteria are based on heuristic measures of interestingness, and the method for searching is the MultiAttribute Generalization algorithm for domain generali ..."
Abstract

Cited by 6 (3 self)
 Add to MetaCart
We study data mining where the task is description by summarization, the representation language is generalized relations, the evaluation criteria are based on heuristic measures of interestingness, and the method for searching is the MultiAttribute Generalization algorithm for domain generalization graphs. We present and empirically compare four heuristics for ranking the interestingness of generalized relations (or summaries). The measures are based on common measures of the diversity of a population, statistical variance, the Simpson index, and the Shannon index. All four measures rank less complex summaries (i.e., those with few tuples and/or nonANY attributes) as most interesting. Highly ranked summaries provide a reasonable starting point for further analysis of discovered knowledge.
Principles for Mining Summaries Using Objective Measures of Interestingness
 In Proceedings of the 12th IEEE International Conference on Tools with Artificial Intelligence (ICTAI'00
, 2000
"... An important problem in the area of data mining is the development of effective measures of interestingness for ranking discovered knowledge. In this paper, we propose five principles that any measure must satisfy to be considered useful for ranking the interestingness of summaries generated from da ..."
Abstract

Cited by 5 (3 self)
 Add to MetaCart
An important problem in the area of data mining is the development of effective measures of interestingness for ranking discovered knowledge. In this paper, we propose five principles that any measure must satisfy to be considered useful for ranking the interestingness of summaries generated from databases. We investigate the problem within the context of summarizing a single dataset which can be generalized in many different ways and to many levels of granularity. We perform a comparative sensitivity analysis of fifteen wellknown diversity measures to identify those which satisfy the proposed principles. The fifteen diversity measures have previously been utilized in various disciplines, such as information theory, statistics, ecology, and economics. Their use as objective measures of interestingness for ranking summaries generated from databases is novel. The objective of this work is to gain some insight into the behaviour that can be expected from each of the diversity measures in practice, and to begin to develop a theory of interestingness against which the utility of new measures can be assessed. 1