Results 1 - 10
of
19
Knowledge discovery and interestingness measures: A survey
, 1999
"... Knowledge discovery in databases, also known as data mining, is the efficient discovery of previously unknown, valid, novel, potentially useful, and understandable patterns in large databases. It encompasses many different techniques and algorithms which differ in the kinds of data that can be analy ..."
Abstract
-
Cited by 44 (1 self)
- Add to MetaCart
Knowledge discovery in databases, also known as data mining, is the efficient discovery of previously unknown, valid, novel, potentially useful, and understandable patterns in large databases. It encompasses many different techniques and algorithms which differ in the kinds of data that can be analyzed and the form of knowledge representation used to convey the discovered knowledge. An important problem in the area of data mining is the development of effective measures of interestingness for ranking the discovered knowledge. In this report, we provide a general overview of the more successful and widely known data mining techniques and algorithms, and survey seventeen interestingness measures from the literature that have been successfully employed in data mining applications. 1 1
Understanding the crucial role of attribute interaction in data mining
- Artif. Intel. Rev
, 2001
"... This is a review paper, whose goal is to significantly improve our understanding of the crucial role of attribute interaction in data mining. The main contributions of this paper are as follows. Firstly, we show that the concept of attribute interaction has a crucial role across different kinds of p ..."
Abstract
-
Cited by 39 (11 self)
- Add to MetaCart
This is a review paper, whose goal is to significantly improve our understanding of the crucial role of attribute interaction in data mining. The main contributions of this paper are as follows. Firstly, we show that the concept of attribute interaction has a crucial role across different kinds of problem in data mining, such as attribute construction, coping with small disjuncts, induction of first-order logic rules, detection of Simpson’s paradox, and finding several types of interesting rules. Hence, a better understanding of attribute interaction can lead to a better understanding of the relationship between these kinds of problems, which are usually studied separately from each other. Secondly, we draw attention to the fact that most rule induction algorithms are based on a greedy search which does not cope well with the problem of attribute interaction, and point out some alternative kinds of rule discovery methods which tend to cope better with this problem. Thirdly, we discussed several algorithms and methods for discovering interesting knowledge that, implicitly or explicitly, are based on the concept of attribute interaction.
Evaluation of Interestingness Measures for Ranking Discovered Knowledge
- Lecture Notes in Computer Science
, 2001
"... When mining a large database, the number of patterns discovered can easily exceed the capabilities of a human user to identify interesting results. To address this problem, various techniques have been suggested to reduce and/or order the patterns prior to presenting them to the user. In this pa ..."
Abstract
-
Cited by 24 (0 self)
- Add to MetaCart
When mining a large database, the number of patterns discovered can easily exceed the capabilities of a human user to identify interesting results. To address this problem, various techniques have been suggested to reduce and/or order the patterns prior to presenting them to the user. In this paper, our focus is on ranking summaries generated from a single dataset, where attributes can be generalized in many different ways and to many levels of granularity according to taxonomic hierarchies. We theoretically and empirically evaluate thirteen diversity measures used as heuristic measures of interestingness for ranking summaries generated from databases. The thirteen diversity measures have previously been utilized in various disciplines, such as information theory, statistics, ecology, and economics. We describe five principles that any measure must satisfy to be considered useful for ranking summaries. Theoretical results show that only four of the thirteen diversity measures satisfy all of the principles. We then analyze the distribution of the index values generated by each of the thirteen diversity measures. Empirical results, obtained using synthetic data, show that the distribution of index values generated tend to be highly skewed about the mean, median, and middle index values. The objective of this work is to gain some insight into the behaviour that can be expected from each of the measures in practice. 1
Managing Interesting Rules in Sequence Mining
- In PKDD
, 1999
"... . The goal of sequence mining is the discovery of interesting sequences of events. Conventional sequence miners discover only frequent sequences, though. This limits the applicability scope of sequence mining for domains like error detection and web usage analysis. We propose a framework for discove ..."
Abstract
-
Cited by 16 (0 self)
- Add to MetaCart
. The goal of sequence mining is the discovery of interesting sequences of events. Conventional sequence miners discover only frequent sequences, though. This limits the applicability scope of sequence mining for domains like error detection and web usage analysis. We propose a framework for discovering and maintaining interesting rules and beliefs in the context of sequence mining. We transform frequent sequences discovered by a conventional miner into sequence rules, remove redundant rules and organize the remaining ones into interestingness categories, from which unexpected rules and new beliefs are derived. 1 Introduction Data miners pursue the discovery of new knowledge. But knowledge based solely on statistical dominance is rarely new. The expert needs means for either instructing the miner to discover only interesting rules or for ranking the mining results by "interestingness" [7]. Tuzhilin et al propose interestingness measures based on the notion of belief [1, 6]. A belief r...
Reducing Redundancy in Characteristic Rule Discovery by Using IP-Techniques
- In Intelligent Data Analysis Journal
, 2000
"... The discovery of characteristic rules is a well-known data mining technique and has lead to several successful applications. Unfortunately, typically a (very) large number of rules is discovered during the mining stage. This makes monitoring and control of these rules extremely costly and difficult. ..."
Abstract
-
Cited by 12 (1 self)
- Add to MetaCart
The discovery of characteristic rules is a well-known data mining technique and has lead to several successful applications. Unfortunately, typically a (very) large number of rules is discovered during the mining stage. This makes monitoring and control of these rules extremely costly and difficult. Therefore, a selection of the most promising rules is desirable. In this paper, we propose an integer programming model to solve the problem of selecting the most promising subset of characteristic rules. The proposed technique allows to control a user-defined level of overall quality of the model in combination with a maximum reduction of the redundancy extant in the original ruleset. We use real-world data to evaluate the performance of the proposed technique against the wellknown RuleCover heuristic. 1 Introduction Data mining is the automated search for hidden, previously unknown and potentially useful information from large databases. Moreover, data mining is a crucial pha...
A Critical Review of Multi-Objective Optimization in Data Mining: a position paper
- ACM SIGKDD Explorations
, 2004
"... This paper addresses the problem of how to evaluate the quality of a model built from the data in a multi-objective optimization scenario, where two or more quality criteria must be simultaneously optimized. A typical example is a scenario where one wants to maximize both the accuracy and the simpli ..."
Abstract
-
Cited by 11 (4 self)
- Add to MetaCart
This paper addresses the problem of how to evaluate the quality of a model built from the data in a multi-objective optimization scenario, where two or more quality criteria must be simultaneously optimized. A typical example is a scenario where one wants to maximize both the accuracy and the simplicity of a classification model or a candidate attribute subset in attribute selection. One reviews three very different approaches to cope with this problem, namely: (a) transforming the original multiobjective problem into a single-objective problem by using a weighted formula; (b) the lexicographical approach, where the objectives are ranked in order of priority; and (c) the Pareto approach, which consists of finding as many non-dominated solutions as possible and returning the set of non-dominated solutions to the user. One also presents a critical review of the case for and against each of these approaches. The general conclusions are that the weighted formula approach – which is by far the most used in the data mining literature – is to a large extent an ad-hoc approach for multi-objective optimization, whereas the lexicographic and the Pareto approach are more principled approaches, and therefore deserve more attention from the data mining community.
Heuristic Measures of Interestingness
- Proceedings of the Third European Conference on the Principles of Data Mining and Knowledge Discovery (PKDD'99
, 1999
"... The tuples in a generalized relation (i.e., a summary generated from a database) are unique, and therefore, can be considered to be a population with a structure that can be described by some probability distribution. In this paper, we present and empirically compare sixteen heuristic measures t ..."
Abstract
-
Cited by 10 (3 self)
- Add to MetaCart
The tuples in a generalized relation (i.e., a summary generated from a database) are unique, and therefore, can be considered to be a population with a structure that can be described by some probability distribution. In this paper, we present and empirically compare sixteen heuristic measures that evaluate the structure of a summary to assign a single real-valued index that represents its interestingness relative to other summaries generated from the same database. The heuristics are based upon well-known measures of diversity, dispersion, dominance, and inequality used in several areas of the physical, social, ecological, management, information, and computer sciences. Their use for ranking summaries generated from databases is a new application area. All sixteen heuristics rank less complex summaries (i.e., those with few tuples and/or few non-ANY attributes) as most interesting. We demonstrate that for sample data sets, the order in which some of the measures rank summaries is highly correlated.
Monitoring the Evolution of Web Usage Patterns
- Lecture Notes in Computer Science
, 2004
"... Abstract With the ongoing shift from off-line to on-line business processes, the Web has become an important business platform, and for most companies it is crucial to have an on-line presence which can be used to gather information about their products and/or services. However, in many cases there ..."
Abstract
-
Cited by 9 (1 self)
- Add to MetaCart
Abstract With the ongoing shift from off-line to on-line business processes, the Web has become an important business platform, and for most companies it is crucial to have an on-line presence which can be used to gather information about their products and/or services. However, in many cases there is a difference between the intended and the effective usage of a web site and, presently, many web site operators analyze the usage of their sites to improve their usability. But especially in the context of the Internet, content and structure change rather quickly, and the way a web site is used may change often, either due to changing information needs of its visitors, or due to an evolving user group. Therefore, the discovered usage patterns need to be updated continuously to always reflect the current state. In this article, we introduce PAM, an automated Pattern Monitor, which can be used to observe changes to the behavior of a web sites visitors. It is based on a temporal representation of rules in which both the content
Interesting Fuzzy Association Rules in Quantitative Databases
- Proceedings of the Fifth European Conference on Principles of Data Mining and Knowledge Discovery (PKDD 2001
, 2001
"... In this paper we examine association rules and their interestingness. Usually these rules are discussed in the world of basket analysis. Instead of customer data we now study the situation with data records of a more general but fixed nature, incorporating quantitative (nonboolean) data. We propose ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
In this paper we examine association rules and their interestingness. Usually these rules are discussed in the world of basket analysis. Instead of customer data we now study the situation with data records of a more general but fixed nature, incorporating quantitative (nonboolean) data. We propose a method for finding interesting rules with the help of fuzzy techniques and taxonomies for the items/attributes. Experiments show that the use of the proposed interestingness measure substantially decreases the number of rules.
Ranking the Interestingness of Summaries from Data Mining Systems
- In Proceedings of the 12th Annual Florida Artificial Intelligence Research Symposium (FLAIRS'99
, 1999
"... We study data mining where the task is description by summarization, the representation language is generalized relations, the evaluation criteria are based on heuristic measures of interestingness, and the method for searching is the Multi-Attribute Generalization algorithm for domain generali ..."
Abstract
-
Cited by 4 (3 self)
- Add to MetaCart
We study data mining where the task is description by summarization, the representation language is generalized relations, the evaluation criteria are based on heuristic measures of interestingness, and the method for searching is the Multi-Attribute Generalization algorithm for domain generalization graphs. We present and empirically compare four heuristics for ranking the interestingness of generalized relations (or summaries). The measures are based on common measures of the diversity of a population, statistical variance, the Simpson index, and the Shannon index. All four measures rank less complex summaries (i.e., those with few tuples and/or non-ANY attributes) as most interesting. Highly ranked summaries provide a reasonable starting point for further analysis of discovered knowledge.

