Results 1 - 10
of
62
Data Mining: An Overview from Database Perspective
- IEEE Transactions on Knowledge and Data Engineering
, 1996
"... Mining information and knowledge from large databases has been recognized by many researchers as a key research topic in database systems and machine learning, and by many industrial companies as an important area with an opportunity of major revenues. Researchers in many different fields have sh ..."
Abstract
-
Cited by 314 (23 self)
- Add to MetaCart
Mining information and knowledge from large databases has been recognized by many researchers as a key research topic in database systems and machine learning, and by many industrial companies as an important area with an opportunity of major revenues. Researchers in many different fields have shown great interest in data mining. Several emerging applications in information providing services, such as data warehousing and on-line services over the Internet, also call for various data mining techniques to better understand user behavior, to improve the service provided, and to increase the business opportunities. In response to such a demand, this article is to provide a survey, from a database researcher's point of view, on the data mining techniques developed recently. A classification of the available data mining techniques is provided and a comparative study of such techniques is presented.
An Algorithm for Multi-Relational Discovery of Subgroups
, 1997
"... We consider the problem of finding statistically unusual subgroups in a multi-relation database, and extend previous work on singlerelation subgroup discovery. We give a precise definition of the multirelation subgroup discovery task, propose a specific form of declarative bias based on foreign ..."
Abstract
-
Cited by 105 (8 self)
- Add to MetaCart
We consider the problem of finding statistically unusual subgroups in a multi-relation database, and extend previous work on singlerelation subgroup discovery. We give a precise definition of the multirelation subgroup discovery task, propose a specific form of declarative bias based on foreign links as a means of specifying the hypothesis space, and show how propositional evaluation functions can be adapted to the multi-relation setting. We then describe an algorithm for this problem setting that uses optimistic estimate and minimal support pruning, an optimal refinement operator and sampling to ensure efficiency and can easily be parallelized.
Subgroup Discovery with CN2-SD
- Journal of Machine Learning Research
, 2004
"... discovery. The goal of subgroup discovery is to find rules describing subsets of the population that are sufficiently large and statistically unusual. The paper presents a subgroup discovery algorithm, CN2-SD, developed by modifying parts of the CN2 classification rule learner: its covering algorit ..."
Abstract
-
Cited by 34 (7 self)
- Add to MetaCart
discovery. The goal of subgroup discovery is to find rules describing subsets of the population that are sufficiently large and statistically unusual. The paper presents a subgroup discovery algorithm, CN2-SD, developed by modifying parts of the CN2 classification rule learner: its covering algorithm, search heuristic, probabilistic classification of instances, and evaluation measures. Experimental evaluation of CN2-SD on 23 UCI data sets shows substantial reduction of the number of induced rules, increased rule coverage and rule significance, as well as slight improvements in terms of the area under ROC curve, when compared with the CN2 algorithm. Application of CN2-SD to a large traffic accident data set confirms these findings.
An analysis of quantitative measures associated with rules
- Proceedings of PAKDD’99
, 1999
"... Abstract. In this paper, we analyze quantitative measures associated with if-then type rules. Basic quantities are identified and many existing measures are examined using the basic quantities. The main objective is to provide a synthesis of existing results in a simple and unified framework. The qu ..."
Abstract
-
Cited by 29 (22 self)
- Add to MetaCart
Abstract. In this paper, we analyze quantitative measures associated with if-then type rules. Basic quantities are identified and many existing measures are examined using the basic quantities. The main objective is to provide a synthesis of existing results in a simple and unified framework. The quantitative measure is viewed as a multi-facet concept, representing the confidence, uncertainty, applicability, quality, accuracy, and interestingness of rules. Roughly, they may be classified as representing one-way and two-way supports. 1
Discovering significant patterns
, 2007
"... Pattern discovery techniques, such as association rule discovery, explore large search spaces of potential patterns to find those that satisfy some user-specified constraints. Due to the large number of patterns considered, they suffer from an extreme risk of type-1 error, that is, of finding patter ..."
Abstract
-
Cited by 25 (3 self)
- Add to MetaCart
Pattern discovery techniques, such as association rule discovery, explore large search spaces of potential patterns to find those that satisfy some user-specified constraints. Due to the large number of patterns considered, they suffer from an extreme risk of type-1 error, that is, of finding patterns that appear due to chance alone to satisfy the constraints on the sample data. This paper proposes techniques to overcome this problem by applying well-established statistical practices. These allow the user to enforce a strict upper limit on the risk of experimentwise error. Empirical studies demonstrate that standard pattern discovery techniques can discover numerous spurious patterns when applied to random data and when applied to real-world data result in large numbers of patterns that are rejected when subjected to sound statistical evaluation. They also reveal that a number of pragmatic choices about how such tests are performed can greatly affect their power.
Extensibility in Data Mining Systems
, 1996
"... The successful application of data mining techniques ideally requires both system support for the entire knowledge discovery process and the right analysis algorithms for the particular task at hand. While there are a number of successful data mining systems that support the entire mining process, t ..."
Abstract
-
Cited by 24 (1 self)
- Add to MetaCart
The successful application of data mining techniques ideally requires both system support for the entire knowledge discovery process and the right analysis algorithms for the particular task at hand. While there are a number of successful data mining systems that support the entire mining process, they usually are limited to a fixed selection of analysis algorithms. In this paper, we argue in favor of extensibility as a key feature of data mining systems, and discuss the requirements that this entails for system architecture. We identify in which points existing data mining systems fail to meet these requirements, and then describe a new integration architecture for data mining systems that addresses these problems based on the concept of "plug-ins". KEPLER, our data mining system built according to this architecture, is presented and discussed. Keywords: data mining, system architecture, extensibility, KEPLER Introduction Data Mining, or Knowledge Discovery in Databases (KDD) aims...
Expert-Guided Subgroup Discovery: Methodology and Application
- Journal of Artificial Intelligence Research
, 2002
"... This paper presents an approach to expert-guided subgroup discovery. The main step of the subgroup discovery process, the induction of subgroup descriptions, is performed by a heuristic beam search algorithm, using a novel parametrized definition of rule quality which is analyzed in detail. The othe ..."
Abstract
-
Cited by 23 (5 self)
- Add to MetaCart
This paper presents an approach to expert-guided subgroup discovery. The main step of the subgroup discovery process, the induction of subgroup descriptions, is performed by a heuristic beam search algorithm, using a novel parametrized definition of rule quality which is analyzed in detail. The other important steps of the proposed subgroup discovery process are the detection of statistically significant properties of selected subgroups and subgroup visualization: statistically significant properties are used to enrich the descriptions of induced subgroups, while the visualization shows subgroup properties in the form of distributions of the numbers of examples in the subgroups. The approach is illustrated by the results obtained for a medical problem of early detection of patient risk groups.
Data Mining: Statistics and More?
, 1998
"... this article we examine some of the major di#erences in emphasis between statistics and data mining. In Section 3 we look at some of the major tools, and Section 4 concludes. ..."
Abstract
-
Cited by 23 (4 self)
- Add to MetaCart
this article we examine some of the major di#erences in emphasis between statistics and data mining. In Section 3 we look at some of the major tools, and Section 4 concludes.
Confirmation-guided discovery of first-order rules with Tertius
- Machine Learning
, 2000
"... . This paper deals with learning first-order logic rules from data lacking an explicit classification predicate. Consequently, the learned rules are not restricted to predicate definitions as in supervised inductive logic programming. First-order logic offers the ability to deal with structured, mul ..."
Abstract
-
Cited by 23 (9 self)
- Add to MetaCart
. This paper deals with learning first-order logic rules from data lacking an explicit classification predicate. Consequently, the learned rules are not restricted to predicate definitions as in supervised inductive logic programming. First-order logic offers the ability to deal with structured, multi-relational knowledge. Possible applications include first-order knowledge discovery, induction of integrity constraints in databases, multiple predicate learning, and learning mixed theories of predicate definitions and integrity constraints. One of the contributions of our work is a heuristic measure of confirmation, trading off novelty and satisfaction of the rule. The approach has been implemented in the Tertius system. The system performs an optimal bestfirst search, finding the k most confirmed hypotheses, and includes a non-redundant refinement operator to avoid duplicates in the search. Tertius can be adapted to many different domains by tuning its parameters, and it can deal eithe...
Finding the Most Interesting Patterns in a Database Quickly by Using Sequential Sampling
- Journal of Machine Learning Research
, 2001
"... Many discovery problems, e.g., subgroup or association rule discovery, can naturally be cast as n-best hypotheses problems where the goal is to nd the n hypotheses from a given hypothesis space that score best according to a certain utility function. We present a sampling algorithm that solves this ..."
Abstract
-
Cited by 19 (2 self)
- Add to MetaCart
Many discovery problems, e.g., subgroup or association rule discovery, can naturally be cast as n-best hypotheses problems where the goal is to nd the n hypotheses from a given hypothesis space that score best according to a certain utility function. We present a sampling algorithm that solves this problem by issuing a small number of database queries while guaranteeing precise bounds on con dence and quality of solutions. Known sampling approaches have treated single hypothesis selection problems, assuming that the utility be the average (over the examples) of some function | which is not the case for many frequently used utility functions. We show that our algorithm works for all utilities that can be estimated with bounded error. We provide these error bounds and resulting worst-case sample bounds for some of the most frequently used utilities, and prove that there is no sampling algorithm for a popular class of utility functions that cannot be estimated with bounded error. The algorithm is sequential in the sense that it starts to return (or discard) hypotheses that already seem to be particularly good (or bad) after a few examples. Thus, the algorithm is almost always faster than its worst-case bounds.

