Results 1  10
of
11
Finding the Most Interesting Patterns in a Database Quickly by Using Sequential Sampling
 Journal of Machine Learning Research
, 2001
"... Many discovery problems, e.g., subgroup or association rule discovery, can naturally be cast as nbest hypotheses problems where the goal is to nd the n hypotheses from a given hypothesis space that score best according to a certain utility function. We present a sampling algorithm that solves this ..."
Abstract

Cited by 25 (4 self)
 Add to MetaCart
Many discovery problems, e.g., subgroup or association rule discovery, can naturally be cast as nbest hypotheses problems where the goal is to nd the n hypotheses from a given hypothesis space that score best according to a certain utility function. We present a sampling algorithm that solves this problem by issuing a small number of database queries while guaranteeing precise bounds on con dence and quality of solutions. Known sampling approaches have treated single hypothesis selection problems, assuming that the utility be the average (over the examples) of some function  which is not the case for many frequently used utility functions. We show that our algorithm works for all utilities that can be estimated with bounded error. We provide these error bounds and resulting worstcase sample bounds for some of the most frequently used utilities, and prove that there is no sampling algorithm for a popular class of utility functions that cannot be estimated with bounded error. The algorithm is sequential in the sense that it starts to return (or discard) hypotheses that already seem to be particularly good (or bad) after a few examples. Thus, the algorithm is almost always faster than its worstcase bounds.
MadaBoost: A Modification of AdaBoost
, 2000
"... . In the last decade, one of the research topics that has received a great deal of attention from the machine learning and computational learning communities has been the so called boosting techniques. In this paper, we further explore this topic by proposing a new boosting algorithm that mends some ..."
Abstract

Cited by 19 (3 self)
 Add to MetaCart
. In the last decade, one of the research topics that has received a great deal of attention from the machine learning and computational learning communities has been the so called boosting techniques. In this paper, we further explore this topic by proposing a new boosting algorithm that mends some of the problems that have been detected in the, so far most successful boosting algorithm, AdaBoost due to Freund and Schapire [FS97]. These problems are: (1) AdaBoost cannot be used in the boosting by filtering framework, and (2) AdaBoost does not seem to be noise resistant. In order to solve them, we propose a new boosting algorithm MadaBoost by modifying the weighting system of AdaBoost. We first prove that one version of MadaBoost is in fact a boosting algorithm. Second, we show how our algorithm can be used and analyzed its performance in detail. Finally, we show that our new boosting algorithm can be casted in the statistical query learning model [Kea93] and thus, it is robust to ra...
Scaling up a BoostingBased Learner via Adaptive Sampling
 In Proceedings of the Fourth PacificAsia Conference on Knowledge Discovery and Data Mining
, 2000
"... In this paper we present a experimental evaluation of a boosting based learning system and show that can be run efficiently over a large dataset. The system uses as base learner decision stumps, single atribute decision trees with only two terminal nodes. To select the best decision stump at each it ..."
Abstract

Cited by 11 (5 self)
 Add to MetaCart
In this paper we present a experimental evaluation of a boosting based learning system and show that can be run efficiently over a large dataset. The system uses as base learner decision stumps, single atribute decision trees with only two terminal nodes. To select the best decision stump at each iteration we use an adaptive sampling method. As a boosting algorithm, we use a modification of AdaBoost that is suitable to be combined with a base learner that does not use all the dataset. We provide experimental evidence that our method is as accurate as the equivalent algorithm that uses all the dataset but much faster.
A Sequential Sampling Algorithm for a General Class of Utility Criteria
 In Proceedings of the International Conference on Knowledge Discovery and Data Mining
, 2000
"... Many discovery problems, e.g., subgroup or association rule discovery, can naturally be cast as nbest hypothesis problems where the goal is to nd the n hypotheses from a given hypothesis space that score best according to a given utility function. We present a sampling algorithm that solves this pr ..."
Abstract

Cited by 8 (1 self)
 Add to MetaCart
Many discovery problems, e.g., subgroup or association rule discovery, can naturally be cast as nbest hypothesis problems where the goal is to nd the n hypotheses from a given hypothesis space that score best according to a given utility function. We present a sampling algorithm that solves this problem by issuing a small number of database queries while guaranteeing precise bounds on condence and quality of solutions. Known sampling algorithms assume that the utility be the average (over the examples) of some function, which is not the case for many frequently used utility functions. We show that our algorithm works for all utilities that can be estimated with bounded error. We provide such error bounds and resulting worstcase sample bounds for some of the most frequently used utilities, and prove that there is no sampling algorithm for another popular class of utility functions. The algorithm is sequential in the sense that it starts to return (or discard) hypotheses that already...
Sequential Sampling Algorithms: Unified Analysis and Lower Bounds
, 2001
"... Sequential sampling algorithms have recently attracted interest as a way to design scalable algorithms for Data mining and KDD processes. In this paper, we identify an elementary sequential sampling task (estimation from examples), from which one can derive many other tasks appearing in practice. We ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
Sequential sampling algorithms have recently attracted interest as a way to design scalable algorithms for Data mining and KDD processes. In this paper, we identify an elementary sequential sampling task (estimation from examples), from which one can derive many other tasks appearing in practice. We present a generic algorithm to solve this task and an analysis of its correctness and running time that is simpler and more intuitive than those existing in the literature. For two specific tasks, frequency and advantage estimation, we derive lower bounds on running time in addition to the general upper bounds.
From Computational Learning Theory to Discovery Science
, 1999
"... . Machine learning has been one of the important subjects of AI that is motivated by many real world applications. In theoretical computer science, researchers also have introduced mathematical frameworks for investigating machine learning, and in these frameworks, many interesting results have been ..."
Abstract

Cited by 5 (3 self)
 Add to MetaCart
. Machine learning has been one of the important subjects of AI that is motivated by many real world applications. In theoretical computer science, researchers also have introduced mathematical frameworks for investigating machine learning, and in these frameworks, many interesting results have been obtained. Now we are proceeding to a new stage to study how to apply these fruitful theoretical results to real problems. We point out in this paper that \adaptivity" is one of the important issues when we consider applications of learning techniques, and we propose one learning algorithm with this feature. 1 Introduction Discovery science 1 is a new area of computer science that aims at (i) developing eĆcient computational methods which enable automatic discoveries of scientic knowledge and decision making rules and (ii) understanding all the issues concerned with this goal. Of course, discovery science involves many areas, from practical to theoretical, of computer science. For exampl...
Using selfsimilarity to cluster large data sets
 DATA MINING AND KNOWLEDGE DISCOVERY
, 2003
"... Clustering is a widely used knowledge discovery technique. It helps uncovering structures in data that were not previously known. The clustering of large data sets has received a lot of attention in recent years, however, clustering is a still a challenging task since many published algorithms fail ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
Clustering is a widely used knowledge discovery technique. It helps uncovering structures in data that were not previously known. The clustering of large data sets has received a lot of attention in recent years, however, clustering is a still a challenging task since many published algorithms fail to do well in scaling with the size of the data set and the number of dimensions that describe the points, or in nding arbitrary shapes of clusters, or dealing effectively with the presence of noise. In this paper, we present a new clustering algorithm, based in selfsimilarity properties of the data sets. Selfsimilarity is the property of being invariant with respect to the scale used to look at the data set. While fractals are selfsimilar at every scale used to look at them, many data sets exhibit selfsimilarity over a range of scales. Selfsimilarity canbe measured using the fractal dimension. The new algorithm which we call Fractal Clustering (FC) places points incrementally in the cluster for which the change in the fractal dimension after adding the point is the least. This is a very natural way of clustering points, since points in the same cluster have a great degree of selfsimilarity among them (and much less selfsimilarity with respect to points in other clusters). FC requires one scan of the data, is suspendable at will, providing the best answer possible at that point, and is incremental. We show via experiments that FC effectively deals with large data sets, highdimensionality and noise and is capable of recognizing clusters of arbitrary shape.
Tracking Clusters in Evolving Data Sets
"... As organizations accumulate data over time, the problem of tracking how patterns evolve becomes important. In this paper, we present an algorithm to track the evolution of cluster models in a stream of data. Our algorithm is based on the application of bounds derived using Cherno#'s inequality ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
As organizations accumulate data over time, the problem of tracking how patterns evolve becomes important. In this paper, we present an algorithm to track the evolution of cluster models in a stream of data. Our algorithm is based on the application of bounds derived using Cherno#'s inequality and makes use of a clustering algorithm that was previously developed by us, namely Fractal Clustering, which uses selfsimilarity as the propertyto group points together. Experiments show that our tracking algorithm is e#cient and e#ective in #nding changes on the patterns.
FRACTAL MINING  Self Similaritybased Clustering and its Applications
"... Selfsimilarity is the property of being invariant with respect to the scale used to look at the data set. While fractals are selfsimilar at every scale used to look at them, many data sets exhibit selfsimilarity over a range of scales. Selfsimilarity can be measured using the fractal dimension. ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
Selfsimilarity is the property of being invariant with respect to the scale used to look at the data set. While fractals are selfsimilar at every scale used to look at them, many data sets exhibit selfsimilarity over a range of scales. Selfsimilarity can be measured using the fractal dimension. Fractal dimension is an important charactaristics for many complex systems and can serve as a powerful representation technique. In this chapter, we present a new clustering algorithm, based on selfsimilarity properties of the data sets, and also its applications to other fields in data mining, such as projected clustering and trend analysis. Clustering is a widely used knowledge discovery technique. It helps uncovering structures in data that were not previously known. The clustering of large data sets has received a lot of attention in recent years, however, clustering is a still a challenging task since many published algorithms fail to do well in scaling with the size of the data set and the number of dimensions that describe the points, or in finding arbitrary shapes of clusters, or dealing effectively with the presence of noise. The new algorithm which we call Fractal Clustering (FC) places points incrementally in the cluster for which the change in the fractal dimension after adding the point is the least. This is a very natural way of clustering points, since points in the same cluster have a great degree of selfsimilarity among them (and much less selfsimilarity with respect to points in other clusters). FC requires one scan of the data, is suspendable at will, providing the best answer possible at that point, and is incremental. We show via experiments that FC effectively deals with large data sets, highdimensionality and noise and is capable of recognizing clusters of arbitrary shape.