Results 1 - 10
of
59
Mining high-speed data streams
, 2000
"... Categories and Subject ���������� � �¨�������������������������¦���¦����������¡¤�� ¡ � ¡����������������¦¡¤����§�£���� ..."
Abstract
-
Cited by 220 (10 self)
- Add to MetaCart
Categories and Subject ���������� � �¨�������������������������¦���¦����������¡¤�� ¡ � ¡����������������¦¡¤����§�£����
Learning when Training Data are Costly: The Effect of Class Distribution on Tree Induction
, 2002
"... For large, real-world inductive learning problems, the number of training examples often must be limited due to the costs associated with procuring, preparing, and storing the data and/or the computational costs associated with learning from the data. One question of practical importance is: if n ..."
Abstract
-
Cited by 79 (9 self)
- Add to MetaCart
For large, real-world inductive learning problems, the number of training examples often must be limited due to the costs associated with procuring, preparing, and storing the data and/or the computational costs associated with learning from the data. One question of practical importance is: if n training examples are going to be selected, in what proportion should the classes be represented? In this article we analyze the relationship between the marginal class distribution of training data and the performance of classification trees induced from these data, when the size of the training set is fixed. We study twenty-six data sets and, for each, determine the best class distribution for learning. Our results show that, for a fixed number of training examples, it is often possible to obtain improved classifier performance by training with a class distribution other than the naturally occurring class distribution. For example, we show that to build a classifier robust to different misclassification costs, a balanced class distribution generally performs quite well. We also describe and evaluate a budgetsensitive progressive-sampling algorithm that selects training examples such that the resulting training set has a good (near-optimal) class distribution for learning.
Types of cost in inductive concept learning
- In Workshop on Cost-Sensitive Learning at the Seventeenth International Conference on Machine Learning
, 2000
"... Inductive concept learning is the task of learning to assign cases to a discrete set of classes. In real-world applications of concept learning, there are many different types of cost involved. The majority of the machine learning literature ignores all types of cost (unless accuracy is interpreted ..."
Abstract
-
Cited by 77 (0 self)
- Add to MetaCart
Inductive concept learning is the task of learning to assign cases to a discrete set of classes. In real-world applications of concept learning, there are many different types of cost involved. The majority of the machine learning literature ignores all types of cost (unless accuracy is interpreted as a type of cost measure). A few papers have investigated the cost of misclassification errors. Very few papers have examined the many other types of cost. In this paper, we attempt to create a taxonomy of the different types of cost that are involved in inductive concept learning. This taxonomy may help to organize the literature on cost-sensitive learning. We hope that it will inspire researchers to investigate all types of cost in inductive concept learning in more depth. 1.
A General Method for Scaling Up Machine Learning Algorithms and its Application to Clustering
- In Proceedings of the Eighteenth International Conference on Machine Learning
, 2001
"... We propose to scale learning algorithms to arbitrarily large databases by the following method. First derive an upper bound for the learner's loss as a function of the number of examples used in each step of the algorithm. ..."
Abstract
-
Cited by 55 (3 self)
- Add to MetaCart
We propose to scale learning algorithms to arbitrarily large databases by the following method. First derive an upper bound for the learner's loss as a function of the number of examples used in each step of the algorithm.
Tree induction vs. logistic regression: A learning-curve analysis
- CEDER WORKING PAPER #IS-01-02, STERN SCHOOL OF BUSINESS
, 2001
"... Tree induction and logistic regression are two standard, off-the-shelf methods for building models for classi cation. We present a large-scale experimental comparison of logistic regression and tree induction, assessing classification accuracy and the quality of rankings based on class-membership pr ..."
Abstract
-
Cited by 50 (16 self)
- Add to MetaCart
Tree induction and logistic regression are two standard, off-the-shelf methods for building models for classi cation. We present a large-scale experimental comparison of logistic regression and tree induction, assessing classification accuracy and the quality of rankings based on class-membership probabilities. We use a learning-curve analysis to examine the relationship of these measures to the size of the training set. The results of the study show several remarkable things. (1) Contrary to prior observations, logistic regression does not generally outperform tree induction. (2) More specifically, and not surprisingly, logistic regression is better for smaller training sets and tree induction for larger data sets. Importantly, this often holds for training sets drawn from the same domain (i.e., the learning curves cross), so conclusions about induction-algorithm superiority on a given domain must be based on an analysis of the learning curves. (3) Contrary to conventional wisdom, tree induction is effective atproducing probability-based rankings, although apparently comparatively less so foragiven training{set size than at making classifications. Finally, (4) the domains on which tree induction and logistic regression are ultimately preferable canbecharacterized surprisingly well by a simple measure of signal-to-noise ratio.
Active learning using adaptive resampling
- In Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
, 2000
"... Classi cation modeling (a.k.a. supervised learning) is an extremely useful analytical technique for developing predictive and forecasting applications. The explosive growth in data warehousing and internet usage has made large amounts of data potentially available for developing classi cation models ..."
Abstract
-
Cited by 37 (1 self)
- Add to MetaCart
Classi cation modeling (a.k.a. supervised learning) is an extremely useful analytical technique for developing predictive and forecasting applications. The explosive growth in data warehousing and internet usage has made large amounts of data potentially available for developing classi cation models. For example, natural language text is widely available in many forms (e.g., electronic mail, news articles, reports, and web page contents). Categorization of data is a common activity which can be automated to a large extent using supervised learning methods. Examples of this include routing of electronic mail, satellite image classi cation, and character recognition. However, these tasks require labeled data sets of su ciently high quality with adequate instances for training the predictive models. Much of the on-line data, particularly the unstructured variety (e.g., text), is unlabeled. Labeling is usually a expensive manual process done by domain experts. Active learning is an approach to solving this problem and works by identifying a subset of the data that needs to be labeled and uses this subset to generate classi cation models. We present an active learning method that uses adaptive resampling in a natural way to signi cantly reduce the size of the required labeled set and generates a classi cation model that achieves the high accuracies possible with current adaptive resampling methods.
Adaptive Sampling Methods for Scaling Up Knowledge Discovery Algorithms
- Data Mining and Knowledge Discovery
, 1999
"... Scalability is a key requirement for any KDD and data mining algorithm, and one of the biggest research challenges is to develop methods that allow to use large amounts of data. One possible approach for dealing with huge amounts of data is to take a random sample and do data mining on it, since for ..."
Abstract
-
Cited by 35 (7 self)
- Add to MetaCart
Scalability is a key requirement for any KDD and data mining algorithm, and one of the biggest research challenges is to develop methods that allow to use large amounts of data. One possible approach for dealing with huge amounts of data is to take a random sample and do data mining on it, since for many data mining applications approximate answers are acceptable. However, as argued by several researchers, random sampling is difficult to use due to the difficulty of determining an appropriate sample size. In this paper, we take a sequential sampling approach for solving this difficulty, and propose an adaptive sampling method that solves a general problem covering many actual problems arising in applications of discovery science. An algorithm following this method obtains examples sequentially in an online fashion, and it determines from the obtained examples whether it has already seen a large enough number of examples. Thus, sample size is notfixed a priori; instead, it adaptively depends on the situation. Due to this adaptiveness, if we are not in a worst case situation as fortunately happens in many practical applications, then we can solve the problem with a number of examples much smaller than the required in the worst case. We prove the correctness of our method and estimates its efficiency theoretically. For illustrating its usefulness, we consider one concrete example of using sampling, provide an algorithm based on our method, and show its efficiency by experimental evaluation.
Machine Learning from Imbalanced Data Sets 101 (Extended Abstract)
"... Foster Provost New York University fprovost@stern.nyu.edu For research to progress most effectively, we first should establish common ground regarding just what is the problem that imbalanced data sets present to machine learning systems. Why and when should imbalanced data sets be problematic? W ..."
Abstract
-
Cited by 23 (1 self)
- Add to MetaCart
Foster Provost New York University fprovost@stern.nyu.edu For research to progress most effectively, we first should establish common ground regarding just what is the problem that imbalanced data sets present to machine learning systems. Why and when should imbalanced data sets be problematic? When is the problem simply an artifact of easily rectified design choices? I will try to pick the low-hanging fruit and share them with the rest of the workshop participants. Specifically, I would like to discuss what the problem is not. I hope this will lead to a profitable discussion of what the problem indeed is, and how it might be addressed most effectively. An early stumbling block A common notion in machine learning causes the most basic problem, and indeed often has stymied both research-oriented and practical attempts to learn from imbalanced data sets. Fortunately the problem is straightforward to fix. The stumbling block is the notion that an inductive learner produces a black b...
Learning and Classifying under Hard Budgets
- In Proceedings of the European Conference on Machine Learning (ECML-05
, 2005
"... Abstract. Since resources for data acquisition are seldom infinite, both learners and classifiers must act intelligently under hard budgets. In this paper, we consider problems in which feature values are unknown to both the learner and classifier, but can be acquired at a cost. Our goal is a learne ..."
Abstract
-
Cited by 20 (0 self)
- Add to MetaCart
Abstract. Since resources for data acquisition are seldom infinite, both learners and classifiers must act intelligently under hard budgets. In this paper, we consider problems in which feature values are unknown to both the learner and classifier, but can be acquired at a cost. Our goal is a learner that spends its fixed learning budget bL acquiring training data, to produce the most accurate “active classifier ” that spends at most bC per instance. To produce this fixed-budget classifier, the fixedbudget learner must sequentially decide which feature values to collect to learn the relevant information about the distribution. We explore several approaches the learner can take, including the standard “round robin” policy (purchasing every feature of every instance until the bL budget is exhausted). We demonstrate empirically that round robin is problematic (especially for small bL), and provide alternate learning strategies that achieve superior performance on a variety of datasets. 1
Learning ensembles from bites: A scalable and accurate approach
"... Bagging and boosting are two popular ensemble methods that typically achieve better accuracy than a single classifier. These techniques have limitations on massive datasets, as the size of the dataset can be a bottleneck. Voting many classifiers built on small subsets of data ("pasting small votes") ..."
Abstract
-
Cited by 19 (2 self)
- Add to MetaCart
Bagging and boosting are two popular ensemble methods that typically achieve better accuracy than a single classifier. These techniques have limitations on massive datasets, as the size of the dataset can be a bottleneck. Voting many classifiers built on small subsets of data ("pasting small votes") is a promising approach for learning from massive datasets, one that can utilize the power of boosting and bagging. We propose a framework for building hundreds or thousands of such classifiers on small subsets of data in a distributed environment. Experiments show this approach is fast, accurate, and scalable.

