• Documents
  • Authors
  • Tables
  • Other Seers ▼
    RefSeer AckSeer CollabSeer SeerSeer
  • Log in
  • Sign up
  • MetaCart

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

Learning when training data are costly: the effect of class distribution on tree induction (2003)

by G Weiss, F Provost
Venue:J. Artif. Intell. Res
Add To MetaCart

Tools

Sorted by:
Results 11 - 20 of 63
Next 10 →

Structural Event Detection for Rich Transcription of Speech

by Yang Liu , 2004
"... xviii 1 ..."
Abstract - Cited by 12 (5 self) - Add to MetaCart
Abstract not found

Does cost-sensitive learning beat sampling for classifying rare classes

by Kate Mccarthy, Bibi Zabar, Gary Weiss - in UBDM ’05 , 2005
"... A highly-skewed class distribution usually causes the learned classifier to predict the majority class much more often than the minority class. This is a consequence of the fact that most classifiers are designed to maximize accuracy. In many instances, such as for medical diagnosis, the minority cl ..."
Abstract - Cited by 11 (0 self) - Add to MetaCart
A highly-skewed class distribution usually causes the learned classifier to predict the majority class much more often than the minority class. This is a consequence of the fact that most classifiers are designed to maximize accuracy. In many instances, such as for medical diagnosis, the minority class is the class of primary interest and hence this classification behavior is unacceptable. In this paper, we compare two basic strategies for dealing with data that has a skewed class distribution and non-uniform misclassification costs. One strategy is based on cost-sensitive learning while the other strategy employs sampling to create a more balanced class distribution in the training set. We compare two sampling techniques, up-sampling and down-sampling, to the cost-sensitive learning approach. The purpose of this paper is to determine which technique produces the best overall classifier—and under what circumstances.

Issues in mining imbalanced data sets - a review paper

by Sofia Visa - in Proceedings of the Sixteen Midwest Artificial Intelligence and Cognitive Science Conference, 2005
"... This paper traces some of the recent progress in the field of learning of imbalanced data. It reviews approaches adopted for this problem and it identifies challenges and points out future directions in this relatively new field. ..."
Abstract - Cited by 8 (0 self) - Add to MetaCart
This paper traces some of the recent progress in the field of learning of imbalanced data. It reviews approaches adopted for this problem and it identifies challenges and points out future directions in this relatively new field.

Automatically countering imbalance and its empirical relationship to cost

by Nitesh V. Chawla, David A. Cieslak, Lawrence O. Hall, Ajay Joshi - Data Mining and Knowledge Discovery , 2008
"... to cost ..."
Abstract - Cited by 7 (4 self) - Add to MetaCart
Abstract not found

Improving Academic Performance Prediction by Dealing with Class Imbalance

by Nguyen Thai-nghe, Andre Busche, Lars Schmidt-thieme
"... Abstract—This paper introduces and compares some techniques used to predict the student performance at the university. Recently, researchers have focused on applying machine learning in higher education to support both the students and the instructors getting better in their performances. Some previ ..."
Abstract - Cited by 6 (5 self) - Add to MetaCart
Abstract—This paper introduces and compares some techniques used to predict the student performance at the university. Recently, researchers have focused on applying machine learning in higher education to support both the students and the instructors getting better in their performances. Some previous papers have introduced this problem but the prediction results were unsatisfactory because of the class imbalance problem, which causes the degradation of the classifiers. The purpose of this paper is to tackle the class imbalance for improving the prediction/classification results by over-sampling techniques as well as using cost-sensitive learning (CSL). The paper shows that the results have been improved when comparing with only using baseline classifiers such as Decision Tree (DT), Bayesian Networks (BN), and Support Vector Machines (SVM) to the original data sets. Keywords-Academic performance; Prediction; Class imbalance; Cost-sensitive.

Handling imbalanced datasets: A review

by Sotiris Kotsiantis, Dimitris Kanellopoulos, Panayiotis Pintelas
"... Abstract. Learning classifiers from imbalanced or skewed datasets is an important topic, arising very often in practice in classification problems. In such problems, almost all the instances are labelled as one class, while far fewer instances are labelled as the other class, usually the more import ..."
Abstract - Cited by 5 (0 self) - Add to MetaCart
Abstract. Learning classifiers from imbalanced or skewed datasets is an important topic, arising very often in practice in classification problems. In such problems, almost all the instances are labelled as one class, while far fewer instances are labelled as the other class, usually the more important class. It is obvious that traditional classifiers seeking an accurate performance over a full range of instances are not suitable to deal with imbalanced learning tasks, since they tend to classify all the data into the majority class, which is usually the less important class. This paper describes various techniques for handling imbalance dataset problems. Of course, a single article cannot be a complete review of all the methods and algorithms, yet we hope that the references cited will cover the major theoretical issues, guiding the researcher in interesting research directions and suggesting possible bias combinations that have yet to be explored. 1

Data Mining in Telecommunications

by Gary Weiss - Data Mining and Knowledge Discovery Handbook: A Complete Guide for Practitioners and Researchers, Kluwer Academic, 2005 , 2004
"... Telecommunication companies generate a tremendous amount of data. These data include call detail data, which describes the calls that traverse the telecommunication networks, network data, which describes the state of the hardware and software components in the network, and customer data, which desc ..."
Abstract - Cited by 4 (0 self) - Add to MetaCart
Telecommunication companies generate a tremendous amount of data. These data include call detail data, which describes the calls that traverse the telecommunication networks, network data, which describes the state of the hardware and software components in the network, and customer data, which describes the telecommunication customers. This chapter describes how data mining can be used to uncover useful information buried within these data sets. Several data mining applications are described and together they demonstrate that data mining can be used to identify telecommunication fraud, improve marketing effectiveness, and identify network faults.

Why Label when you can Search? Alternatives to Active Learning for Applying Human Resources to Build Classification Models Under Extreme Class Imbalance ABSTRACT

by Josh Attenberg
"... This paper analyses alternative techniques for deploying lowcost human resources for data acquisition for classifier induction in domains exhibiting extreme class imbalance—where traditional labeling strategies, such as active learning, can be ineffective. Consider the problem of building classifier ..."
Abstract - Cited by 4 (0 self) - Add to MetaCart
This paper analyses alternative techniques for deploying lowcost human resources for data acquisition for classifier induction in domains exhibiting extreme class imbalance—where traditional labeling strategies, such as active learning, can be ineffective. Consider the problem of building classifiers to help brands control the content adjacent to their on-line advertisements. Although frequent enough to worry advertisers, objectionable categories are rare in the distribution of impressions encountered by most on-line advertisers—so rare that traditional sampling techniques do not find enough positive examples to train effective models. An alternative way to deploy human resources for training-data acquisition is to have them “guide ” the learning by searching explicitly for training examples of each class. We show that under extreme skew, even basic techniques for guided learning completely dominate smart (active) strategies for applying human resources to select cases for labeling. Therefore, it is critical to consider the relative cost of search versus labeling, and we demonstrate the tradeoffs for different relative costs. We show that in cost/skew settings where the choice between search and active labeling is equivocal, a hybrid strategy can combine the benefits.

Cell Phone-Based Biometric Identification

by Jennifer R. Kwapisz, Gary M. Weiss, Samuel A. Moore
"... Abstract — Mobile devices are becoming increasingly sophisticated and now incorporate many diverse and powerful sensors. The latest generation of smart phones is especially laden with sensors, including GPS sensors, vision sensors (cameras), audio sensors (microphones), light sensors, temperature se ..."
Abstract - Cited by 4 (2 self) - Add to MetaCart
Abstract — Mobile devices are becoming increasingly sophisticated and now incorporate many diverse and powerful sensors. The latest generation of smart phones is especially laden with sensors, including GPS sensors, vision sensors (cameras), audio sensors (microphones), light sensors, temperature sensors, direction sensors (compasses), and acceleration sensors. In this paper we describe and evaluate a system that uses phone-based acceleration sensors, called accelerometers, to identify and authenticate cell phone users. This form of behavioral biometric identification is possible because a person’s movements form a unique signature and this is reflected in the accelerometer data that they generate. To implement our system we collected accelerometer data from thirty-six users as they performed normal daily activities such as walking, jogging, and climbing stairs, aggregated this time series data into examples, and then applied standard classification algorithms to the resulting data to generate predictive models. These models either predict the identity of the individual from the set of thirty-six users, a task we call user identification, or predict whether (or not) the user is a specific user, a task we call user authentication. This work is notable because it enables identification and authentication to occur unobtrusively, without the users taking any extra actions—all they need to do is carry their cell phones. There are many uses for this work. For example, in environments where sharing may take place, our work can be used to automatically customize a mobile device to a user. It can also be used to provide device security by enabling usage for only specific users and can provide an extra level of identity verification. M I.

Improving classifier utility by altering the misclassification cost ratio

by Michelle Ciraco, Michael Rogalewski, Gary Weiss - Proceedings of the KDD-2005 Workshop on Utility-Based Data Mining , 2005
"... This paper examines whether classifier utility can be improved by altering the misclassification cost ratio (the ratio of false positive misclassification costs to false negative misclassification costs) associated with two-class datasets. This is evaluated by varying the cost ratio passed into two ..."
Abstract - Cited by 3 (1 self) - Add to MetaCart
This paper examines whether classifier utility can be improved by altering the misclassification cost ratio (the ratio of false positive misclassification costs to false negative misclassification costs) associated with two-class datasets. This is evaluated by varying the cost ratio passed into two cost-sensitive learners and then evaluating the results using the actual (or presumed actual) cost information. Our results indicate that a cost ratio other than the true ratio often maximizes classifier utility. Furthermore, by using a hold out set to identify the “best ” cost ratio for learning, we are able to take advantage of this behavior and generate classifiers that outperform the accepted strategy of always using the actual cost information during the learning phase.
The National Science Foundation
  • About CiteSeerX
  • Submit Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2010 The Pennsylvania State University