Results 1 - 10
of
23
Learning from imbalanced data
- IEEE Trans. on Knowledge and Data Engineering
, 2009
"... Abstract—With the continuous expansion of data availability in many large-scale, complex, and networked systems, such as surveillance, security, Internet, and finance, it becomes critical to advance the fundamental understanding of knowledge discovery and analysis from raw data to support decision-m ..."
Abstract
-
Cited by 260 (6 self)
- Add to MetaCart
(Show Context)
Abstract—With the continuous expansion of data availability in many large-scale, complex, and networked systems, such as surveillance, security, Internet, and finance, it becomes critical to advance the fundamental understanding of knowledge discovery and analysis from raw data to support decision-making processes. Although existing knowledge discovery and data engineering techniques have shown great success in many real-world applications, the problem of learning from imbalanced data (the imbalanced learning problem) is a relatively new challenge that has attracted growing attention from both academia and industry. The imbalanced learning problem is concerned with the performance of learning algorithms in the presence of underrepresented data and severe class distribution skews. Due to the inherent complex characteristics of imbalanced data sets, learning from such data requires new understandings, principles, algorithms, and tools to transform vast amounts of raw data efficiently into information and knowledge representation. In this paper, we provide a comprehensive review of the development of research in learning from imbalanced data. Our focus is to provide a critical review of the nature of the problem, the state-of-the-art technologies, and the current assessment metrics used to evaluate learning performance under the imbalanced learning scenario. Furthermore, in order to stimulate future research in this field, we also highlight the major opportunities and challenges, as well as potential important research directions for learning from imbalanced data. Index Terms—Imbalanced learning, classification, sampling methods, cost-sensitive learning, kernel-based learning, active learning, assessment metrics. Ç
SVMs Modeling for Highly Imbalanced Classification
, 2009
"... Traditional classification algorithms can be limited in their performance on highly unbalanced data sets. A popular stream of work for countering the problem of class imbalance has been the application of a sundry of sampling strategies. In this correspondence, we focus on designing modifications to ..."
Abstract
-
Cited by 44 (0 self)
- Add to MetaCart
(Show Context)
Traditional classification algorithms can be limited in their performance on highly unbalanced data sets. A popular stream of work for countering the problem of class imbalance has been the application of a sundry of sampling strategies. In this correspondence, we focus on designing modifications to support vector machines (SVMs) to appropriately tackle the problem of class imbalance. We incorporate different “rebalance ” heuristics in SVM modeling, including cost-sensitive learning, and over- and undersampling. These SVM-based strategies are compared with various state-of-the-art approaches on a variety of data sets by using various metrics, including G-mean, area under the receiver operating characteristic curve, F-measure, and area under the precision/recall curve. We show that we are able to surpass or match the previously known best algorithms on each data set. In particular, of the four SVM variations considered in this correspondence, the novel granular SVMs–repetitive undersampling algorithm (GSVM-RU) is the best in terms of both effectiveness and efficiency. GSVM-RU is effective, as it can minimize the negative effect of information loss while maximizing the positive effect of data cleaning in the undersampling process. GSVM-RU is efficient by extracting much less support vectors and, hence, greatly speeding up SVM prediction.
ADASYN: Adaptive synthetic sampling approach for imbalanced learning
- IN: IEEE INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IEEE WORLD CONGRESS ON COMPUTATIONAL INTELLIGENCE), IJCNN 2008
, 2008
"... This paper presents a novel adaptive synthetic (ADASYN) sampling approach for learning from imbalanced data sets. The essential idea of ADASYN is to use a weighted distribution for different minority class examples according to their level of difficulty in learning, where more synthetic data is gen ..."
Abstract
-
Cited by 26 (1 self)
- Add to MetaCart
(Show Context)
This paper presents a novel adaptive synthetic (ADASYN) sampling approach for learning from imbalanced data sets. The essential idea of ADASYN is to use a weighted distribution for different minority class examples according to their level of difficulty in learning, where more synthetic data is generated for minority class examples that are harder to learn compared to those minority examples that are easier to learn. As a result, the ADASYN approach improves learning with respect to the data distributions in two ways: (1) reducing the bias introduced by the class imbalance, and (2) adaptively shifting the classification decision boundary toward the difficult examples. Simulation analyses on several machine learning data sets show the effectiveness of this method across five evaluation metrics.
1 On the Classification of Imbalanced Datasets
"... In recent research the classifications of imbalanced data sets have received considerable attention. It is natural that due to the class imbalance the classifier tends to favour majority class. In this paper we investigate the performance of different methods for handling data imbalance in the micro ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
(Show Context)
In recent research the classifications of imbalanced data sets have received considerable attention. It is natural that due to the class imbalance the classifier tends to favour majority class. In this paper we investigate the performance of different methods for handling data imbalance in the microcalcification classification which is a classical example for data imbalance problem. Micro calcifications are very tiny deposits of calcium that appear as small bright spots in the mammogram. Classification of microcalcification clusters from mammograms plays an important role in computer-aided diagnosis for early detection of breast cancer. In this paper, we review in brief the state of the art techniques in the framework of imbalanced data sets and investigate the performance of different methods for microcalcification classification.
Constructing Training Sets for Outlier Detection
"... Outlier detection often works in an unsupervised manner due to the difficulty of obtaining enough training data. Since outliers are rare, one has to label a very large dataset to include enough outliers in the training set, with which classifiers could sufficiently learn the concept of outliers. Lab ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
(Show Context)
Outlier detection often works in an unsupervised manner due to the difficulty of obtaining enough training data. Since outliers are rare, one has to label a very large dataset to include enough outliers in the training set, with which classifiers could sufficiently learn the concept of outliers. Labeling a large training set is costly for most applications. However, we could just label suspected instances identified by unsupervised methods. In this way, the number of instances to be labeled could be greatly reduced. Based on this idea, we propose CISO, an algorithm Constructing training set by Identifying Suspected Outliers. In this algorithm, instances in a pool are first ranked by an unsupervised outlier detection algorithm. Then, suspected instances are selected and hand-labeled, and all remaining instances receive label of inlier. As such, all instances in the pool are labeled and used in the training set. We also propose Budgeted CISO (BCISO), with which user could set a fixed budget for labeling. Experiments show that both algorithms achieve good performance compared to other methods when the same amount of labeling effort are used. 1
Building Accurate Classifier for the Classification of Microcalcification
"... Abstract — The most common life threatening type of cancer affecting woman is breast cancer. Mammography is an effective screening tool for breast cancer. For mammogram the CAD system is like a spell checker. CAD systems use digital image processing techniques to improve the detection performance an ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
(Show Context)
Abstract — The most common life threatening type of cancer affecting woman is breast cancer. Mammography is an effective screening tool for breast cancer. For mammogram the CAD system is like a spell checker. CAD systems use digital image processing techniques to improve the detection performance and efficiency of mammography screening. The two most common features that are associated with cancers are clusters of microcalcifications and masses. There are some reasons such as small size of microcalcification, less brightness of microcalcification compared to the background, and superimposition of microcalcification on textures make the detection of microcalcification difficult. This paper proposes a methodology for the classification of microcalcification in mammograms. An improved classifier that introduces balanced learning for the accurate classification for the classification of microcalcification is proposed as one of the main steps in the methodology. The experiments are conducted on the samples collected from well known MIAS database and outperforms other methods in the classification of microcalcification.
Predictions for Biomedical Decision Support
, 2010
"... reliability diagram, adaptive learning, structured learning, maximum margin optimization, convex optimizaton, Medications designed for a general population do not work the same for each individual. Similarly, patterns observed from naturally occurring disease outbreaks do not necessarily describe ou ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
(Show Context)
reliability diagram, adaptive learning, structured learning, maximum margin optimization, convex optimizaton, Medications designed for a general population do not work the same for each individual. Similarly, patterns observed from naturally occurring disease outbreaks do not necessarily describe outbreaks of purposeful disease outbreaks (e.g. bioterrorism). To tackle challenges posed by individual differences, my thesis introduces data-driven paradigms that predict a particular case will have the outcome of interest. My insight is to accommodate individual differences by coherently leveraging information from complementary perspectives (e.g., temporal dependency, relational correlation, feature similarity, and estimation uncertainty) to provide more reliable predictions than possible with existing cohort-based approaches. Specifically, I carefully investigated two representative problems, bioterrorism-related disease outbreak and personalized clinical decision support, for which previous research does not provide satisfactory solutions. I developed a Temporal Maximum Margin Markov Network framework to consider the temporal correlation concurrently with relational dependency in bioterrorism-related diseases ’ outbreaks. This framework reduces the ambiguity in estimating
Symmetric RBF Classifier for Nonlinear Detection in Multiple-Antenna-Aided Systems
, 2008
"... In this paper, we propose a powerful symmetric radial basis function (RBF) classifier for nonlinear detection in the so-called “overloaded ” multiple-antenna-aided communication systems. By exploiting the inherent symmetry property of the optimal Bayesian detector, the proposed symmetric RBF classi ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
In this paper, we propose a powerful symmetric radial basis function (RBF) classifier for nonlinear detection in the so-called “overloaded ” multiple-antenna-aided communication systems. By exploiting the inherent symmetry property of the optimal Bayesian detector, the proposed symmetric RBF classifier is capable of approaching the optimal classification performance using noisy training data. The classifier construction process is robust to the choice of the RBF width and is computationally efficient. The proposed solution is capable of providing a signal-to-noise ratio (SNR) gain in excess of 8 dB against the powerful linear minimum bit error rate (BER) benchmark, when supporting four users with the aid of two receive antennas or seven users with four receive antenna elements.
Orthogonal-Least-Squares Forward Selection for Parsimonious Modelling from Data
, 2009
"... The objective of modelling from data is not that the model simply fits the training data well. Rather, the goodness of a model is characterized by its generalization capability, interpretability and ease for knowledge extraction. All these desired properties depend crucially on the ability to constr ..."
Abstract
- Add to MetaCart
The objective of modelling from data is not that the model simply fits the training data well. Rather, the goodness of a model is characterized by its generalization capability, interpretability and ease for knowledge extraction. All these desired properties depend crucially on the ability to construct appropriate parsimonious models by the modelling process, and a basic principle in practical nonlinear data modelling is the parsimonious principle of ensuring the smallest possible model that explains the training data. There exists a vast amount of works in the area of sparse modelling, and a widely adopted approach is based on the linear-in-the-parameters data modelling that include the radial basis function network, the neurofuzzy network and all the sparse kernel modelling techniques. A well tested strategy for parsimonious modelling from data is the orthogonal least squares (OLS) algorithm for forward selection modelling, which is capable of constructing sparse models that generalise well. This contribution continues this theme and provides a unified framework for sparse modelling from data that includes regression and classification, which belong to supervised learning, and probability density function estimation, which is an unsupervised learning problem. The OLS forward selection method based on the leave-one-out test criteria is presented within this unified data-modelling framework. Examples from regression, classification and density estimation applications are used to illustrate the effectiveness of this generic parsimonious modelling approach from data.
Kernel-Matching Pursuits With Arbitrary Loss Functions
"... Abstract—The purpose of this research is to develop a classifier capable of state-of-the-art performance in both computational efficiency and generalization ability while allowing the algorithm designer to choose arbitrary loss functions as appropriate for a give problem domain. This is critical in ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract—The purpose of this research is to develop a classifier capable of state-of-the-art performance in both computational efficiency and generalization ability while allowing the algorithm designer to choose arbitrary loss functions as appropriate for a give problem domain. This is critical in applications involving heavily imbalanced, noisy, or non-Gaussian distributed data. To achieve this goal, a kernel-matching pursuit (KMP) framework is formulated where the objective is margin maximization rather than the standard error minimization. This approach enables excellent performance and computational savings in the presence of large, imbalanced training data sets and facilitates the development of two general algorithms. These algorithms support the use of arbitrary loss functions allowing the algorithm designer to control the degree to which outliers are penalized and the manner in which non-Gaussian distributed data is handled. Example loss functions are provided and algorithm performance is illustrated in two groups of experimental results. The first group demonstrates that the proposed algorithms perform equivalent to several state-of-the-art machine learning algorithms on well-published, balanced data. The second group of results illustrates superior performance by the proposed algorithms on imbalanced, non-Gaussian data achieved by employing loss functions appropriate for the data characteristics and problem domain. Index Terms—Boosting, imbalanced data, iteratively reweighted least squares, kernel machines, kernel-matching pursuits (KMPs), margin maximization, robust classification, robust statistics, unbalanced data. I.