Results 1  10
of
213
Anomaly Detection: A Survey
, 2007
"... Anomaly detection is an important problem that has been researched within diverse research areas and application domains. Many anomaly detection techniques have been specifically developed for certain application domains, while others are more generic. This survey tries to provide a structured and c ..."
Abstract

Cited by 540 (5 self)
 Add to MetaCart
(Show Context)
Anomaly detection is an important problem that has been researched within diverse research areas and application domains. Many anomaly detection techniques have been specifically developed for certain application domains, while others are more generic. This survey tries to provide a structured and comprehensive overview of the research on anomaly detection. We have grouped existing techniques into different categories based on the underlying approach adopted by each technique. For each category we have identified key assumptions, which are used by the techniques to differentiate between normal and anomalous behavior. When applying a given technique to a particular domain, these assumptions can be used as guidelines to assess the effectiveness of the technique in that domain. For each category, we provide a basic anomaly detection technique, and then show how the different existing techniques in that category are variants of the basic technique. This template provides an easier and succinct understanding of the techniques belonging to each category. Further, for each category, we identify the advantages and disadvantages of the techniques in that category. We also provide a discussion on the computational complexity of the techniques since it is an important issue in real application domains. We hope that this survey will provide a better understanding of the di®erent directions in which research has been done on this topic, and how techniques developed in one area can be applied in domains for which they were not intended to begin with.
Learning from imbalanced data
 IEEE Trans. on Knowledge and Data Engineering
, 2009
"... Abstract—With the continuous expansion of data availability in many largescale, complex, and networked systems, such as surveillance, security, Internet, and finance, it becomes critical to advance the fundamental understanding of knowledge discovery and analysis from raw data to support decisionm ..."
Abstract

Cited by 260 (6 self)
 Add to MetaCart
(Show Context)
Abstract—With the continuous expansion of data availability in many largescale, complex, and networked systems, such as surveillance, security, Internet, and finance, it becomes critical to advance the fundamental understanding of knowledge discovery and analysis from raw data to support decisionmaking processes. Although existing knowledge discovery and data engineering techniques have shown great success in many realworld applications, the problem of learning from imbalanced data (the imbalanced learning problem) is a relatively new challenge that has attracted growing attention from both academia and industry. The imbalanced learning problem is concerned with the performance of learning algorithms in the presence of underrepresented data and severe class distribution skews. Due to the inherent complex characteristics of imbalanced data sets, learning from such data requires new understandings, principles, algorithms, and tools to transform vast amounts of raw data efficiently into information and knowledge representation. In this paper, we provide a comprehensive review of the development of research in learning from imbalanced data. Our focus is to provide a critical review of the nature of the problem, the stateoftheart technologies, and the current assessment metrics used to evaluate learning performance under the imbalanced learning scenario. Furthermore, in order to stimulate future research in this field, we also highlight the major opportunities and challenges, as well as potential important research directions for learning from imbalanced data. Index Terms—Imbalanced learning, classification, sampling methods, costsensitive learning, kernelbased learning, active learning, assessment metrics. Ç
Exploratory UnderSampling for ClassImbalance Learning
"... Undersampling is a classimbalance learning method which uses only a subset of major class examples and thus is very efficient. The main deficiency is that many major class examples are ignored. We propose two algorithms to overcome the deficiency. EasyEnsemble samples several subsets from the majo ..."
Abstract

Cited by 97 (5 self)
 Add to MetaCart
(Show Context)
Undersampling is a classimbalance learning method which uses only a subset of major class examples and thus is very efficient. The main deficiency is that many major class examples are ignored. We propose two algorithms to overcome the deficiency. EasyEnsemble samples several subsets from the major class, trains a learner using each of them, and combines the outputs of those learners. BalanceCascade is similar toEasyEnsemble except that it removes correctly classified major class examples of trained learners from further consideration. Experiments show that both of the proposed algorithms have better AUC scores than many existing classimbalance learning methods. Moreover, they have approximately the same training time as that of undersampling, which trains significantly faster than other methods. 1
Mining multilabel data
 In Data Mining and Knowledge Discovery Handbook
, 2010
"... A large body of research in supervised learning deals with the analysis of singlelabel data, where training examples are associated with a single label λ from a set of disjoint labels L. However, training examples in several application domains are often associated with a set of labels Y ⊆ L. Such d ..."
Abstract

Cited by 92 (9 self)
 Add to MetaCart
(Show Context)
A large body of research in supervised learning deals with the analysis of singlelabel data, where training examples are associated with a single label λ from a set of disjoint labels L. However, training examples in several application domains are often associated with a set of labels Y ⊆ L. Such data are called multilabel.
Costsensitive boosting for classification of imbalanced data
, 2007
"... Classification of data with imbalanced class distribution has posed a significant drawback of the performance attainable by most standard classifier learning algorithms, which assume a relatively balanced class distribution and equal misclassification costs. The significant difficulty and frequent o ..."
Abstract

Cited by 77 (1 self)
 Add to MetaCart
(Show Context)
Classification of data with imbalanced class distribution has posed a significant drawback of the performance attainable by most standard classifier learning algorithms, which assume a relatively balanced class distribution and equal misclassification costs. The significant difficulty and frequent occurrence of the class imbalance problem indicate the need for extra research efforts. The objective of this paper is to investigate metatechniques applicable to most classifier learning algorithms, with the aim to advance the classification of imbalanced data. The AdaBoost algorithm is reported as a successful metatechnique for improving classification accuracy. The insight gained from a comprehensive analysis of the AdaBoost algorithm in terms of its advantages and shortcomings in tacking the class imbalance problem leads to the exploration of three costsensitive boosting algorithms, which are developed by introducing cost items into the learning framework of AdaBoost. Further analysis shows that one of the proposed algorithms tallies with the stagewise additive modelling in statistics to minimize the cost exponential loss. These boosting algorithms are also studied with respect to their weighting strategies towards different types of samples, and their effectiveness in identifying rare cases through experiments on several real world medical data sets, where the class imbalance problem prevails.
InputDependent Estimation of Generalization Error under Covariate Shift
 STATISTICS & DECISIONS, VOL.23, NO.4, PP.249–279, 2005
, 2005
"... A common assumption in supervised learning is that the training and test input points follow the same probability distribution. However, this assumption is not fulfilled, e.g., in interpolation, extrapolation, active learning, or classification with imbalanced data. The violation of this assumption— ..."
Abstract

Cited by 61 (32 self)
 Add to MetaCart
A common assumption in supervised learning is that the training and test input points follow the same probability distribution. However, this assumption is not fulfilled, e.g., in interpolation, extrapolation, active learning, or classification with imbalanced data. The violation of this assumption—known as the covariate shift— causes a heavy bias in standard generalization error estimation schemes such as crossvalidation or Akaike’s information criterion, and thus they result in poor model selection. In this paper, we propose an alternative estimator of the generalization error for the squared loss function when training and test distributions are different. The proposed generalization error estimator is shown to be exactly unbiased for finite samples if the learning target function is realizable and asymptotically unbiased in general. We also show that, in addition to the unbiasedness, the proposed generalization error estimator can accurately estimate the difference of the generalization error among different models, which is a desirable property in model selection. Numerical studies show that the proposed method compares favorably with existing model selection methods in regression for extrapolation and in classification with imbalanced data.
A general framework for mining conceptdrifting data streams with skewed distributions
 In Proc. SDM’07
, 2007
"... In recent years, there have been some interesting studies on predictive modeling in data streams. However, most such studies assume relatively balanced and stable data streams but cannot handle well rather skewed (e.g., few positives but lots of negatives) and stochastic distributions, which are typ ..."
Abstract

Cited by 47 (6 self)
 Add to MetaCart
(Show Context)
In recent years, there have been some interesting studies on predictive modeling in data streams. However, most such studies assume relatively balanced and stable data streams but cannot handle well rather skewed (e.g., few positives but lots of negatives) and stochastic distributions, which are typical in many data stream applications. In this paper, we propose a new approach to mine data streams by estimating reliable posterior probabilities using an ensemble of models to match the distribution over undersamples of negatives and repeated samples of positives. We formally show some interesting and important properties of the proposed framework, e.g., reliability of estimated probabilities on skewed positive class, accuracy of estimated probabilities, efficiency and scalability. Experiments are performed on several synthetic as well as realworld datasets with skewed distributions, and they demonstrate that our framework has substantial advantages over existing approaches in estimation reliability and predication accuracy. 1
SVMs Modeling for Highly Imbalanced Classification
, 2009
"... Traditional classification algorithms can be limited in their performance on highly unbalanced data sets. A popular stream of work for countering the problem of class imbalance has been the application of a sundry of sampling strategies. In this correspondence, we focus on designing modifications to ..."
Abstract

Cited by 44 (0 self)
 Add to MetaCart
(Show Context)
Traditional classification algorithms can be limited in their performance on highly unbalanced data sets. A popular stream of work for countering the problem of class imbalance has been the application of a sundry of sampling strategies. In this correspondence, we focus on designing modifications to support vector machines (SVMs) to appropriately tackle the problem of class imbalance. We incorporate different “rebalance ” heuristics in SVM modeling, including costsensitive learning, and over and undersampling. These SVMbased strategies are compared with various stateoftheart approaches on a variety of data sets by using various metrics, including Gmean, area under the receiver operating characteristic curve, Fmeasure, and area under the precision/recall curve. We show that we are able to surpass or match the previously known best algorithms on each data set. In particular, of the four SVM variations considered in this correspondence, the novel granular SVMs–repetitive undersampling algorithm (GSVMRU) is the best in terms of both effectiveness and efficiency. GSVMRU is effective, as it can minimize the negative effect of information loss while maximizing the positive effect of data cleaning in the undersampling process. GSVMRU is efficient by extracting much less support vectors and, hence, greatly speeding up SVM prediction.
Link Prediction via Matrix Factorization
"... Abstract. We propose to solve the link prediction problem in graphs using a supervised matrix factorization approach. The model learns latent features from the topological structure of a (possibly directed) graph, and is shown to make better predictions than popular unsupervised scores. We show how ..."
Abstract

Cited by 39 (0 self)
 Add to MetaCart
(Show Context)
Abstract. We propose to solve the link prediction problem in graphs using a supervised matrix factorization approach. The model learns latent features from the topological structure of a (possibly directed) graph, and is shown to make better predictions than popular unsupervised scores. We show how these latent features may be combined with optional explicit features for nodes or edges, which yields better performance than using either type of feature exclusively. Finally, we propose a novel approach to address the class imbalance problem which is common in link prediction by directly optimizing for a ranking loss. Our model is optimized with stochastic gradient descent and scales to large graphs. Results on several datasets show the efficacy of our approach.
Boosting for learning multiple classes with imbalanced class distribution
 In 2006 IEEE International Conference on Data Mining (accepted), HongKong
, 2006
"... Classification of data with imbalanced class distribution has posed a significant drawback of the performance attainable by most standard classifier learning algorithms, which assume a relatively balanced class distribution and equal misclassification costs. This learning difficulty attracts a lot o ..."
Abstract

Cited by 35 (1 self)
 Add to MetaCart
(Show Context)
Classification of data with imbalanced class distribution has posed a significant drawback of the performance attainable by most standard classifier learning algorithms, which assume a relatively balanced class distribution and equal misclassification costs. This learning difficulty attracts a lot of research interests. Most efforts concentrate on biclass problems. However, biclass is not the only scenario where the class imbalance problem prevails. Reported solutions for biclass applications are not applicable to multiclass problems. In this paper, we develop a costsensitive boosting algorithm to improve the classification performance of imbalanced data involving multiple classes. One barrier of applying the costsensitive boosting algorithm to the imbalanced data is that the cost matrix is often unavailable for a problem domain. To solve this problem, we apply Genetic Algorithm to search the optimum cost setup of each class. Empirical tests show that the proposed costsensitive boosting algorithm improves the classification performances of imbalanced data sets significantly. 1