Results 1 - 10
of
14
Probabilistic Noise Identification and Data Cleaning
- In Proceedings of International Conference Data Mining (ICDM
, 2002
"... Real world data is never as perfect as we would like it to be and can often suffer from corruptions that may impact interpretations of the data, models created from the data, and decisions made based on the data. One approach to this problem is to identify and remove records that contain corruptions ..."
Abstract
-
Cited by 16 (0 self)
- Add to MetaCart
Real world data is never as perfect as we would like it to be and can often suffer from corruptions that may impact interpretations of the data, models created from the data, and decisions made based on the data. One approach to this problem is to identify and remove records that contain corruptions. Unfortunately, if only certain fields in a record have been corrupted then usable, uncorrupted data will be lost. In this paper we present LENS, an approach for identifying corrupted fields and using the remaining noncorrupted fields for subsequent modeling and analysis. Our approach uses the data to learn a probabilistic model containing three components: a generative model of the clean records, a generative model of the noise values, and a probabilistic model of the corruption process. We provide an algorithm for the unsupervised discovery of such models and empirically evaluate both its performance at detecting corrupted fields and, as one example application, the resulting improvement this gives to a classifier.
Eliminating class noise in large datasets
- In Proceeding of International Conference on Machine Learning (ICML2003
, 2003
"... This paper presents a new approach for identifying and eliminating mislabeled instances in large or distributed datasets. We first partition a dataset into subsets, each of which is small enough to be processed by an induction algorithm at one time. We construct good rules from each subset, and use ..."
Abstract
-
Cited by 16 (3 self)
- Add to MetaCart
This paper presents a new approach for identifying and eliminating mislabeled instances in large or distributed datasets. We first partition a dataset into subsets, each of which is small enough to be processed by an induction algorithm at one time. We construct good rules from each subset, and use the good rules to evaluate the whole dataset. For a given instance Ik, two error count variables are used to count the number of times it has been identified as noise by all subsets. The instance with higher error values will have a higher probability of being a mislabeled example. Two threshold schemes, majority and non-objection, are used to identify the noise. Experimental results and comparative studies from real-world datasets are reported to evaluate the effectiveness and efficiency of the proposed approach. 1.
Class Noise Mitigation through Instance Weighting
"... Abstract. We describe a novel framework for class noise mitigation that assigns a vector of class membership probabilities to each training instance, and uses the confidence on the current label as a weight during training. The probability vector should be calculated such that clean instances have a ..."
Abstract
-
Cited by 8 (2 self)
- Add to MetaCart
Abstract. We describe a novel framework for class noise mitigation that assigns a vector of class membership probabilities to each training instance, and uses the confidence on the current label as a weight during training. The probability vector should be calculated such that clean instances have a high confidence on its current label, while mislabeled instances have a low confidence on its current label and a high confidence on its correct label. Past research focuses on techniques that either discard or correct instances. This paper proposes that discarding and correcting are special cases of instance weighting, and thus, part of this framework. We propose a method that uses clustering to calculate a probability distribution over the class labels for each instance. We demonstrate that our method improves classifier accuracy over the original training set. We also demonstrate that instance weighting can outperform discarding. 1
Cost-guided Class Noise Handling for Effective Cost-sensitive Learning
- Proc. of the ICDM Conf
, 2004
"... Recent research in machine learning, data mining and related areas has produced a wide variety of algorithms for costsensitive (CS) classification, where instead of maximizing the classification accuracy, minimizing the misclassification cost becomes the objective. However, these methods assume that ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
Recent research in machine learning, data mining and related areas has produced a wide variety of algorithms for costsensitive (CS) classification, where instead of maximizing the classification accuracy, minimizing the misclassification cost becomes the objective. However, these methods assume that training sets do not contain significant noise, which is rarely the case in real-world environments. In this paper, we systematically study the impacts of class noise on CS learning, and propose a cost-guided class noise handling algorithm to identify noise for effective CS learning. We call it Cost-guided Iterative Classification Filter (CICF), because it seamlessly integrates costs and an existing Classification Filter [1] for noise identification. Instead of putting equal weights to handle noise in all classes in existing efforts, CICF puts more emphasis on expensive classes, which makes it especially successful in dealing with datasets with a large cost-ratio. Experimental results and comparative studies from real-world datasets indicate that the existence of noise may seriously corrupt the performance of CS classifiers, and by adopting the proposed CICF algorithm, we can significantly reduce the misclassification cost of a CS classifier in noisy environments. 1.
Dealing with predictive-but-unpredictable attributes in noisy data sources
- In Proceedings of the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD 04
, 2004
"... Abstract. Attribute noise can affect classification learning. Previous work in handling attribute noise has focused on those predictable attributes that can be predicted by the class and other attributes. However, attributes can often be predictive but unpredictable. Being predictive, they are essen ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
Abstract. Attribute noise can affect classification learning. Previous work in handling attribute noise has focused on those predictable attributes that can be predicted by the class and other attributes. However, attributes can often be predictive but unpredictable. Being predictive, they are essential to classification learning and it is important to handle their noise. Being unpredictable, they require strategies different from those of predictable attributes. This paper presents a study on identifying, cleansing and measuring noise for predictive-but-unpredictable attributes. New strategies are accordingly proposed. Both theoretical analysis and empirical evidence suggest that these strategies are more effective and more efficient than previous alternatives. 1
Distinguishing Mislabeled Data from Correctly Labeled Data in Classifier Design
- IN 16TH IEEE INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE (ICTAI’04), BOCA RATON, FL
, 2004
"... We have developed a method for distinguishing between correctly labeled and mislabeled data sampled from video sequences and used in the construction of a facial expression recognition classifier. The novelty of our approach lies in training a single, optimal classifier type (a Support Vector Machin ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
We have developed a method for distinguishing between correctly labeled and mislabeled data sampled from video sequences and used in the construction of a facial expression recognition classifier. The novelty of our approach lies in training a single, optimal classifier type (a Support Vector Machine, or SVM) on multiple representations of the data, involving different "discriminating" subspaces. Results of a preliminary study on the discrimination of "high stress" vs. "low stress" facial expression data by this method confirms that our novel approach is able to distinguish subproblems where labeling is highly reliable from those where mislabeling can lead to high error rates. In helping detect data sub-samples which yield misleading classification results, the method is also a rapid, highly efficient cross-validated approach for eliminating outliers.
Fast distributed outlier detection in mixed-attribute data sets
- Data Min. Knowl. Discov
, 2006
"... Efficiently detecting outliers or anomalies is an important problem in many areas of science, medicine and information technology. Applications range from data cleaning to clinical diagnosis, from detecting anomalous defects in materials to fraud and intrusion detection. Over the past decade, resear ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Efficiently detecting outliers or anomalies is an important problem in many areas of science, medicine and information technology. Applications range from data cleaning to clinical diagnosis, from detecting anomalous defects in materials to fraud and intrusion detection. Over the past decade, researchers in data mining and statistics have addressed the problem of outlier detection using both parametric and non-parametric approaches in a centralized setting. However, there are several challenges that must still be addressed. First, most approaches to date have focused on detecting outliers in a continuous attribute space. However, almost all real-world data sets contain a mixture of categorical and continuous attributes. The categorical attributes are typically ignored or incorrectly modeled by existing approaches, resulting in a significant loss of information. Second, there have not been any general-purpose distributed outlier detection algorithms. Most distributed detection algorithms are designed with a specific domain (e.g. sensor networks) in mind. Third, the data sets being analyzed may be streaming or otherwise dynamic in nature. Such data sets are prone to concept drift, and models of the data must be dynamic as well. To address these challenges, we present a tunable algorithm for distributed outlier detection in mixed-attribute data sets.
Inconsistency Tests for Patient Records in a Coronary Heart Disease Database
, 2000
"... . The work presents the results of inconsistency detection experiments on the data records of an atherosclerotic coronary heart disease database collected in the regular medical practice. Medical expert evaluation of some preliminary inductive learning results have demonstrated that explicit detecti ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
. The work presents the results of inconsistency detection experiments on the data records of an atherosclerotic coronary heart disease database collected in the regular medical practice. Medical expert evaluation of some preliminary inductive learning results have demonstrated that explicit detection of outliers can be useful for maintaining the data quality of medical records and that it might be a key for the improvement of medical decisions and their reliability in the regular medical practice. With the intention of on-line detection of possible data inconsistences, sets of confirmation rules have been developed for the database and their test results are reported in this work. 1 Introduction The motivation for the research presented in this work stems from the fact that modern medical decision processes are generally based on patient data from many different sources which are typically collected and archived by a multiterminal or distributed computer systems. Such organization e...
Kernel Methods for Anomaly Detection and Noise Elimination
"... A kernel-based algorithm for useful-anomaly detection and noise elimination is introduced. The algorithm's objective is to improve data quality by correcting wrong observations while leaving intact the correct ones. The proposed algorithm is based on a process that we called "Re-Measurement" and i ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
A kernel-based algorithm for useful-anomaly detection and noise elimination is introduced. The algorithm's objective is to improve data quality by correcting wrong observations while leaving intact the correct ones. The proposed algorithm is based on a process that we called "Re-Measurement" and it is oriented to datasets that might contain both kinds of rare objects: noise and useful anomalies. Two versions of
Error Detection and Impact-Sensitive Instance Ranking in Noisy Datasets
"... Given a noisy dataset, how to locate erroneous instances and attributes and rank suspicious instances based on their impacts on the system performance is an interesting and important research issue. We provide in this paper an Error Detection and Impact-sensitive instance Ranking (EDIR) mechanism to ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Given a noisy dataset, how to locate erroneous instances and attributes and rank suspicious instances based on their impacts on the system performance is an interesting and important research issue. We provide in this paper an Error Detection and Impact-sensitive instance Ranking (EDIR) mechanism to address this problem. Given a noisy dataset D, we first train a benchmark classifier T from D. The instances, that cannot be effectively classified by T are treated as suspicious and forwarded to a subset S. For each attribute A i, we switch A i and the class label C to train a classifier AP i for A i. Given an instance I k in S, we use AP i and the benchmark classifier T to locate the erroneous value of each attribute A i. To quantitatively rank instances in S, we define an impact measure based on the Information-gain Ratio (IR). We calculate IR i between attribute A i and C, and use IR i as the impact-sensitive weight of A i. The sum of impact-sensitive weights from all located erroneous attributes of I k indicates its total impact value. The experimental results demonstrate the effectiveness of our strategies. 1.

