Results 1  10
of
18
Fast nearest neighbor condensation for large data sets classification
 IEEE Transactions on Knowledge and Data Engineering
"... Abstract—This work has two main objectives, namely, to introduce a novel algorithm, called the Fast Condensed Nearest Neighbor (FCNN) rule, for computing a trainingsetconsistent subset for the nearest neighbor decision rule and to show that condensation algorithms for the nearest neighbor rule can ..."
Abstract

Cited by 10 (0 self)
 Add to MetaCart
(Show Context)
Abstract—This work has two main objectives, namely, to introduce a novel algorithm, called the Fast Condensed Nearest Neighbor (FCNN) rule, for computing a trainingsetconsistent subset for the nearest neighbor decision rule and to show that condensation algorithms for the nearest neighbor rule can be applied to huge collections of data. The FCNN rule has some interesting properties: it is order independent, its worstcase time complexity is quadratic but often with a small constant prefactor, and it is likely to select points very close to the decision boundary. Furthermore, its structure allows for the triangle inequality to be effectively exploited to reduce the computational effort. The FCNN rule outperformed even hereenhanced variants of existing competence preservation methods both in terms of learning speed and learning scaling behavior and, often, in terms of the size of the model while it guaranteed the same prediction accuracy. Furthermore, it was three orders of magnitude faster than hybrid instancebased learning algorithms on the MNIST and Massachusetts Institute of Technology (MIT) Face databases and computed a model of accuracy comparable to that of methods incorporating a noisefiltering pass. Index Terms—Classification, large and highdimensional data, nearest neighbor rule, prototype selection algorithms, trainingsetconsistent subset. Ç 1
Semisupervised condensed nearest neighbor for partofspeech tagging
"... This paper introduces a new training set condensation technique designed for mixtures of labeled and unlabeled data. It finds a condensed set of labeled and unlabeled data points, typically smaller than what is obtained using condensed nearest neighbor on the labeled data only, and improves classifi ..."
Abstract

Cited by 7 (0 self)
 Add to MetaCart
(Show Context)
This paper introduces a new training set condensation technique designed for mixtures of labeled and unlabeled data. It finds a condensed set of labeled and unlabeled data points, typically smaller than what is obtained using condensed nearest neighbor on the labeled data only, and improves classification accuracy. We evaluate the algorithm on semisupervised partofspeech tagging and present the best published result on the Wall Street Journal data set. 1
Condensed Nearest Neighbor Data Domain Description
"... Abstract—A simple yet effective unsupervised classification rule to discriminate between normal and abnormal data is based on accepting test objects whose nearest neighbors ’ distances in a reference data set, assumed to model normal behavior, lie within a certain threshold. This work investigates t ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
(Show Context)
Abstract—A simple yet effective unsupervised classification rule to discriminate between normal and abnormal data is based on accepting test objects whose nearest neighbors ’ distances in a reference data set, assumed to model normal behavior, lie within a certain threshold. This work investigates the effect of using a subset of the original data set as the reference set of the classifier. With this aim, the concept of a referenceconsistent subset is introduced and it is shown that finding the minimumcardinality referenceconsistent subset is intractable. Then, the Condensed Nearest Neighbor Domain Description (CNNDD) algorithm is described, which computes a referenceconsistent subset with only two reference set passes. Experimental results revealed the advantages of condensing the data set and confirmed the effectiveness of the proposed approach. A thorough comparison with related methods was accomplished, pointing out the strengths and weaknesses of oneclass nearestneighborbased trainingsetconsistent condensation. Index Terms—Classification, data domain description, data condensation, nearest neighbor rule, novelty detection. Ç 1
D.: Adaptationguided case base maintenance
 In: Proceedings of the TwentyEighth Conference on Artificial Intelligence, AAAI Press (2014) In
"... In casebased reasoning (CBR), problems are solved by retrieving prior cases and adapting their solutions to fit; learning occurs as new cases are stored. Controlling the growth of the case base is a fundamental problem, and research on casebase maintenance has developed methods for compacting c ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
In casebased reasoning (CBR), problems are solved by retrieving prior cases and adapting their solutions to fit; learning occurs as new cases are stored. Controlling the growth of the case base is a fundamental problem, and research on casebase maintenance has developed methods for compacting case bases while maintaining system competence, primarily by competencebased deletion strategies assuming static case adaptation knowledge. This paper proposes adaptationguided casebase maintenance (AGCBM), a casebase maintenance approach exploiting the ability to dynamically generate new adaptation knowledge from cases. In AGCBM, case retention decisions are based both on cases ’ value as base cases for solving problems and on their value for generating new adaptation rules. The paper illustrates the method for numerical prediction tasks (casebased regression) in which adaptation rules are generated automatically using the case difference heuristic. In comparisons of AGCBM to five alternative methods in four domains, for varying case base densities, AGCBM outperformed the alternatives in all domains, with greatest benefit at high compression.
Stochastic Neighbor Compression
"... We present Stochastic Neighbor Compression (SNC), an algorithm to compress a dataset for the purpose of knearest neighbor (kNN) classification. Given training data, SNC learns a much smaller synthetic data set, that minimizes the stochastic 1nearest neighbor classification error on the training d ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
(Show Context)
We present Stochastic Neighbor Compression (SNC), an algorithm to compress a dataset for the purpose of knearest neighbor (kNN) classification. Given training data, SNC learns a much smaller synthetic data set, that minimizes the stochastic 1nearest neighbor classification error on the training data. This approach has several appealing properties: due to its small size, the compressed set speeds up kNN testing drastically (up to several orders of magnitude, in our experiments); it makes the kNN classifier substantially more robust to label noise; on 4 of 7 data sets it yields lower test error than kNN on the entire training set, even at compression ratios as low as 2%; finally, the SNC compression leads to impressive speed ups over kNN even when kNN and SNC are both used with balltree data structures, hashing, and LMNN dimensionality reduction—demonstrating that it is complementary to existing stateoftheart algorithms to speed up kNN classification and leads to substantial further improvements. 1.
NRMCS: NOISE REMOVING BASED ON THE MCS
"... MCS (Minimal Consistent Set) is one of the classical algorithms for minimal consistent subset selection problem. However, when noisy samples are present classification accuracy can suffer. In addition, noise affect the size of minimal consistent set. Therefore, removing noise is an important issue b ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
MCS (Minimal Consistent Set) is one of the classical algorithms for minimal consistent subset selection problem. However, when noisy samples are present classification accuracy can suffer. In addition, noise affect the size of minimal consistent set. Therefore, removing noise is an important issue before sample selection. In this paper, an improvement approach based on MCS to select the representative samples is proposed. Compared with other algorithms which remove the noise by Wilson Editing in advance for the representative samples selection, this algorithm performs the processes of noise removing and samples selection simultaneously. According to this method, most noise can be deleted and the most representative samples can be identified and retained. The experiments show that the proposed method can greatly remove the redundant samples and noise as well as increase the accuracy of solutions when it is used for classification tasks. Keywords:
Graphbased Discrete Differential Geometry for Critical Instance Filtering
"... Abstract. Graph theory has been shown to provide a powerful tool for representing and tackling machine learning problems, such as clustering, semisupervised learning, and feature ranking. This paper proposes a graphbased discrete differential operator for detecting and eliminating competencecriti ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract. Graph theory has been shown to provide a powerful tool for representing and tackling machine learning problems, such as clustering, semisupervised learning, and feature ranking. This paper proposes a graphbased discrete differential operator for detecting and eliminating competencecritical instances and class label noise from a training set in order to improve classification performance. Results of extensive experiments on artificial and reallife classification problems substantiate the effectiveness of the proposed approach. 1
A Relative Position Viewbased Instances Set Reduction Algorithm ⋆
"... With the development of data acquisition and storage techniques, even more and larger datasets are easily confronted in machine learning. In order to save excessive storage and computational time and improve generalization accuracy by removing noise, we propose a novel instancebased learning algori ..."
Abstract
 Add to MetaCart
(Show Context)
With the development of data acquisition and storage techniques, even more and larger datasets are easily confronted in machine learning. In order to save excessive storage and computational time and improve generalization accuracy by removing noise, we propose a novel instancebased learning algorithm based on the Relative Position view, namely RePo, in this paper. We treat the training set reduction as a problem that selects which instance should be deleted, and develop two understandable definitions of replaceable structure, which result in simple and effective principles of dealing with points: retain the border points, delete the noisy and reduce the internal ones. By generating new prototypes, deleting noise and close border points, the RePo algorithm is quite effective in storage reduction. In addition, we compare our RePo to other nine traditional and typical reduction techniques performing on 16 classification tasks. It has been demonstrated that the RePo outperforms the others in terms of storage requirement while guarantees generalization accuracy is close to the best one.
ACL HLT 2011 The 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies
, 2011
"... Computational Linguistics 35(4), and an obituary by Mark Liberman appeared in Computational Linguistics 36(4). Several other newspaper and professional society obituaries have described his extraordinary personal life and career. Fred’s influence on computational linguistics is almost impossible to ..."
Abstract
 Add to MetaCart
(Show Context)
Computational Linguistics 35(4), and an obituary by Mark Liberman appeared in Computational Linguistics 36(4). Several other newspaper and professional society obituaries have described his extraordinary personal life and career. Fred’s influence on computational linguistics is almost impossible to overstate. In the 1970s and 1980s, he and his colleagues at IBM developed the statistical paradigm that dominates our field today, including a great many specific techniques for modeling, parameter estimation, and search that continue to enjoy wide use. Even more fundamentally, as Mark Liberman recounts in his obituary, Fred led the field away from a mode where lone inventors defended their designs by appealing to aesthetics and anecdotes, to a more communal and transparent process of evaluating methods objectively through controlled comparisons on training and test sets. Under Fred’s visionary leadership, the IBM group revolutionized speech recognition by adopting a statistical, datadriven perspective that was deeply at odds with the rationalist ethos of the time. The group began with Fred’s informationtheoretic reconceptualization of the task as recovering a source signal (text) after it had passed through a noisy channel. They then worked out the many components needed for a full speech recognizer, along with the training algorithms for each component and global
Contents lists available at SciVerse ScienceDirect Pattern Recognition Letters
"... journal homepage: www.elsevier.com/locate/patrec A direct boosting algorithm for the knearest neighbor classifier via local warping ..."
Abstract
 Add to MetaCart
(Show Context)
journal homepage: www.elsevier.com/locate/patrec A direct boosting algorithm for the knearest neighbor classifier via local warping