Results 1  10
of
387
Data Clustering: A Review
 ACM COMPUTING SURVEYS
, 1999
"... Clustering is the unsupervised classification of patterns (observations, data items, or feature vectors) into groups (clusters). The clustering problem has been addressed in many contexts and by researchers in many disciplines; this reflects its broad appeal and usefulness as one of the steps in exp ..."
Abstract

Cited by 1892 (14 self)
 Add to MetaCart
Clustering is the unsupervised classification of patterns (observations, data items, or feature vectors) into groups (clusters). The clustering problem has been addressed in many contexts and by researchers in many disciplines; this reflects its broad appeal and usefulness as one of the steps in exploratory data analysis. However, clustering is a difficult problem combinatorially, and differences in assumptions and contexts in different communities has made the transfer of useful generic concepts and methodologies slow to occur. This paper presents an overview of pattern clustering methods from a statistical pattern recognition perspective, with a goal of providing useful advice and references to fundamental concepts accessible to the broad community of clustering practitioners. We present a taxonomy of clustering techniques, and identify crosscutting themes and recent advances. We also describe some important applications of clustering algorithms such as image segmentation, object recognition, and information retrieval.
Quantization
 IEEE TRANS. INFORM. THEORY
, 1998
"... The history of the theory and practice of quantization dates to 1948, although similar ideas had appeared in the literature as long ago as 1898. The fundamental role of quantization in modulation and analogtodigital conversion was first recognized during the early development of pulsecode modula ..."
Abstract

Cited by 877 (12 self)
 Add to MetaCart
The history of the theory and practice of quantization dates to 1948, although similar ideas had appeared in the literature as long ago as 1898. The fundamental role of quantization in modulation and analogtodigital conversion was first recognized during the early development of pulsecode modulation systems, especially in the 1948 paper of Oliver, Pierce, and Shannon. Also in 1948, Bennett published the first highresolution analysis of quantization and an exact analysis of quantization noise for Gaussian processes, and Shannon published the beginnings of rate distortion theory, which would provide a theory for quantization as analogtodigital conversion and as data compression. Beginning with these three papers of fifty years ago, we trace the history of quantization from its origins through this decade, and we survey the fundamentals of the theory and many of the popular and promising techniques for quantization.
An Efficient kMeans Clustering Algorithm: Analysis and Implementation
, 2000
"... Kmeans clustering is a very popular clustering technique, which is used in numerous applications. Given a set of n data points in R d and an integer k, the problem is to determine a set of k points R d , called centers, so as to minimize the mean squared distance from each data point to its ..."
Abstract

Cited by 406 (4 self)
 Add to MetaCart
(Show Context)
Kmeans clustering is a very popular clustering technique, which is used in numerous applications. Given a set of n data points in R d and an integer k, the problem is to determine a set of k points R d , called centers, so as to minimize the mean squared distance from each data point to its nearest center. A popular heuristic for kmeans clustering is Lloyd's algorithm. In this paper we present a simple and efficient implementation of Lloyd's kmeans clustering algorithm, which we call the filtering algorithm. This algorithm is very easy to implement. It differs from most other approaches in that it precomputes a kdtree data structure for the data points rather than the center points. We establish the practical efficiency of the filtering algorithm in two ways. First, we present a datasensitive analysis of the algorithm's running time. Second, we have implemented the algorithm and performed a number of empirical studies, both on synthetically generated data and on real...
Curvilinear Component Analysis: A SelfOrganizing Neural Network for Nonlinear Mapping of Data Sets
, 1997
"... We present a new strategy called “curvilinear component analysis” (CCA) for dimensionality reduction and representation of multidimensional data sets. The principle of CCA is a selforganized neural network performing two tasks: vector quantization (VQ) of the submanifold in the data set (input spac ..."
Abstract

Cited by 211 (1 self)
 Add to MetaCart
(Show Context)
We present a new strategy called “curvilinear component analysis” (CCA) for dimensionality reduction and representation of multidimensional data sets. The principle of CCA is a selforganized neural network performing two tasks: vector quantization (VQ) of the submanifold in the data set (input space) and nonlinear projection (P) of these quantizing vectors toward an output space, providing a revealing unfolding of the submanifold. After learning, the network has the ability to continuously map any new point from one space into another: forward mapping of new points in the input space, or backward mapping of an arbitrary position in the output space.
Generalized learning vector quantization
 Hasselmo (Eds.), NIPS
, 1995
"... We propose a new learning method, "Generalized Learning Vector Quantization (GLVQ), " in which reference vectors are updated based on the steepest descent method in order to minimize the cost function. The cost function is determined so that the obtained learning rule satisfies the conver ..."
Abstract

Cited by 123 (0 self)
 Add to MetaCart
We propose a new learning method, "Generalized Learning Vector Quantization (GLVQ), " in which reference vectors are updated based on the steepest descent method in order to minimize the cost function. The cost function is determined so that the obtained learning rule satisfies the convergence condition. We prove that Kohonen's rule as used in LVQ does not satisfy the convergence condition and thus degrades recognition ability. Experimental results for printed Chinese character recognition reveal that GLVQ is superior to LVQ in recognition ability. 1
Efficient Iris Recognition through Improvement of Feature Vector and Classifier
 ETRI Journal
, 2001
"... In this paper, we propose an efficient method for personal identification by analyzing iris patterns that have a high level of stability and distinctiveness. To improve the efficiency and accuracy of the proposed system, we present a new approach to making a feature vector compact and efficient by u ..."
Abstract

Cited by 93 (0 self)
 Add to MetaCart
In this paper, we propose an efficient method for personal identification by analyzing iris patterns that have a high level of stability and distinctiveness. To improve the efficiency and accuracy of the proposed system, we present a new approach to making a feature vector compact and efficient by using wavelet transform, and two straightforward but efficient mechanisms for a competitive learning method such as a weight vector initialization and the winner selection. With all of these novel mechanisms, the experimental results showed that the proposed system could be used for personal identification in an efficient and effective manner.
TextIndependent Writer Identification and Verification on Offline Arabic Handwriting
 in Proc. of Int. Conf. of Doc. Anal. and Rec. (ICDAR
, 2007
"... Abstract—The identification of a person on the basis of scanned images of handwriting is a useful biometric modality with application in forensic and historic document analysis and constitutes an exemplary study area within the research field of behavioral biometrics. We developed new and very effec ..."
Abstract

Cited by 67 (5 self)
 Add to MetaCart
(Show Context)
Abstract—The identification of a person on the basis of scanned images of handwriting is a useful biometric modality with application in forensic and historic document analysis and constitutes an exemplary study area within the research field of behavioral biometrics. We developed new and very effective techniques for automatic writer identification and verification that use probability distribution functions (PDFs) extracted from the handwriting images to characterize writer individuality. A defining property of our methods is that they are designed to be independent of the textual content of the handwritten samples. Our methods operate at two levels of analysis: the texture level and the charactershape (allograph) level. At the texture level, we use contourbased joint directional PDFs that encode orientation and curvature information to give an intimate characterization of individual handwriting style. In our analysis at the allograph level, the writer is considered to be characterized by a stochastic pattern generator of inktrace fragments, or graphemes. The PDF of these simple shapes in a given handwriting sample is characteristic for the writer and is computed using a common shape codebook obtained by graphemeclustering. Combiningmultiple features (directional, grapheme, and runlengthPDFs) yields increasedwriter identification and verification performance. The proposed methods are applicable to freestyle handwriting (both cursive and isolated) and have practical feasibility, under the assumption that a few text lines of handwrittenmaterial are available in order to obtain reliable probability estimates. Index Terms—Handwriting analysis, writer identification and verification, behavioral biometrics, joint directional probability distributions, graphemeemission probability distribution. Ç 1
Statistical strategies for avoiding false discoveries in metabolomics and related experiments
, 2006
"... Many metabolomics, and other highcontent or highthroughput, experiments are set up such that the primary aim is the discovery of biomarker metabolites that can discriminate, with a certain level of certainty, between nominally matched ‘case ’ and ‘control ’ samples. However, it is unfortunately ve ..."
Abstract

Cited by 60 (11 self)
 Add to MetaCart
(Show Context)
Many metabolomics, and other highcontent or highthroughput, experiments are set up such that the primary aim is the discovery of biomarker metabolites that can discriminate, with a certain level of certainty, between nominally matched ‘case ’ and ‘control ’ samples. However, it is unfortunately very easy to find markers that are apparently persuasive but that are in fact entirely spurious, and there are wellknown examples in the proteomics literature. The main types of danger are not entirely independent of each other, but include bias, inadequate sample size (especially relative to the number of metabolite variables and to the required statistical power to prove that a biomarker is discriminant), excessive false discovery rate due to multiple hypothesis testing, inappropriate choice of particular numerical methods, and overfitting (generally caused by the failure to perform adequate validation and crossvalidation). Many studies fail to take these into account, and thereby fail to discover anything of true significance (despite their claims). We summarise these problems, and provide pointers to a substantial existing literature that should assist in the improved design and evaluation of metabolomics experiments, thereby allowing robust scientific conclusions to be drawn from the available data. We provide a list of some of the simpler checks that might improve one’s confidence that a candidate biomarker is not simply a statistical artefact, and suggest a series of preferred tests and visualisation tools that can assist readers and authors in assessing papers. These tools can be applied to individual metabolites by using multiple univariate tests performed in parallel across all metabolite peaks. They may also be applied to the validation of multivariate models. We stress in