## Statistical pattern recognition: A review (2000)

### Cached

### Download Links

- [www.cs.colorado.edu]
- [www.csee.wvu.edu]
- [www.cse.msu.edu]
- [web.cse.msu.edu]
- [web.cse.msu.edu]
- [people.sabanciuniv.edu]
- [www.ai.rug.nl]
- [www.cfar.umd.edu]
- [www.cse.msu.edu]
- [www.sas.el.utwente.nl]
- [www.cs.unimaas.nl]
- [www.mts.jhu.edu]
- [www.ppgia.pucpr.br]
- [aimm02.cse.ttu.edu.tw]
- DBLP

### Other Repositories/Bibliography

Venue: | IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE |

Citations: | 752 - 23 self |

### BibTeX

@ARTICLE{Jain00statisticalpattern,

author = {Anil K. Jain and Robert P. W. Duin and Jianchang Mao},

title = {Statistical pattern recognition: A review},

journal = {IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE},

year = {2000},

volume = {22},

number = {1},

pages = {4--37}

}

### Years of Citing Articles

### OpenURL

### Abstract

The primary goal of pattern recognition is supervised or unsupervised classification. Among the various frameworks in which pattern recognition has been traditionally formulated, the statistical approach has been most intensively studied and used in practice. More recently, neural network techniques and methods imported from statistical learning theory have bean receiving increasing attention. The design of a recognition system requires careful attention to the following issues: definition of pattern classes, sensing environment, pattern representation, feature extraction and selection, cluster analysis, classifier design and learning, selection of training and test samples, and performance evaluation. In spite of almost 50 years of research and development in this field, the general problem of recognizing complex patterns with arbitrary orientation, location, and scale remains unsolved. New and emerging applications, such as data mining, web searching, retrieval of multimedia data, face recognition, and cursive handwriting recognition, require robust and efficient pattern recognition techniques. The objective of this review paper is to summarize and compare some of the well-known methods used in various stages of a pattern recognition system and identify research topics and applications which are at the forefront of this exciting and challenging field.

### Citations

9946 | Statistical Learning Theory
- Vapnik
- 1998
(Show Context)
Citation Context ...ification of training patterns. Such boundaries can be constructed using, for example, a mean squared error criterion. The direct boundary construction approaches are supported by Vapnik's philosophy =-=[162]-=-: ªIf you possess a restricted amount of information for solving some problem, try to solve the problem directly and never solve a more general problem as an intermediate step. It is possible that the... |

5438 |
C4.5: Programs for Machine Learning
- Quinlan
- 1993
(Show Context)
Citation Context ...tions where each class conditional density is represented by a weighted sum of Gaussians (a so-called Gaussian mixture; see Section 8.2). A special type of classifier is the decision tree [22], [30], =-=[129]-=-, which is trained by an iterative selection of individual features that are most salient at each node of the tree. The criteria for feature selection and tree generation include the information conte... |

5369 |
Neural Networks for Pattern Recognition
- Bishop
- 1995
(Show Context)
Citation Context ... (10), error estimation (25) and unsupervised classification (50). In addition to the excellent textbooks by Duda and Hart [44],1 Fukunaga [58], Devijver and Kittler [39], Devroye et al. [41], Bishop =-=[18]-=-, Ripley [137], Schurmann [147], and McLachlan [105], we should also point out two excellent survey papers written by Nagy [111] in 1968 and by Kanal [89] in 1974. Nagy described the early roots of pa... |

4457 |
Classification and Regression Trees
- Breiman, Friedman, et al.
- 1984
(Show Context)
Citation Context ...ble to situations where each class conditional density is represented by a weighted sum of Gaussians (a so-called Gaussian mixture; see Section 8.2). A special type of classifier is the decision tree =-=[22]-=-, [30], [129], which is trained by an iterative selection of individual features that are most salient at each node of the tree. The criteria for feature selection and tree generation include the info... |

4311 |
Neural Networks, A Comprehensive Foundation
- Haykin
- 1994
(Show Context)
Citation Context ...isher distance using Parzen density estimates [41]. There are several ways to define nonlinear feature extraction techniques. One such method which is directly related to PCA is called the Kernel PCA =-=[73]-=-, [145]. The basic idea of kernel PCA is to first map input data into some new feature space F typically via a nonlinear function (e.g., polynomial of degree p) and then perform a linear PCA in the ... |

4178 |
Pattern Classification and Scene Analysis
- Duda, Hart
- 1973
(Show Context)
Citation Context ...atistical decision theoretic approach, the decision boundaries are determined by the probability distributions of the patterns belonging to each class, which must either be specified or learned [41], =-=[44]-=-. One can also take a discriminant analysis-based approach to classification: First a parametric form of the decision boundary (e.g., linear or quadratic) is specified; then the ªbestº decision bounda... |

3732 |
Self-organizing maps
- Kohonen
- 1997
(Show Context)
Citation Context ... Function (RBF) networks. These networks are organized into layers and have unidirectional connections between the layers. Another popular network is the Self-Organizing Map (SOM), or Kohonen-Network =-=[92]-=-, which is mainly used for data clustering and feature mapping. The learning process involves updating network architecture and connection weights so that a network can efficiently perform a specific ... |

2928 |
Introduction to Statistical Pattern Recognition, Electrical Science Series
- Fukunaga
- 1972
(Show Context)
Citation Context ...duction (50), classifier design (175), classifier combination (10), error estimation (25) and unsupervised classification (50). In addition to the excellent textbooks by Duda and Hart [44],1 Fukunaga =-=[58]-=-, Devijver and Kittler [39], Devroye et al. [41], Bishop [18], Ripley [137], Schurmann [147], and McLachlan [105], we should also point out two excellent survey papers written by Nagy [111] in 1968 an... |

2497 | A tutorial on support vector machines for pattern recognition
- Burges
- 1998
(Show Context)
Citation Context ... as a benchmark. One of the most interesting recent developments in classifier design is the introduction of the support vector classifier by Vapnik [162] which has also been studied by other authors =-=[23]-=-, [144], [146]. It is primarily a two-class classifier. The optimization criterion here is the width of the margin between the classes, i.e., the empty area around the decision boundary defined by the... |

2334 |
Algorithms for Clustering Data
- Jain, Dubes
- 1988
(Show Context)
Citation Context ... tree of Fig. 2) are attempting to implement the Bayes decision rule. The field of cluster analysis essentially deals with decision making problems in the nonparametric and unsupervised learning mode =-=[81]-=-. Further, in cluster analysis the number of categories or clusters may not even be specified; the task is to discover a reasonable categorization of the data (if one exists). Cluster analysis algorit... |

1450 |
Pattern recognition with fuzzy objective function algorithms
- Bezdek
- 1981
(Show Context)
Citation Context ...ure on fuzzy classification and fuzzy clustering which are in our opinion beyond the scope of this paper. Interested readers can refer to the well-written books on fuzzy pattern recognition by Bezdek =-=[15]-=- and [16]. In most of the sections, the various approaches and methods are summarized in tables as an easy and quick reference for the reader. Due to space constraints, we are not able to provide many... |

1441 | Affective Computing
- PICARD
- 1997
(Show Context)
Citation Context ... of this article, please send e-mail to: tpami@computer.org, and reference IEEECS Log Number 110296. 0162-8828/00/$10.00 ß 2000 IEEE various physical attributes such as face and fingerprints). Picard =-=[125]-=- has identified a novel application of pattern recognition, called affective computing which will give a computer the ability to recognize and express emotions, to respond intelligently to human emoti... |

1212 |
Pattern Recognition and Neural Networks
- Ripley
- 1996
(Show Context)
Citation Context ...estimation (25) and unsupervised classification (50). In addition to the excellent textbooks by Duda and Hart [44],1 Fukunaga [58], Devijver and Kittler [39], Devroye et al. [41], Bishop [18], Ripley =-=[137]-=-, Schurmann [147], and McLachlan [105], we should also point out two excellent survey papers written by Nagy [111] in 1968 and by Kanal [89] in 1974. Nagy described the early roots of pattern recognit... |

1180 | An information-maximization approach to blind separation and blind deconvolution
- Bell, Sejnowski
- 1995
(Show Context)
Citation Context ...envalues), it effectively approximates the data by a linear subspace using the mean squared error criterion. Other methods, like projection pursuit [53] and independent component analysis (ICA) [31], =-=[11]-=-, [24], [96] are more appropriate for non-Gaussian distributions since they do not rely on the second-order property of the data. ICA has been successfully used for blind-source separation [78]; extra... |

1171 |
Bayesian Theory
- Bernardo, Smith
- 1994
(Show Context)
Citation Context ...s data for training which is especially undesirable when the training data set is small. To avoid this problem, a number of model selection schemes [71] have been proposed, including Bayesian methods =-=[14]-=-, minimum description length (MDL) [138], Akaike information criterion (AIC) [2] and marginalized likelihood [101], [159]. Various other regularization schemes which incorporate prior knowledge about ... |

1145 | Nonlinear component analysis as a kernel eigenvalue problem
- Scholkopf, Smola, et al.
- 1998
(Show Context)
Citation Context ...distance using Parzen density estimates [41]. There are several ways to define nonlinear feature extraction techniques. One such method which is directly related to PCA is called the Kernel PCA [73], =-=[145]-=-. The basic idea of kernel PCA is to first map input data into some new feature space F typically via a nonlinear function (e.g., polynomial of degree p) and then perform a linear PCA in the mapped ... |

1111 |
Fast training of support vector machines using sequential minimal optimization
- Platt
- 1999
(Show Context)
Citation Context ... the drawbacks of this method. Various training algorithms have been proposed in the literature [23], including chunking [161], Osuna's decomposition method [119], and sequential minimal optimization =-=[124]-=-. An appropriate kernel function K (as in kernel PCA, Section 4.1) needs to be selected. In its most simple form, it is just a dot product between the input pattern x and a member of the support set: ... |

1064 |
A Probabilistic Theory of Pattern Recognition
- Devroye, Györfi, et al.
- 1996
(Show Context)
Citation Context ...the statistical decision theoretic approach, the decision boundaries are determined by the probability distributions of the patterns belonging to each class, which must either be specified or learned =-=[41]-=-, [44]. One can also take a discriminant analysis-based approach to classification: First a parametric form of the decision boundary (e.g., linear or quadratic) is specified; then the ªbestº decision ... |

835 |
Estimation of Dependencies Based on Empirical Data
- Vapnik
- 1979
(Show Context)
Citation Context ...nal complexity of the training procedure (a quadratic minimization problem) are the drawbacks of this method. Various training algorithms have been proposed in the literature [23], including chunking =-=[161]-=-, Osuna's decomposition method [119], and sequential minimal optimization [124]. An appropriate kernel function K (as in kernel PCA, Section 4.1) needs to be selected. In its most simple form, it is j... |

804 |
The Jackknife, the Bootstrap and Other Resampling Plans
- Efron
- 1990
(Show Context)
Citation Context ...ation and evaluation [166]. Examples are the optimization of the covariance estimates for the Parzen kernel [76] and discriminant analysis [61], and the use of bootstrapping for designing classifiers =-=[48]-=-, and for error estimation [82]. Throughout the paper, some of the classification methods will be illustrated by simple experiments on the following three data sets: Dataset 1: An artificial dataset c... |

798 |
Statistical Methods for Speech Recognition”, The
- Jelinek
- 1997
(Show Context)
Citation Context ... become a very important topic in pattern recognition. Hidden Markov Models (HMM), have been a popular statistical tool for modeling and recognizing sequential data, in particular, speech data [130], =-=[86]-=-. A large number of variations and enhancements of HMMs have been proposed in the literature [12], including hybrids of HMMs and neural networks, inputoutput HMMs, weighted transducers, variable-durat... |

771 | Boosting the margin: a new explanation of effectiveness of voting methods
- Schapire, Freund, et al.
- 1998
(Show Context)
Citation Context ...ng preferably error-free independent probabilities, e.g. resulting from well estimated densities of different, independent feature sets (case (2) in the introduction of this section). Schapire et al. =-=[143]-=- proposed a different explanation for the effectiveness of voting (weighted average, in fact) methods. The explanation is based on the notion of ªmarginº which is the difference between the combined s... |

724 |
Statistical Analysis of Finite Mixture Distributions
- Titterington, Smith, et al.
- 1985
(Show Context)
Citation Context ...rvised classification [81]. The reason behind this is that mixtures adequately model situations where each pattern has been produced by one of a set of alternative (probabilistically modeled) sources =-=[155]-=-. Nevertheless, it should be kept in mind that strict adherence to this interpretation is not required: mixtures can also be seen as a class of models that are able to represent arbitrarily complex pr... |

701 |
Pattern Recognition: A Statistical Approach
- Devijver, Kittler
- 1982
(Show Context)
Citation Context ...sign (175), classifier combination (10), error estimation (25) and unsupervised classification (50). In addition to the excellent textbooks by Duda and Hart [44],1 Fukunaga [58], Devijver and Kittler =-=[39]-=-, Devroye et al. [41], Bishop [18], Ripley [137], Schurmann [147], and McLachlan [105], we should also point out two excellent survey papers written by Nagy [111] in 1968 and by Kanal [89] in 1974. Na... |

647 | Bayesian Learning for Neural Networks
- Neal
- 1995
(Show Context)
Citation Context ...e already built in, such as slow training in combination with early stopping. Other regularization methods include the addition of noise and weight decay [18], [28], [137], and also Bayesian learning =-=[113]-=-. 18 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 22, NO. 1, JANUARY 2000 One of the interesting characteristics of multilayer perceptrons is that in addition to classifying an... |

541 |
Stochastic complexity
- Rissanen
- 1987
(Show Context)
Citation Context ...imate the probability density of the training samples. A major issue in VQ is the selection of the output alphabet size. A number of techniques, such as the minimum description length (MDL) principle =-=[138]-=-, can be used to select this parameter (see Section 8.2). The supervised version of VQ is called learning vector quantization (LVQ) [92]. 8.2 Mixture Decomposition Finite mixtures are a flexible and p... |

485 | On Bayesian analysis of Mixtures with an Unknown Number of Components
- Richardson, Green
- 1997
(Show Context)
Citation Context ...ld conditions, converges to the maximum likelihood (ML) estimate of the mixture parameters); several authors have also advocated the (computationally demanding) Markov chain Monte-Carlo (MCMC) method =-=[135]-=-. The second question is more difficult; several techniques have been proposed which are summarized in Section 8.2.3. Note that the output of the mixture decomposition is as good as the validity of th... |

481 | A fast fixedpoint algorithm for independent component analysis
- Hyvärinen, Oja
- 1997
(Show Context)
Citation Context ... [31], [11], [24], [96] are more appropriate for non-Gaussian distributions since they do not rely on the second-order property of the data. ICA has been successfully used for blind-source separation =-=[78]-=-; extracting linear feature combinations that define independent sources. This demixing is possible if at most one of the sources has a Gaussian distribution. Whereas PCA is an unsupervised linear fea... |

434 |
Discriminant Analysis and Statistical Pattern Recognition
- McLachlan
- 1992
(Show Context)
Citation Context ...ification (50). In addition to the excellent textbooks by Duda and Hart [44],1 Fukunaga [58], Devijver and Kittler [39], Devroye et al. [41], Bishop [18], Ripley [137], Schurmann [147], and McLachlan =-=[105]-=-, we should also point out two excellent survey papers written by Nagy [111] in 1968 and by Kanal [89] in 1974. Nagy described the early roots of pattern recognition, which at that time was shared wit... |

404 |
Methods of combining multiple classifiers and their applications to handwriting recognition
- Xu, Krzyzak, et al.
- 1992
(Show Context)
Citation Context ...ombination function. It is also possible to use a fixed combiner and optimize the set of input classifiers, see Section 6.1. A large number of combination schemes have been proposed in the literature =-=[172]-=-. A typical combination scheme consists of a set of individual classifiers and a combiner which combines the results of the individual classifiers to make the final decision. When the individual class... |

394 | The Random Subspace Method for constructing decision forests
- Ho
- 1998
(Show Context)
Citation Context ...sets of training patterns, different feature sets may be used. This even more explicitly forces the individual classifiers to contain independent information. An example is the random subspace method =-=[75]-=-. 6.2 Combiner After individual classifiers have been selected, they need to be combined together by a module, called the combiner. Various combiners can be distinguished from each other in their trai... |

380 |
Computer systems that learn
- Weiss, Kulikowski
- 1991
(Show Context)
Citation Context ...order to avoid the necessity of having several independent test sets, estimators are often based on rotated subsets of the data, preserving different parts of the data for optimization and evaluation =-=[166]-=-. Examples are the optimization of the covariance estimates for the Parzen kernel [76] and discriminant analysis [61], and the use of bootstrapping for designing classifiers [48], and for error estima... |

360 | Feature selection: Evaluation, application, and small sample performance
- Jain, Zongker
- 1997
(Show Context)
Citation Context ...hich essentially tradeoff the optimality of the selected subset for computational efficiency. Table 5 lists most of the well-known feature selection methods which have been proposed in the literature =-=[85]-=-. Only the first two methods in this table guarantee an optimal subset. All other strategies are suboptimal due to the fact that the best pair of features need not contain the best single feature [34]... |

315 | The minimum description length principle in coding and modeling
- Barron, Rissanen, et al.
- 1998
(Show Context)
Citation Context ...udes the maximized loglikelihood function plus an additional term whose role is to penalize large values of K. An obvious choice in this class is to use the minimum description length (MDL) criterion =-=[10]-=- [138], but several other model selection criteria have been proposed: Schwarz's Bayesian inference criterion (BIC), the minimum message length (MML) criterion, and Akaike's information criterion (AIC... |

308 | ªWhen Networks Disagree: Ensemble Methods for Hybrid
- Perrone, Cooper
- 1993
(Show Context)
Citation Context ...ds remarkable for as N approaches infinity, the variance is reduced to zero. Unfortunately, this is not realistic because the i.i.d. assumption breaks down for large N . Similarly, Perrone and Cooper =-=[123]-=- showed that under the zero-mean and independence assumption on the misfit (difference between the desired output and the actual output), averaging the outputs of N neural networks can reduce the mean... |

285 | An improved training algorithm for support vector machines
- Osuna, Freund, et al.
- 1997
(Show Context)
Citation Context ...dure (a quadratic minimization problem) are the drawbacks of this method. Various training algorithms have been proposed in the literature [23], including chunking [161], Osuna's decomposition method =-=[119]-=-, and sequential minimal optimization [124]. An appropriate kernel function K (as in kernel PCA, Section 4.1) needs to be selected. In its most simple form, it is just a dot product between the input ... |

249 |
Tutorial On Hidden Markov Models and Selected
- Rabiner, “A
- 1989
(Show Context)
Citation Context ...refore, become a very important topic in pattern recognition. Hidden Markov Models (HMM), have been a popular statistical tool for modeling and recognizing sequential data, in particular, speech data =-=[130]-=-, [86]. A large number of variations and enhancements of HMMs have been proposed in the literature [12], including hybrids of HMMs and neural networks, inputoutput HMMs, weighted transducers, variable... |

246 |
ªEfficient Pattern Recognition Using a New Transformation Distance,º
- Simard, LeCun, et al.
- 1993
(Show Context)
Citation Context ...lt task. Recently, there has been some activity in designing invariant recognition methods which do not require invariant features. Examples are the nearest neighbor classifier using tangent distance =-=[152]-=- and deformable template matching [84]. These approaches only achieve invariance to small amounts of linear transformations and nonlinear deformations. Besides, they are computationally very intensive... |

244 |
The use of Faces to Represent Points in k-Dimensional Space Graphically
- Chernoff
- 1973
(Show Context)
Citation Context ...r visually observing multivariate data, in which the objective is to exactly depict each pattern as a picture with d degrees of freedom, where d is the given number of features. For example, Chernoff =-=[29]-=- represents each pattern as a cartoon face whose facial characteristics, such as nose length, mouth curvature, and eye size, are made to correspond to individual features. Fig. 3 shows three faces cor... |

240 |
Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition
- Cover
(Show Context)
Citation Context ...re are too many free parameters. Overtraining has been investigated theoretically for classifiers that minimize the apparent error rate (the error on the training set). The classical studies by Cover =-=[33]-=- and Vapnik [162] on classifier capacity and complexity provide a good understanding of the mechanisms behind overtraining. Complex classifiers (e.g., those having many independent parameters) may hav... |

228 |
Tutorial in pattern theory
- Grenander
- 1983
(Show Context)
Citation Context ...of disadvantages. For instance, it would fail if the patterns are distorted due to the imaging process, viewpoint change, or large intraclass variations among the patterns. Deformable template models =-=[69]-=- or rubber sheet deformations [9] can be used to match patterns when the deformation cannot be easily explained or modeled directly. 1.3 Statistical Approach In the statistical approach, each pattern ... |

219 |
Syntactic Pattern Recognition and Applications
- Fu
- 1982
(Show Context)
Citation Context ... complex patterns, it is more appropriate to adopt a hierarchical perspective where a pattern is viewed as being composed of simple subpatterns which are themselves built from yet simpler subpatterns =-=[56]-=-, [121]. The simplest/elementary subpatterns to be recognized are called primitives and the given complex pattern is represented in terms of the interrelationships between these primitives. In syntact... |

168 |
ªNeural Networks: A Review from Statistical Perspective,º
- Cheng, Titterington
- 1994
(Show Context)
Citation Context ...ssary. Many regularization mechanisms are already built in, such as slow training in combination with early stopping. Other regularization methods include the addition of noise and weight decay [18], =-=[28]-=-, [137], and also Bayesian learning [113]. 18 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 22, NO. 1, JANUARY 2000 One of the interesting characteristics of multilayer perceptr... |

166 | ªThe Evidence Framework Applied to Classification Networks,º
- MacKay
- 1992
(Show Context)
Citation Context ... number of model selection schemes [71] have been proposed, including Bayesian methods [14], minimum description length (MDL) [138], Akaike information criterion (AIC) [2] and marginalized likelihood =-=[101]-=-, [159]. Various other regularization schemes which incorporate prior knowledge about model structure and parameters have also been proposed. Structural risk minimization based on the notion of VC dim... |

156 | ªKnowledge Discovery and Data Mining: Towards a Unifying Framework,º - Fayyad, Piatetsky-Shapiro, et al. - 1999 |

156 | Model Selection and the Principle of Minimum Description Length
- Hansen, Yu
- 2001
(Show Context)
Citation Context ...ning, this method does not fully utilize the precious data for training which is especially undesirable when the training data set is small. To avoid this problem, a number of model selection schemes =-=[71]-=- have been proposed, including Bayesian methods [14], minimum description length (MDL) [138], Akaike information criterion (AIC) [2] and marginalized likelihood [101], [159]. Various other regularizat... |

156 |
Subspace methods of pattern recognition
- Oja
- 1983
(Show Context)
Citation Context ... Fig. 6. Classification error vs. the number of features using the floating search feature selection technique (see text). or the nearest mean classifier can be viewed as finding the nearest subspace =-=[116]-=-. The second main concept used for designing pattern classifiers is based on the probabilistic approach. The optimal Bayes decision rule (with the 0/1 loss function) assigns a pattern to the class wit... |

155 |
ªSmall Sample Size Effects in Statistical Pattern Recognition: Recommendations for Practitioners,º
- Raudys, Jain
- 1991
(Show Context)
Citation Context ... if the number of training samples that are used to design the classifier is small relative to the number of features. This paradoxical behavior is referred to as the peaking phenomenon3 [80], [131], =-=[132]-=-. A simple explanation for this phenomenon is as follows: The most commonly used parametric classifiers estimate the unknown parameters and plug them in for the true parameters in the class-conditiona... |

153 | Sklansky, "A note of genetic algorithms for large-scale feature selection - Siedlecki - 1989 |

143 | Artificial Neural Networks for Feature Extraction and Multivariate Data Projection - Mao, Jain - 1995 |