## Deterministic Annealing for Clustering, Compression, Classification, Regression, and Related Optimization Problems (1998)

### Cached

### Download Links

Venue: | Proceedings of the IEEE |

Citations: | 253 - 11 self |

### BibTeX

@INPROCEEDINGS{Rose98deterministicannealing,

author = {Kenneth Rose},

title = {Deterministic Annealing for Clustering, Compression, Classification, Regression, and Related Optimization Problems},

booktitle = {Proceedings of the IEEE},

year = {1998},

pages = {2210--2239}

}

### Years of Citing Articles

### OpenURL

### Abstract

this paper. Let us place it within the neural network perspective, and particularly that of learning. The area of neural networks has greatly benefited from its unique position at the crossroads of several diverse scientific and engineering disciplines including statistics and probability theory, physics, biology, control and signal processing, information theory, complexity theory, and psychology (see [45]). Neural networks have provided a fertile soil for the infusion (and occasionally confusion) of ideas, as well as a meeting ground for comparing viewpoints, sharing tools, and renovating approaches. It is within the ill-defined boundaries of the field of neural networks that researchers in traditionally distant fields have come to the realization that they have been attacking fundamentally similar optimization problems.

### Citations

8520 | Maximum likelihood from incomplete data
- Dempster, Laird, et al.
- 1967
(Show Context)
Citation Context ... of the random binary variable which take the value one if input is assigned to codevector , and zero if not. From this perspective one may recognize the known expectation maximization (EM) algorithm =-=[21] in the -=-above two step iteration. The first step, which computes the association probabilities, is the “expectation” step, and the second step which minimizes is the “maximization” (of ) step. Note fu... |

4050 |
Pattern Classification and Scene Analysis
- Duda, Hart
- 1973
(Show Context)
Citation Context ...ition design step is readily specified by the nearest neighbor rule, in the tree structured case an optimal partition is determined only by solving a formidable multiclass risk discrimination problem =-=[23]-=-. Thus, the heuristically determined high-level boundaries may severely constrain the final partition at the leaf layer, yet they are not readjusted when lower layers are being designed. The DA approa... |

3889 |
Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images
- Geman, Geman
- 1984
(Show Context)
Citation Context ...ve to that of the current state. However, one must be very careful with the annealing schedule, i.e., the rate at which the temperature is lowered. In their work on image restoration, Geman and Geman =-=[34]-=- have shown that, in theory, the global minimum can be achieved if the schedule obeys , where is the number of the current iteration (see also the derivation of necessary and sufficient conditions for... |

3878 |
Neural Network: A Comprehensive Foundation
- Haykin
- 1994
(Show Context)
Citation Context ...diverse scientific and engineering disciplines including statistics and probability theory, physics, biology, control and signal processing, information theory, complexity theory, and psychology (see =-=[45]-=-). Neural networks have provided a fertile soil for the infusion (and occasionally confusion) of ideas, as well as a meeting ground for comparing viewpoints, sharing tools, and renovating approaches. ... |

3740 | Optimization by simulated annealing
- Kirkpatrick, Gelatt
- 1983
(Show Context)
Citation Context ...signs nonzero probability only to global minimum configurations. A known technique for nonconvex optimization that capitalizes on this physical analogy is stochastic relaxation or simulated annealing =-=[54]-=- based on the Metropolis algorithm [68] for atomic simulations. A sequence of random moves is generated and the random decision to accept a move depends on the cost of the resulting configuration rela... |

2385 |
Equations of state calculations by fast computing machines
- Metropolis, Rosenbluth, et al.
- 1953
(Show Context)
Citation Context ...l minimum configurations. A known technique for nonconvex optimization that capitalizes on this physical analogy is stochastic relaxation or simulated annealing [54] based on the Metropolis algorithm =-=[68]-=- for atomic simulations. A sequence of random moves is generated and the random decision to accept a move depends on the cost of the resulting configuration relative to that of the current state. Howe... |

2057 |
A new look at the statistical model identification
- Akaike
- 1974
(Show Context)
Citation Context ...arsimony, of the solution, in addition to performance on the training set. In one basic approach, penalty terms are added to the training cost, either to directly favor the formation of a small model =-=[1]-=-, [85], or to do so indirectly via regularization/smoothness constraints or other costs which measure overspecialization. A second common approach is to build a large model, overspecialized to the tra... |

1990 | Some Methods for Classification and Analysis of Multivariate Observations
- MacQueen
- 1967
(Show Context)
Citation Context ...nt of the subject within the areas of compression and communications see [36]. In the pattern-recognition literature, similar algorithms have been introduced including the ISODATA [4] and the - means =-=[63]-=- algorithms. Later, fuzzy relatives to these algorithms were derived [9], [25]. All these iterative methods alternate between two complementary steps: optimization of the encoding rule for the current... |

1705 |
Vector quantization and signal compression
- Gersho, Gray
- 1992
(Show Context)
Citation Context ...[59], and the resulting algorithm is commonly referred to as the generalized Lloyd algorithm (GLA). For a comprehensive treatment of the subject within the areas of compression and communications see =-=[36]-=-. In the pattern-recognition literature, similar algorithms have been introduced including the ISODATA [4] and the - means [63] algorithms. Later, fuzzy relatives to these algorithms were derived [9],... |

1584 |
Clustering algorithms
- Hartigan
- 1975
(Show Context)
Citation Context ...ch grows linearly, rather than exponentially, with the dimension and the rate. The design of TSVQ is, in general, a harder optimization problem than the design of regular VQ. Typical approaches [15], =-=[44]-=-, [84] employ a greedy sequential design, optimizing a local cost to grow the tree one node (or layer) at a time. The reason for the greedy nature of standard approaches is that, whereas in the unstru... |

1512 |
Information theory and reliable communication
- Gallager
(Show Context)
Citation Context ...annot be attained by any coding system. Important extensions of the theory to more general classes of sources than those originally considered by Shannon have been developed since (see, e.g., [7] and =-=[32]-=-). Explicit analytical evaluation of the function has been generally elusive, except for very few examples of sources and distortion measures. Two main approaches were taken to address this problem. T... |

1263 |
An Algorithm for Vector Quantizer Design
- Linde, Buzo, et al.
- 1980
(Show Context)
Citation Context ...d for scalar quantization, which is known as the Lloyd algorithm [60] or the Max quantizer [65]. This method was later generalized to vector quantization, and to a large family of distortion measures =-=[59]-=-, and the resulting algorithm is commonly referred to as the generalized Lloyd algorithm (GLA). For a comprehensive treatment of the subject within the areas of compression and communications see [36]... |

889 | Least squares quantization in PCM
- LLOYD
- 1982
(Show Context)
Citation Context ...een developed in different disciplines. In the communications or information-theory literature, an early clustering method was suggested for scalar quantization, which is known as the Lloyd algorithm =-=[60]-=- or the Max quantizer [65]. This method was later generalized to vector quantization, and to a large family of distortion measures [59], and the resulting algorithm is commonly referred to as the gene... |

820 |
A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains
- Baum, Petrie, et al.
- 1970
(Show Context)
Citation Context ...rk appeared in [81]. It is shown that DA can be applied to time sequences, and further, can be implemented efficiently by a forward-backward algorithm similar to the Baum–Welch reestimation algorith=-=m [5]-=-. The DA method allows joint optimization of the classifier components to directly minimize the classification error, rather than separate modeling of speech utterances via the maximum likelihood appr... |

811 |
Adaptive mixtures of local experts
- Jacobs, Jordan, et al.
- 1991
(Show Context)
Citation Context ...ure of Experts: Mixture of experts is an important type of structures that was inspired by mixture models from statistics [67], [102]. This class includes the structures known as “mixture of experts=-=” [51] and �-=-��hierarchical mixture of experts” (HME) [53], as well as normalized radial basis functions (NRBF) [75]. We refer to this class generally as mixture of experts (ME) models. ME’s have been suggeste... |

747 | Hierarchical mixtures of experts and the EM algorithm
- Jordan, Jacobs
- 1994
(Show Context)
Citation Context ...t type of structures that was inspired by mixture models from statistics [67], [102]. This class includes the structures known as “mixture of experts” [51] and “hierarchical mixture of experts��=-=� (HME) [53], -=-as well as normalized radial basis functions (NRBF) [75]. We refer to this class generally as mixture of experts (ME) models. ME’s have been suggested for a variety of problems, including classifica... |

704 |
Information theory and statistical mechanics
- Jaynes
- 1957
(Show Context)
Citation Context ... is to characterize the random solution at gradually diminishing levels of distortion until minimal distortion is reached. To estimate the distribution we appeal to Jaynes’s maximum entropy principl=-=e [52]-=- which states: of all the probability distributions that satisfy a given set of constraints, choose the one that maximizes the entropy. The informal justification is that while this choice agrees with... |

653 |
Statistical analysis of finite mixture distributions
- Titterington, Smith, et al.
- 1985
(Show Context)
Citation Context ...ning Set. u Is the Number of Regiouns Used to Represent the Data 5) Mixture of Experts: Mixture of experts is an important type of structures that was inspired by mixture models from statistics [67], =-=[102]. This c-=-lass includes the structures known as “mixture of experts” [51] and “hierarchical mixture of experts” (HME) [53], as well as normalized radial basis functions (NRBF) [75]. We refer to this cla... |

490 |
Mixture models: Inference and Applications to Clustering
- McLachlan, Basford
- 1988
(Show Context)
Citation Context ...e Training Set. u Is the Number of Regiouns Used to Represent the Data 5) Mixture of Experts: Mixture of experts is an important type of structures that was inspired by mixture models from statistics =-=[67], [102].-=- This class includes the structures known as “mixture of experts” [51] and “hierarchical mixture of experts” (HME) [53], as well as normalized radial basis functions (NRBF) [75]. We refer to t... |

449 |
A Mathematical Theory of Communication,” Bell Syst
- Shannon
- 1948
(Show Context)
Citation Context ...ned by DA, GD, and ML on the Training (TR) and Test (TE) Sets, for HME Design to Approximate Functions, �I @A F F F �S @A. u Denotes the Number of Leaves in the Binary Tree results are due to Shan=-=non [95]-=-, [96]. These are the coding theorems which provide an (asymptotically) achievable bound on the performance of source coding methods. This bound is often expressed as an RD function for a given source... |

328 |
A fuzzy relative of the isodata process and its use in detecting compact, well-separated clusters
- Dunn
- 1973
(Show Context)
Citation Context ... In the pattern-recognition literature, similar algorithms have been introduced including the ISODATA [4] and the - means [63] algorithms. Later, fuzzy relatives to these algorithms were derived [9], =-=[25]-=-. All these iterative methods alternate between two complementary steps: optimization of the encoding rule for the current codebook, and optimization of the codebook for the encoding rule. When operat... |

297 | A graduated assignment algorithm for graph matching
- Gold, Rangarajan
- 1996
(Show Context)
Citation Context ... deterministically optimized at successively reduced temperatures. This approach was adopted by various researchers in the fields of graph-theoretic optimization and computer vision [10], [26], [33], =-=[37], -=-[98], [99], [108]. Our starting point here is the early work on clustering by deterministic annealing which appeared in [86] and [88]–[90]. Although strongly motivated by the physical analogy, the a... |

267 |
Neural gas network for vector quantization and its application to time-series prediction
- Martinez, Berkovich, et al.
- 1993
(Show Context)
Citation Context ...clustering solution, or a quantizer, is obtained. The basic DA approach to clustering has since inspired modifications, extensions, and related work by numerous researchers including [6], [14], [47], =-=[64]-=-, [70], [72], [73], [82], [91], [103], [106]. This paper begins with a tutorial review of the basic DA approach to clustering, and then goes into some of its most significant extensions to handle vari... |

258 |
Sthocastic complexity and Modelling
- Rissanen
- 1986
(Show Context)
Citation Context ...ony, of the solution, in addition to performance on the training set. In one basic approach, penalty terms are added to the training cost, either to directly favor the formation of a small model [1], =-=[85]-=-, or to do so indirectly via regularization/smoothness constraints or other costs which measure overspecialization. A second common approach is to build a large model, overspecialized to the training ... |

211 |
Quantizing for minimum distortion
- MAX
- 1960
(Show Context)
Citation Context ... disciplines. In the communications or information-theory literature, an early clustering method was suggested for scalar quantization, which is known as the Lloyd algorithm [60] or the Max quantizer =-=[65]-=-. This method was later generalized to vector quantization, and to a large family of distortion measures [59], and the resulting algorithm is commonly referred to as the generalized Lloyd algorithm (G... |

204 |
Axiomatic Derivation of the Principle of Maximum Entropy and the Principle of Minimum Cross-Entropy
- Shore, Johnson
- 1980
(Show Context)
Citation Context ...ture on the partition. Earlier work on this problem [73] appealed to the principle of minimum cross-entropy (or minimum divergence) which is a known generalization of the principle of maximum entropy =-=[97]. -=-Minimum cross-entropy provides a probablistic tool to gradually enforce the desired consistency between the leaf layer, where the quantization cost is calculated, and the rest of the tree—thereby im... |

200 | Pairwise data clustering by deterministic annealing
- Hofmann, Buhmann
- 1997
(Show Context)
Citation Context ... hard clustering solution, or a quantizer, is obtained. The basic DA approach to clustering has since inspired modifications, extensions, and related work by numerous researchers including [6], [14], =-=[47]-=-, [64], [70], [72], [73], [82], [91], [103], [106]. This paper begins with a tutorial review of the basic DA approach to clustering, and then goes into some of its most significant extensions to handl... |

197 |
Rate Distortion Theory
- Berger
- 1971
(Show Context)
Citation Context ...n must satisfy (33) where are implicit in (23). Equation (33) is thus the equation we solve while optimizing over . This equation also arises from the Kuhn–Tucker conditions of rate-distortion theor=-=y [7]-=-, [12], [40]. It is instructive to point out that (24) and (33) imply that (34) In other words, the optimal codevector distribution mimics the training data set partition into the clusters. The distri... |

185 | Computation of channel capacity and ratedistortion functions
- Blahut
- 1972
(Show Context)
Citation Context ...rtant example is the Shannon lower bound [96] which is useful for difference distortion measures. The second main approach was to develop a numerical algorithm, the Blahut–Arimoto (BA) algorithm [2]=-=, [11]-=- to evaluate RD functions. The power of the second approach is in that the function can be 2234 PROCEEDINGS OF THE IEEE, VOL. 86, NO. 11, NOVEMBER 1998sapproximated arbitrarily closely at the cost of ... |

173 |
An analogue approach to the traveling salesman problem using an elastic net method
- Durbin, Willshaw
- 1987
(Show Context)
Citation Context ... for solving a variety of hard graph-theoretic problems. As an example, when applied to the famous “traveling salesman” problem, such DA derivation becomes identical to the “elastic net” metho=-=d [26], [27]-=-. The approach has been applied to various data assignment problems such as the module placement problem in computer-aided design (CAD) and graph partitioning. Another variant with a different constra... |

147 |
Parallel and deterministic algorithms from MRF’s: Surface reconstruction
- Geiger, Girosi
- 1991
(Show Context)
Citation Context ...and is deterministically optimized at successively reduced temperatures. This approach was adopted by various researchers in the fields of graph-theoretic optimization and computer vision [10], [26], =-=[33], -=-[37], [98], [99], [108]. Our starting point here is the early work on clustering by deterministic annealing which appeared in [86] and [88]–[90]. Although strongly motivated by the physical analogy,... |

144 |
Entropyconstrained vector quantization
- Chou, Lookabaugh, et al.
- 1989
(Show Context)
Citation Context ...aint within the design to produce quantizers optimized for subsequent entropy coding. The earlier work was concerned with scalar quantizers [8], [29]. The VQ design method was proposed by Chou et al. =-=[19]-=-. We refer to this paradigm as the entropy-constrained VQ (ECVQ). The cost function is the weighted cost (43) where determines the penalty for increase in rate relative to increase in distortion. It c... |

143 | Neural networks and related methods for classification - Ripley - 1994 |

139 |
A study of vector quantization for noisy channels
- Farvardin
- 1990
(Show Context)
Citation Context ...orithms to this case. There is a long history of noisy-channel quantizer design. In the 1960’s, a basic method was proposed for scalar quantizers [58] and was extended in many papers since [3], [24]=-=, [28]-=-, [30], [57], [109]. These papers basically describe GLA-type methods which alternate between enforcing the encoder and centroid (decoder) optimality conditions. One can similarly extend the DA approa... |

101 |
Statistical pattern recognition with Neural Networks: Benchmarking studies
- Kohonen, Barna, et al.
- 1988
(Show Context)
Citation Context ... classifier design, [78] for piecewise regression, and [82] for mixture of experts. 1) VQ Classifier Design: The DA approach to VQ classifier design [70] is compared with the learning VQ (LVQ) method =-=[56]. No-=-te that here LVQ will refer narrowly to that design method, not to the structure itself which we call VQ. The first simulation result is on the “synthetic” example from [83], where DA design achie... |

96 | A deterministic annealing approach to clustering - Rose, Gurewitz, et al. - 1990 |

93 |
Hedonic prices and the demand for clean air
- Harrison, Rubinfeld
- 1978
(Show Context)
Citation Context ...we simply compare the performance of the two regression models versus model size. One benchmark problem is concerned with predicting the value of homes in the Boston area from a variety of parameters =-=[43]-=-. The training set consists of data from 506 homes. The output in this case is the median price of a home, with the input consisting of a vector of 13 scalar features believed to influence the price. ... |

92 |
A clustering technique for summarizing multivariate data
- Ball, Hall
- 1967
(Show Context)
Citation Context ...omprehensive treatment of the subject within the areas of compression and communications see [36]. In the pattern-recognition literature, similar algorithms have been introduced including the ISODATA =-=[4]-=- and the - means [63] algorithms. Later, fuzzy relatives to these algorithms were derived [9], [25]. All these iterative methods alternate between two complementary steps: optimization of the encoding... |

86 | Nonlinear Gated Experts for Time Series: Discovering Regimes and Avoiding Overfitting
- Weigend, Mangeas, et al.
- 1995
(Show Context)
Citation Context ... this class generally as mixture of experts (ME) models. ME’s have been suggested for a variety of problems, including classification [48], [51], control [50], [53], and regression tasks [53], [104]=-=, [105]. We-=- define the “local expert” regression function , where is the set of model parameters for local model . The ME regression function is defined as (73) where, is a nonnegative weight of association ... |

84 |
On the performance and complexity of channel-optimized vector quantizers
- Farvardin, Vaishampayan
- 1991
(Show Context)
Citation Context ...s to this case. There is a long history of noisy-channel quantizer design. In the 1960’s, a basic method was proposed for scalar quantizers [58] and was extended in many papers since [3], [24], [28]=-=, [30]-=-, [57], [109]. These papers basically describe GLA-type methods which alternate between enforcing the encoder and centroid (decoder) optimality conditions. One can similarly extend the DA approach to ... |

76 |
convergence theorem for the fuzzy ISODATA clustering algorithms
- Bezdek
- 1980
(Show Context)
Citation Context ...[36]. In the pattern-recognition literature, similar algorithms have been introduced including the ISODATA [4] and the - means [63] algorithms. Later, fuzzy relatives to these algorithms were derived =-=[9]-=-, [25]. All these iterative methods alternate between two complementary steps: optimization of the encoding rule for the current codebook, and optimization of the codebook for the encoding rule. When ... |

70 |
Optimal partitioning for classification and regression trees
- Chou
- 1991
(Show Context)
Citation Context ...d its extended form as CART2. Our implementation of CART consists of growing a large “full tree” and then pruning it down to the root node using the Breiman–Friedman–Olshen–Stone algorithm (=-=see e.g., [18]-=-). The sequence of CART regression trees is obtained during the pruning process. It is known that the pruning phase is optimal given the fully grown tree. Unlike CART2, the complexity of the DA method... |

68 | On the training of radial basis function classifiers - Musavi, Ahmed, et al. - 1992 |

67 | Regression modeling in back-propagation and projection pursuit learning
- Hwang, Lay, et al.
- 1994
(Show Context)
Citation Context ...btained Using DA and GD Algorithms for NRBF Design for the Fat Content Prediction Problem. u Is the Number of Gaussian Basis Functions Used. “TR” and “TE” Refer to Training and Test Sets, Resp=-=ecively [49]-=-. Each function was used to generate both a training set and test set of size 225. We designed NRBF and HME regression estimates for each data set using both DA and the competitive design approaches. ... |

66 |
A construction of vector quantizers for noisy channels
- Kumazawa, Kasahara, et al.
- 1984
(Show Context)
Citation Context ...his case. There is a long history of noisy-channel quantizer design. In the 1960’s, a basic method was proposed for scalar quantizers [58] and was extended in many papers since [3], [24], [28], [30]=-=, [57]-=-, [109]. These papers basically describe GLA-type methods which alternate between enforcing the encoder and centroid (decoder) optimality conditions. One can similarly extend the DA approach to the no... |

61 |
Optimum Quantizer Performance for a Class of Non-Gaussian Memoryless Sources
- Farvardin, Modestino
- 1984
(Show Context)
Citation Context ...blem was obtained by incorporation of an entropy constraint within the design to produce quantizers optimized for subsequent entropy coding. The earlier work was concerned with scalar quantizers [8], =-=[29]-=-. The VQ design method was proposed by Chou et al. [19]. We refer to this paradigm as the entropy-constrained VQ (ECVQ). The cost function is the weighted cost (43) where determines the penalty for in... |

57 |
Learning piecewise control strategies in a modular neural network architecture
- Jacobs, Jordan
- 1993
(Show Context)
Citation Context ...adial basis functions (NRBF) [75]. We refer to this class generally as mixture of experts (ME) models. ME’s have been suggested for a variety of problems, including classification [48], [51], contro=-=l [50], [5-=-3], and regression tasks [53], [104], [105]. We define the “local expert” regression function , where is the set of model parameters for local model . The ME regression function is defined as (73)... |

54 | Vector quantization with complexity costs - Buhmann, Kuhnel - 1993 |

54 |
An analysis of the elastic net approach to the travelling salesman problem
- Durbin, Szeliski, et al.
- 1989
(Show Context)
Citation Context ...ation and is deterministically optimized at successively reduced temperatures. This approach was adopted by various researchers in the fields of graph-theoretic optimization and computer vision [10], =-=[26], -=-[33], [37], [98], [99], [108]. Our starting point here is the early work on clustering by deterministic annealing which appeared in [86] and [88]–[90]. Although strongly motivated by the physical an... |

53 |
The design of joint source and channel trellis waveform coders
- Ayanoglu, Gray
- 1987
(Show Context)
Citation Context ... design algorithms to this case. There is a long history of noisy-channel quantizer design. In the 1960’s, a basic method was proposed for scalar quantizers [58] and was extended in many papers sinc=-=e [3]-=-, [24], [28], [30], [57], [109]. These papers basically describe GLA-type methods which alternate between enforcing the encoder and centroid (decoder) optimality conditions. One can similarly extend t... |