## Supervised Learning of Quantizer Codebooks by Information Loss Minimization (2007)

Citations: | 35 - 0 self |

### BibTeX

@MISC{Lazebnik07supervisedlearning,

author = {Svetlana Lazebnik and Maxim Raginsky},

title = {Supervised Learning of Quantizer Codebooks by Information Loss Minimization},

year = {2007}

}

### OpenURL

### Abstract

This paper proposes a technique for jointly quantizing continuous features and the posterior distributions of their class labels based on minimizing empirical information loss, such that the index K of the quantizer region to which a given feature X is assigned approximates a sufficient statistic for its class label Y. We derive an alternating minimization procedure for simultaneously learning codebooks in the Euclidean feature space and in the simplex of posterior class distributions. The resulting quantizer can be used to encode unlabeled points outside the training set and to predict their posterior class distributions, and has an elegant interpretation in terms of lossless source coding. The proposed method is extensively validated on synthetic and real datasets, and is applied to two diverse problems: learning discriminative visual vocabularies for bag-of-features image classification, and image segmentation.

### Citations

8650 |
Elements of Information Theory
- Cover, Thomas
- 1991
(Show Context)
Citation Context ...at the quantized representation of a feature approximates a sufficient statistic for its attribute label. The learning scheme is derived from information-theoretic properties of sufficient statistics =-=[8]-=-, [21] and is based on minimizing the loss of information about the attribute that is incurred by the quantization operation (in general, quantization is compression, and some information will inevita... |

5237 | Distinctive image features from scale-invariant keypoints
- Lowe
(Show Context)
Citation Context ...sets consisting of 100 images per class. 4 In the present experiments, we follow the setup of [23] for feature extraction and training. Namely, the image features are 128-dimensional SIFT descriptors =-=[26]-=- of 16 × 16 patches sampled on a regular 8 × 8 grid. Let us underscore that classifying individual image patches or features is not our goal in this section. In fact, this task is excessively difficul... |

4893 |
Neural Networks For Pattern Recognition
- Bishop
- 1996
(Show Context)
Citation Context ...the form m (t+1) k = m (t) k = m (t) k − α ∂E(t) ∂m (t) k − α N� i=1 C� j (Xi) ∂m (t) , D(PXi j=1 �π(t) j )∂w(t) k where α>0 is the learning rate shared by all the centers and found using line search =-=[5]-=-, and ∂wj(x) = β[δjkwk(x) − wk(x)wj(x)](x − mk) ∂mk (15) where δjk is 1 if j = k and 0 otherwise. For a fixed codebook M, the minimization over Π is accomplished in closed form by setting the derivati... |

3151 |
Introduction to Modern Information Retrieval
- Salton, McGill
- 1983
(Show Context)
Citation Context ...te in this section the use of our method to build effective discrete visual vocabularies for bag-of-features image classification [10], [41], [46]. Analogously to bag-of-words document classification =-=[39]-=-, this framework represents images by histograms of discrete indices of the “visual words” contained in them. Despite the extreme simplicity of this model — in particular, its lack of information abou... |

1336 | Color indexing
- Swain, Ballard
- 1991
(Show Context)
Citation Context ...tor machines, we use the histogram intersection kernel [29], 4 Another reference performance figure for a 13-class subset of this dataset is 65.2% by Fei-Fei and Perona [14]. August 27, 2007 DRAFT 21s=-=[43]-=- defined by K (N(I1),N(I2)) = C� min (Nk(I1),Nk(I2)) . k=1 As seen from Figure 8, codebooks produced by our method yield an improvement over kmeans, which, though not large in absolute terms (2% to 4%... |

1170 | Information theory and statistics
- Kullback
- 1959
(Show Context)
Citation Context ...DRAFT 2sretains all the information that is useful for predicting the attribute. Then the compressed representation of the features created by this quantizer may be called a sufficient statistic [6], =-=[21]-=- for the attribute labels, and for any statistical decision procedure about the attribute that uses the original features, we can find another one that performs just as well based on the output of the... |

1007 | A Probabilistic Theory of Pattern Recognition - Devroye, Györfi, et al. - 1996 |

971 | Beyond bag of features: Spatial pyramid matching for recognizing natural scene categories
- Lazebnik, Schmid, et al.
- 2006
(Show Context)
Citation Context ...een different scene categories, and is quite challenging — for example, it is difficult to distinguish indoor categories such as bedroom and living room. This dataset has been used by Lazebnik et al. =-=[23]-=-, who report a bag-of-features classification rate of 72.2% with a k-means vocabulary of size 200 and training sets consisting of 100 images per class. 4 In the present experiments, we follow the setu... |

946 | Video google: A text retrieval approach to object matching in videos
- Sivic, Zisserman
- 2003
(Show Context)
Citation Context ... its own sake, but for the sake of facilitating the subsequent step of learning a statistical model for classification or inference. For example, bag-of-features models for image classification [10], =-=[41]-=-, [46] work by quantizing high-dimensional descriptors of local image patches into discrete visual codewords, representing images by frequency counts of the codeword indices contained in them, and the... |

887 |
The Elements of
- Hastie, Tibshirani, et al.
- 2001
(Show Context)
Citation Context ...eks to minimize the probability of classification error Pr[ ˆ Y (X) �= Y ] over some family of classifiers ˆ Y : X→Y, such as k-nearest-neighbor classifiers, decision trees or support vector machines =-=[15]-=-. A more general approach is based on the notion of sufficient statistics. Informally, a sufficient statistic of X for Y contains as much information about Y as X itself. Hence an optimal hypothesis t... |

769 | A Comparison of Event Models for Naive Bayes Text Classification. AAAI-98 Workshop on Learning for Text Categorization
- McCallum, Nigam
- 1998
(Show Context)
Citation Context ...pes of codebooks. We use two different classifiers, Naive Bayes (NB) and support vector machines (SVM). Naive Bayes performs maximum likelihood classification according to the multinomial event model =-=[27]-=-: P (I|y) = � P (k|y) Nk(I) , (20) k where in the case of a codebook output by our method, P (k|y) is obtained directly by Bayes rule from the centroid πk. For support vector machines, we use the hist... |

748 |
On the estimation of a probability density function and the mode
- Parzen
- 1962
(Show Context)
Citation Context ...te µ by the empirical distribution ˆµ = 1 N PXi for each i =1, 2,...,N either by point masses, i.e., ˆ PXi �N δXi i=1 and to obtain estimates of nonparametric density estimator such as Parzen windows =-=[32]-=-. Also, let ˆ P = 1 N = δYi , or by using a consistent 8 � N i=1 ˆ PXi denote the empirical estimate of P . Then the empirical version of the mutual information between X and Y is given by Î(X; Y )= 1... |

691 | Modeling the shape of the scene: a holistic representation of the spatial envelope
- Oliva, Torralba
- 2001
(Show Context)
Citation Context ... ∗ inside city ∗ street ∗ highway ∗ coast ∗ open country ∗ mountain ∗ forest ∗ suburb Fig. 7. Example images from the scene category database. The starred categories originate from Oliva and Torralba =-=[31]-=-. The entire dataset is publicly available at http://www-cvr.ai.uiuc.edu/ponce grp/data. NB k-means NB info-loss SVM k-means SVM info-loss C =32 C =64 C = 128 C = 256 56.2 ± 0.9 60.3 ± 1.7 62.7 ± 0.7 ... |

597 | Visual categorization with bags of keypoints
- Csurka, Bray, et al.
- 2004
(Show Context)
Citation Context ...ot for its own sake, but for the sake of facilitating the subsequent step of learning a statistical model for classification or inference. For example, bag-of-features models for image classification =-=[10]-=-, [41], [46] work by quantizing high-dimensional descriptors of local image patches into discrete visual codewords, representing images by frequency counts of the codeword indices contained in them, a... |

552 | A bayesian hierarchical model for learning natural scene categories
- Li, Perona
(Show Context)
Citation Context ...he centroid πk. For support vector machines, we use the histogram intersection kernel [29], 4 Another reference performance figure for a 13-class subset of this dataset is 65.2% by Fei-Fei and Perona =-=[14]-=-. August 27, 2007 DRAFT 21s[43] defined by K (N(I1),N(I2)) = C� min (Nk(I1),Nk(I2)) . k=1 As seen from Figure 8, codebooks produced by our method yield an improvement over kmeans, which, though not la... |

439 | The information bottleneck method
- Tishby, Pereira, et al.
- 1999
(Show Context)
Citation Context ...ded outside the original input set. There are several clustering methods motivated by an information-theoretic interpretation of sufficient statistics in terms of mutual information [11], [40], [42], =-=[44]-=-. The information bottleneck (IB) method [42], [44] is a general theoretical framework for clustering problems where the goal is to find a compressed representation K of the data X under the constrain... |

382 | Local features and kernels for classification of texture and object categories: A comprehensive study
- Zhang, Marszalek, et al.
(Show Context)
Citation Context ...wn sake, but for the sake of facilitating the subsequent step of learning a statistical model for classification or inference. For example, bag-of-features models for image classification [10], [41], =-=[46]-=- work by quantizing high-dimensional descriptors of local image patches into discrete visual codewords, representing images by frequency counts of the codeword indices contained in them, and then lear... |

315 | Rate distortion theory: A mathematical basis for data compression - Berger - 1971 |

311 | Clustering with Bregman divergences
- Banerjee, Merugu, et al.
- 2005
(Show Context)
Citation Context ...K; Y )= 1 N C� � k=1 Xi∈Rk D( ˆ PXi �πk). (7) It is not hard to show, either directly or using the fact that the relative entropy D(·�·) as a Bregman divergence on the probability simplex P(Y) over Y =-=[3]-=-, that � D( ˆ PXi�π), ∀k =1, 2,...,C (8) πk = arg min π Xi∈Rk where the minimization is over all π in the interior of P(Y) (i.e., π(y) > 0 for all y ∈Y) 3 . Furthermore, πk is the unique minimizer in ... |

263 |
The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming
- Bregman
- 1967
(Show Context)
Citation Context ...ction C� 1 N � D( ˆ PXi�qk). (9) k=1 Xi∈Rk 2 Banerjee et al. [3] give a more general derivation of this expression for empirical information loss in the context of clustering with Bregman divergences =-=[7]-=-, of which the relative entropy is a special case; the derivation included in our paper is meant to make it self-contained. 3 The requirement that π lie in the interior of the probability simplex is a... |

250 | Deterministic annealing for clustering, compression, classification, regression, and related optimization problems
- Rose
- 1998
(Show Context)
Citation Context ...ller β’s correspond to softer cluster assignments, and the limit of infinite β yields hard clustering. While in principle it is possible to use annealing techniques to pass to the limit of infinite β =-=[38]-=-, we have found that a fixed value of β works well in practice (our method for selecting this value in the experiments will be discussed in Section IV). Note also that we deliberately avoid any probab... |

204 |
Information-type measures of difference of probability distributions and indirect
- Csiszár
- 1967
(Show Context)
Citation Context ... efficient practical algorithms for information loss minimization could be derived by considering another general class of divergences, the so-called Ali–Silvey–Csiszár distances or fdivergences [2], =-=[9]-=- (this class also contains the relative entropy, as well as the variational distance and the Bhattacharyya coefficient). A (noniterative) technique for designing high-resolution quantizers for use in ... |

182 | Learning a classification model for segmentation
- Ren, Malik
(Show Context)
Citation Context ...known attributes of all the image pixels using a much smaller set of appearance centroids. Note that the Voronoi regions into which our procedure partitions the image can be thought of as superpixels =-=[35]-=-, or coherent and relatively homogeneous units of image description. In our implementation, the appearance attribute or label Y of each pixel X is its color or grayscale value discretized to 100 level... |

171 |
Modeling by shortest data description. Automatica 14
- Rissanen
- 1978
(Show Context)
Citation Context ...summary of our contributions and an outline of possible future directions. We also include an appendix discussing an interpretation of information loss minimization in terms of lossless source coding =-=[36]-=-. A preliminary version of this research has appeared in AISTATS 2007 [24]. August 27, 2007 DRAFT 3sX K Y continuous feature vector quantization prediction codeword index attribute class label Fig. 1.... |

171 | Elements of information theory (2nd ed - Cover, Thomas - 2006 |

157 |
A general class of coefficients of divergence of one distribution from Another
- Ali, Silvey
(Show Context)
Citation Context ... More efficient practical algorithms for information loss minimization could be derived by considering another general class of divergences, the so-called Ali–Silvey–Csiszár distances or fdivergences =-=[2]-=-, [9] (this class also contains the relative entropy, as well as the variational distance and the Bhattacharyya coefficient). A (noniterative) technique for designing high-resolution quantizers for us... |

142 |
The Minimum Description Length Principle
- Grünwald
- 2007
(Show Context)
Citation Context ...f the separating boundary, note the similarity of this objective function to ours, given by eq. (13). Heiler and Schnörr motivate their objective function in terms of minimum description length [36], =-=[13]-=-. Namely, the quantity D(PX�Pin/out) represents the excess description length of encoding a pixel with true distribution PX using a code that is optimal for the distribution Pin/out. We develop this i... |

128 |
Theory of Games and Statistical Decisions
- Blackwell, Girschick
- 1954
(Show Context)
Citation Context ...2007 DRAFT 2sretains all the information that is useful for predicting the attribute. Then the compressed representation of the features created by this quantizer may be called a sufficient statistic =-=[6]-=-, [21] for the attribute labels, and for any statistical decision procedure about the attribute that uses the original features, we can find another one that performs just as well based on the output ... |

108 | A division information-theoretic feature clustering algorithm for text classification
- Dhillon, Mallela, et al.
- 2003
(Show Context)
Citation Context ... that can be extended outside the original input set. There are several clustering methods motivated by an information-theoretic interpretation of sufficient statistics in terms of mutual information =-=[11]-=-, [40], [42], [44]. The information bottleneck (IB) method [42], [44] is a general theoretical framework for clustering problems where the goal is to find a compressed representation K of the data X u... |

108 |
Self-Organizing Maps, 3rd ed
- Kohonen
- 2000
(Show Context)
Citation Context ...egardless of the actual classifier used. Learning Vector Quantization [18], [19] is an early heuristic approach for supervised quantizer design using Voronoi partitions, based on self-organizing maps =-=[20]-=-. An approach more directly related to ours is Generalized Vector Quantization (GVQ) by Rao et al. [34]. GVQ is designed for regression type problems where the goal is to encode or estimate a random v... |

92 |
Improved Versions of Learning Vector Quantization
- Kohonen
- 1990
(Show Context)
Citation Context ...er se, but a quantized representation of the data that preserves the relevant information for a given classification task, regardless of the actual classifier used. Learning Vector Quantization [18], =-=[19]-=- is an early heuristic approach for supervised quantizer design using Voronoi partitions, based on self-organizing maps [20]. An approach more directly related to ours is Generalized Vector Quantizati... |

72 |
Some inequalities for information divergence and related measures of discrimination
- TOPSOE
- 2000
(Show Context)
Citation Context ...minimizing an f-divergence loss has been presented in [33], but it requires full knowledge of the underlying probabilistic model. The f-divergences admit a useful information-theoretic interpretation =-=[45]-=-, and their structure is specifically suited for use in iterative techniques based on convex optimization. Going beyond information loss minimization, our approach readily extends to any Bregman diver... |

67 | The power of word clusters for text classification
- Slonim, Tishby
- 2001
(Show Context)
Citation Context ... extended outside the original input set. There are several clustering methods motivated by an information-theoretic interpretation of sufficient statistics in terms of mutual information [11], [40], =-=[42]-=-, [44]. The information bottleneck (IB) method [42], [44] is a general theoretical framework for clustering problems where the goal is to find a compressed representation K of the data X under the con... |

44 |
Learning vector quantization for pattern recognition
- Kohonen
- 1986
(Show Context)
Citation Context ...fier per se, but a quantized representation of the data that preserves the relevant information for a given classification task, regardless of the actual classifier used. Learning Vector Quantization =-=[18]-=-, [19] is an early heuristic approach for supervised quantizer design using Voronoi partitions, based on self-organizing maps [20]. An approach more directly related to ours is Generalized Vector Quan... |

42 |
Combining image compression and classification using vector quantization
- OEHLER, GRAY
- 1995
(Show Context)
Citation Context ...r the attribute labels, and to information loss as the objective criterion. Another supervised quantizer design approach is the work on jointly learning a codebook and a classifier by Oehler and Gray =-=[30]-=-. However, this work is tailored for use with Maximum A Posteriori (MAP) classification, whereas we use a much more general information-theoretic formulation which produces discriminative quantized re... |

41 |
Building kernels from binary strings for image matching
- Odone, Barla, et al.
- 2005
(Show Context)
Citation Context ...) , (20) k where in the case of a codebook output by our method, P (k|y) is obtained directly by Bayes rule from the centroid πk. For support vector machines, we use the histogram intersection kernel =-=[29]-=-, 4 Another reference performance figure for a 13-class subset of this dataset is 65.2% by Fei-Fei and Perona [14]. August 27, 2007 DRAFT 21s[43] defined by K (N(I1),N(I2)) = C� min (Nk(I1),Nk(I2)) . ... |

33 |
Randomized clustering forests for building fast and discriminative visual vocabularies
- Moosmann, Triggs, et al.
- 2007
(Show Context)
Citation Context ...hat is used to quantize the image features into descrete visual words. In recent literature, the problem of effective design of these codebooks has been gaining increasing attention (see, e.g., [22], =-=[28]-=- and references therein). August 27, 2007 DRAFT Classification Rate 0.85 0.8 0.75 10.0 Inf 19soffice kitchen living room bedroom store industrial tall building ∗ inside city ∗ street ∗ highway ∗ coast... |

28 |
The bayesian choice (2nd ed
- Robert
- 2001
(Show Context)
Citation Context ...real-world problems. APPENDIX LOSSLESS SOURCE CODING INTERPRETATION If we formulate the problem of inferring the class label Y from the observed feature X in the Bayesian decision-theoretic framework =-=[37]-=-, then the main object of interest is the posterior distribution, i.e., the conditional distribution Px of Y given X = x. Let us consider the problem of using the training sequence {(Xi,Yi)} N i=1 to ... |

27 | Natural image statistics for natural image segmentation
- Heiler, Schnörr
- 2003
(Show Context)
Citation Context ...d by sampling each pixel from the appearance distribution πk of its Voronoi region. August 27, 2007 DRAFT 26sIn existing literature, KL-divergence has been used for segmentation by Heiler and Schnörr =-=[16]-=-, who have proposed a variational framework to partition an image into two regions, Ωin and Ωout, by a smooth curve C. Their objective function is as follows: � � � L(Ωin, Ωout) = ds + D(Px�Pin)dx + D... |

27 |
Information-based clustering
- Slonim, Atwal, et al.
- 2005
(Show Context)
Citation Context ...can be extended outside the original input set. There are several clustering methods motivated by an information-theoretic interpretation of sufficient statistics in terms of mutual information [11], =-=[40]-=-, [42], [44]. The information bottleneck (IB) method [42], [44] is a general theoretical framework for clustering problems where the goal is to find a compressed representation K of the data X under t... |

19 | A generalized VQ method for combined compression and estimation
- Rao, Miller, et al.
- 1996
(Show Context)
Citation Context ...approach for supervised quantizer design using Voronoi partitions, based on self-organizing maps [20]. An approach more directly related to ours is Generalized Vector Quantization (GVQ) by Rao et al. =-=[34]-=-. GVQ is designed for regression type problems where the goal is to encode or estimate a random variable Y ∈Ybased on features X ∈X. This approach assumes a particular distortion or loss function on Y... |

17 | Learning-theoretic methods in vector quantization
- Linder
- 2001
(Show Context)
Citation Context ... our goal is to learn a codebook that minimizes the loss of information about Y that is incurred by this operation. II. PREVIOUS WORK The main concern of our paper is empirical quantizer design [12], =-=[25]-=-: given a representative training sequence drawn from the signal space, the goal is to learn a quantization rule that performs well not only on the specific training examples, but also on arbitrary, p... |

17 |
Fine quantization in signal detection and estimation
- Poor
- 1988
(Show Context)
Citation Context ...e Bhattacharyya coefficient). A (noniterative) technique for designing high-resolution quantizers for use in a statistical inference procedure by minimizing an f-divergence loss has been presented in =-=[33]-=-, but it requires full knowledge of the underlying probabilistic model. The f-divergences admit a useful information-theoretic interpretation [45], and their structure is specifically suited for use i... |

9 | M.: Learning nearest-neighbor quantizers from labeled data by information loss minimization
- Lazebnik, Raginsky
- 2007
(Show Context)
Citation Context ... We also include an appendix discussing an interpretation of information loss minimization in terms of lossless source coding [36]. A preliminary version of this research has appeared in AISTATS 2007 =-=[24]-=-. August 27, 2007 DRAFT 3sX K Y continuous feature vector quantization prediction codeword index attribute class label Fig. 1. The task of quantization (compression) for the sake of classification. X ... |

3 |
clustering of Gauss mixture models for image compression and classification
- Lloyd
- 2005
(Show Context)
Citation Context ..., i.e., the means mk and the class-specific mixture weights P (k|y), are learned using the EM algorithm [5]. (Alternatively, one could use GMVQ, a hard clustering algorithm for Gauss mixture modeling =-=[1]-=-.) Instead of fixing a global value of σ 2 , we also experimented with including the variances σ 2 k as parameters in the optimization, but this had little effect on classification performance, or eve... |

1 |
Latent mixture vocabularies for object characterization
- Larlus, Jurie
- 2005
(Show Context)
Citation Context ...book that is used to quantize the image features into descrete visual words. In recent literature, the problem of effective design of these codebooks has been gaining increasing attention (see, e.g., =-=[22]-=-, [28] and references therein). August 27, 2007 DRAFT Classification Rate 0.85 0.8 0.75 10.0 Inf 19soffice kitchen living room bedroom store industrial tall building ∗ inside city ∗ street ∗ highway ∗... |