## An Information Theoretic Approach to Machine Learning (2005)

Citations: | 7 - 2 self |

### BibTeX

@TECHREPORT{Jenssen05aninformation,

author = {Robert Jenssen},

title = {An Information Theoretic Approach to Machine Learning},

institution = {},

year = {2005}

}

### OpenURL

### Abstract

In this thesis, theory and applications of machine learning systems based on information theoretic criteria as performance measures are studied. A new clustering algorithm based on maximizing the Cauchy-Schwarz (CS) divergence measure between probability density functions (pdfs) is proposed. The CS divergence is estimated non-parametrically using the Parzen window technique for density estimation. The problem domain is transformed from discrete 0/1 cluster membership values to continuous membership values. A constrained gradient descent maximization algorithm is implemented. The gradients are stochastically approximated to reduce computational complexity, making the algorithm more practical. Parzen window annealing is incorporated into the algorithm to help avoid convergence to a local maximum. The clustering results obtained on synthetic and real data are encouraging. The Parzen window-based estimator for the CS divergence is shown to have a dual expression as a measure of the cosine of the angle between cluster mean vectors in a feature space determined by the eigenspectrum of a Mercer kernel matrix. A spectral clustering

### Citations

8980 | Statistical Learning Theory
- Vapnik
- 1998
(Show Context)
Citation Context ...rcer kernel-based learning algorithms [Shawe-Taylor and Cristianini, 2004, Müller et al., 2001, Perez-Cruz and Bousquet, 2004, Schölkopf and Smola, 2002], of which support vector machines [Cortez and =-=Vapnik, 1995-=-, Vapnik, 1995, Cristianini and Shawe-Taylor, 2000, Burges, 1998, Hastie et al., 2004], kernel principal component analysis [Schölkopf et al., 1998], and kernel Fisher discriminant analysis [Mika et a... |

8563 |
Elements of Information Theory
- Cover, Thomas
- 1991
(Show Context)
Citation Context ...on theory has traditionally been associated with the communications area, where it has had a tremendous impact in the design of efficient and reliable communication systems [Shannon and Weaver, 1949, =-=Cover and Thomas, 1991-=-, Fano, 1961]. Shannon [1948] laid down the foundations of information theory, by defining a quantitative measure of the uncertainty, or information, associated with the outcome of a stochastic experi... |

8089 | Maximum likelihood from incomplete data via the EM algorithm
- Dempster, Laird, et al.
- 1977
(Show Context)
Citation Context ...s a mixture of these component distributions, which are most often chosen to be Gaussian [Banfield and Raftery, 1993]. The fit between the data and the model is then optimized using the EM algorithm [=-=Dempster et al., 1977-=-]. Traditionally, the most well-known graph theoretic partitional clustering algorithm is based on construction of the minimal spanning tree (MST) of the data [Zahn, 1971]. The clusters are generated ... |

6050 |
A mathematical theory of communication
- Shannon
- 1948
(Show Context)
Citation Context ...ent has some “uncertainty” associated with it. Shannon [1948] was the first to define a quantitative measure, H = HN(p1,...,pN), of this uncertainty, satisfying the following set of basic postulates [=-=Shannon, 1948-=-, Renyi, 1976a] 1. HN(p1,...,pN) is a symmetric function of its variables. 2. HN(p1,...,pN) is a continuous function of p1,...,pN. � 3. HN attains the maximum value. � 1 1 N ,..., N 4. HN+1(tp1, (1 − ... |

3921 |
Pattern Classification and Scene Analysis
- Duda, Hart
- 1973
(Show Context)
Citation Context ...nd varied. There are a number of books available, some dating back to the 1970’s [Anderberg, 1973, Hartigan, 1975]. More recent books include [Jain and Dubes, 1988, Theodoridis and Koutroumbas, 1999, =-=Duda et al., 2001-=-]. It is impossible to cover all kinds of clustering algorithms. A taxonomy anno 1999 of the most well-known clustering methods was given by Jain et al. [1999], and to a large extent paralleled by The... |

3629 |
Neural Networks: A Comprehensive Foundation (2 nd ed
- Haykin
- 1999
(Show Context)
Citation Context ...e. Then the additivity property states that DKL{f(x),g(x)} = DKL{f(x1),g(x1)} + DKL{f(x2),g(x2)}. (2.15) The Kullback-Leibler divergence is also invariant under the following changes in the vector x [=-=Haykin, 1999-=-] 1. Permutation of the order in which the components are arranged. 2. Amplitude scaling. 3. Monotonic nonlinear transformation. It can be seen that the Kullback-Leibler divergence is implicitly based... |

3249 | The anatomy of a large-scale hypertextual web search engine
- Brin, Page
- 1998
(Show Context)
Citation Context ... 2003, 2004]. Since the eigenvalues of such an affinity matrix are known as the spectrum of the matrix, these methods are referred to as spectral algorithms. For example, Google’s Pagerank algorithm [=-=Brin and Page, 1998-=-] is based on related eigenvector calculations [Ng et al., 2001]. 1.2 Extended Summary This thesis considers both theoretical and application aspects of information theoretic learning, broadly defined... |

2649 |
Introduction to Statistical Pattern Recognition”, 2nd edition
- Fukunaga
- 1990
(Show Context)
Citation Context ...That is, the cluster pdfs need to be estimated “good enough” relative to each other, to obtain a reasonable partition. Such an approach is somewhat similar in spirit to density-based algorithms like [=-=Fukunaga, 1990-=-, Ester et al., 1996, Ankerst et al., 1999, Hinneburg and Keim, 1998]. In theory, this approach makes no assumptions with respect to the choice of pdf distance measure or to the choice of density esti... |

2590 | Normalized cuts and image segmentation - Shi, Malik - 1997 |

2284 | A tutorial on support vector machines for pattern recognition
- Burges
- 1998
(Show Context)
Citation Context ...ini, 2004, Müller et al., 2001, Perez-Cruz and Bousquet, 2004, Schölkopf and Smola, 2002], of which support vector machines [Cortez and Vapnik, 1995, Vapnik, 1995, Cristianini and Shawe-Taylor, 2000, =-=Burges, 1998-=-, Hastie et al., 2004], kernel principal component analysis [Schölkopf et al., 1998], and kernel Fisher discriminant analysis [Mika et al., 1999, Roth and Steinhage, 2000] are examples. Research on Me... |

2171 | Support-vector networks
- Cortes, Vapnik
- 1995
(Show Context)
Citation Context ...s is the Mercer kernel-based learning algorithms [Shawe-Taylor and Cristianini, 2004, Müller et al., 2001, Perez-Cruz and Bousquet, 2004, Schölkopf and Smola, 2002], of which support vector machines [=-=Cortez and Vapnik, 1995-=-, Vapnik, 1995, Cristianini and Shawe-Taylor, 2000, Burges, 1998, Hastie et al., 2004], kernel principal component analysis [Schölkopf et al., 1998], and kernel Fisher discriminant analysis [Mika et a... |

2162 |
Density Estimation for Statistics and Data Analysis. Monographson Statistics and Applied Probability
- Silverman
- 1986
(Show Context)
Citation Context .... [2000a] argued that one should make as few assumptions as possible about the structure of the probability density functions (pdfs) in question. Hence, Parzen windowing [Parzen, 1962, Devroye, 1989, =-=Silverman, 1986-=-, Scott, 1992, Wand and Jones, 1995] was proposed as the appropriate density estimation technique, since this method makes no such assumptions. Viola et al. [1995] had already proposed to approximate ... |

2152 | Algorithms for Clustering Data - Jain, Dubes - 1988 |

2028 | Learning with Kernels
- Scholkopf, Smola
- 2002
(Show Context)
Citation Context ... methods. Perhaps the most well-known of these learning schemes is the Mercer kernel-based learning algorithms [Shawe-Taylor and Cristianini, 2004, Müller et al., 2001, Perez-Cruz and Bousquet, 2004, =-=Schölkopf and Smola, 2002-=-], of which support vector machines [Cortez and Vapnik, 1995, Vapnik, 1995, Cristianini and Shawe-Taylor, 2000, Burges, 1998, Hastie et al., 2004], kernel principal component analysis [Schölkopf et al... |

1857 | Some Methods for classification and Analysis of Multivariate Observations
- MacQueen
- 1967
(Show Context)
Citation Context ...ork they use the eigenvectors of the Laplacian matrix to transform, or map, the input data into a new representation, for then to perform the actual clustering in that space by the K-means technique [=-=MacQueen, 1967-=-], which we discuss next. Since the 1960’s, the most intuitive and most frequently used criterion function in parti27stional clustering techniques is the squared error criterion, defined as J(U, C) = ... |

1688 | A Global Geometric Framework for Nonlinear Dimensionality - Tenenbaum, Silva, et al. |

1614 | Nonlinear dimensionality reduction by locally linear embedding - Roweis, Saul - 2000 |

1500 |
A k-means clustering algorithm
- Hartigan, Wong
(Show Context)
Citation Context ...ructures present in multi-dimensional data sets, resulting in a clustering literature which is huge and varied. There are a number of books available, some dating back to the 1970’s [Anderberg, 1973, =-=Hartigan, 1975-=-]. More recent books include [Jain and Dubes, 1988, Theodoridis and Koutroumbas, 1999, Duda et al., 2001]. It is impossible to cover all kinds of clustering algorithms. A taxonomy anno 1999 of the mos... |

1493 | Topographic independent component analysis
- Hyvarinen, Hoyer, et al.
(Show Context)
Citation Context ... entropy in high dimensional data spaces, and only need to estimate entropy for one-dimensional signals. Note that the independent components can only be determined up to a permutation and a scaling [=-=Hyvärinen et al., 2001-=-]. Estimation of ICA by minimization of mutual information was probably first proposed by Comon [1994] using an Edgeworth series expansion density estimator. Amari et al. [1996] proposed a related met... |

1357 |
Independent component analysis, a new concept
- Comon
- 1994
(Show Context)
Citation Context ...]. Those with unbounded support are the Hermite system (Gram-Charlier, Edgeworth) on R and the Laguerre system on [0, ∞). The Gram-Charlier and Edgeworth expansions are well-known for example in ICA [=-=Comon, 1994-=-, Amari et al., 1996]. These differ in the ordering of the terms in the expansion. The Gram-Charlier series uses [Izenman, 1991, Haykin, 1999] 1 φk(x) = (2kk!π 1 2 ) 1 � exp − 2 x2 � Hk(x), 2 Hk(x) = ... |

1306 | Self-Organization and Associate Memory - Kohonen - 1984 |

1240 |
A.: On information and sufficiency
- KULLBACK, LEIBLER
- 1951
(Show Context)
Citation Context ...oposed such a measure, subsequently called the Kullback-Leibler divergence, or the relative entropy. The measure discriminates between two probability density functions f(x) and g(x) and is given by [=-=Kullback and Leibler, 1951-=-, Kullback, 1968] � DKL{f,g} = f(x)log f(x) g(x) dx, � = Ef log f(x) � . (2.14) g(x) The term “distance” is a misnomer because this measure is not a distance metric in the mathematical sense, since it... |

1231 |
Adaptive Filter Theory
- Haykin
- 1996
(Show Context)
Citation Context ...second-order statistics optimality measures resulted in quadratic performance surfaces, for which the analytical expression of the optimal solution could easily be obtained [Widrow and Stearns, 1985, =-=Haykin, 2002-=-]. The MSE criterion also carried over to non-linear adaptive systems and 1sx i g(W) e i y i Desired response Information theoretic criterion Figure 1.1: In information theoretic machine learning, the... |

1213 |
An algorithm for vector quantizer design
- Linde, Buzo, et al.
- 1980
(Show Context)
Citation Context ...terns. It is however very sensitive to the initialization. The K-means algorithm and the related ISODATA algorithm [Ball and Hall, 1965] have given rise to several extended versions [Anderberg, 1973, =-=Linde et al., 1980-=-, Lloyd, 1982, Selim and Ismail, 1984, Diday, 1973, Symon, 1977], many of which try to resolve the initialization problem. A very influential development was the derivation of fuzzy K-means [Bezdek, 1... |

1154 | Information theory and statistics
- Kullback
(Show Context)
Citation Context ...quently called the Kullback-Leibler divergence, or the relative entropy. The measure discriminates between two probability density functions f(x) and g(x) and is given by [Kullback and Leibler, 1951, =-=Kullback, 1968-=-] � DKL{f,g} = f(x)log f(x) g(x) dx, � = Ef log f(x) � . (2.14) g(x) The term “distance” is a misnomer because this measure is not a distance metric in the mathematical sense, since it is not symmetri... |

1097 | On spectral clustering: Analysis and an algorithm
- Ng, Jordan, et al.
- 2001
(Show Context)
Citation Context ...ents for performing tasks such as lower-dimensional embedding of the data on a non-linear manifold [Roweis and Saul, 2000, Tenenbaum et al., 2000, Belkin and Niyogi, 2003, Brand, 2004] or clustering [=-=Ng et al., 2002-=-, Weiss, 1999, Dhillon et al., 2004, Ding and He, 2004, Verma and Meila, 2003, Zha et al., 2002]. See also [Bengio et al., 2003, 2004]. Since the eigenvalues of such an affinity matrix are known as th... |

1096 | A density-based algorithm for discovering clusters in large spatial databases with noise - Ester, Kriegel, et al. - 1996 |

1070 | An Information-Maximization Approach to Blind Separation and Blind Deconvolution - Bell, Sejnowski - 1995 |

1048 | Nonlinear component analysis as a kernel eigenvalue problem
- Schölkopf, Smola, et al.
- 1998
(Show Context)
Citation Context ...and Smola, 2002], of which support vector machines [Cortez and Vapnik, 1995, Vapnik, 1995, Cristianini and Shawe-Taylor, 2000, Burges, 1998, Hastie et al., 2004], kernel principal component analysis [=-=Schölkopf et al., 1998-=-], and kernel Fisher discriminant analysis [Mika et al., 1999, Roth and Steinhage, 2000] are examples. Research on Mercer kernel-based methods have been dominating in machine learning and pattern reco... |

943 | An Introduction to Support Vector Machines - Cristianini, Shawe-Taylor - 2000 |

879 |
Mixture Models
- Mclachlan, Basford
- 1988
(Show Context)
Citation Context ...e samples. Since the contribution of mixture density is not known, the maximum likelihood principle can not be easily employed, and one needs to resort to the expectation-maximization (EM) algorithm [=-=McLachlan and Peel, 2000-=-] to solve the problem. Because of the additional flexibility the mixture models add to the parametric models, this method may be regarded as semi-parametric. There are basically two problems associat... |

842 | Least squares quantization in pcm
- Lloyd
- 1982
(Show Context)
Citation Context ... very sensitive to the initialization. The K-means algorithm and the related ISODATA algorithm [Ball and Hall, 1965] have given rise to several extended versions [Anderberg, 1973, Linde et al., 1980, =-=Lloyd, 1982-=-, Selim and Ismail, 1984, Diday, 1973, Symon, 1977], many of which try to resolve the initialization problem. A very influential development was the derivation of fuzzy K-means [Bezdek, 1980, Bezdek e... |

786 |
Kernel Methods for Pattern Analysis
- Shawe-Taylor, Cristianini
- 2004
(Show Context)
Citation Context ...he socalled spectral methods. The work conducted in this thesis is also related to these methods. Perhaps the most well-known of these learning schemes is the Mercer kernel-based learning algorithms [=-=Shawe-Taylor and Cristianini, 2004-=-, Müller et al., 2001, Perez-Cruz and Bousquet, 2004, Schölkopf and Smola, 2002], of which support vector machines [Cortez and Vapnik, 1995, Vapnik, 1995, Cristianini and Shawe-Taylor, 2000, Burges, 1... |

755 | Alignment by maximization of mutual information
- Viola, Wells
- 1997
(Show Context)
Citation Context ... estimation technique, since this method makes no such assumptions. Viola et al. [1995] had already proposed to approximate Shannonbased measures using sample means, integrated with Parzen windowing [=-=Viola and Wells, 1997-=-]. Principe et al. [2000a] went a step further, by introducing a series of information theoretic quantities which can be estimated without the sample mean approximation [Xu, 1999, Principe et al., 200... |

727 |
On estimation of a probability density function and mode
- Parzen
- 1962
(Show Context)
Citation Context ... Specifically, Principe et al. [2000a] argued that one should make as few assumptions as possible about the structure of the probability density functions (pdfs) in question. Hence, Parzen windowing [=-=Parzen, 1962-=-, Devroye, 1989, Silverman, 1986, Scott, 1992, Wand and Jones, 1995] was proposed as the appropriate density estimation technique, since this method makes no such assumptions. Viola et al. [1995] had ... |

713 | A measure of asymptotic efficiency of tests of a hypothesis based on the sum of observations - Chernoff - 1952 |

710 |
Cluster Analysis for Applications
- Anderberg
- 1973
(Show Context)
Citation Context ...uncovering the structures present in multi-dimensional data sets, resulting in a clustering literature which is huge and varied. There are a number of books available, some dating back to the 1970’s [=-=Anderberg, 1973-=-, Hartigan, 1975]. More recent books include [Jain and Dubes, 1988, Theodoridis and Koutroumbas, 1999, Duda et al., 2001]. It is impossible to cover all kinds of clustering algorithms. A taxonomy anno... |

668 | T.: Information theory and statistical mechanics - JAYNES - 1963 |

526 |
Theory of probability
- Jeffreys
- 1961
(Show Context)
Citation Context ...on about the closeness of two probability distributions. Some such measures are [Kazakos and Papantoni-Kazakos, 1990] DJ(f,g) = DKL{f,g} + DKL{g, f} , (2.28) 2 which is called the Jeffrey’s distance [=-=Jeffreys, 1948-=-]. Jeffrey’s distance is a symmetric version of the Kullback-Leibler measure. Another measure is due to Chernoff [1952], defined as � DC(f,g) =− log f 1−t (x)q t (x)dx, 0 ≤ t ≤ 1, (2.29) where the mos... |

499 |
Hierachical grouping to optimize an objective function
- Ward
- 1963
(Show Context)
Citation Context ...ingle cluster and performs a splitting operation. Most hierarchical clustering algorithms are variants of the single-link [Sneath and Sokal, 1973], 26scomplete-link [King, 1967] and minimum variance [=-=Ward, 1963-=-, Murtagh, 1984] algorithms. These algorithms differ in the way they characterize the similarity between a pair of clusters. For example, the single-link algorithm may handle well-separated, non-isotr... |

491 | Blind beamforming for non-gaussian signals
- Cardoso, Souloumiac
- 1993
(Show Context)
Citation Context ...tion of the covariance matrix. The eigenvectors of the tensor more or less directly give the mixing matrix for whitened data. Papers related to tensorial methods are for example [Cardoso, 1989, 1990, =-=Cardoso and Souloumiac, 1993-=-, Comon and Mourrain, 1996]. See for example the book by Hyvärinen et al. [2001] for more on ICA algorithms. ICA has in the recent years been widely applied in diverse areas like biomedical imaging [J... |

486 |
Paritioning sparse matrices with eigenvectors of graphs
- Pothen, Simon, et al.
- 1990
(Show Context)
Citation Context ...the last few years, this method has been rediscovered, and a number of related techniques have been published [Ding et al., 2001, Shi and Malik, 2000, Perona and Freeman, 1998, Hagen and Kahng, 1991, =-=Pothen et al., 1990-=-, Sarkar and Soundararajan, 2000]. These techniques are all based on variants of the graph cut, a measure of the cost of partitioning a graph into two pieces. But they use different graph matrices, an... |

478 | The ‘Independent Components’ of Natural Scenes are Edge Filters
- Bell, Sejnowski
(Show Context)
Citation Context ...la, 1999]. See e.g. [Cardoso et al., 1999, Pajunen and Karhunen, 2000, Lee et al., 2001] for more references on ICA applications. ICA has also been proposed as a generic statistical model for images [=-=Bell and Sejnowski, 1997-=-, Hurri, 1997, Hurri et al., 1997, van Hateren and van der Schaaf, 1998, Hoyer and Hyvärinen, 2000]. In this case an image x is modeled as: x = N� si ai, (2.66) i=1 where ai, i=1,...,N, are referred t... |

466 | Mixture Models: Inference and Applications to Clustering - McLachlan, Basford - 1988 |

457 |
Pattern Recognition
- Theodoridis, Koutroumbas
- 2003
(Show Context)
Citation Context ...endence, f(X ; θ) = �N i=1 f(xi; θ), which is known as the likelihood function of θ. The maximum likelihood (ML) method estimates θ such that the likelihood function takes its maximum value, that is [=-=Theodoridis and Koutroumbas, 1999-=-] ˆθML = arg max θ N� f(xi; θ). (2.36) In practice, the log-likelihood is used, since it is easier to compute the gradient of this function. In a similar manner, the maximum a posteriori probability (... |

434 | The information bottleneck method - Tishby, Pereira, et al. - 1999 |

405 | Algebraic connectivity of graphs - Fiedler - 1973 |

378 |
What Is the Goal of Sensory Coding
- Field
- 1994
(Show Context)
Citation Context ...lent under the Gaussian assumption. Therefore entropy was proposed as a learning criterion [Barlow et al., 1989]. This work set the scene for several related feature-learning algorithms [Atick, 1992, =-=Field, 1994-=-, Intrator, 1992, Olshausen and Field, 1996]. Recently, many machine learning problems have been encountered which necessitate the use of cost functions which can capture higher order statistical prop... |

375 | An Introduction to Kernel-Based Learning Algorithms - Muller, Mika, et al. - 2001 |

365 | Multivariate Density Estimation
- Scott
- 1992
(Show Context)
Citation Context ...that one should make as few assumptions as possible about the structure of the probability density functions (pdfs) in question. Hence, Parzen windowing [Parzen, 1962, Devroye, 1989, Silverman, 1986, =-=Scott, 1992-=-, Wand and Jones, 1995] was proposed as the appropriate density estimation technique, since this method makes no such assumptions. Viola et al. [1995] had already proposed to approximate Shannonbased ... |