## Nonextensive Information Theoretic Kernels on Measures (2009)

### Cached

### Download Links

Citations: | 9 - 3 self |

### BibTeX

@MISC{Martins09nonextensiveinformation,

author = {André F. T. Martins and Noah A. Smith and Eric P. Xing and Pedro M. Q. Aguiar and Mário A. T. Figueiredo},

title = {Nonextensive Information Theoretic Kernels on Measures },

year = {2009}

}

### OpenURL

### Abstract

Positive definite kernels on probability measures have been recently applied to classification problems involving text, images, and other types of structured data. Some of these kernels are related to classic information theoretic quantities, such as (Shannon’s) mutual information and the Jensen-Shannon (JS) divergence. Meanwhile, there have been recent advances in nonextensive generalizations of Shannon’s information theory. This paper bridges these two trends by introducing nonextensive information theoretic kernels on probability measures, based on new JS-type divergences. These new divergences result from extending the the two building blocks of the classical JS divergence: convexity and Shannon’s entropy. The notion of convexity is extended to the wider concept of q-convexity, for which we prove a Jensen q-inequality. Based on this inequality, we introduce Jensen-Tsallis (JT) q-differences, a nonextensive generalization of the JS divergence, and define a k-th order JT q-difference between stochastic processes. We then define a new family of nonextensive mutual information kernels, which allow weights to be assigned to their arguments, and which includes the Boolean, JS, and linear kernels as particular cases. Nonextensive string kernels are also defined that generalize the p-spectrum kernel. We illustrate the performance of

### Citations

8567 | Elements of Information Theory - Cover, Thomas - 1991 |

6050 |
E.: A mathematical theory of communication
- SHANNON
- 1948
(Show Context)
Citation Context ...rictly positive reals, and ∆ n−1 � { (x1,...,xn) ∈ R n | n ∑ i=1 xi = 1, ∀i xi ≥ 0 denotes the (n − 1)-dimensional simplex. Inspired by the axiomatic formulation of Shannon’s entropy (Khinchin, 1957; =-=Shannon and Weaver, 1949-=-), Suyari (2004) proposed an axiomatic framework for nonextensive entropies and a uniqueness theorem. Let q ≥ 0 be a fixed scalar, called the entropic index. Suyari’s axioms (Appendix A) determine a f... |

2362 | Modern Information Retrieval
- Baeza-Yates, Ribeiro-Neto
- 1999
(Show Context)
Citation Context ...tions over words using the bag-of-words model and maximum likelihood estimation; this corresponds to normalizing the term frequencies (tf) using the ℓ1-norm, and is referred to as tf (Joachims, 2002; =-=Baeza-Yates and Ribeiro-Neto, 1999-=-). We also used the tf-idf (term frequency-inverse document frequency) representation, which penalizes terms that occur in many documents (Joachims, 2002; Baeza-Yates and Ribeiro-Neto, 1999). To weigh... |

2030 | Learning with Kernels
- Scholkopf, Smola
- 2002
(Show Context)
Citation Context .... c○2009 André F. T. Martins, Noah A. Smith, Eric P. Xing, Pedro M. Q. Aguiar and Mário A. T. Figueiredo.MARTINS, SMITH, XING, AGUIAR AND FIGUEIREDO 1. Introduction In kernel-based machine learning (=-=Schölkopf and Smola, 2002-=-; Shawe-Taylor and Cristianini, 2004), there has been recent interest in defining kernels on probability distributions to tackle several problems involving structured data (Desobry et al., 2007; Moren... |

1240 |
A.: On information and sufficiency
- KULLBACK, LEIBLER
- 1951
(Show Context)
Citation Context ...exity and Jensen’s inequality are key concepts underlying several central results of information theory, for example, the non-negativity of the Kullback-Leibler (KL) divergence (or relative entropy) (=-=Kullback and Leibler, 1951-=-). Jensen’s inequality (Jensen, 1906) also underlies the Jensen-Shannon (JS) divergence, a symmetrized and smoothed version of the KL divergence (Lin and Wong, 1990; Lin, 1991), often used in statisti... |

899 |
Algorithms on Strings, Trees, and Sequences
- Gusfield
- 1997
(Show Context)
Citation Context ...e computed in O(|s|+|t|) time (i.e., with cost that is linear in the length of the strings), as shown by Vishwanathan and Smola (2003), by using data structures such as suffix trees or suffix arrays (=-=Gusfield, 1997-=-). Moreover, with s fixed, any kernel k(s,t) may be computed in time O(|t|), which is particularly useful for classification applications. We will now see how Jensen-Tsallis kernels may be used as str... |

787 |
Kernel Methods for Pattern Analysis
- Shawe-Taylor, Cristianini
- 2004
(Show Context)
Citation Context ...s, Noah A. Smith, Eric P. Xing, Pedro M. Q. Aguiar and Mário A. T. Figueiredo.MARTINS, SMITH, XING, AGUIAR AND FIGUEIREDO 1. Introduction In kernel-based machine learning (Schölkopf and Smola, 2002; =-=Shawe-Taylor and Cristianini, 2004-=-), there has been recent interest in defining kernels on probability distributions to tackle several problems involving structured data (Desobry et al., 2007; Moreno et al., 2004; Jebara et al., 2004;... |

389 |
Learning to Classify Text using Support Vector Machines
- Joachims
- 2002
(Show Context)
Citation Context ... that map data to statistical manifolds, equipped with wellmotivated non-Euclidean metrics (Lafferty and Lebanon, 2005), often outperform support vector machine (SVM) classifiers with linear kernels (=-=Joachims, 2002-=-). Some of these kernels have a natural information theoretic interpretation, establishing a bridge between kernel methods and information theory (Cuturi et al., 2005; Hein and Bousquet, 2005). The ma... |

385 | Watkins,C.,Text classification using string kernel - Lodhi, Cristianini, et al. - 2001 |

368 | Convolution Kernels on Discrete Structures
- Haussler
- 1999
(Show Context)
Citation Context ... them as “string kernels,” they are more generally kernels between stochastic processes. Several string kernels (i.e., kernels operating on the space of strings) have been proposed in the literature (=-=Haussler, 1999-=-; Lodhi et al., 2002; Leslie et al., 2002; Vishwanathan and Smola, 2003; Shawe-Taylor and Cristianini, 2004; Cortes et al., 2004). These are kernels defined on A ∗ ×A ∗ , where A ∗ is the Kleene closu... |

335 |
On Measures of Entropy and Information
- Rényi
- 1961
(Show Context)
Citation Context ...oducing a new class of kernels rooted in nonextensive information theory, which contains previous information theoretic kernels as particular elements. The Shannon and Rényi entropies (Shannon, 1948; =-=Rényi, 1961-=-) share the extensivity property: the joint entropy of a pair of independent random variables equals the sum of the individual entropies. Abandoning this property yields the so-called nonextensive ent... |

282 |
Possible generalization of Boltzmann-Gibbs statistics
- Tsallis
- 1988
(Show Context)
Citation Context ...andom variables equals the sum of the individual entropies. Abandoning this property yields the so-called nonextensive entropies (Havrda and Charvát, 1967; Lindhard, 1974; Lindhard and Nielsen, 1971; =-=Tsallis, 1988-=-), which have raised great interest among physicists in modeling phenomena such as long-range interactions and multifractals, and in constructing nonextensive generalizations of Boltzmann-Gibbs statis... |

275 |
Methods of Information Geometry
- Amari, Nagaoka
- 2000
(Show Context)
Citation Context ...IC KERNELS ON MEASURES Appendix E. The Heat Kernel Approximation The diffusion kernel for statistical manifolds, recently proposed by Lafferty and Lebanon (2005), is grounded in information geometry (=-=Amari and Nagaoka, 2001-=-). It models the diffusion of “information” over a statistical manifold according to the heat equation. Since in the case of the multinomial manifold (the relative interior of ∆n ), the diffusion kern... |

258 |
I-divergence geometry of probability distributions and minimization problems. The Annals of Probability
- Csiszár
- 1975
(Show Context)
Citation Context ...R is defined as ϕH(y) = −k ylny, ϕH ◦ f, (8) 939MARTINS, SMITH, XING, AGUIAR AND FIGUEIREDO and, as usual, 0ln0 � 0. The generalized form of the KL divergence, often called generalized I-divergence (=-=Csiszar, 1975-=-), is a directed divergence between two measures µf,µg ∈ M H +(X), such that µf is µg-absolutely continuous (denoted µf ≪ µg). Let f and g be the densities associated with µf and µg, respectively. In ... |

147 |
Mathematical Foundations of Information Theory
- KHINCHIN
- 1957
(Show Context)
Citation Context ...+ denotes the strictly positive reals, and ∆ n−1 � { (x1,...,xn) ∈ R n | n ∑ i=1 xi = 1, ∀i xi ≥ 0 denotes the (n − 1)-dimensional simplex. Inspired by the axiomatic formulation of Shannon’s entropy (=-=Khinchin, 1957-=-; Shannon and Weaver, 1949), Suyari (2004) proposed an axiomatic framework for nonextensive entropies and a uniqueness theorem. Let q ≥ 0 be a fixed scalar, called the entropic index. Suyari’s axioms ... |

133 |
Harmonic Analysis on Semigroups
- Berg, Christensen, et al.
- 1984
(Show Context)
Citation Context ...954NONEXTENSIVE INFORMATION THEORETIC KERNELS ON MEASURES Lemma 22 If f :X → R satisfies f ≥ 0, then, for α ∈ [1,2], the function ψα(x,y) = −( f(x)+ f(y)) α is a nd kernel. The following definition (=-=Berg et al., 1984-=-) has been used in a machine learning context by Cuturi and Vert (2005). Definition 23 Let (X,⊕) be a semigroup. 2 A function ϕ :X → R is called pd (in the semigroup sense) if k :X ×X → R, defined as ... |

104 | Probability product kernels - Jebara, Kondor, et al. |

92 | A kullback-leibler divergence based kernel for svm classification in multimedia applications
- Moreno, Ho, et al.
- 2004
(Show Context)
Citation Context ... 2002; Shawe-Taylor and Cristianini, 2004), there has been recent interest in defining kernels on probability distributions to tackle several problems involving structured data (Desobry et al., 2007; =-=Moreno et al., 2004-=-; Jebara et al., 2004; Hein and Bousquet, 2005; Lafferty and Lebanon, 2005; Cuturi et al., 2005). By defining a parametric family S containing the distributions from which the data points (in the inpu... |

86 | Diffusion kernels on statistical manifolds
- Lafferty, Lebanon
- 2005
(Show Context)
Citation Context ...nterest in defining kernels on probability distributions to tackle several problems involving structured data (Desobry et al., 2007; Moreno et al., 2004; Jebara et al., 2004; Hein and Bousquet, 2005; =-=Lafferty and Lebanon, 2005-=-; Cuturi et al., 2005). By defining a parametric family S containing the distributions from which the data points (in the input space X) are assumed to have been generated, and defining a map from X f... |

84 | Quantification method of classification process, concept of structural α-entorpy, Kybernetika 3
- Havrda, Charvát
- 1967
(Show Context)
Citation Context ...the extensivity property: the joint entropy of a pair of independent random variables equals the sum of the individual entropies. Abandoning this property yields the so-called nonextensive entropies (=-=Havrda and Charvát, 1967-=-; Lindhard, 1974; Lindhard and Nielsen, 1971; Tsallis, 1988), which have raised great interest among physicists in modeling phenomena such as long-range interactions and multifractals, and in construc... |

82 | A (2004). “Fast Kernels for String and Tree Matching
- Vishwanathan, Smola
(Show Context)
Citation Context ...tic processes. Several string kernels (i.e., kernels operating on the space of strings) have been proposed in the literature (Haussler, 1999; Lodhi et al., 2002; Leslie et al., 2002; Vishwanathan and =-=Smola, 2003-=-; Shawe-Taylor and Cristianini, 2004; Cortes et al., 2004). These are kernels defined on A ∗ ×A ∗ , where A ∗ is the Kleene closure of a finite alphabet A (i.e., the set of all finite strings formed b... |

67 |
Nonextensive entropy: interdisciplinary applications
- Gell-Mann, Tsallis
- 2004
(Show Context)
Citation Context ...ng nonextensive generalizations of Boltzmann-Gibbs statistical mechanics (Abe, 2006). Nonextensive entropies have also been recently used in signal/image processing (Li et al., 2006) and other areas (=-=Gell-Mann and Tsallis, 2004-=-). The so-called Tsallis entropies (Havrda and Charvát, 1967; Tsallis, 1988) form a parametric family of nonextensive entropies that includes the Shannon-Boltzmann-Gibbs entropy as a particular case. ... |

62 |
Sur les fonctions convexes et les inégalités entre les valeurs moyennes
- Jensen
- 1906
(Show Context)
Citation Context ...erlying several central results of information theory, for example, the non-negativity of the Kullback-Leibler (KL) divergence (or relative entropy) (Kullback and Leibler, 1951). Jensen’s inequality (=-=Jensen, 1906-=-) also underlies the Jensen-Shannon (JS) divergence, a symmetrized and smoothed version of the KL divergence (Lin and Wong, 1990; Lin, 1991), often used in statistics, machine learning, signal/image p... |

50 | Hilbertian metrics and positive definite kernels on probability measures - Hein, Bousquet - 2005 |

45 | A new metric for probability distributions - Endres, Schindelin - 2003 |

39 | Analysis of symbolic sequences using the JensenShannon divergence - Grosse, Bernaola-Galvan, et al. |

38 | Rational kernels: Theory and algorithms - Cortes, Haffner, et al. |

38 |
Generalized information functions
- Daróczy
- 1970
(Show Context)
Citation Context ...(p1,..., pn) = −k n ∑ i=1 pi ln pi, and pseudoadditivity turns into additivity, that is, H(A ⊗ B) = H(A)+H(B) holds. Several proposals for φ have appeared in the literature (Havrda and Charvát, 1967; =-=Daróczy, 1970-=-; Tsallis, 1988). In this article, unless stated otherwise, we set φ(q) = q − 1, which yields the Tsallis entropy: Sq(p1,..., pn) = k ( q − 1 To simplify, we let k = 1 and write the Tsallis entropy as... |

32 | Agnostic classification of markovian sequences - El-Yaniv, Fine, et al. - 1997 |

22 |
Semigroup kernels on measures
- Cuturi, Fukumizu, et al.
- 2005
(Show Context)
Citation Context ...on probability distributions to tackle several problems involving structured data (Desobry et al., 2007; Moreno et al., 2004; Jebara et al., 2004; Hein and Bousquet, 2005; Lafferty and Lebanon, 2005; =-=Cuturi et al., 2005-=-). By defining a parametric family S containing the distributions from which the data points (in the input space X) are assumed to have been generated, and defining a map from X from S (e.g., via maxi... |

18 | Generalization of Shannon-Khinchin axioms to nonextensive systems and the uniqueness theorem for the nonextensive entropy - Suyari - 2004 |

13 |
A new directed divergence measure and its characterization
- Lin, Wong
- 1990
(Show Context)
Citation Context ...ence (or relative entropy) (Kullback and Leibler, 1951). Jensen’s inequality (Jensen, 1906) also underlies the Jensen-Shannon (JS) divergence, a symmetrized and smoothed version of the KL divergence (=-=Lin and Wong, 1990-=-; Lin, 1991), often used in statistics, machine learning, signal/image processing, and physics. In this paper, we introduce new extensions of JS-type divergences by generalizing its two pillars: conve... |

13 |
The Cauchy-Schwarz Master Class
- Steele
- 2006
(Show Context)
Citation Context ...-convex. Since ϕ ′′ q(x)=qx q−2 , ϕq is convex for x ≥ 0 and q ≥ 0. To show the (2−q)-convexity of −1/ϕ ′′ q(x)= −(1/q)x2−q , for xt ≥ 0, and q ∈ [0,1], we use a version of the power mean inequality (=-=Steele, 2006-=-), − ( ) 2−q l ∑ λi xi i=1 ≤ − l ∑ i=1 (λi xi) 2−q = − 949 l ∑ i=1 λ 2−q i x 2−q i ,MARTINS, SMITH, XING, AGUIAR AND FIGUEIREDO thus concluding that −1/ϕ ′′ q is in fact (2 − q)-convex. A consequence... |

12 |
Semigroup kernels on finite sets
- Cuturi, Vert
- 2004
(Show Context)
Citation Context ...tional G : M+(X) → R, let the set MG +(X) � { f ∈ M+(X) : |G( f)| < ∞} be its effective domain, and M 1,G + (X) � M G +(X)∩M 1 +(X) be its subdomain of probability measures. The following functional (=-=Cuturi and Vert, 2005-=-), extends the Shannon-Boltzmann-Gibbs entropy from M 1,H + (X) to the unnormalized measures in M H +(X): Z H( f) = −k Z f ln f = where k > 0 is a constant, the function ϕH : R+ → R is defined as ϕH(y... |

10 |
Foundations of nonextensive statistical mechanics
- Abe
- 2006
(Show Context)
Citation Context ...sed great interest among physicists in modeling phenomena such as long-range interactions and multifractals, and in constructing nonextensive generalizations of Boltzmann-Gibbs statistical mechanics (=-=Abe, 2006-=-). Nonextensive entropies have also been recently used in signal/image processing (Li et al., 2006) and other areas (Gell-Mann and Tsallis, 2004). The so-called Tsallis entropies (Havrda and Charvát, ... |

9 | Spirals in Hilbert space: with an application in information theory - Fuglede |

8 | Image registration and segmentation by maximizing the Jensen-Rényi divergence
- Hamza, Krim
- 2003
(Show Context)
Citation Context ... ( ) p1 + p2 JRq(p1, p2) = Rq − 2 Rq(p1)+Rq(p2) . 2 The JR divergence has been used in several signal/image processing applications, such as registration, segmentation, denoising, and classification (=-=Ben-Hamza and Krim, 2003-=-; He et al., 2003; Karakos et al., 2007). In Section 6, we show that the JR divergence is (like the JS divergence) a Hilbertian metric, which is relevant for its use in kernel-based machine learning. ... |

7 | Information theoretical properties of Tsallis entropies - Furuichi - 2006 |

7 | Hilbertian metrics on probability measures and their application in svms
- Hein, Lal, et al.
- 2004
(Show Context)
Citation Context ...sures, with arbitrary mass. This is relevant, for example, in applications of kernels on empirical measures (e.g., word counts, pixel intensity histograms); instead of the usual step of normalization =-=Hein et al. 2004-=-, we may leave these empirical measures unnormalized, thus allowing objects of different size (e.g., total number of words in a document, total number of image pixels) to be weighted differently. Anot... |

6 | Iterative denoising using Jensen-Rényi divergences with an application to unsupervised document categorization
- Karakos, Khudanpur, et al.
- 2007
(Show Context)
Citation Context ...(p2) . 2 The JR divergence has been used in several signal/image processing applications, such as registration, segmentation, denoising, and classification (Ben-Hamza and Krim, 2003; He et al., 2003; =-=Karakos et al., 2007-=-). In Section 6, we show that the JR divergence is (like the JS divergence) a Hilbertian metric, which is relevant for its use in kernel-based machine learning. 3.4 The Jensen-Tsallis Divergence Burbe... |

6 |
Image segmentation based on Tsallisentropy and Renyi-entropy and their comparison
- Li, Fan, et al.
(Show Context)
Citation Context ...d multifractals, and in constructing nonextensive generalizations of Boltzmann-Gibbs statistical mechanics (Abe, 2006). Nonextensive entropies have also been recently used in signal/image processing (=-=Li et al., 2006-=-) and other areas (Gell-Mann and Tsallis, 2004). The so-called Tsallis entropies (Havrda and Charvát, 1967; Tsallis, 1988) form a parametric family of nonextensive entropies that includes the Shannon-... |

5 | Non-logarithmic Jensen-Shannon divergence - Lamberti, Majtey - 2003 |

5 | E.P.: Nonextensive entropic kernels - Martins, Figueiredo, et al. - 2008 |

3 | Density kernels on unordered sets for kernel-based signal processing
- Desobry, Davy, et al.
- 2007
(Show Context)
Citation Context ... (Schölkopf and Smola, 2002; Shawe-Taylor and Cristianini, 2004), there has been recent interest in defining kernels on probability distributions to tackle several problems involving structured data (=-=Desobry et al., 2007-=-; Moreno et al., 2004; Jebara et al., 2004; Hein and Bousquet, 2005; Lafferty and Lebanon, 2005; Cuturi et al., 2005). By defining a parametric family S containing the distributions from which the dat... |

2 |
A nonextensive information-theoretic measure for image edge detection
- Ben-Hamza
- 2006
(Show Context)
Citation Context ...1 − 1 − q p1 q p2 1−q there is no counterpart of the Equality (18). WhenX andT are finite, Jπ in (20) is called the Jensen-Tsallis (JT) divergence and it has also Sq been applied in image processing (=-=Ben-Hamza, 2006-=-). Unlike the JS divergence, the JT divergence lacks an interpretation as a mutual information. Despite this, for q ∈ [1,2], it exhibits joint convexity (Burbea and Rao, 1982). In the next section, we... |

2 | Tsallis kernels on measures - Martins, Aguiar, et al. |

1 | Clustering with Bregman divergences - SMITH, AGUIAR, et al. - 1705 |

1 | Divergence measures based on the Shannon entropy - SMITH, AGUIAR, et al. - 1991 |

1 |
On the Theory of Measurement and
- Lindhard
- 1974
(Show Context)
Citation Context ...the joint entropy of a pair of independent random variables equals the sum of the individual entropies. Abandoning this property yields the so-called nonextensive entropies (Havrda and Charvát, 1967; =-=Lindhard, 1974-=-; Lindhard and Nielsen, 1971; Tsallis, 1988), which have raised great interest among physicists in modeling phenomena such as long-range interactions and multifractals, and in constructing nonextensiv... |