Results 1  10
of
16
Divergence measures based on the Shannon entropy
 IEEE Transactions on Information theory
, 1991
"... AbstractA new class of informationtheoretic divergence measures based on the Shannon entropy is introduced. Unlike the wellknown Kullback divergences, the new measures do not require the condition of absolute continuity to be satisfied by the probability distributions involved. More importantly, ..."
Abstract

Cited by 451 (0 self)
 Add to MetaCart
AbstractA new class of informationtheoretic divergence measures based on the Shannon entropy is introduced. Unlike the wellknown Kullback divergences, the new measures do not require the condition of absolute continuity to be satisfied by the probability distributions involved. More importantly, their close relationship with the variational distance and the probability of misclassification error are established in terms of bounds. These bounds are crucial in many applications of divergence measures. The new measures are also well characterized by the properties of nonnegativity, finiteness, semiboundedness, and boundedness. Index TermsDivergence, dissimilarity measure, discrimination information, entropy, probability of error bounds. I.
Learning from examples with Information Theoretic Criteria
 Journal of VLSI Systems, Kluwer
, 1999
"... This paper discusses a framework for learning based on information theoretic criteria. A novel algorithm based on Renyi’s quadratic entropy is used to train, directly from a data set, linear or nonlinear mappers for entropy maximization or minimization. We provide an intriguing analogy between the c ..."
Abstract

Cited by 22 (7 self)
 Add to MetaCart
(Show Context)
This paper discusses a framework for learning based on information theoretic criteria. A novel algorithm based on Renyi’s quadratic entropy is used to train, directly from a data set, linear or nonlinear mappers for entropy maximization or minimization. We provide an intriguing analogy between the computation and an information potential measuring the interactions among the data samples. We also propose two approximations to the KulbackLeibler divergence based on quadratic distances (CauchySchwartz inequality and Euclidean distance). These distances can still be computed using the information potential. We test the newly proposed distances in blind source separation (unsupervised learning) and in feature extraction for classification (supervised learning). In blind source separation our algorithm is capable of separating instantaneously mixed sources, and for classification the performance of our classifier is comparable to the support vector machines (SVMs). 1
On the Relationship between Classification Error Bounds and Training Criteria in Statistical Pattern Recognition
, 2003
"... We present two novel bounds for the classification error that, at the same time, can be used as practical training criteria. Unlike the bounds reported in the literature so far, these novel bounds are based on a strict distinction between the true but unknown distribution and the model distribution, ..."
Abstract

Cited by 6 (5 self)
 Add to MetaCart
(Show Context)
We present two novel bounds for the classification error that, at the same time, can be used as practical training criteria. Unlike the bounds reported in the literature so far, these novel bounds are based on a strict distinction between the true but unknown distribution and the model distribution, which is used in the decision rule. The two bounds we derive are the squared distance and the KullbackLeibler distance, where in both cases the distance is computed between the true distribution and the model distribution. In terms of practical training criteria, these bounds result in the squared error criterion and the mutual information (or equivocation) criterion, respectivel . 1
Mixing and nonmixing local minima of the entropy contrast for blind source separation
 IEEE Transactions on Information Theory
, 2007
"... Abstract — In this paper, both nonmixing and mixing local minima of the entropy are analyzed from the viewpoint of blind source separation (BSS); they correspond respectively to acceptable and spurious solutions of the BSS problem. The contribution of this work is twofold. First, a Taylor developme ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
(Show Context)
Abstract — In this paper, both nonmixing and mixing local minima of the entropy are analyzed from the viewpoint of blind source separation (BSS); they correspond respectively to acceptable and spurious solutions of the BSS problem. The contribution of this work is twofold. First, a Taylor development is used to show that the exact output entropy cost function has a nonmixing minimum when this output is proportional to any of the nonGaussian sources, and not only when the output is proportional to the lowest entropic source. Second, in order to prove that mixing entropy minima exist when the source densities are strongly multimodal, an entropy approximator is proposed. The latter has the major advantage that an error bound can be provided. Even if this approximator (and the associated bound) is used here in the BSS context, it can be applied for estimating the entropy of any random variable with multimodal density. Index Terms — Blind source separation. Independent component analysis. Entropy estimation. Multimodal densities. Mixture distribution. EDICS Category: I.
USING EXPONENTIAL MIXTURE MODELS FOR SUBOPTIMAL DISTRIBUTED DATA FUSION
"... In this paper we investigate the use of Exponential Mixture Densities (EMDs) as suboptimal update rules for distributed data fusion. We show that EMDs have a pointwise bound “from below ” on the minimum value of the probability distribution. However, the distributions are not bounded from above and ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
In this paper we investigate the use of Exponential Mixture Densities (EMDs) as suboptimal update rules for distributed data fusion. We show that EMDs have a pointwise bound “from below ” on the minimum value of the probability distribution. However, the distributions are not bounded from above and thus can be interpreted as a fusion operation. 1.
A Discriminative Splitting Criterion for Phonetic Decision Trees
"... Phonetic decision trees are a key concept in acoustic modeling for large vocabulary continuous speech recognition. Although discriminative training has become a major line of research in speech recognition and all stateoftheart acoustic models are trained discriminatively, the conventional phonet ..."
Abstract
 Add to MetaCart
(Show Context)
Phonetic decision trees are a key concept in acoustic modeling for large vocabulary continuous speech recognition. Although discriminative training has become a major line of research in speech recognition and all stateoftheart acoustic models are trained discriminatively, the conventional phonetic decision tree approach still relies on the maximum likelihood principle. In this paper we develop a splitting criterion based on the minimization of the classification error. An improvement of more than 10 % relative over a discriminatively trained baseline system on the Wall Street Journal corpus suggests that the proposed approach is promising. Index Terms: discriminative training, phonetic decision trees, state tying, new paradigms 1.
Can Entropy Characterize Performance of Online Algorithms?
"... Abstract We focus in this work on an aspect of online computation that is not addressed by the standard competitive analysis. Namely, identifying request sequences for which nontrivial online algorithms are useful versus request sequences for which all algorithms perform equally bad. The motivation ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract We focus in this work on an aspect of online computation that is not addressed by the standard competitive analysis. Namely, identifying request sequences for which nontrivial online algorithms are useful versus request sequences for which all algorithms perform equally bad. The motivation for this work are advanced system and architecture designs which allow the operating system to dynamically allocate resources to online protocols such as prefetching and caching. To utilize these features the operating system needs to identify data streams that can benefit from more resources. Our approach in this work is based on the relation between entropy, compression and gambling, extensively studied in information theory. It has been shown that in some settings entropy can either fully or at least partially characterize the expected outcome of an iterative gambling game. Viewing online problem with stochastic input as an iterative gambling game, our goal is to study the extent to which the entropy of the input characterizes the expected performance of online algorithms for problems that arise in computer applications. We study bounds based on entropy for three online problems list accessing, prefetching and caching. We show that entropy is a good performance characterizer for prefetching, but not so good characterizer for online caching. Our work raises several open questions in using entropy as a predictor in online computation.
The Minimum Information Principle for Discriminative Learning
"... Exponential models of distributions are widely used in machine learning for classification and modelling. It is well known that they can be interpreted as maximum entropy models under empirical expectation constraints. In this work, we argue that for classification tasks, mutual information is a mor ..."
Abstract
 Add to MetaCart
(Show Context)
Exponential models of distributions are widely used in machine learning for classification and modelling. It is well known that they can be interpreted as maximum entropy models under empirical expectation constraints. In this work, we argue that for classification tasks, mutual information is a more suitable information theoretic measure to be optimized. We show how the principle of minimum mutual information generalizes that of maximum entropy, and provides a comprehensive framework for building discriminative classifiers. A game theoretic interpretation of our approach is then given, and several generalization bounds provided. We present iterative algorithms for solving the minimum information problem and its convex dual, and demonstrate their performance on various classification tasks. The results show that minimum information classifiers outperform the corresponding maximum entropy models. 1
Sequential Feature Extraction Using InformationTheoretic Learning
"... Abstract A classification system typically includes both a feature extractor and a classifier. The two components can be trained either sequentially or simultaneously. The former option has an implementation advantage since the extractor is trained independently of the classifier, but it is hinder ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract A classification system typically includes both a feature extractor and a classifier. The two components can be trained either sequentially or simultaneously. The former option has an implementation advantage since the extractor is trained independently of the classifier, but it is hindered by the suboptimality of feature selection. Simultaneous training has the advantage of minimizing classification error, but it has implementation difficulties. Certain criteria, such as Minimum Classification Error, are better suited for simultaneous training, while other criteria, such as Mutual Information, are amenable for training the extractor either sequentially or simultaneously. Herein, an informationtheoretic criterion is introduced and is evaluated for sequential training, in order to ascertain its ability to find relevant features for classification. The proposed method uses nonparametric estimation of Renyi’s entropy to train the extractor by maximizing an approximation of the mutual information between the class labels and the output of the extractor. The proposed method is compared against seven other feature reduction methods and, when combined with a simple classifier, against the Support Vector Machine and Optimal Hyperplane. Interestingly, the evaluations show that the proposed method, when used in a sequential manner, performs at least as well as the best simultaneous feature reduction methods. Index Terms Feature extraction, Information theory, Classification, Nonparametric statistics. 1