## On Feature Extraction By Mutual Information Maximization (2002)

Venue: | In Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing |

Citations: | 11 - 3 self |

### BibTeX

@INPROCEEDINGS{Torkkola02onfeature,

author = {Kari Torkkola},

title = {On Feature Extraction By Mutual Information Maximization},

booktitle = {In Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing},

year = {2002},

pages = {821--824}

}

### OpenURL

### Abstract

In order to learn discriminative feature transforms, we discuss mutual information between class labels and transformed features as a criterion. Instead of Shannon's definition we use measures based on Renyi entropy, which lends itself into an efficient implementation and an interpretation of "information potentials" and "information forces" induced by samples of data. This paper presents two routes towards practical usability of the method, especially aimed to large databases: The first is an on-line stochastic gradient algorithm, and the second is based on approximating class densities in the output space by Gaussian mixture models.

### Citations

362 | Towards optimal feature selection
- Koller, Sahami
- 1996
(Show Context)
Citation Context ... of available features. Greedy algorithms based on sequential feature selection using any filter criterion are suboptimal, as they fail to find a feature set that would jointly maximize the criterion =-=[1]-=-. For this very reason, finding a transformation to lower dimensions might be easier than selecting features, given an appropriate differentiable objective function. We discuss mutual information betw... |

87 |
Experiments with random projection
- Dasgupta
- 2000
(Show Context)
Citation Context ... Running the EM-algorithm in the input space is now unnecessary since we know which samples belong to which mixture components. Similar strategy has been used to learn GMMs in high dimensional spaces =-=[6]. Writ-=-ing the density of class p as a GMM with Kp mixture components and hpj as their mixture weights we have in the input space Kp � p(x|cp) = hpjG(x − µpj, Σpj) (14) j=1 As a result, we have GMMs in... |

56 |
Information theoretic learning
- Principe, Xu, et al.
- 2000
(Show Context)
Citation Context ...ition we use measures based on Renyi entropy, which lends itself into an efficient implementation and an interpretation of ”information potentials” and ”information forces” induced by samples =-=of data [2, 3, 4, 5]-=-. This paper is structured as follows. An introduction is given to the maximum mutual information (MMI) formulation for discriminative feature transforms using Renyi entropy and Parzen density estimat... |

49 | Mutual information in learning feature transformation
- Torkkola, Campbell
- 2000
(Show Context)
Citation Context ...ition we use measures based on Renyi entropy, which lends itself into an efficient implementation and an interpretation of ”information potentials” and ”information forces” induced by samples =-=of data [2, 3, 4, 5]-=-. This paper is structured as follows. An introduction is given to the maximum mutual information (MMI) formulation for discriminative feature transforms using Renyi entropy and Parzen density estimat... |

41 | A methodology for information theoretic feature extraction
- Fisher, Principe
- 1998
(Show Context)
Citation Context ...ition we use measures based on Renyi entropy, which lends itself into an efficient implementation and an interpretation of ”information potentials” and ”information forces” induced by samples =-=of data [2, 3, 4, 5]-=-. This paper is structured as follows. An introduction is given to the maximum mutual information (MMI) formulation for discriminative feature transforms using Renyi entropy and Parzen density estimat... |

39 | Feature extraction using non-linear transformation for robust speech recognition on the Aurora database
- Sharma, Ellis, et al.
- 2000
(Show Context)
Citation Context ...ral and temporal domain) but by using phones or other phonetic subunits as the classes instead of states, and by using a lot of other speech material for training besides the AURORA training database =-=[10]-=-. This approach has produced significant improvements in the task. 6. DISCUSSION In static pattern recognition tasks MMI transforms are clearly useful, as they appear to be able to extract more discri... |

35 |
Enabling New Speech Driven Services for Mobile Devices: An overview of the ETSI standards activities for Distributed Speech Recognition Front-ends
- Pearce
- 2000
(Show Context)
Citation Context ...hould be advantageous regarding recognition accuracies since HMMs have the same exact structure. We attempted to test this hypothesis in noise robust connected digit recognition with the AURORA2 task =-=[7]-=-. Baseline HMMs were trained using the HTK and AURORA scripts using MFCCs with delta and acceleration coefficients (12 cepstral coefficients plus energy, 39 coefficients altogether). The multiconditio... |

12 | Acoustic Front End Optimization for Large Vocabulary Speech Recognition
- Welling, Haberland, et al.
- 1997
(Show Context)
Citation Context ...oblem [8]. As far as the second conclusion, previous work on LDA in state-discriminatory transforms shows that such transforms may produce very task-specific improvements, and not generalize too well =-=[9]-=-. Also, some recent work on the AURORA task has made use of LDA to learn discriminative transforms (both in the spectral and temporal domain) but by using phones or other phonetic subunits as the clas... |

10 | Nonlinear feature transforms using maximum mutual information
- Torkkola
- 2001
(Show Context)
Citation Context |

2 |
ESE2 special sessions on noise robust recognition
- Pearce, Ed
- 2001
(Show Context)
Citation Context ...ata. In fact, none of the papers so far published on the task that have improved over the MFCCs did attemp to make use of only the AURORA training data to learn the noise-robust facets of the problem =-=[8]-=-. As far as the second conclusion, previous work on LDA in state-discriminatory transforms shows that such transforms may produce very task-specific improvements, and not generalize too well [9]. Also... |