## Discriminative Feature Selection via Multiclass Variable Memory Markov Model (2002)

Venue: | EURASIP Journal on Applied Signal Processing (JASP), Special issue on Unstructured Information Management from Multimedia Data Sources |

Citations: | 10 - 1 self |

### BibTeX

@INPROCEEDINGS{Slonim02discriminativefeature,

author = {Noam Slonim and Gill Bejerano and Shai Fine and Naftali Tishby},

title = {Discriminative Feature Selection via Multiclass Variable Memory Markov Model},

booktitle = {EURASIP Journal on Applied Signal Processing (JASP), Special issue on Unstructured Information Management from Multimedia Data Sources},

year = {2002},

pages = {578--585}

}

### OpenURL

### Abstract

We propose a novel feature selection method based on a Variable Memory Markov model (VMM). The VMM was originally proposed as a generative model trying to preserve the original source statistics from training data.

### Citations

8609 |
Elements of information theory
- Cover, Thomas
- 1991
(Show Context)
Citation Context ... `distance' (measured by relative entropy) between the distributions embodied by the original and the pruned models. By relating relative entropy to 1 The upper bound is due to Fano's inequality (cf. =-=[8-=-]), and the lower bound can be found e.g., at [12]. 2 The backo recursive rule represents n-gram conditional probabilities P (wn jwn 1 :::w1 ) using (n-1)-gram conditional probabilities multiplied by ... |

857 | An empirical study of smoothing techniques for language modeling
- Chen, Goodman
- 1999
(Show Context)
Citation Context ...1 :::w1 ) using (n-1)-gram conditional probabilities multiplied by a backo weight, (wn 1 :::w1 ), associated with the full history, i.e. P (wn jwn 1 :::w1 ) = (wn 1 :::w1 )P (wn jwn 1 :::w2 ), cf. [9]=-=-=- the relative change in training set perplexity 3 , a simple pruning criterion is devised, which removes from the model all n-grams that change perplexity by less than a threshold. Stolcke shows [17] ... |

554 | Inducing features of random fields
- Pietra, Pietra, et al.
- 1997
(Show Context)
Citation Context ...s, and their direct usage of the mutual information measure (cf. [4]). Out of numerous feature selection techniques found in the literature, we would like to point out the work of Della Pietra et al. =-=[10] who devis-=-ed a feature selection (or rather, induction) mechanism to build n-grams of varying lengths, and McCallum's "U-Tree" [14], which build PST's based on the ability to predict the future discou... |

504 | Inductive learning algorithms and representations for text categorization
- Dumais, Platt, et al.
- 1998
(Show Context)
Citation Context ...VMM are 95% and 87%, respectively. This implies a breakeven performance of at least 87% (probably higher). We therefore compared these results with the breakeven performance reported by Dumais et al. =-=[1-=-1] for the same task. In that work the authors comparedsve dierent classication algorithms: FindSim (a variant of Rocchio's method), Naive Bayes, Bayes nets, Decision Trees and SVM. The (weighted) ave... |

279 |
Reinforcement Learning with Selective Perception and Hidden State
- McCallum
- 1995
(Show Context)
Citation Context ...terature, we would like to point out the work of Della Pietra et al. [10] who devised a feature selection (or rather, induction) mechanism to build n-grams of varying lengths, and McCallum's "U-T=-=ree" [14]-=-, which build PST's based on the ability to predict the future discounted reward in the context of reinforcement learning. Another popular approach in language modeling is the use of pruning as a mean... |

212 | Learning with Many Irrelevant Features
- Almuallim, Dietterich
- 1991
(Show Context)
Citation Context ...ossible to signicantly reduce model dimensions without impeding the performance of the learning algorithm. In some cases one may even gain in generalization power bysltering irrelevant features (cf. [=-=1]-=-). In this work we present a novel method for feature selection based on a Variable Memory Markov (VMM) model [16]. For a large variety of sequential data, statistical correlations decrease rapidly wi... |

198 | Using Mutual Information for Selection Features in Supervised Neural Net Learning
- Battiti
- 1994
(Show Context)
Citation Context ...rr : (1) A number of methods have been posed, diering essentially by their method of approximating the joint and marginal distributions, and their direct usage of the mutual information measure (cf. [=-=4]-=-). Out of numerous feature selection techniques found in the literature, we would like to point out the work of Della Pietra et al. [10] who devised a feature selection (or rather, induction) mechanis... |

172 | The power of amnesia: learning probabilistic automata with variable memory length
- Ron, Singer, et al.
- 1996
(Show Context)
Citation Context ...me cases one may even gain in generalization power bysltering irrelevant features (cf. [1]). In this work we present a novel method for feature selection based on a Variable Memory Markov (VMM) model =-=[16]-=-. For a large variety of sequential data, statistical correlations decrease rapidly with the distance between symbols in the sequence. In particular, consider the conditional (empirical) probability d... |

165 |
Maximum mutual information estimation of hidden Markov model parameters for speech recognition
- Bahl, Brown, et al.
- 1986
(Show Context)
Citation Context ...es. A related usage of MI for stochastic modeling is the Maximal Mutual Information (MMI) approach for multi-class model training. This is a discriminative training approach attributed to Bahl et al. =-=[3]-=-, designed to directly approximate the posterior probability distribution, in contrast to the indirect approach, via Bayes' formula, of maximum likelihood (ML) training. The MMI method was applied suc... |

150 |
On the mean accuracy of statistical pattern recognizers
- Hughes
- 1968
(Show Context)
Citation Context ... very large, requires impractical large training sets. Indeed, increasing the number of features while keeping the number of samplessxed can actually lead to decrease in the accuracy of the classier [=-=14, 6]-=-. 1 In this work we present a novel method for feature selection based on a Variable Memory Markov (VMM) model [20]. For a large variety of sequential data, statistical correlations decrease rapidly w... |

150 | Document clustering using word clusters via the information bottleneck method - Slonim, Tishby |

92 | A study on thresholding strategies for text categorization
- Yang
- 2001
(Show Context)
Citation Context ...tests, the DVMM results are consis12 Available at http://www.research.att.com/lewis. 13 The F1 measure is the harmonic average of the standard recall and precision measures: F1 = 2pr p+r (see, e.g., [=-=19-=-]). It is easy to verify that for a uni-labeled dataset and a uni-labeled classication scheme, the micro-averaged precision and recall are equivalent, and hence equal to the F1 measure. Therefore, for... |

72 | Large scale discriminative training for speech recognition
- Woodland, Povey
- 2000
(Show Context)
Citation Context ... distribution, in contrast to the indirect approach, via Bayes' formula, of maximum likelihood (ML) training. The MMI method was applied successfully to HMM training in speech applications (see e.g., =-=[1-=-8]). However, MMI training is signicantly more expensive than ML training. Unlike ML training, in this approach all models aect the training of every single model through the denominator. In fact this... |

49 | Mutual information in learning feature transformation
- Torkkola, Campbell
- 2000
(Show Context)
Citation Context ...ound can be found e.g., at [13]. 2 is the fact that evaluating the MI measure involves integrating over a dense set, which leads to a computational overload. To circumvent that, Torkkola and Campbell =-=[25]-=- have recently suggested to perform feature transformation (rather than feature selection) to a lower dimension space in which the training and analysis of the data is more feasible. Their method is d... |

41 |
Highperformance connected digit recognition using maximum mutual information estimation
- Normandin, Cardin, et al.
- 1994
(Show Context)
Citation Context ... distribution, in contrast to the indirect approach, via Bayes' formula, of maximum likelihood (ML) training. The MMI method was applied successfully to HMM training in speech applications (see e.g., =-=[18, 2-=-6]). However, MMI training is signicantly more expensive than ML training. Unlike ML training, in this approach all models aect the training of every single model through the denominator. In fact this... |

23 |
Decision tree design from a communication theory standpoint
- Goodman, Smyth
- 1988
(Show Context)
Citation Context ...en the distributions embodied by the original and the pruned models. By relating relative entropy to 1 The upper bound is due to Fano's inequality (cf. [8]), and the lower bound can be found e.g., at =-=[-=-12]. 2 The backo recursive rule represents n-gram conditional probabilities P (wn jwn 1 :::w1 ) using (n-1)-gram conditional probabilities multiplied by a backo weight, (wn 1 :::w1 ), associated with ... |

19 | Feature Selection Based on Joint Mutual Information
- Yang, Moody
- 1999
(Show Context)
Citation Context ...ce then, a number of methods have been posed, diering essentially by their method of approximating the joint and marginal distributions, and their direct usage of the mutual information measure (cf. [=-=5, 4, 28]-=-). One of the diculties in applying MI based feature selection methods, 1 The upper bound is due to Fano's inequality (cf. [10]), and the lower bound can be found e.g., at [13]. 2 is the fact that eva... |

7 |
A mutual information measure for feature selection with application to pulse classification
- Barrows, Sciortino
- 1996
(Show Context)
Citation Context ...ce then, a number of methods have been posed, diering essentially by their method of approximating the joint and marginal distributions, and their direct usage of the mutual information measure (cf. [=-=5, 4, 28]-=-). One of the diculties in applying MI based feature selection methods, 1 The upper bound is due to Fano's inequality (cf. [10]), and the lower bound can be found e.g., at [13]. 2 is the fact that eva... |

6 | Newsweeder: Learning to netnews - Lang - 1995 |

6 | The power of word clusters for text classi - Slonim, Tishby - 2001 |

5 |
Variations on probabilistic sux trees: statistical modeling and prediction of protein families
- Bejerano, Yona
- 2001
(Show Context)
Citation Context ...iclass categorization tasks would build a separate VMM for each class, based solely on its own data, and would classify a new example to the model with the highest score (a one-vs.-all approach, e.g. =-=[5-=-]). Motivated by a generative goal, this approach disregards the possible (dis)similarities between the dierent categories. Each model aims at best approximating its assigned source. However, in a dis... |

4 | Scalable backo language models - Seymore, Rosenfeld - 1996 |

3 |
et al., “PRINTS and PRINTS-S shed light on protein ancestry
- Attwood, Blythe, et al.
- 2002
(Show Context)
Citation Context ... to the classication of proteins into families, however most of these methods agree on a wide subset of the known protein world. We have chosen to compare our results to those of the PRINTS database [=-=2-=-] as its approach resembles ours. This database is a collection of protein familysngerprints. Each family is matched with asngerprint of one or more short subsequences which have been iteratively rene... |

3 | Entropy-based Pruning of Backo Language Models - Stolcke - 1998 |

2 |
Markovian domain statistical segmentation of protein sequences
- Bejerano, Seldin, et al.
- 2001
(Show Context)
Citation Context ...is missing. It may even be possible to extend the theoretical results presented in [16], in the context 14 For a related approach to discrimination, using competitive learning of generative PST's see =-=[6]-=-. of discriminative VMM models. Acknowledgments Useful discussion with Y. Bilu, R. Bachrach, E. Schneidman and E. Shamir is greatly appreciated. The authors would also like to thank A. Stolcke for hel... |

2 |
Beyond Word N grams
- Pereira, Singer, et al.
(Show Context)
Citation Context ...h will probably call for sophisticated smoothing as well) is left for future re11 The alphabet size (and node out degree) is in general notsnite in this case. However, previous work by Pereira et al. =-=[15-=-] suggests practical solutions to this situation. feat. ^ P (js; c) ^ P (sjc) seq-corr.sng-corr. c3 : AAGV jE 0:74 0:0037 77% 52% c2 : AAGV jQ 0:38 0:0020 31% 0% c2 : Y IADjC 0:49 0:0016 34% 31% c5 : ... |

2 |
Lewis.The Characteristic Selection Problem in Recognition Systems
- M
- 1962
(Show Context)
Citation Context ...for feature selection is well known in machine learning realm, though it is usually suggested in the context of \static" rather than stochastic modeling. The original idea may be traced back to L=-=ewis [16]-=-. It is motivated by the fact that when the a-priori class uncertainty is given, maximizing the mutual information is equivalent to the minimization of the conditional entropy. This in turn links mutu... |