## Maximum Entropy Model Parameterization with Tf*Idf Weighted Vector Space Model

Venue: | Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding. 2007: Kyoto |

Citations: | 1 - 1 self |

### BibTeX

@INPROCEEDINGS{Wang_maximumentropy,

author = {Ye-yi Wang and Alex Acero},

title = {Maximum Entropy Model Parameterization with Tf*Idf Weighted Vector Space Model},

booktitle = {Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding. 2007: Kyoto},

year = {}

}

### OpenURL

### Abstract

Maximum entropy (MaxEnt) models have been used in many spoken language tasks. The training of a MaxEnt model often involves an iterative procedure that starts from an initial parameterization and gradually updates it towards the optimum. Due to the convexity of its objective function (hence a global optimum on a training set), little attention has been paid to model initialization in MaxEnt training. However, MaxEnt model training often ends early before convergence to the global optimum, and prior distributions with hyper-parameters are often added to the objective function to prevent over-fitting. This paper shows that the initialization and regularization hyper-parameter setting may significantly affect the test set accuracy. It investigates the MaxEnt initialization/regularization based on an n-gram classifier and a TF*IDF weighted vector space model. The theoretically motivated TF*IDF initialization/regularization has achieved significant improvements over the baseline flat initialization/regularization, especially when training data are sparse. In contrast, the n-gram based initialization/ regularization does not exhibit significant improvements. Index Terms — Maximum entropy model, TF*IDF, vector space model, n-gram classification model, model initialization, model regularization. 1.

### Citations

1083 | A maximum entropy approach to natural language processing
- Berger, Pietra, et al.
- 1996
(Show Context)
Citation Context ...bution that satisfy Eq. (1) have the following exponential (loglinear) form and the parameterization that maximizes the entropy maximizes the conditional probability of a training set of C andQ pairs =-=[9]-=-. 1 � � P �C | Q�� exp � � �ifi( C, Q) �� Z �Q� ��� � �� (2) f �F � � Z��Q���exp( ��ifi( C, Q )) is a normalization constant, C fi�F and � i ’s are the parameters of the model, also known as the weigh... |

882 | A language modeling approach to information retrieval
- Ponte, Croft
- 1998
(Show Context)
Citation Context ...P( Q | C) � �( �P( Qi | C) �( 1��) P( Qi | Qi�1, C ) � (6) i The n-gram classification model is also used for information retrieval when each document in a document collection is treated as a class c =-=[10]-=-. 2.3. TF*IDF Weighted Vector Space Model The TF*IDF weighted vector space model is widely used in information retrieval (IR). It represents a query (document) with a vector q (d). The relevance (or s... |

431 |
Generalized iterative scaling for log-linear models
- Darroch, Ratcliff
- 1972
(Show Context)
Citation Context ...ken language technology, including language modeling [1], call-routing [2], and confidence measures [3, 4], etc. The training algorithms for a MaxEnt model, for example, generalized iterative scaling =-=[5]-=- or stochastic gradient ascend [6], involves an iterative procedure that starts from an initial parameterization and gradually updates it towards the optimum. The MaxEnt models have a convex objective... |

371 |
Stochastic Approximation Algorithms and Applications
- Kushner, Yin
- 1997
(Show Context)
Citation Context ... language modeling [1], call-routing [2], and confidence measures [3, 4], etc. The training algorithms for a MaxEnt model, for example, generalized iterative scaling [5] or stochastic gradient ascend =-=[6]-=-, involves an iterative procedure that starts from an initial parameterization and gradually updates it towards the optimum. The MaxEnt models have a convex objective function. Hence they converge to ... |

242 | A maximum entropy approach to adaptive statistical learning modeling
- Rosenfeld
- 1996
(Show Context)
Citation Context ...del, model initialization, model regularization. 1. INTRODUCTION Maximum entropy (MaxEnt) models have been used in various tasks related to the spoken language technology, including language modeling =-=[1]-=-, call-routing [2], and confidence measures [3, 4], etc. The training algorithms for a MaxEnt model, for example, generalized iterative scaling [5] or stochastic gradient ascend [6], involves an itera... |

96 |
Evaluation of spoken language systems: The ATIS domain
- Price
- 1990
(Show Context)
Citation Context ...rs) model instead of treating them equally. i i 4. EXPERIMENTAL RESULTS 4.1. Experimental Settings We conducted experiments with two different data sets, the Air Travel Information System data (ATIS) =-=[11]-=- in the public domain and a Microsoft internal product review sentiment classification data set. ATIS was originally not a classification task. We followed the practice in [12] to use the data for cal... |

85 | Understanding inverse document frequency: on theoretical arguments
- Robertson
(Show Context)
Citation Context ...y robust. It provides a weight sharing mechanism for linear classifiers. Researchers have been searching for the theoretical justification for this weighting scheme originally proposed as a heuristic =-=[7]-=-. Recent work has revealed its relation with a relaxed and much simplified MaxEnt model [8]. Preliminary experiments shows that properly scaled TF*IDF initialization/regularization has significantly i... |

12 |
Speech utterance classification
- Chelba, Mahajan, et al.
- 2003
(Show Context)
Citation Context ...tion System data (ATIS) [11] in the public domain and a Microsoft internal product review sentiment classification data set. ATIS was originally not a classification task. We followed the practice in =-=[12]-=- to use the data for call-routing experiments by assigning the main database table name in the manually created SQL query (available in the NIST ATIS data set) for an utterance as its classification d... |

11 | Maximum entropy confidence estimation for speech recognition
- White, Droppo, et al.
(Show Context)
Citation Context .... 1. INTRODUCTION Maximum entropy (MaxEnt) models have been used in various tasks related to the spoken language technology, including language modeling [1], call-routing [2], and confidence measures =-=[3, 4]-=-, etc. The training algorithms for a MaxEnt model, for example, generalized iterative scaling [5] or stochastic gradient ascend [6], involves an iterative procedure that starts from an initial paramet... |

3 | Confidence measures for voice search applications
- Wang, Yu, et al.
- 2007
(Show Context)
Citation Context .... 1. INTRODUCTION Maximum entropy (MaxEnt) models have been used in various tasks related to the spoken language technology, including language modeling [1], call-routing [2], and confidence measures =-=[3, 4]-=-, etc. The training algorithms for a MaxEnt model, for example, generalized iterative scaling [5] or stochastic gradient ascend [6], involves an iterative procedure that starts from an initial paramet... |

2 | Speech Utterance Classification Model Training Without Manual Transcriptions
- Wang, Lee, et al.
- 2006
(Show Context)
Citation Context ...ization, model regularization. 1. INTRODUCTION Maximum entropy (MaxEnt) models have been used in various tasks related to the spoken language technology, including language modeling [1], call-routing =-=[2]-=-, and confidence measures [3, 4], etc. The training algorithms for a MaxEnt model, for example, generalized iterative scaling [5] or stochastic gradient ascend [6], involves an iterative procedure tha... |

1 |
Why inverse document frequency?," in the proceedings of North American Chapter Of The Association For Computational Linguistics
- Papineni
- 2001
(Show Context)
Citation Context ...een searching for the theoretical justification for this weighting scheme originally proposed as a heuristic [7]. Recent work has revealed its relation with a relaxed and much simplified MaxEnt model =-=[8]-=-. Preliminary experiments shows that properly scaled TF*IDF initialization/regularization has significantly improved the classification accuracy in different tasks, while the n-gram initialization/reg... |