Performance Prediction for Exponential Language Models
SVM HeaderParse 0.2
AUTHOR NAME
Stanley F. Chen
SVM HeaderParse 0.2
AUTHOR AFFIL
IBM T.J. Watson Research Center
SVM HeaderParse 0.2
AUTHOR ADDR
P.O. Box 218, Yorktown Heights, NY 10598
SVM HeaderParse 0.2
ABSTRACT
We investigate the task of performance prediction for language models belonging to the exponential family. First, we attempt to empirically discover a formula for predicting test set cross-entropy for n-gram language models. We build models over varying domains, data set sizes, and n-gram orders, and perform linear regression to see whether we can model test set performance as a simple function of training set performance and various model statistics. Remarkably, we find a simple relationship that predicts test set performance with a correlation of 0.9997. We analyze why this relationship holds and show that it holds for other exponential language models as well, including class-based models and minimum discrimination information models. Finally, we discuss how this relationship can be applied to improve language model performance. 1