• Documents
  • Authors
  • Tables
  • Other Seers ▼
    RefSeer AckSeer CollabSeer SeerSeer
  • Log in
  • Sign up
  • MetaCart

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

Efficient Sampling and Feature Selection in Whole Sentence Maximum Entropy Language Models

by Stanley F. Chen, Ronald Rosenfeld
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 16
Next 10 →

Maximum entropy markov models for information extraction and segmentation

by Andrew Mccallum, Dayne Freitag , 2000
"... Hidden Markov models (HMMs) are a powerful probabilistic tool for modeling sequential data, and have been applied with success to many text-related tasks, such as part-of-speech tagging, text segmentation and information extraction. In these cases, the observations are usually modeled as multinomial ..."
Abstract - Cited by 355 (17 self) - Add to MetaCart
Hidden Markov models (HMMs) are a powerful probabilistic tool for modeling sequential data, and have been applied with success to many text-related tasks, such as part-of-speech tagging, text segmentation and information extraction. In these cases, the observations are usually modeled as multinomial distributions over a discrete vocabulary, and the HMM parameters are set to maximize the likelihood of the observations. This paper presents a new Markovian sequence model, closely related to HMMs, that allows observations to be represented as arbitrary overlapping features (such as word, capitalization, formatting, part-of-speech), and defines the conditional probability of state sequences given observation sequences. It does this by using the maximum entropy framework to fit a set of exponential models that represent the probability of a state given an observation and the previous state. We present positive experimental results on the segmentation of FAQ’s. 1.

Two decades of statistical language modeling: Where do we go from here

by Ronald Rosenfeld - Proceedings of the IEEE , 2000
"... Statistical Language Models estimate the distribution of various natural language phenomena for the purpose of speech recognition and other language technologies. Since the first significant model was proposed in 1980, many attempts have been made to improve the state of the art. We review them here ..."
Abstract - Cited by 119 (1 self) - Add to MetaCart
Statistical Language Models estimate the distribution of various natural language phenomena for the purpose of speech recognition and other language technologies. Since the first significant model was proposed in 1980, many attempts have been made to improve the state of the art. We review them here, point to a few promising directions, and argue for a Bayesian approach to integration of linguistic theories with data. 1. OUTLINE Statistical language modeling (SLM) is the attempt to capture regularities of natural language for the purpose of improving the performance of various natural language applications. By and large, statistical language modeling amounts to estimating the probability distribution of various linguistic units, such as words, sentences, and whole documents. Statistical language modeling is crucial for a large variety of language technology applications. These include speech recognition (where SLM got its start), machine translation, document classification and routing, optical character recognition, information retrieval, handwriting recognition, spelling correction, and many more. In machine translation, for example, purely statistical approaches have been introduced in [1]. But even researchers using rule-based approaches have found it beneficial to introduce some elements of SLM and statistical estimation [2]. In information retrieval, a language modeling approach was recently proposed by [3], and a statistical/information theoretical approach was developed by [4]. SLM employs statistical estimation techniques using language training data, that is, text. Because of the categorical nature of language, and the large vocabularies people naturally use, statistical techniques must estimate a large number of parameters, and consequently depend critically on the availability of large amounts of training data.

A Bit of Progress in Language Modeling

by Joshua T. Goodman , 2001
"... Language modeling is the art of determining the probability of a sequence of words. This is useful in a large variety of areas including speech recognition, optical character recognition, handwriting recognition, machine translation, and spelling correction (Church, 1988; Brown et al., 1990; Hull, 1 ..."
Abstract - Cited by 70 (1 self) - Add to MetaCart
Language modeling is the art of determining the probability of a sequence of words. This is useful in a large variety of areas including speech recognition, optical character recognition, handwriting recognition, machine translation, and spelling correction (Church, 1988; Brown et al., 1990; Hull, 1992; Kernighan et al., 1990; Srihari and Baltus, 1992). The most commonly used language models are very simple (e.g. a Katz-smoothed trigram model). There are many improvements over this simple model however, including caching, clustering, higherorder n-grams, skipping models, and sentence-mixture models, all of which we will describe below. Unfortunately, these more complicated techniques have rarely been examined in combination. It is entirely possible that two techniques that work well separately will not work well together, and, as we will show, even possible that some techniques will work better together than either one does by itself. In this...

Whole-Sentence Exponential Language Models: A Vehicle for Linguistic-Statistical Integration

by Ronald Rosenfeld, Stanley F. Chen, Xiaojin Zhu - Computers, Speech and Language , 2001
"... We introduce an exponential language model which models a whole sentence or utterance as a single unit. By avoiding the chain rule, the model treats each sentence as a "bag of features", where features are arbitrary computable properties of the sentence. The new model is computationally more effici ..."
Abstract - Cited by 12 (0 self) - Add to MetaCart
We introduce an exponential language model which models a whole sentence or utterance as a single unit. By avoiding the chain rule, the model treats each sentence as a "bag of features", where features are arbitrary computable properties of the sentence. The new model is computationally more efficient, and more naturally suited to modeling global sentential phenomena, than the conditional exponential (e.g. Maximum Entropy) models proposed to date. Using the model is straightforward. Training the model requires sampling from an exponential distribution. We describe the challenge of applying Monte Carlo Markov Chain (MCMC) and other sampling techniques to natural language, and discuss smoothing and step-size selection. We then present a novel procedure for feature selection, which exploits discrepancies between the existing model and the training corpus. We demonstrate our ideas by constructing and analyzing competitive models in the Switchboard domain, incorporating lexical and syntact...

Estimation of Stochastic Attribute-Value Grammars using an Informative Sample

by Miles Osborne , 2000
"... We argue that some of the computational complexity associated with estimation of stochastic attribute- value grammars can be reduced by training upon an informative subset of the full training set. Results using the t)arsed Wall Street Journal tort)us show that in some circumstances, it is possible ..."
Abstract - Cited by 8 (1 self) - Add to MetaCart
We argue that some of the computational complexity associated with estimation of stochastic attribute- value grammars can be reduced by training upon an informative subset of the full training set. Results using the t)arsed Wall Street Journal tort)us show that in some circumstances, it is possible to obtain better estimation results using au inbrmative sampie than when training upon all the available naterial. Further experimentation demonstrates that with unlexicalised models, a Gaussian prior can reduce overfitting. However, when models are lexicalised and contain overlapping features, overfitting does not seem to be a problem, and a Gaussian prior makes minimal difference to performance. Our approach is applicable for situations when there are an infeasibly large number of parses in the training set, or else Ibr when recovery of these parses fi'om a packed representation is itself comi)utationally expensive.

Exponential Language Models, Logistic Regression, and Semantic Coherence

by Can Cai, Ronald Rosenfeld, Larry Wasserman - In Proceedings of the NIST/DARPA Speech Transcription Workshop , 2000
"... In this paper, we modify the traditional trigram model by using utterance-level semantic coherence features in an exponential model. The semantic coherence features are collected by measuring the correlations among content-word pairs occurring in sentences of two corpora, the real corpus and a corpu ..."
Abstract - Cited by 7 (4 self) - Add to MetaCart
In this paper, we modify the traditional trigram model by using utterance-level semantic coherence features in an exponential model. The semantic coherence features are collected by measuring the correlations among content-word pairs occurring in sentences of two corpora, the real corpus and a corpus generated by the baseline trigram model. The measure we use for estimating the semantic association of content word pairs is Yule's Q statistic. For our preliminary analysis, we have further simplified the modeling task by extracting a small set of statistics from each sentence-based Q statistics and applying them as features to the exponential model. We also simplified the process of obtaining the MLE solutions of the exponential models by approximating it with a logistic regression model. We account for the uncertainty in the estimates of Q by constructing confidence intervals. The new model results in a slight reduction in test-set perplexity. We also discuss and compare alternative mea...

A fast algorithm for feature selection in conditional maximum entropy modeling

by Yaqian Zhou, Lide Wu - in Proceedings of the EMNLP 2003 , 2003
"... This paper describes a fast algorithm that selects features for conditional maximum entropy modeling. Berger et al. (1996) presents an incremental feature selection (IFS) algorithm, which computes the approximate gains for all candidate features at each selection stage, and is very time-consuming fo ..."
Abstract - Cited by 7 (1 self) - Add to MetaCart
This paper describes a fast algorithm that selects features for conditional maximum entropy modeling. Berger et al. (1996) presents an incremental feature selection (IFS) algorithm, which computes the approximate gains for all candidate features at each selection stage, and is very time-consuming for any problems with large feature spaces. In this new algorithm, instead, we only compute the approximate gains for the top-ranked features based on the models obtained from previous stages. Experiments on WSJ data in Penn Treebank are conducted to show that the new algorithm greatly speeds up the feature selection process while maintaining the same quality of selected features. One variant of this new algorithm with look-ahead functionality is also tested to further confirm the good quality of the selected features. The new algorithm is easy to implement, and given a feature space of size F, it only uses O(F) more space than the original IFS algorithm. 1

Interactive Feature Induction And Logistic Regression For Whole Sentence Exponential Language Models

by Ronald Rosenfeld , Larry Wasserman, Can Cai, Xiaojin Zhu - IN PROCEEDINGS OF THE IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING , 1999
"... Whole sentence exponential language models directly model the probability of an entire sentence using arbitrary computable properties of that sentence. We present an interactive methodology for feature induction, and demonstrate it in the simple but common case of a trigram baseline, focusing on fea ..."
Abstract - Cited by 5 (4 self) - Add to MetaCart
Whole sentence exponential language models directly model the probability of an entire sentence using arbitrary computable properties of that sentence. We present an interactive methodology for feature induction, and demonstrate it in the simple but common case of a trigram baseline, focusing on features that capture the linguistic notion of semantic coherence. We then show how parametric regression can be used in this setup to efficiently estimate the model's parameters, whereas non-parametric regression can be used to construct more powerful exponential models from the raw features.

Minimum Classification Error Training In Exponential Language Models

by Chris Paciorek, Roni Rosenfeld , 2000
"... Minimum Classification Error (MCE) training is difficult to apply to language modeling due to inherent scarcity of training data (N-best lists). However, a whole-sentence exponential language model is particularly suitable for MCE training, because it can use a relatively small number of powerful fe ..."
Abstract - Cited by 5 (1 self) - Add to MetaCart
Minimum Classification Error (MCE) training is difficult to apply to language modeling due to inherent scarcity of training data (N-best lists). However, a whole-sentence exponential language model is particularly suitable for MCE training, because it can use a relatively small number of powerful features to capture global sentential phenomena. We review the model, discuss feature induction, find features in both the Broadcast News and Switchboard domains, and build an MCE-trained model for the latter. Our experiments show that even models with relatively few features are prone to overfitting and are sensitive to initial parameter setting, leading us to examine alternative weight optimization criteria and search algorithms.

Incorporating Linguistic Structure into Statistical Language Models

by Ronald Rosenfeld - In Philosophical Transactions of the Royal Society of London A , 2000
"... this paper. References ..."
Abstract - Cited by 3 (0 self) - Add to MetaCart
this paper. References
The National Science Foundation
  • About CiteSeerX
  • Submit Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2010 The Pennsylvania State University