## Mixtures of Conditional Maximum Entropy Models (2002)

Venue: | In Proc. of ICML-2003 |

Citations: | 14 - 8 self |

### BibTeX

@INPROCEEDINGS{Pavlov02mixturesof,

author = {Dmitry Pavlov and Alexandrin Popescul and David M. Pennock and Lyle H. Ungar},

title = {Mixtures of Conditional Maximum Entropy Models},

booktitle = {In Proc. of ICML-2003},

year = {2002},

pages = {584--591}

}

### Years of Citing Articles

### OpenURL

### Abstract

Driven by successes in several application areas, maximum entropy modeling has recently gained considerable popularity. We generalize the standard maximum entropy formulation of classi cation problems to better handle the case where complex data distributions arise from a mixture of simpler underlying (latent) distributions. We develop a theoretical framework for characterizing data as a mixture of maximum entropy models. We formulate a maximum-likelihood interpretation of the mixture model learning, and derive a generalized EM algorithm to solve the corresponding optimization problem. We present empirical results for a number of data sets showing that modeling the data as a mixture of latent maximum entropy models gives signi cant improvement over the standard, single component, maximum entropy approach.

### Citations

8142 | Maximum likelihood from incomplete data via the EM algorithm - Dempster, Laird, et al. - 1977 |

2871 | UCI repository of machine learning databases. advances in kernel methods, support vector learning - Blake, Merz - 1998 |

1087 | A maximum entropy approach to natural language processing
- Berger, Pietra, et al.
- 1996
(Show Context)
Citation Context ..., advances in computing and the growth of available data contributed to increased popularity of maxent modeling, leading to a number of successful applications, including natural language processing (=-=Berger et al., 1996-=-), language modeling (Chen & Rosenfeld, 1999), part of speech tagging (Ratnaparkhi, 1996), database querying (Pavlov & Smyth, 2001), and protein modeling (Buehler & Ungar, 2001), to name a few. The ma... |

948 | The EM Algorithm and Extensions - McLachlan, Krishnan - 1997 |

895 | Mixture Models - McLachlan, Basford - 1988 |

740 |
Statistical methods for speech recognition
- Jelinek
- 1997
(Show Context)
Citation Context ...on factor) of this feature in the training data. The set of features supplied with maximum entropy as an objective function can be shown to lead to the following form of the conditional maxent model (=-=Jelinek, -=-1998) p(cjd) = 1 Z (d) exp[ S X s=1 sc F s (c; d)]; (2) where Z (d) is a normalization constant ensuring that the distribution sums to 1. In what follows we drop the subscript in Z to simplify not... |

443 | A maximum entropy model for part-of-speech tagging
- Ratnaparkhi
- 1996
(Show Context)
Citation Context ...ty of maxent modeling, leading to a number of successful applications, including natural language processing (Berger et al., 1996), language modeling (Chen & Rosenfeld, 1999), part of speech tagging (=-=Ratnaparkhi, 1996-=-), database querying (Pavlov & Smyth, 2001), and protein modeling (Buehler & Ungar, 2001), to name a few. The maxent approach has several attractive properties that have contributed to its popularity.... |

431 | Generalized iterative scaling for log-linear models - Darroch, Ratcliff - 1972 |

342 | A.: Learning to extract symbolic knowledge from the world wide web. In: AAAI
- Craven, Dipasquo, et al.
- 1998
(Show Context)
Citation Context ...corresponding to data records, then sparsitysreports the percentage of 0 entries in this matrix. As we mentioned above, the higher the sparsity, the more time-ecient the algorithm is. The WebKB data (=-=Craven et al., 199-=-8) contains a set of Web pages gathered from university computer science departments. We used all classes but others and dierent numbers (up to 1000) of the most frequent words. The Letter recognition... |

235 | Distributional clustering of words for text classification - Baker, McCallum - 1998 |

230 | A gaussian prior for smoothing maximum entropy models - Chen, Rosenfeld - 1999 |

165 | Mixed MNL models for discrete response - Daniel, Train - 2000 |

37 |
Where do We Stand on Maximum Entropy?” The Maximum Entropy Formalism
- Jaynes
- 1979
(Show Context)
Citation Context .... 1. Introduction Maximum entropy (maxent) modeling has a long history, beginning as a concept in physics and later working its way into the foundations of information theory and Bayesian statistics (=-=Jaynes, 1979-=-). In recent years, advances in computing and the growth of available data contributed to increased popularity of maxent modeling, leading to a number of successful applications, including natural lan... |

31 | A Maximum Entropy Approach To Collaborative Filtering in Dynamic, Sparse, High-Dimensional Domains - Pavlov, Pennock - 2002 |

22 |
Inducing features of random
- Pietra, S, et al.
- 1997
(Show Context)
Citation Context ...sources of information, andsnally, under fairly general assumptions, maxent modeling has been shown to be equivalent to maximum-likelihood modeling of distributions from the exponential family (Della =-=Pietra et al., 19-=-97). One of the more recent and successful applications of maxent modeling is in the area of classication (Jaakkola et al., 1999), and text classication in particular (Nigam et al., 1999). In this cas... |

17 | The latent maximum entropy principle
- Wang, Rosenfeld, et al.
- 2002
(Show Context)
Citation Context ...ussed in Section 5. In Section 6 we draw conclusions and describe directions for future work. 2. Related Work The latent maximum entropy principle was introduced in a general setting by Wang et. al. (=-=Wang et al., 2002-=-). In particular, they gave a motivation for generalizing the standard Jaynes maximum entropy principle (Jaynes, 1979) to include latent variables and formulated a convergence theorem of the associate... |

11 | Probabilistic query models for transaction data - Pavlov, Smyth - 2001 |

9 | Maximum entropy methods for biological sequence modeling - Buehler, Ungar - 2001 |

1 | Mixed logit with repeated choices: Households' choices of appliance eciency level - David - 1998 |

1 |
Sequential conditional generalized iterative scaling. Association for Computational Linguistics Annual Meeting
- Goodman
- 2002
(Show Context)
Citation Context ...side of the update equation forss 0 c 0 k 0 vanish. As we show in Section 5, on sparse data the speed-ups can be quite signicant. Further speed-ups might be achieved by employing the recent work by (G=-=oodman, 2002-=-), though we have not explored this direction at present. 5. Experimental Results We ran experiments on several publicly available data sets. The names and parameters of the data sets are given in Tab... |

1 | Maximum entropy discrimination (Technical Report MIT AITR-1668 - Jaakkola, Meila - 1999 |

1 | Using maximum entropy for text classi IJCAI99 Workshop on Machine Learning for Information Filtering (pp - Nigam, Laerty - 1999 |

1 | Distributional clustering of English words. Meeting of the Association for Computational Linguistics (pp - Pereira, Tishby - 1993 |