## Maximum Entropy Discrimination (1999)

### Cached

### Download Links

- [publications.ai.mit.edu]
- [publications.ai.mit.edu]
- [www.media.mit.edu]
- [www.cs.cmu.edu]
- [www.ai.mit.edu]
- DBLP

### Other Repositories/Bibliography

Citations: | 122 - 20 self |

### BibTeX

@MISC{Jaakkola99maximumentropy,

author = {Tommi Jaakkola and Marina Meila and Tony Jebara},

title = {Maximum Entropy Discrimination},

year = {1999}

}

### Years of Citing Articles

### OpenURL

### Abstract

We present a general framework for discriminative estimation based on the maximum entropy principle and its extensions. All calculations involve distributions over structures and/or parameters rather than specific settings and reduce to relative entropy projections. This holds even when the data is not separable within the chosen parametric class, in the context of anomaly detection rather than classification, or when the labels in the training set are uncertain or incomplete. Support vector machines are naturally subsumed under this class and we provide several extensions. We are also able to estimate exactly and efficiently discriminative distributions over tree structures of class-conditional models within this framework. Preliminary experimental results are indicative of the potential in these techniques.

### Citations

8983 | N.: The nature of statistical learning theory
- Vapnik
- 1995
(Show Context)
Citation Context ... maximum likelihood) used for parameter/structure estimation is suboptimal. Support vector machines (SVMs) are, for example, more robust techniques as they are specically designed for discrimination [=-=9]-=-. Our approach towards general discriminative training is based on the well known maximum entropy principle (e.g., [3]). This enables an appropriate training of both ordinary and structural parameters... |

8564 |
Elements of Information Theory
- Cover, Thomas
- 2006
(Show Context)
Citation Context ... point of view of regularization theory, the prior probability P 0 species the entropic regularization used in this approach. Theorem 1 The solution to the MRE problem has the following general form [1] P(;s) = 1 Z() P 0 (;s) e P t t [ y t L(X t j)st ] (4) where Z() is the normalization constant (partition function) and = f 1 ; : : : ; T g denes a set of non-negative Lagrange multiplier... |

1245 | Combining Labeled and Unlabeled Data with Co-training - Blum, Mitchell - 1998 |

1075 | Herskovitz: A Bayesian Method for the Induction - Cooper, E - 1992 |

903 | Learning Bayesian networks: The combination of knowledge and statistical data - Heckerman, Geiger, et al. - 1995 |

817 |
Introduction to graph theory
- West
- 2001
(Show Context)
Citation Context ...ion function(s) Z1 . P is a discrete distribution over all possible tree structures for n variables (there are n n 2 trees). However, a remarkable graph theory result, called the Matrix Tree Theorem [=-=10]-=-, enables us to perform all necessary summations in closed form in polynomial time. On the basis of this result, wesnd Theorem 3 The normalization constant Z of a distribution of the form (9) is Z = h... |

397 | Exploiting generative models in discriminative classifiers - Jaakkola, Haussler - 1998 |

174 | Semi-supervised support vector machines - Bennett, Demiriz - 1999 |

132 | Multisurface method of pattern separation for medical diagnosis applied to breast cytology - Wolberg, Mangasarian - 1990 |

123 | A decision theoretic generalization of on-line learning and an application to boosting - Freund, Schapire - 1995 |

104 | Probalistic kernel regression models - Jaakkola, Haussler - 1999 |

101 | Information geometry of the EM and em algorithms for neural networks - Amari - 1995 |

75 | Boosting the margin: A new explanation for the eectiveness of voting methods - Schapire, Freund, et al. - 1998 |

74 | Maximum-Entropy Models in Science and Engineering - Kapur - 1990 |

73 | D.: Learning Bayesian networks: Search methods and experimental results - Chickering, Geiger, et al. - 1995 |

58 | Boosting as entropy projection
- Kvinen, Warmuth
- 1999
(Show Context)
Citation Context ...specify a convex region. Note that the preference towards high entropy distributions (fewer assumptions) applies only within the admissible set of distributionssPsconsistent with the constraints. See =-=[2-=-] for related work. We will extend this basic idea in a number of ways. The ME formulation assumes, for example, that the training examples can be separated with the specied margin. We may also have a... |

54 | Maximum conditional likelihood via bound maximization and the CEM algorithm
- Jebara, Pentland
- 1998
(Show Context)
Citation Context ...rds general discriminative training is based on the well known maximum entropy principle (e.g., [3]). This enables an appropriate training of both ordinary and structural parameters of the model (cf. =-=[5, -=-7]). The approach is not limited to probability models and extends, e.g., SVMs. 2 Maximum entropy classication Consider a two-class classication problem 1 where labels y 2 f1; 1g are assigned 1 The ex... |

37 | Flexible non-linear approaches to classification, in From Statistics to Neural Networks - Ripley - 1994 |

26 | Estimating dependency structure as a hidden variable
- Meila, Jordan, et al.
- 1997
(Show Context)
Citation Context ...discuss in the forthcoming. A tree graphical model is a graphical model for which the structure is a tree. This model has the property that its log-likelihood can be expressed as a sum of local terms =-=-=-[8] log P (X; Ej) = X u h u (X; ) + X uv2E w uv (X; ) (8) The discriminant function consisting of the log-likelihood ratio of a pair of tree models (depending on the edge sets E 1 , E 1 , and paramete... |

23 | Models and selection criteria for regression and classification - Heckerman, Meek - 1997 |

5 |
Transductive inference for text classi using support vector machines
- Joachims
- 1999
(Show Context)
Citation Context ...ertheless formulate an ecient meanseld approach in this context [4]. Figure 2c demonstrates that even the approximate method is able to reap most of the bene t from unlabeled examples (compare, e.g., =-=[6-=-]). The results are for a DNA splice junction classication problem. For more details see [4]. 4 Discussion We have presented a general approach to discriminative training of model parameters, structur... |

5 | Gaussian processes for Bayesian classi via hybrid Monte Carlo - Barber, Williams - 1997 |

2 |
Tribus (eds
- Levin
- 1978
(Show Context)
Citation Context ...ample, more robust techniques as they are specically designed for discrimination [9]. Our approach towards general discriminative training is based on the well known maximum entropy principle (e.g., [=-=3]-=-). This enables an appropriate training of both ordinary and structural parameters of the model (cf. [5, 7]). The approach is not limited to probability models and extends, e.g., SVMs. 2 Maximum entro... |

2 | Inferring a gaussian distribution. http://www.media.mit.edu/ tpminka/papers/minka-gaussian.ps.gz [18] Nigam - Minka - 1998 |