## Unifying Non-Maximum Likelihood Learning Objectives with Minimum KL Contraction

Citations: | 1 - 0 self |

### BibTeX

@MISC{Lyu_unifyingnon-maximum,

author = {Siwei Lyu},

title = {Unifying Non-Maximum Likelihood Learning Objectives with Minimum KL Contraction},

year = {}

}

### OpenURL

### Abstract

When used to learn high dimensional parametric probabilistic models, the classical maximum likelihood (ML) learning often suffers from computational intractability, which motivates the active developments of non-ML learning methods. Yet, because of their divergent motivations and forms, the objective functions of many non-ML learning methods are seemingly unrelated, and there lacks a unified framework to understand them. In this work, based on an information geometric view of parametric learning, we introduce a general non-ML learning principle termed as minimum KL contraction, where we seek optimal parameters that minimizes the contraction of the KL divergence between the two distributions after they are transformed with a KL contraction operator. We then show that the objective functions of several important or recently developed non-ML learning methods, including contrastive divergence [12], noise-contrastive estimation [11], partial likelihood [7], non-local contrastive objectives [31], score matching [14], pseudo-likelihood [3], maximum conditional likelihood [17], maximum mutual information [2], maximum marginal likelihood [9], and conditional and marginal composite likelihood [24], can be unified under the minimum KL contraction framework with different choices of the KL contraction operators. 1

### Citations

9359 |
Elements of information theory
- Cover, Thomas
- 1991
(Show Context)
Citation Context ... as the set of all probability density functions over Rd . For two different probability distributions p, q ∈ Ωd, their KulbackLeibler (KL) divergence (also known as relative entropy or I-divergence) =-=[6]-=- is defined as KL(p‖q) = ∫ Rd p(x) log p(x) q(x) dx. KL divergence is non-negative and equals to zero if and only if p = q almost everywhere (a.e.). We define a distribution operator, Φ, as a mapping ... |

553 | Training products of experts by minimizing contrastive divergence
- Hinton
- 2000
(Show Context)
Citation Context ...ter they are transformed with a KL contraction operator. We then show that the objective functions of several important or recently developed non-ML learning methods, including contrastive divergence =-=[12]-=-, noise-contrastive estimation [11], partial likelihood [7], non-local contrastive objectives [31], score matching [14], pseudo-likelihood [3], maximum conditional likelihood [17], maximum mutual info... |

288 |
Statistical analysis of non-lattice data
- Besag
- 1975
(Show Context)
Citation Context ...on-ML learning methods, including contrastive divergence [12], noise-contrastive estimation [11], partial likelihood [7], non-local contrastive objectives [31], score matching [14], pseudo-likelihood =-=[3]-=-, maximum conditional likelihood [17], maximum mutual information [2], maximum marginal likelihood [9], and conditional and marginal composite likelihood [24], can be unified under the minimum KL cont... |

271 |
Introductory functional analysis with applications
- Kreyszig
- 1989
(Show Context)
Citation Context ...nless the two distributions are equal (a.e.). The KL contraction operators are analogous to the contraction operators in ordinary metric spaces, with β having a similar role as the Lipschitz constant =-=[19]-=-. 1 Indeed, it is known that the KL divergence behaves like the squared Euclidean distance [6]. 2Eq.(1) gives the general and abstract definition of KL contraction operators. In the following, we giv... |

247 | Projection pursuit - HUBER - 1985 |

177 |
Maximum Mutual Information Estimation of Hidden Markov Model parameters for speech recognition
- Bahl, Brown, et al.
- 1986
(Show Context)
Citation Context ...contrastive estimation [11], partial likelihood [7], non-local contrastive objectives [31], score matching [14], pseudo-likelihood [3], maximum conditional likelihood [17], maximum mutual information =-=[2]-=-, maximum marginal likelihood [9], and conditional and marginal composite likelihood [24], can be unified under the minimum KL contraction framework with different choices of the KL contraction operat... |

165 |
The Estimation of Probabilities: An Essay on Modern Bayesian Methods
- Good
- 1965
(Show Context)
Citation Context ...ial likelihood [7], non-local contrastive objectives [31], score matching [14], pseudo-likelihood [3], maximum conditional likelihood [17], maximum mutual information [2], maximum marginal likelihood =-=[9]-=-, and conditional and marginal composite likelihood [24], can be unified under the minimum KL contraction framework with different choices of the KL contraction operators. 1 Introduction Fitting param... |

158 |
Partial likelihood
- Cox
- 1975
(Show Context)
Citation Context ...then show that the objective functions of several important or recently developed non-ML learning methods, including contrastive divergence [12], noise-contrastive estimation [11], partial likelihood =-=[7]-=-, non-local contrastive objectives [31], score matching [14], pseudo-likelihood [3], maximum conditional likelihood [17], maximum mutual information [2], maximum marginal likelihood [9], and condition... |

157 |
Markov random fields and their applications
- Kindermann
- 1980
(Show Context)
Citation Context ... to find a member in a parametric distribution family, qθ, to best represent the training data. In practice, many useful high dimensional parametric probabilistic models, such as Markov random fields =-=[18]-=- or products of experts [12], are defined as qθ(x) = ˜qθ(x)/Z(θ), where ˜qθ is the unnormalized model and Z(θ) = ∫ Rd ˜qθ(x)dx is the partition function. The maximum (log) likelihood (ML) estimation i... |

81 | Pseudo likelihood Estimation for Social Networks
- Strauss, Ikeda
- 1990
(Show Context)
Citation Context ...expectations with averages over samples from p(x), we have 7argmin KL(p‖q) − θ 1 d d∑ ( KL Φ m \{i} p ∥ ∥Φ m \{i} q ) i=1 ≈ argmax θ which is objective function in maximum pseudo-likelihood learning =-=[3, 29]-=-. 4.7 Relation with Marginal Composite Likelihood 1 n n∑ k=1 1 d d∑ i=1 log qi|\i(x (k) i |x (k) \i ), We now consider combining Type III MKC objective function, Eq.(9), with the KL contraction operat... |

67 |
Composite likelihood methods
- Lindsay
- 1988
(Show Context)
Citation Context ...1], score matching [14], pseudo-likelihood [3], maximum conditional likelihood [17], maximum mutual information [2], maximum marginal likelihood [9], and conditional and marginal composite likelihood =-=[24]-=-, can be unified under the minimum KL contraction framework with different choices of the KL contraction operators. 1 Introduction Fitting parametric probabilistic models to data is a basic task in st... |

56 | Maximum conditional likelihood via bound maximization and the CEM algorithm
- Jebara, Pentland
- 1998
(Show Context)
Citation Context ...ntrastive divergence [12], noise-contrastive estimation [11], partial likelihood [7], non-local contrastive objectives [31], score matching [14], pseudo-likelihood [3], maximum conditional likelihood =-=[17]-=-, maximum mutual information [2], maximum marginal likelihood [9], and conditional and marginal composite likelihood [24], can be unified under the minimum KL contraction framework with different choi... |

43 | A tutorial on energy-based learning
- Lecun, Chopra, et al.
- 2006
(Show Context)
Citation Context ... in numerical implementations are only tangential to the more fundamental relationship among the objective functions of different non-ML learning methods. On the other hand, the energy based learning =-=[22]-=- presents a general framework that subsume most non-ML learning objectives, but its broad generality also obscures their specific statistical interpretations. At the objective function level, relation... |

35 |
Information theory and statistics: A tutorial,” Foundations and Trends
- Csiszár, Shields
- 2004
(Show Context)
Citation Context ...q(x, c), respectively. The equality holds if and only if p = q (a.e.). 3.4 Lumping Let S = (S1, S2, · · · , Sm) be a partition of Rd such that Si ∩ Sj = ∅ for i ̸= j, and ⋃m i=1 Si = Rd , the lumping =-=[8]-=- of a distribution p(x) ∈ Ωd over S yields a distribution over i ∈ {1, 2, · · · , m}, and subsequently induces a distribution operator Φl S , as: Φ l ∫ S{p}(i) = p(x)dx = P S i , for i = 1, · · · , m.... |

35 | An asymptotic analysis of generative, discriminative, and pseudolikelihood estimators
- Liang, Jordan
- 2008
(Show Context)
Citation Context ... Am: argmin KL(p‖q)− θ 1 m m∑ i=1 KL ( Φ m Ai {p}∥ ∥Φ m Ai {q}) ≈ argmax θ 1 m m∑ i=1 1 n n∑ k=1 log q Ai|\Ai (x (k) Ai |x(k) \Ai ). This is the objective function in conditional composite likelihood =-=[24, 30, 23, 1]-=- (also rediscovered under the name piecewise learning in [26]). A special case of conditional composite likelihood is when Ai = \{i}, the resulting marginalization operator, Φ m \{i} , is known as the... |

33 | Multi-conditional learning: Generative/discriminative training for clustering and classification
- McCallum, Pal, et al.
- 2006
(Show Context)
Citation Context ... θ 1 m m∑ i=1 1 n n∑ k=1 log q Ai|\Ai (x (k) Ai |x(k) \Ai ). This is the objective function in conditional composite likelihood [24, 30, 23, 1] (also rediscovered under the name piecewise learning in =-=[26]-=-). A special case of conditional composite likelihood is when Ai = \{i}, the resulting marginalization operator, Φ m \{i} , is known as the ith singleton marginalization operator. With the d different... |

30 |
A note on composite likelihood inference and model selection
- Varin, Vidoni
- 2005
(Show Context)
Citation Context ...terpretations. At the objective function level, relations between some non-ML methods are known. For instance, it is known that pseudo-likelihood is a special case of conditional composite likelihood =-=[30]-=-. In [10], several non-ML learning methods are unified under the framework of minimizing Bregman divergence. 3 KL Contraction Operator We base our discussion hereafter on continuous variables and prob... |

29 |
Estimation of non-normalized statistical models using score matching
- Hyvärinen
- 2005
(Show Context)
Citation Context ... or recently developed non-ML learning methods, including contrastive divergence [12], noise-contrastive estimation [11], partial likelihood [7], non-local contrastive objectives [31], score matching =-=[14]-=-, pseudo-likelihood [3], maximum conditional likelihood [17], maximum mutual information [2], maximum marginal likelihood [9], and conditional and marginal composite likelihood [24], can be unified un... |

29 | Maximum likelihood: An introduction
- Cam, L
- 1990
(Show Context)
Citation Context ...er learning, where the optimal parameter is obtained 1 n by solving argmaxθ n k=1 log qθ(x (k)). The obtained ML estimators has many desirable properties, such as consistency and asymptotic normality =-=[21]-=-. However, because of the high dimensional integration/summation, the partition function in qθ oftentimes makes ML learning computationally intractable. For this reason, non-ML parameter learning meth... |

27 |
Statistical manifolds
- LAURITZEN
- 1987
(Show Context)
Citation Context ...uming training data are samples from a distribution p ∈ Ωd, we seek an optimal distribution on the statistical manifold corresponding to the parametric distribution family qθ that best approximates p =-=[20]-=-. In this context, the maximum (log) likelihood learning is equivalent to finding∫parameter θ that minimizes the KL divergence of p and qθ [8], as argminθ KL(p‖qθ) = argmaxθ Rd p(x) log qθ(x)dx. The d... |

20 | On the distinction between the conditional probability and the joint probability approaches in the specification of nearest-neighbor systems - Brook - 1964 |

16 | Some extensions of score matching
- Hyvärinen
- 2007
(Show Context)
Citation Context ...g the more general f-divergence [8], of which the KL divergence is a special case. With the more general framework, we hope to reveal further relations among other types of non-ML learning objectives =-=[16, 25, 28, 27]-=-. Second, in the current work, we have focused on the idealization of parametric learning as matching probability distributions. In practice, learning is most often performed on finite data set with a... |

15 | Connections between score matching, contrastive divergence, and pseudolikelihood for continuous-valued variables - Hyvärinen - 2006 |

12 | Noise-contrastive estimation: A new estimation principle for unnormalized statistical models
- Gutmann, Hyvärinen
- 2010
(Show Context)
Citation Context ...contraction operator. We then show that the objective functions of several important or recently developed non-ML learning methods, including contrastive divergence [12], noise-contrastive estimation =-=[11]-=-, partial likelihood [7], non-local contrastive objectives [31], score matching [14], pseudo-likelihood [3], maximum conditional likelihood [17], maximum mutual information [2], maximum marginal likel... |

8 | Interpretation and generalization of score matching
- Lyu
(Show Context)
Citation Context ...g the more general f-divergence [8], of which the KL divergence is a special case. With the more general framework, we hope to reveal further relations among other types of non-ML learning objectives =-=[16, 25, 28, 27]-=-. Second, in the current work, we have focused on the idealization of parametric learning as matching probability distributions. In practice, learning is most often performed on finite data set with a... |

8 | Non-local contrastive objectives
- Vickrey, Lin, et al.
- 2010
(Show Context)
Citation Context ... of several important or recently developed non-ML learning methods, including contrastive divergence [12], noise-contrastive estimation [11], partial likelihood [7], non-local contrastive objectives =-=[31]-=-, score matching [14], pseudo-likelihood [3], maximum conditional likelihood [17], maximum mutual information [2], maximum marginal likelihood [9], and conditional and marginal composite likelihood [2... |

5 |
Minimum probability flow learning
- Sohl-Dickstein, Battaglino, et al.
(Show Context)
Citation Context ...g the more general f-divergence [8], of which the KL divergence is a special case. With the more general framework, we hope to reveal further relations among other types of non-ML learning objectives =-=[16, 25, 28, 27]-=-. Second, in the current work, we have focused on the idealization of parametric learning as matching probability distributions. In practice, learning is most often performed on finite data set with a... |

4 | A family of computationally efficient and simple estimators for unnormalized statistical models
- Pihlaja, Gutmann, et al.
- 2010
(Show Context)
Citation Context |

3 | Learning with Blocks: Composite Likelihood and Contrastive Divergence
- Asuncion, Liu, et al.
- 2010
(Show Context)
Citation Context ...divergence [12] that performs Langevin approximation instead of Gibbs sampling, and both are approximations to the parameter update of pseudo-likelihood [3]. This connection is further generalized in =-=[1]-=-, which shows that parameter update in another variant of contrastive divergence is equivalent to a stochastic parameter update in conditional composite likelihood [24]. However, such similarities in ... |

1 |
Bregman divergence as general framework to estimate unnormalized statistical models
- Gutmann, Hirayama
- 2011
(Show Context)
Citation Context ...ions. At the objective function level, relations between some non-ML methods are known. For instance, it is known that pseudo-likelihood is a special case of conditional composite likelihood [30]. In =-=[10]-=-, several non-ML learning methods are unified under the framework of minimizing Bregman divergence. 3 KL Contraction Operator We base our discussion hereafter on continuous variables and probability d... |