## Efficient large-scale distributed training of conditional maximum entropy models (2009)

### Cached

### Download Links

Venue: | In Advances in Neural Information Processing Systems |

Citations: | 31 - 2 self |

### BibTeX

@INPROCEEDINGS{Mann09efficientlarge-scale,

author = {Gideon Mann and Ryan Mcdonald and Mehryar Mohri and Nathan Silberman and Daniel D. Walker},

title = {Efficient large-scale distributed training of conditional maximum entropy models},

booktitle = {In Advances in Neural Information Processing Systems},

year = {2009}

}

### OpenURL

### Abstract

Training conditional maximum entropy models on massive data sets requires significant computational resources. We examine three common distributed training methods for conditional maxent: a distributed gradient computation method, a majority vote method, and a mixture weight method. We analyze and compare the CPU and network time complexity of each of these methods and present a theoretical analysis of conditional maxent models, including a study of the convergence of the mixture weight method, the most resource-efficient technique. We also report the results of large-scale experiments comparing these three methods which demonstrate the benefits of the mixture weight method: this method consumes less resources, while achieving a performance comparable to that of standard approaches. 1

### Citations

2105 | Building a Large Annotated Corpus of English: The Penn Treebank
- Marcus, Marcinkiewicz, et al.
- 1993
(Show Context)
Citation Context ...ter than wm. The convergence bound for wµ 0 contains two terms, one somewhat more favorable, one somewhat less than its counterpart term in the bound for wpm. k=1 6pm |Y| |X | sparsity p English POS =-=[16]-=- 1 M 24 500 K 0.001 10 Sentiment 9 M 3 500 K 0.001 10 RCV1-v2 [14] 26 M 103 10 K 0.08 10 Speech 50 M 129 39 1.0 499 Deja News Archive 306 M 8 50 K 0.002 200 Deja News Archive 250K 306 M 8 250 K 0.0004... |

1894 |
Numerical Optimization
- Nocedal, SJ
- 2000
(Show Context)
Citation Context ...itional maxent models using a single processor. These include generalized iterative scaling [7], improved iterative scaling [8], gradient descent, conjugate gradient methods, and second-order methods =-=[15, 18]-=-. This paper examines distributed methods for training conditional maxent models that can scale to very large samples of up to 1B instances. Both batch algorithms and on-line training algorithms such ... |

1083 | A maximum entropy approach to natural language processing
- Berger, Pietra, et al.
- 1996
(Show Context)
Citation Context ... benefits of the mixture weight method: this method consumes less resources, while achieving a performance comparable to that of standard approaches. 1 Introduction Conditional maximum entropy models =-=[1, 3]-=-, conditional maxent models for short, also known as multinomial logistic regression models, are widely used in applications, most prominently for multiclass classification problems with a large numbe... |

668 |
T.: Information theory and statistical mechanics
- JAYNES
- 1963
(Show Context)
Citation Context ... problems with a large number of classes in natural language processing [1, 3] and computer vision [12] over the last decade or more. These models are based on the maximum entropy principle of Jaynes =-=[11]-=-, which consists of selecting among the models approximately consistent with the constraints, the one with the greatest entropy. They benefit from a theoretical foundation similar to that of standard ... |

553 | Inducing features of random fields
- Pietra, Pietra, et al.
- 1997
(Show Context)
Citation Context ...tely consistent with the constraints, the one with the greatest entropy. They benefit from a theoretical foundation similar to that of standard maxent probabilistic models used for density estimation =-=[8]-=-. In particular, a duality theorem for conditional maxent model shows that these models belong to the exponential family. As shown by Lebanon and Lafferty [13], in the case of two classes, these model... |

436 | Rcv1: A new benchmark collection for text categorization research
- Lewis, Yang, et al.
(Show Context)
Citation Context ...e somewhat more favorable, one somewhat less than its counterpart term in the bound for wpm. k=1 6pm |Y| |X | sparsity p English POS [16] 1 M 24 500 K 0.001 10 Sentiment 9 M 3 500 K 0.001 10 RCV1-v2 =-=[14]-=- 26 M 103 10 K 0.08 10 Speech 50 M 129 39 1.0 499 Deja News Archive 306 M 8 50 K 0.002 200 Deja News Archive 250K 306 M 8 250 K 0.0004 200 Gigaword [10] 1,000 M 96 10 K 0.001 1000 Table 2: Description... |

431 |
Generalized iterative scaling for log-linear models
- Darroch, Ratcliff
- 1972
(Show Context)
Citation Context ...r data sets of several million points. A number of algorithms have been described for batch training of conditional maxent models using a single processor. These include generalized iterative scaling =-=[7]-=-, improved iterative scaling [8], gradient descent, conjugate gradient methods, and second-order methods [15, 18]. This paper examines distributed methods for training conditional maxent models that c... |

389 |
On the method of bounded differences
- McDiarmid
- 1989
(Show Context)
Citation Context ...inequality holds: ‖w − w ⋆ R ‖ ≤ λ √ ( √ ) 1 + log 1/δ . (9) m/2 Proof. Let S and S ′ be as before samples of size m differing by a single point. To derive this bound, we apply McDiarmid’s inequality =-=[17]-=- to Ψ(S)=‖w − w ⋆ ‖. By the triangle inequality and Theorem 1, the following Lipschitz property holds: |Ψ(S ′ ) − Ψ(S)| = ∣ ∣ ′ ⋆ ⋆ ‖w − w ‖ − ‖w − w ‖ ′ 2R ≤ ‖w − w‖ ≤ . (10) λm 5Thus, by McDiarmid’... |

339 |
Prediction and Entropy of Printed English
- Shannon
- 1951
(Show Context)
Citation Context ...ixture weight and 40GB for distributed gradient method when we discard machine-to-disk traffic. For the largest experiment, we examined the task of predicting the next character in a sequence of text =-=[19]-=-, which has implications for many natural language processing tasks. As a training and evaluation corpus we used the English Gigaword corpus [10] and used the full ASCII output space of that corpus of... |

229 | A comparison of algorithms for maximum entropy parameter estimation
- Malouf
- 2002
(Show Context)
Citation Context ...itional maxent models using a single processor. These include generalized iterative scaling [7], improved iterative scaling [8], gradient descent, conjugate gradient methods, and second-order methods =-=[15, 18]-=-. This paper examines distributed methods for training conditional maxent models that can scale to very large samples of up to 1B instances. Both batch algorithms and on-line training algorithms such ... |

204 | Logistic regression, AdaBoost and Bregman distances
- Collins, Schapire, et al.
- 2004
(Show Context)
Citation Context ...odels that can scale to very large samples of up to 1B instances. Both batch algorithms and on-line training algorithms such ∗ This work was conducted while at Google Research, New York. 1as that of =-=[5]-=- or stochastic gradient descent [21] can benefit from parallelization, but we concentrate here on batch distributed methods. We examine three common distributed training methods: a distributed gradien... |

165 | Stability and generalization - Bousquet, Elisseeff |

140 | K.: Map-reduce for machine learning on multicore
- Chu, Kim, et al.
- 2006
(Show Context)
Citation Context ...nt descent [21] can benefit from parallelization, but we concentrate here on batch distributed methods. We examine three common distributed training methods: a distributed gradient computation method =-=[4]-=-, a majority vote method, and a mixture weight method. We analyze and compare the CPU and network time complexity of each of these methods (Section 2) and present a theoretical analysis of conditional... |

86 | R: A Survey of Smoothing Techniques for ME Models
- SF, Rosenfeld
(Show Context)
Citation Context ... benefits of the mixture weight method: this method consumes less resources, while achieving a performance comparable to that of standard approaches. 1 Introduction Conditional maximum entropy models =-=[1, 3]-=-, conditional maxent models for short, also known as multinomial logistic regression models, are widely used in applications, most prominently for multiclass classification problems with a large numbe... |

80 | Boosting and maximum likelihood for exponential models
- Lebanon, Lafferty
- 2001
(Show Context)
Citation Context ...listic models used for density estimation [8]. In particular, a duality theorem for conditional maxent model shows that these models belong to the exponential family. As shown by Lebanon and Lafferty =-=[13]-=-, in the case of two classes, these models are also closely related to AdaBoost, which can be viewed as solving precisely the same optimization problem with the same constraints, modulo a normalizatio... |

70 | Feature Hashing for Large Scale Multitask Learning
- Weinberger, Dasgupta, et al.
- 2009
(Show Context)
Citation Context ...nd the Deja News Archive, a text topic classification problem generated from a collection of Usenet discussion forums from the years 1995-2000. For all text experiments, we used random feature mixing =-=[9, 20]-=- to control the size of the feature space. The results reported in Table 3 show that the accuracy of the mixture weight method consistently matches or exceeds that of the majority vote method. As expe... |

56 | Solving large scale linear prediction problems using stochastic gradient descent algorithms
- Zhang
- 2004
(Show Context)
Citation Context ...samples of up to 1B instances. Both batch algorithms and on-line training algorithms such ∗ This work was conducted while at Google Research, New York. 1as that of [5] or stochastic gradient descent =-=[21]-=- can benefit from parallelization, but we concentrate here on batch distributed methods. We examine three common distributed training methods: a distributed gradient computation method [4], a majority... |

53 | Using maximum entropy for automatic image annotation
- Jeon, Manmatha
(Show Context)
Citation Context ...ic regression models, are widely used in applications, most prominently for multiclass classification problems with a large number of classes in natural language processing [1, 3] and computer vision =-=[12]-=- over the last decade or more. These models are based on the maximum entropy principle of Jaynes [11], which consists of selecting among the models approximately consistent with the constraints, the o... |

17 | Sample selection bias correction theory
- Mohri, Riley, et al.
- 2008
(Show Context)
Citation Context ...ded, that is there exists R > 0 such that for all (x, y) in X ×Y, ‖Φ(x, y)‖ ≤ R. Our bounds are derived using 4techniques similar to those used by Bousquet and Elisseeff [2], or other authors, e.g., =-=[6]-=-, in the analysis of stability. In what follows, for any w ∈ H and z = (x, y) ∈ X ×Y, we denote by Lz(w) the negative log-likelihood -log pw[y|x]. Theorem 1. Let S ′ and S be two arbitrary samples of ... |

15 | Small statistical models by random feature mixing
- Ganchev, Dredze
- 2008
(Show Context)
Citation Context ...nd the Deja News Archive, a text topic classification problem generated from a collection of Usenet discussion forums from the years 1995-2000. For all text experiments, we used random feature mixing =-=[9, 20]-=- to control the size of the feature space. The results reported in Table 3 show that the accuracy of the mixture weight method consistently matches or exceeds that of the majority vote method. As expe... |

6 |
English gigaword third edition. Linguistic Data Consortium
- Graff, Kong, et al.
- 2007
(Show Context)
Citation Context ...0.001 10 Sentiment 9 M 3 500 K 0.001 10 RCV1-v2 [14] 26 M 103 10 K 0.08 10 Speech 50 M 129 39 1.0 499 Deja News Archive 306 M 8 50 K 0.002 200 Deja News Archive 250K 306 M 8 250 K 0.0004 200 Gigaword =-=[10]-=- 1,000 M 96 10 K 0.001 1000 Table 2: Description of data sets. The column named sparsity reports the frequency of non-zero feature values for each data set. 4 Experiments We ran a number of experiment... |