## Inducing Features of Random Fields (1997)

### Cached

### Download Links

- [www.cs.cmu.edu]
- [reports.adm.cs.cmu.edu]
- [www.stat.ucla.edu]
- [www.stat.ucla.edu]
- [www.stat.ucla.edu]
- [www.stat.ucla.edu]
- [www.stat.ucla.edu]
- CiteULike
- DBLP

### Other Repositories/Bibliography

Venue: | IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE |

Citations: | 553 - 14 self |

### BibTeX

@ARTICLE{Pietra97inducingfeatures,

author = {Stephen Della Pietra and Vincent Della Pietra and John Lafferty},

title = {Inducing Features of Random Fields},

journal = {IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE},

year = {1997},

volume = {19},

number = {4},

pages = {380--393}

}

### Years of Citing Articles

### OpenURL

### Abstract

We present a technique for constructing random fields from a set of training samples. The learning paradigm builds increasingly complex fields by allowing potential functions, or features, that are supported by increasingly large subgraphs. Each feature has a weight that is trained by minimizing the Kullback-Leibler divergence between the model and the empirical distribution of the training data. A greedy algorithm determines how features are incrementally added to the field and an iterative scaling algorithm is used to estimate the optimal values of the weights. The random field models and techniques introduced in this paper differ from those common to much of the computer vision literature in that the underlying random fields are non-Markovian and have a large number of parameters that must be estimated. Relations to other learning approaches, including decision trees, are given. As a demonstration of the method, we describe its application to the problem of automatic word classifica...

### Citations

8080 | Maximum likelihood from incomplete data via the EM algorithm
- Dempster, Lairdsand, et al.
- 1977
(Show Context)
Citation Context ...e incremental step of the algorithm in terms of an auxiliary function which bounds from below the log-likelihood objective function. This technique is the standard means of analyzing the EM algorithm =-=[13]-=-, but it has not previously been applied to iterative scaling. Our analysis of iterative scaling is different and simpler than previous treatments. In particular, in contrast to Csiszar's proof of the... |

3902 | Classification and Regression Trees - Breiman, Friedman, et al. - 1984 |

3717 |
Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images
- Geman, Geman
- 1984
(Show Context)
Citation Context ...er than discuss the details of Monte Carlo techniques for this problem we refer to the extensive literature on this topic. We have obtained good results using the standard technique of Gibbs sampling =-=[17]-=- for the problem we describe in Section 5. IV. PARAMETER ESTIMATION In this section we present an algorithm for selecting the parameters associated with the features of a random field. The algorithm i... |

1083 | A maximum entropy approach to natural language processing
- Berger, Pietra, et al.
- 1996
(Show Context)
Citation Context ... functions f(x; y), the same general approach is applicable. The feature induction method for conditional exponential models is demonstrated for several problems in statistical machine translation in =-=[3]-=-, where it is presented in terms of the principle of maximum entropy. B. Decision trees Our feature induction paradigm also bears some resemblence to various methods for growing classification and reg... |

697 | Class-based Ngram models of natural language
- Brown
- 1992
(Show Context)
Citation Context ...y generating a single set of samples from the distribution q (k) . V. APPLICATION: WORD MORPHOLOGY Word clustering algorithms are useful for many natural language processing tasks. One such algorithm =-=[6]-=-, called mutual information clustering, is based upon the construction of simple bigram language models using the maximum likelihood criterion. The algorithm gives a hierarchical binary classification... |

584 | A Statistical Approach to Machine Translation
- Brown, Cocke, et al.
- 1990
(Show Context)
Citation Context ... binary classification of words that has been used for a variety of purposes, including the construction of decision tree language and parsing models, and sense disambiguation for machine translation =-=[7]-=-. A fundamental shortcoming of the mutual information word clustering algorithm given in [6] is that it takes as fundamental the word spellings themselves. This increases the severity of the problem o... |

512 | An Inequality and Associated Maximization Technique in Statistical Estimation for - Baum - 1970 |

431 |
Generalized iterative scaling for log-linear models
- Darroch, Ratcliff
- 1972
(Show Context)
Citation Context ...TERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 19, NO. 4, APRIL 1997 that we call Improved Iterative Scaling. It is an improvement of the Generalized Iterative Scaling algorithm of Darroch and Ratcliff =-=[12]-=- in that it does not require that the features sum to a constant. The improved algorithm is easier to implement than the Darroch and Ratcliff algorithm, and can lead to an increase in the rate of conv... |

431 | A learning algorithm for Boltzmann machines
- Ackley, Hinton, et al.
- 1985
(Show Context)
Citation Context ...ethod. 6.1 Boltzmann machines. There is an immediate resemblence between the parameter estimation problem for the random fields that we have considered and the learning problem for Boltzmann machines =-=[1]-=-. The classical Boltzmann machine is considered to be a random field on a graph G = (E; V ) with configuration space\Omega = f0; 1g V consisting of all labelings of the vertices by either a zero or a ... |

257 |
I-divergence geometry of probability distributions and minimization problems. The Annals of Probability 3(1
- Csiszár
- 1975
(Show Context)
Citation Context ...rovides a good starting point from which to begin iterative scaling. In fact, we can view this distribution as the result of applying one iteration of an Iterative Proportional Fitting Procedure [5], =-=[9]-=- to project q ffg onto the linear family of distributions with gmarginals constrained tosp[g]. Our main result in this section is Proposition 3: Suppose q (k) is the sequence in D determined by the Im... |

147 | Fundamental of statistical exponential families with applications in statistical decision theory - Brown - 1986 |

136 | Information geometry and alternating minimization procedures, Statistics and Decisions (supplement 1 - Csiszar, Tusnady - 1984 |

92 | Best-First Model Merging for Hidden Markov Model Induction," TR-94-003, International Computer Science lnstitute
- Stolcke, Omohundro
- 1994
(Show Context)
Citation Context ... the technique is unable to precisely calculate the reduction in entropy due to splitting a state, and must instead rely on more primitive heuristics. A closely related technique is given in [31]. In =-=[33]-=- a method for building hidden Markov models is presented which is in some sense the opposite approach, in that it starts with a maximally detailed finite-state model and proceeds by incrementally gene... |

79 | Data compression using dynamic markov modelling
- Cormack, Horspool
- 1987
(Show Context)
Citation Context ...res in the field can be overlapping. 6.3 Dynamic Markov coding. Another technique that is similar in some aspects to random field induction is the dynamic Markov coding technique for text compression =-=[14,5]-=-. To incrementally build a finite state machine for generating strings in some output alphabet, dynamic Markov coding is based on the heuristic that the relative entropy of the finitestate machine mig... |

76 | The power of amnesia
- RON, SINGER, et al.
- 1994
(Show Context)
Citation Context ...v coding, the technique is unable to precisely calculate the reduction in entropy due to splitting a state, and must instead rely on more primitive heuristics. A closely related technique is given in =-=[31]-=-. In [33] a method for building hidden Markov models is presented which is in some sense the opposite approach, in that it starts with a maximally detailed finite-state model and proceeds by increment... |

50 |
Conjugate priors for exponential families
- Diaconis, Ylvisaker
- 1979
(Show Context)
Citation Context ...orated. This could enable a principled approach for deciding when the feature induction is complete. While there is a natural class of conjugate priors for the class of exponential models that we use =-=[14]-=-, the problem of incorporating prior knowledge about the set of candiate features is more challenging. APPENDIX I. DUALITY In this Appendix we prove Proposition 4 restated here. Proposition 4: Suppose... |

41 | An iterative Gibbsian technique for reconstruction of m-ary images - Chalmond - 1989 |

29 | Noncausal Gauss Markov random fields: Parameter structure and estimation - Balram, Moura - 1993 |

26 |
A note on approximations of discrete probability distributions
- Brown
- 1959
(Show Context)
Citation Context ...his provides a good starting point from which to begin iterative scaling. In fact, we can view this distribution as the result of applying one iteration of an Iterative Proportional Fitting Procedure =-=[5]-=-, [9] to project q ffg onto the linear family of distributions with gmarginals constrained tosp[g]. Our main result in this section is Proposition 3: Suppose q (k) is the sequence in D determined by t... |

21 |
A geometric interpretation of Darroch and Ratcliff’s generalized iterative scaling
- Csiszár
- 1989
(Show Context)
Citation Context ...hat does not make use of the Kuhn-Tucker theorem or other machinery of constrained optimization. Moreover, our proof does not rely on the convergence of alternating I-projection as in Csiszar's proof =-=[10]-=- of the Darroch-Ratcliff procedure. Both the feature selection step and the parameter estimation step require the solution of certain algebraic equations whose coefficients are determined as expectati... |

18 | Optimal spectral structure of reversible stochastic matrices, Monte Carlo methods and the simulation of Markov random - Frigessi, Hwang, et al. - 1992 |

17 | Alternating minimization and Boltzmann machine learning
- Byrne
- 1992
(Show Context)
Citation Context ...mum likelihood problem, the training problem for Boltzmann machines becomes an instance of the general problem addressed by the EM algorithm [21], where iterative scaling is carried out in the M-step =-=[12]-=-. Most often the architecture of a Boltzmann machine is prescribed, and the learning problem is then solved by applying the EM algorithm (which typically involves random sampling and annealing). To ca... |

12 | Estimation and annealing for Gibbsian fields - Younes - 1988 |

10 |
Class-based -gram models of natural language
- BROWN, PIETRA, et al.
- 1992
(Show Context)
Citation Context ...ating a single set of samples from the distribution¡ � ¢ � . � � � V. APPLICATION: WORD MORPHOLOGY Word clustering algorithms are useful for many natural language processing tasks. One such algorithm =-=[6]-=-, called mutual information clustering, is based upon the construction of simple bigram language models using the maximum likelihood criterion. The algorithm gives a hierarchical binary classification... |

9 | Convergence of some partially parallel gibbs samplers with annealing. The Annals of Applied Probability - Ferrari, Frigessi, et al. - 1993 |

7 | Partition Function Estimation of Gibbs Random Filed Images Using Monte Carlo Simulation - Potamianos, Goutsias - 1993 |

7 |
Text Compression, Englewood Cliffs
- Bell, Cleary, et al.
- 1990
(Show Context)
Citation Context ...res in the field can be overlapping. 6.3 Dynamic Markov coding. Another technique that is similar in some aspects to random field induction is the dynamic Markov coding technique for text compression =-=[14,5]-=-. To incrementally build a finite state machine for generating strings in some output alphabet, dynamic Markov coding is based on the heuristic that the relative entropy of the finitestate machine mig... |

5 | A variational method for estimating the parameters of MRF from complete or incomplete data - Almeida, Gidas - 1993 |

3 | Random fields and inverse problems in imaging," in Ecole d"et'e de Saint-Flour - Geman - 1990 |

1 | ConstrainedMonte Carlo maximum likelihood for dependent data (with discussion - Thomson - 1992 |

1 |
Automatic word classification using features of spellings
- Lafferty, Mercer
- 1993
(Show Context)
Citation Context ...y. A description of how the resulting features 8 IEEE TRANSACTIONS PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 19, NO. 4, APRIL 1997 were used to improve mutual information clustering is given in =-=[20]-=-, and is beyond the scope of the present paper; we refer the reader to [6], [20] for a more detailed treatment of this topic. In Section 5.1 we formulate the problem in terms of the notation and resul... |

1 |
Parsing as statistical pattern recognition," Doctoral dissertation
- Magerman
- 1994
(Show Context)
Citation Context ...hood criterion. The algorithm gives a hierarchical binary classification of words that has been used for a variety of purposes, including the construction of decision tree language and parsing models =-=[28]-=-, and sense disambiguation for machine translation [11]. A fundamental shortcoming of the mutual information word clustering algorithm given in [10] is that it takes as fundamental the word spellings ... |

1 |
Higher order Boltzmann machines," in Neural Networks for Computing
- Sejnowski
- 1986
(Show Context)
Citation Context ...yvalued features of the form fv (!) = ! v1 ! v2 \Delta \Delta \Delta ! vn for v = (v 1 ; : : : ; vn ) a path in G = (E; V ), we construct models that are essentially "higherorder " Boltzmann=-= machines [32]-=-. With candidate features of this form our algorithm incrementally constructs a Boltzmann machine with no hidden units. If only a subset of the vertices are labelled in the training samples, then our ... |