#### DMCA

## Inducing Features of Random Fields (1997)

### Cached

### Download Links

Venue: | IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE |

Citations: | 659 - 10 self |

### Citations

11761 | Maximum likelihood from incomplete data via the em algorithm.
- Dempster, Laird, et al.
- 1977
(Show Context)
Citation Context ...e incremental step of the algorithm in terms of an auxiliary function which bounds from below the log-likelihood objective function. This technique is the standard means of analyzing the EM algorithm =-=[13]-=-, but it has not previously been applied to iterative scaling. Our analysis of iterative scaling is different and simpler than previous treatments. In particular, in contrast to Csiszar's proof of the... |

5843 | Classification and Regression Trees, - Breiman, Friedman, et al. |

5055 |
Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images,”
- Geman, Geman
- 1984
(Show Context)
Citation Context ...er than discuss the details of Monte Carlo techniques for this problem we refer to the extensive literature on this topic. We have obtained good results using the standard technique of Gibbs sampling =-=[17]-=- for the problem we describe in Section 5. IV. PARAMETER ESTIMATION In this section we present an algorithm for selecting the parameters associated with the features of a random field. The algorithm i... |

1342 | A Maximum Entropy approach to Natural Language Processing.
- Berger, Pietra, et al.
- 1996
(Show Context)
Citation Context ... functions f(x; y), the same general approach is applicable. The feature induction method for conditional exponential models is demonstrated for several problems in statistical machine translation in =-=[3]-=-, where it is presented in terms of the principle of maximum entropy. B. Decision trees Our feature induction paradigm also bears some resemblence to various methods for growing classification and reg... |

966 | Class-based n-gram models of natural language,”
- Brown, deSouza, et al.
- 1992
(Show Context)
Citation Context ...y generating a single set of samples from the distribution q (k) . V. APPLICATION: WORD MORPHOLOGY Word clustering algorithms are useful for many natural language processing tasks. One such algorithm =-=[6]-=-, called mutual information clustering, is based upon the construction of simple bigram language models using the maximum likelihood criterion. The algorithm gives a hierarchical binary classification... |

713 | A statistical approach to machine translation.
- Brown, Cocke, et al.
- 1990
(Show Context)
Citation Context ... binary classification of words that has been used for a variety of purposes, including the construction of decision tree language and parsing models, and sense disambiguation for machine translation =-=[7]-=-. A fundamental shortcoming of the mutual information word clustering algorithm given in [6] is that it takes as fundamental the word spellings themselves. This increases the severity of the problem o... |

625 | An inequality and association maximization technique in statistical estimation for probabilistic functions of Markov processes. - Baum - 1972 |

576 | A learning algorithm for Boltzmann machines,”
- Ackley, Hinton, et al.
- 1985
(Show Context)
Citation Context ...ethod. 6.1 Boltzmann machines. There is an immediate resemblence between the parameter estimation problem for the random fields that we have considered and the learning problem for Boltzmann machines =-=[1]-=-. The classical Boltzmann machine is considered to be a random field on a graph G = (E; V ) with configuration space\Omega = f0; 1g V consisting of all labelings of the vertices by either a zero or a ... |

506 |
Generalized iterative scaling for loglinear models,”
- Darroch, Ratcliff
- 1972
(Show Context)
Citation Context ...TERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 19, NO. 4, APRIL 1997 that we call Improved Iterative Scaling. It is an improvement of the Generalized Iterative Scaling algorithm of Darroch and Ratcliff =-=[12]-=- in that it does not require that the features sum to a constant. The improved algorithm is easier to implement than the Darroch and Ratcliff algorithm, and can lead to an increase in the rate of conv... |

372 |
I-divergence geometry of probability distributions and minimization problems.
- Csiszar
- 1975
(Show Context)
Citation Context ...rovides a good starting point from which to begin iterative scaling. In fact, we can view this distribution as the result of applying one iteration of an Iterative Proportional Fitting Procedure [5], =-=[9]-=- to project q ffg onto the linear family of distributions with gmarginals constrained tosp[g]. Our main result in this section is Proposition 3: Suppose q (k) is the sequence in D determined by the Im... |

256 | Fundamentals of Statistical Exponential Families,, - Brown - 1986 |

191 | Information geometry and alternating minimization procedures.”Statistics and Decisions Supp. - Csiszar, Tusnady - 1984 |

101 | Best-first model merging for Hidden Markov Model induction,
- Stolcke, Omohundro
- 1993
(Show Context)
Citation Context ... the technique is unable to precisely calculate the reduction in entropy due to splitting a state, and must instead rely on more primitive heuristics. A closely related technique is given in [31]. In =-=[33]-=- a method for building hidden Markov models is presented which is in some sense the opposite approach, in that it starts with a maximally detailed finite-state model and proceeds by incrementally gene... |

94 | Data compression using dynamic markov modeling
- Cormack, Horspool
- 1987
(Show Context)
Citation Context ...res in the field can be overlapping. 6.3 Dynamic Markov coding. Another technique that is similar in some aspects to random field induction is the dynamic Markov coding technique for text compression =-=[14,5]-=-. To incrementally build a finite state machine for generating strings in some output alphabet, dynamic Markov coding is based on the heuristic that the relative entropy of the finitestate machine mig... |

89 | Tishby N: The Power of Amnesia
- Ron, Singer
- 1994
(Show Context)
Citation Context ...v coding, the technique is unable to precisely calculate the reduction in entropy due to splitting a state, and must instead rely on more primitive heuristics. A closely related technique is given in =-=[31]-=-. In [33] a method for building hidden Markov models is presented which is in some sense the opposite approach, in that it starts with a maximally detailed finite-state model and proceeds by increment... |

86 |
Conjugate Priors for Exponential Families”,
- Diaconis, Ylvisaker
- 1979
(Show Context)
Citation Context ...orated. This could enable a principled approach for deciding when the feature induction is complete. While there is a natural class of conjugate priors for the class of exponential models that we use =-=[14]-=-, the problem of incorporating prior knowledge about the set of candiate features is more challenging. APPENDIX I. DUALITY In this Appendix we prove Proposition 4 restated here. Proposition 4: Suppose... |

52 | An Iterative Gibbsian Technique for Reconstruction of M-ary Images - Chalmond - 1989 |

35 | Noncausal Gauss-Markov random fields: parameter structure and estimation - Balram, Moura - 1993 |

26 |
A geometric interpretation of Darroch and Ratcliff’s generalized iterative scaling
- Csiszár
- 1989
(Show Context)
Citation Context ...hat does not make use of the Kuhn-Tucker theorem or other machinery of constrained optimization. Moreover, our proof does not rely on the convergence of alternating I-projection as in Csiszar's proof =-=[10]-=- of the Darroch-Ratcliff procedure. Both the feature selection step and the parameter estimation step require the solution of certain algebraic equations whose coefficients are determined as expectati... |

25 |
A Note on Approximations to Discrete Probability Distributions. Information and Control
- Brown
- 1959
(Show Context)
Citation Context ...his provides a good starting point from which to begin iterative scaling. In fact, we can view this distribution as the result of applying one iteration of an Iterative Proportional Fitting Procedure =-=[5]-=-, [9] to project q ffg onto the linear family of distributions with gmarginals constrained tosp[g]. Our main result in this section is Proposition 3: Suppose q (k) is the sequence in D determined by t... |

23 | Optimal spectral structure of reversible stochastic matrices, Monte Carlo methods and the simulation of Markov random - Frigessi, Hwang, et al. - 1992 |

18 | Alternating minimization and Boltzmann machine learning
- Byrne
- 1992
(Show Context)
Citation Context ...mum likelihood problem, the training problem for Boltzmann machines becomes an instance of the general problem addressed by the EM algorithm [21], where iterative scaling is carried out in the M-step =-=[12]-=-. Most often the architecture of a Boltzmann machine is prescribed, and the learning problem is then solved by applying the EM algorithm (which typically involves random sampling and annealing). To ca... |

16 | Estimation and annealing for Gibbsian fields - Younes - 1988 |

13 |
Class-based -grammodels of natural language
- Brown, Pietra, et al.
- 1992
(Show Context)
Citation Context ...ating a single set of samples from the distribution¡ � ¢ � . � � � V. APPLICATION: WORD MORPHOLOGY Word clustering algorithms are useful for many natural language processing tasks. One such algorithm =-=[6]-=-, called mutual information clustering, is based upon the construction of simple bigram language models using the maximum likelihood criterion. The algorithm gives a hierarchical binary classification... |

11 | A variational method for estimating the parameters of MRF from complete or incomplete data - Almeida, Gidas - 1993 |

8 | Convergence of some partially parallel Gibbs samplers with annealing - Ferrari, Frigessi, et al. - 1993 |

7 | Partition Function Estimation of Gibbs Random Field Images Using Monte Carlo Simulations - Potamianos, Goutsias - 1993 |

7 |
Text Compression. Englewood Cliffs
- Bell, Cleary, et al.
- 1990
(Show Context)
Citation Context ...res in the field can be overlapping. 6.3 Dynamic Markov coding. Another technique that is similar in some aspects to random field induction is the dynamic Markov coding technique for text compression =-=[14,5]-=-. To incrementally build a finite state machine for generating strings in some output alphabet, dynamic Markov coding is based on the heuristic that the relative entropy of the finitestate machine mig... |

3 | Random fields and inverse problems in imaging," in Ecole d"et'e de Saint-Flour - Geman - 1990 |

1 | ConstrainedMonte Carlo maximum likelihood for dependent data (with discussion - Thomson - 1992 |

1 |
Automatic word classification using features of spellings
- Lafferty, Mercer
- 1993
(Show Context)
Citation Context ...y. A description of how the resulting features 8 IEEE TRANSACTIONS PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 19, NO. 4, APRIL 1997 were used to improve mutual information clustering is given in =-=[20]-=-, and is beyond the scope of the present paper; we refer the reader to [6], [20] for a more detailed treatment of this topic. In Section 5.1 we formulate the problem in terms of the notation and resul... |

1 |
Parsing as statistical pattern recognition," Doctoral dissertation
- Magerman
- 1994
(Show Context)
Citation Context ...hood criterion. The algorithm gives a hierarchical binary classification of words that has been used for a variety of purposes, including the construction of decision tree language and parsing models =-=[28]-=-, and sense disambiguation for machine translation [11]. A fundamental shortcoming of the mutual information word clustering algorithm given in [10] is that it takes as fundamental the word spellings ... |

1 |
Higher order Boltzmann machines," in Neural Networks for Computing
- Sejnowski
- 1986
(Show Context)
Citation Context ...yvalued features of the form fv (!) = ! v1 ! v2 \Delta \Delta \Delta ! vn for v = (v 1 ; : : : ; vn ) a path in G = (E; V ), we construct models that are essentially "higherorder " Boltzmann=-= machines [32]-=-. With candidate features of this form our algorithm incrementally constructs a Boltzmann machine with no hidden units. If only a subset of the vertices are labelled in the training samples, then our ... |