Results 1  10
of
52
S.: Hidden Markov Model Induction by Bayesian Model Merging
 Advances in Neural Information Processing Systems 5
, 1993
"... This paper describes a technique for learning both the number of states and the topology of Hidden Markov Models from examples. The induction process starts with the most specific model consistent with the training data and generalizes by successively merging states. Both the choice of states to mer ..."
Abstract

Cited by 135 (2 self)
 Add to MetaCart
This paper describes a technique for learning both the number of states and the topology of Hidden Markov Models from examples. The induction process starts with the most specific model consistent with the training data and generalizes by successively merging states. Both the choice of states to merge and the stopping criterion are guided by the Bayesian posterior probability. We compare our algorithm with the BaumWelch method of estimating fixedsize models, and find that it can induce minimal HMMs from data in cases where fixed estimation does not converge or requires redundant parameters to converge. 1
Inducing probabilistic grammars by bayesian model merging
 In: Int. Conf. Grammatical Inference. URL: citeseer.nj.nec.com/stolcke94inducing.html
, 1994
"... We describe a framework for inducing probabilistic grammars from corpora of positive samples. First, samples are incorporated by adding adhoc rules to a working grammar; subsequently, elements of the model (such as states or nonterminals) are merged to achieve generalization and a more compact repr ..."
Abstract

Cited by 130 (0 self)
 Add to MetaCart
We describe a framework for inducing probabilistic grammars from corpora of positive samples. First, samples are incorporated by adding adhoc rules to a working grammar; subsequently, elements of the model (such as states or nonterminals) are merged to achieve generalization and a more compact representation. The choice of what to merge and when to stop is governed by the Bayesian posterior probability of the grammar given the data, which formalizes a tradeoff between a close fit to the data and a default preference for simpler models (‘Occam’s Razor’). The general scheme is illustrated using three types of probabilistic grammars: Hidden Markov models, classbasedgrams, and stochastic contextfree grammars. 1
PartofSpeech Tagging and Partial Parsing
 CorpusBased Methods in Language and Speech
, 1996
"... m we can carve o# next. `Partial parsing' is a cover term for a range of di#erent techniques for recovering some but not all of the information contained in a traditional syntactic analysis. Partial parsing techniques, like tagging techniques, aim for reliability and robustness in the face of the va ..."
Abstract

Cited by 96 (0 self)
 Add to MetaCart
m we can carve o# next. `Partial parsing' is a cover term for a range of di#erent techniques for recovering some but not all of the information contained in a traditional syntactic analysis. Partial parsing techniques, like tagging techniques, aim for reliability and robustness in the face of the vagaries of natural text, by sacrificing completeness of analysis and accepting a low but nonzero error rate. 1 Tagging The earliest taggers [35, 51] had large sets of handconstructed rules for assigning tags on the basis of words' character patterns and on the basis of the tags assigned to preceding or following words, but they had only small lexica, primarily for exceptions to the rules. TAGGIT [35] was used to generate an initial tagging of the Brown corpus, which was then handedited. (Thus it provided the data that has since been used to train other taggers [20].) The tagger described by Garside [56, 34], CLAWS, was a probabilistic version of TAGGIT, and the DeRose tagger improved on
Bestfirst Model Merging for Hidden Markov Model Induction
, 1994
"... This report describes a new technique for inducing the structure of Hidden Markov Models from data which is based on the general `model merging' strategy (Omohundro 1992). The process begins with a maximum likelihood HMM that directly encodes the training data. Successively more general models are p ..."
Abstract

Cited by 93 (7 self)
 Add to MetaCart
This report describes a new technique for inducing the structure of Hidden Markov Models from data which is based on the general `model merging' strategy (Omohundro 1992). The process begins with a maximum likelihood HMM that directly encodes the training data. Successively more general models are produced by merging HMM states. A Bayesian posterior probability criterion is used to determine which states to merge and when to stop generalizing. The procedure may be considered a heuristic search for the HMM structure with the highest posterior probability. We discuss a variety of possible priors for HMMs, as well as a number of approximations which improve the computational efficiency of the algorithm. We studied three applications to evaluate the procedure. The first compares the merging algorithm with the standard BaumWelch approach in inducing simple finitestate languages from small, positiveonly training samples. We found that the merging procedure is more robust and accurate, part...
Statistical methods and linguistics
 THE BALANCING ACT: COMBINING SYMBOLIC AND STATISTICAL APPROACHES TO LANGUAGE
, 1996
"... In the space of the last ten years, statistical methods have gone from being virtually unknown in computational linguistics to being a fundamental given. In 1996, no one can profess to be a computational linguist without a passing knowledge of statistical methods. HMM's are as de rigeur as LR tables ..."
Abstract

Cited by 79 (0 self)
 Add to MetaCart
In the space of the last ten years, statistical methods have gone from being virtually unknown in computational linguistics to being a fundamental given. In 1996, no one can profess to be a computational linguist without a passing knowledge of statistical methods. HMM's are as de rigeur as LR tables, and anyone who cannot at least use the terminology persuasively risks being mistaken for kitchen help at the ACL banquet. More seriously, statistical techniques have brought signi cant advances in broadcoverage language processing. Statistical methods have made real progress possible on a number of issues that had previously stymied attempts to liberate systems from toy domains � issues that include disambiguation, error correction, and the induction of the sheer volume of information requisite for handling unrestricted text. And the sense of progress has generated a great deal of enthusiasm for statistical methods in computational linguistics. However, this enthusiasm has not been catching in linguistics proper. It is always dangerous to generalize about linguists, but I think it is fair to say
Language Acquisition in the Absence of Explicit Negative Evidence: How Important is Starting Small?
 COGNITION
, 1999
"... It is commonly assumed that innate linguistic constraints are necessary to learn a natural language, based on the apparent lack of explicit negative evidence provided to children and on Gold's proof that, under assumptions of virtually arbitrary positive presentation, most interesting classes of ..."
Abstract

Cited by 68 (6 self)
 Add to MetaCart
It is commonly assumed that innate linguistic constraints are necessary to learn a natural language, based on the apparent lack of explicit negative evidence provided to children and on Gold's proof that, under assumptions of virtually arbitrary positive presentation, most interesting classes of languages are not learnable. However, Gold's results do not apply under the rather common assumption that language presentation may be modeled as a stochastic process. Indeed, Elman (Elman, J.L., 1993. Learning and development in neural networks: the importance of starting small. Cognition 48, 7199) demonstrated that a simple recurrent connectionist network could learn an artificial grammar with some of the complexities of English, including embedded clauses, based on performing a word prediction task within a stochastic environment. However, the network was successful only when either embedded sentences were initially withheld and only later introduced gradually, or when the network itself was given initially limited memory which only gradually improved. This finding has been taken as support for Newport's `less is more' proposal, that child language acquisition may be aided rather than hindered by limited cognitive resources. The current article reports on connectionist simulations which indicate, to the contrary, that starting with simplified inputs or limited memory is not necessary in training recurrent networks to learn pseudonatural languages; in fact, such restrictions hinder acquisition as the languages are made more Englishlike by the introduction of semantic as well as syntactic constraints. We suggest that, under a statistical model of the language environment, Gold's theorem and the possible lack of explicit negative evidence do not implicate i...
Probabilistic Syntax
, 2002
"... istic methods for syntax, just as for a long time McCarthy and Hayes (1969) discouraged exploration of probabilistic methods in Artificial Intelligence. Among his arguments were that: (i) Probabilistic models wrongly mix in world knowledge (New York occurs more in text than Dayton, Ohio, but for no ..."
Abstract

Cited by 34 (1 self)
 Add to MetaCart
istic methods for syntax, just as for a long time McCarthy and Hayes (1969) discouraged exploration of probabilistic methods in Artificial Intelligence. Among his arguments were that: (i) Probabilistic models wrongly mix in world knowledge (New York occurs more in text than Dayton, Ohio, but for no linguistic reason), (ii) Probabilistic models don't model grammaticality (neither Colorless green ideas sleep furiously nor Furiously sleep ideas green colorless have previously been uttered  and hence must be estimated to have probability zero, Chomsky wrongly assumes  but the former is grammatical while the latter is not, and (iii) Use of probabilities does not meet the goal of describing the mindinternal Ilanguage as opposed to the observedintheworld Elanguage. This chapter is not meant to be a detailed critique of Chomsky's arguments  Abney (1996) provides a survey and a rebuttal, and Pereira (2000) has further useful discussion  but some of these concerns are still importa
The Application Of Algorithmic Probability to Problems in Artificial Intelligence
 in Uncertainty in Artificial Intelligence, Kanal, L.N. and Lemmer, J.F. (Eds), Elsevier Science Publishers B.V
, 1986
"... INTRODUCTION We will cover two topics First, Algorithmic Probability  the motivation for defining it, how it overcomes di#culties in other formulations of probability, some of its characteristic properties and successful applications. Second, we will apply it to problems in A.I.  where it p ..."
Abstract

Cited by 30 (5 self)
 Add to MetaCart
INTRODUCTION We will cover two topics First, Algorithmic Probability  the motivation for defining it, how it overcomes di#culties in other formulations of probability, some of its characteristic properties and successful applications. Second, we will apply it to problems in A.I.  where it promises to give near optimum search procedures for two very broad classes of problems. A strong motivation for revising classical concepts of probability has come from the analysis of human problem solving. When working on a di#cult problem, a person is in a maze in which he must make choices of possible courses of action. If the problem is a familiar one, the choices will all be easy. If it is not familiar, there can be much uncertainty in each choice, but choices must somehow be made. One basis for choice might be the probability of each choice leading to a quick solution  this probability being based on experience in this problem and in problems like it. A good reason for using proba