Results 11 - 20
of
43
Optimization with EM and Expectation-Conjugate-Gradient
, 2003
"... We show a close relationship between the Expectation - Maximization (EM) algorithm and direct optimization algorithms such as gradient-based methods for parameter learning. ..."
Abstract
-
Cited by 31 (1 self)
- Add to MetaCart
We show a close relationship between the Expectation - Maximization (EM) algorithm and direct optimization algorithms such as gradient-based methods for parameter learning.
Exploiting Syntactic Structure for Natural Language Modeling
, 2000
"... The thesis presents an attempt at using the syntactic structure in natural language for improved language models for speech recognition. The structured language model merges techniques in automatic parsing and language modeling using an original probabilistic parameterization of a shift-reduce parse ..."
Abstract
-
Cited by 27 (0 self)
- Add to MetaCart
The thesis presents an attempt at using the syntactic structure in natural language for improved language models for speech recognition. The structured language model merges techniques in automatic parsing and language modeling using an original probabilistic parameterization of a shift-reduce parser. A maximum likelihood reestimation procedure belonging to the class of expectation-maximization algorithms is employed for training the model. Experiments on the Wall Street Journal, Switchboard and Broadcast News corpora show improvement in both perplexity and word error rate -- word lattice rescoring -- over the standard 3-gram language model. The significance of the thesis lies in presenting an original approach to language modeling that uses the hierarchical -- syntactic -- structure in natural language to improve on current 3-gram modeling techniques for large vocabulary speech recognition.
Adaptive Overrelaxed Bound Optimization Methods
- In Proceedings of International Conference on Machine Learning, ICML. International Conference on Machine Learning, ICML
, 2003
"... We study a class of overrelaxed bound optimization algorithms, and their relationship to standard bound optimizers, such as ExpectationMaximization, Iterative Scaling, CCCP and Non-Negative Matrix Factorization. We provide a theoretical analysis of the convergence properties of these optimizer ..."
Abstract
-
Cited by 25 (0 self)
- Add to MetaCart
We study a class of overrelaxed bound optimization algorithms, and their relationship to standard bound optimizers, such as ExpectationMaximization, Iterative Scaling, CCCP and Non-Negative Matrix Factorization. We provide a theoretical analysis of the convergence properties of these optimizers and identify analytic conditions under which they are expected to outperform the standard versions. Based on this analysis, we propose a novel, simple adaptive overrelaxed scheme for practical optimization and report empirical results on several synthetic and real-world data sets showing that these new adaptive methods exhibit superior performance (in certain cases by several orders of magnitude) compared to their traditional counterparts. Our "drop-in" extensions are simple to implement, apply to a wide variety of algorithms, almost always give a substantial speedup, and do not require any theoretical analysis of the underlying algorithm.
On the Effectiveness of the Skew Divergence for Statistical Language Analysis
- In Artificial Intelligence and Statistics 2001
, 2001
"... Estimating word co-occurrence probabilities is a problem underlying many applications in statistical natural language processing. ..."
Abstract
-
Cited by 22 (0 self)
- Add to MetaCart
Estimating word co-occurrence probabilities is a problem underlying many applications in statistical natural language processing.
Formal grammar and information theory: Together again?
- PHILOSOPHICAL TRANSACTIONS OF THE ROYAL SOCIETY
, 2000
"... In the last 40 years, research on models of spoken and written language has been split between two seemingly irreconcilable traditions: formal linguistics in the Chomsky tradition, and information theory in the Shannon tradition. Zellig Harris had advocated a close alliance between grammatical and i ..."
Abstract
-
Cited by 22 (0 self)
- Add to MetaCart
In the last 40 years, research on models of spoken and written language has been split between two seemingly irreconcilable traditions: formal linguistics in the Chomsky tradition, and information theory in the Shannon tradition. Zellig Harris had advocated a close alliance between grammatical and information-theoretic principles in the analysis of natural language, and early formal-language theory provided another strong link between information theory and linguistics. Nevertheless, in most research on language and computation, grammatical and information-theoretic approaches had moved far apart. Today, after many years on the defensive, the information-theoretic approach has gained new strength and achieved practical successes in speech recognition, information retrieval, and, increasingly, in language analysis and machine translation. The exponential increase in the speed and storage capacity of computers is the proximate cause of these engineering successes, allowing the automatic estimation of the parameters of probabilistic models of language by counting occurrences of linguistic events in very large bodies of text and speech. However, I will argue that informationtheoretic and computational ideas are also playing an increasing role in the scientific understanding of language, and will help bring together formal-linguistic and information-theoretic perspectives.
Learning the Similarity of Documents: An Information-Geometric Approach to Document Retrieval and Categorization
, 2000
"... The project pursued in this paper is to develop from rst information-geometric principles a general method for learning the similarity between text documents. Each individual document is modeled as a memoryless information source. Based on a latent class decomposition of the term-document matrix ..."
Abstract
-
Cited by 20 (0 self)
- Add to MetaCart
The project pursued in this paper is to develop from rst information-geometric principles a general method for learning the similarity between text documents. Each individual document is modeled as a memoryless information source. Based on a latent class decomposition of the term-document matrix, a lowdimensional (curved) multinomial subfamily is learned. From this model a canonical similarity function -- known as the Fisher kernel -- is derived. Our approach can be applied for unsupervised and supervised learning problems alike. This in particular covers interesting cases where both, labeled and unlabeled data are available. Experiments in automated indexing and text categorization verify the advantages of the proposed method.
On the convergence of bound optimization algorithms
- in: Proc. 19th Conference in Uncertainty in Artificial Intelligence (UAI ’03
, 2003
"... Many practitioners who use EM and related algorithms complain that they are sometimes slow. When does this happen, and what can be done about it? In this paper, we study the general class of bound optimization algorithms – including EM, Iterative Scaling, Non-negative Matrix Factorization, CCCP – an ..."
Abstract
-
Cited by 15 (0 self)
- Add to MetaCart
Many practitioners who use EM and related algorithms complain that they are sometimes slow. When does this happen, and what can be done about it? In this paper, we study the general class of bound optimization algorithms – including EM, Iterative Scaling, Non-negative Matrix Factorization, CCCP – and their relationship to direct optimization algorithms such as gradientbased methods for parameter learning. We derive a general relationship between the updates performed by bound optimization methods and those of gradient and second-order methods and identify analytic conditions under which bound optimization algorithms exhibit quasi-Newton behavior, and under which they possess poor, first-order convergence. Based on this analysis, we consider several specific algorithms, interpret and analyze their convergence properties and provide some recipes for preprocessing input to these algorithms to yield faster convergence behavior. We report empirical results supporting our analysis and showing that simple data preprocessing can result in dramatically improved performance of bound optimizers in practice. 1 Bound Optimization Algorithms Many problems in machine learning and pattern recognition ultimately reduce to the optimization of a scalar valued function L(Θ) of a free parameter vector Θ. For example, in supervised and unsupervised probabilistic modeling the objective function may be the (conditional) data likelihood or the posterior over parameters. In discriminative learning we may use a classification or regression score; in reinforcement learning an average discounted reward. Optimization may also arise during inference; for example we may want to reduce the cross entropy between two distributions or minimize a function such as the Bethe free energy. Bound optimization (BO) algorithms take advantage of the fact that many objective functions arising in practice have a
Generative Models for Cold-Start Recommendations
- IN PROCEEDINGS OF THE 2001 SIGIR WORKSHOP ON RECOMMENDER SYSTEMS
, 2001
"... Systems for automatically recommending items (e.g., movies, products, or information) to users are becoming increasingly important in e-commerce applications, digital libraries, and other domains where mass personalization is highly valued. Such recommender systems typically base their suggestions o ..."
Abstract
-
Cited by 13 (4 self)
- Add to MetaCart
Systems for automatically recommending items (e.g., movies, products, or information) to users are becoming increasingly important in e-commerce applications, digital libraries, and other domains where mass personalization is highly valued. Such recommender systems typically base their suggestions on (1) collaborative data encoding which users like which items, and/or (2) content data describing item features and user demographics. Systems that rely solely on collaborative data fail when operating from a cold start|that is, when recommending items (e.g., rst-run movies) that no member of the community has yet seen. We develop several generative probabilistic models that circumvent the cold-start problem by mixing content data with collaborative data in a sound statistical manner. We evaluate the algorithms using MovieLens movie ratings data, augmented with actor and director information from the Internet Movie Database. We nd that maximum likelihood learning with the expectation maximization (EM) algorithm and variants tends to over t complex models that are initialized randomly. However, by seeding parameters of the complex models with parameters learned in simpler models, we obtain greatly improved performance. We explore both methods that exploit a single type of content data (e.g., actors only) and methods that leverage multiple types of content data (e.g., both actors and directors) simultaneously.
Distributed Latent Variable Models of Lexical Co-occurrences
- IN PROCEEDINGS OF THE INTERNATIONAL WORKSHOP ON ARTIFICIAL INTELLIGENCE AND STATISTICS
, 2005
"... Low-dimensional representations for lexical co-occurrence data have become increasingly important in alleviating the sparse data problem inherent in natural language processing tasks. This work presents a distributed latent variable model for inducing these low-dimensional representations. The ..."
Abstract
-
Cited by 10 (0 self)
- Add to MetaCart
Low-dimensional representations for lexical co-occurrence data have become increasingly important in alleviating the sparse data problem inherent in natural language processing tasks. This work presents a distributed latent variable model for inducing these low-dimensional representations. The model takes
Probabilistic Finite-State Machines - Part I
"... Probabilistic finite-state machines are used today in a variety of areas in pattern recognition, or in fields to which pattern recognition is linked: computational linguistics, machine learning, time series analysis, circuit testing, computational biology, speech recognition and machine translatio ..."
Abstract
-
Cited by 9 (1 self)
- Add to MetaCart
Probabilistic finite-state machines are used today in a variety of areas in pattern recognition, or in fields to which pattern recognition is linked: computational linguistics, machine learning, time series analysis, circuit testing, computational biology, speech recognition and machine translation are some of them. In part I of this paper we survey these generative objects and study their definitions and properties. In part II, we will study the relation of probabilistic finite-state automata with other well known devices that generate strings as hidden Markov models and n-grams, and provide theorems, algorithms and properties that represent a current state of the art of these objects.

