Results 1  10
of
17
Deterministic Annealing for Clustering, Compression, Classification, Regression, and Related Optimization Problems
 Proceedings of the IEEE
, 1998
"... this paper. Let us place it within the neural network perspective, and particularly that of learning. The area of neural networks has greatly benefited from its unique position at the crossroads of several diverse scientific and engineering disciplines including statistics and probability theory, ph ..."
Abstract

Cited by 248 (11 self)
 Add to MetaCart
this paper. Let us place it within the neural network perspective, and particularly that of learning. The area of neural networks has greatly benefited from its unique position at the crossroads of several diverse scientific and engineering disciplines including statistics and probability theory, physics, biology, control and signal processing, information theory, complexity theory, and psychology (see [45]). Neural networks have provided a fertile soil for the infusion (and occasionally confusion) of ideas, as well as a meeting ground for comparing viewpoints, sharing tools, and renovating approaches. It is within the illdefined boundaries of the field of neural networks that researchers in traditionally distant fields have come to the realization that they have been attacking fundamentally similar optimization problems.
Variational learning for switching statespace models
 Neural Computation
, 1998
"... We introduce a new statistical model for time series which iteratively segments data into regimes with approximately linear dynamics and learns the parameters of each of these linear regimes. This model combines and generalizes two of the most widely used stochastic time series models  hidden Ma ..."
Abstract

Cited by 142 (6 self)
 Add to MetaCart
We introduce a new statistical model for time series which iteratively segments data into regimes with approximately linear dynamics and learns the parameters of each of these linear regimes. This model combines and generalizes two of the most widely used stochastic time series models  hidden Markov models and linear dynamical systems  and is closely related to models that are widely used in the control and econometrics literatures. It can also be derived by extending the mixture of experts neural network (Jacobs et al., 1991) to its fully dynamical version, in which both expert and gating networks are recurrent. Inferring the posterior probabilities of the hidden states of this model is computationally intractable, and therefore the exact Expectation Maximization (EM) algorithm cannot be applied. However, we present a variational approximation that maximizes a lower bound on the log likelihood and makes use of both the forwardbackward recursions for hidden Markov models and the Kalman lter recursions for linear dynamical systems. We tested the algorithm both on artificial data sets and on a natural data set of respiration force from a patient with sleep apnea. The results suggest that variational approximations are a viable method for inference and learning in switching statespace models.
Using Unlabeled Data to Improve Text Classification
, 2001
"... One key difficulty with text classification learning algorithms is that they require many handlabeled examples to learn accurately. This dissertation demonstrates that supervised learning algorithms that use a small number of labeled examples and many inexpensive unlabeled examples can create high ..."
Abstract

Cited by 49 (0 self)
 Add to MetaCart
One key difficulty with text classification learning algorithms is that they require many handlabeled examples to learn accurately. This dissertation demonstrates that supervised learning algorithms that use a small number of labeled examples and many inexpensive unlabeled examples can create highaccuracy text classifiers. By assuming that documents are created by a parametric generative model, ExpectationMaximization (EM) finds local maximum a posteriori models and classifiers from all the data  labeled and unlabeled. These generative models do not capture all the intricacies of text; however on some domains this technique substantially improves classification accuracy, especially when labeled data are sparse. Two problems arise from this basic approach. First, unlabeled data can hurt performance in domains where the generative modeling assumptions are too strongly violated. In this case the assumptions can be made more representative in two ways: by modeling subtopic class structure, and by modeling supertopic hierarchical class relationships. By doing so, model probability and classification accuracy come into correspondence, allowing unlabeled data to improve classification performance. The second problem is that even with a representative model, the improvements given by unlabeled data do not sufficiently compensate for a paucity of labeled data. Here, limited labeled data provide EM initializations that lead to lowprobability models. Performance can be significantly improved by using active learning to select highquality initializations, and by using alternatives to EM that avoid lowprobability local maxima.
Switching StateSpace Models
 King’s College Road, Toronto M5S 3H5
, 1996
"... We introduce a statistical model for times series data with nonlinear dynamics which iteratively segments the data into regimes with approximately linear dynamics and learns the parameters of each of those regimes. This model combines and generalizes two of the most widely used stochastic time se ..."
Abstract

Cited by 41 (2 self)
 Add to MetaCart
We introduce a statistical model for times series data with nonlinear dynamics which iteratively segments the data into regimes with approximately linear dynamics and learns the parameters of each of those regimes. This model combines and generalizes two of the most widely used stochastic time series modelsthe hidden Markov model and the linear dynamical systemand is related to models that are widely used in the control and econometrics literatures. It can also be derived by extending the mixture of experts neural network model (Jacobs et al., 1991) to its fully dynamical version, in which both expert and gating networks are recurrent. Inferring the posterior probabilities of the hidden states of this model is computationally intractable, and therefore the exact Expectation Maximization (EM) alogithm cannot be applied. However, we present a variational approximation which maximizes a lower bound on the log likelihood and makes use of both the forwardbackward recursio...
A Hierarchical Model for Clustering and Categorising Documents
, 2002
"... We propose a new hierarchical generative model for textual data, where words may be generated by topic specific distributions at any level in the hierarchy. This model is naturally wellsuited to clustering documents in preset or automatically generated hierarchies, as well as categorising new docum ..."
Abstract

Cited by 36 (13 self)
 Add to MetaCart
We propose a new hierarchical generative model for textual data, where words may be generated by topic specific distributions at any level in the hierarchy. This model is naturally wellsuited to clustering documents in preset or automatically generated hierarchies, as well as categorising new documents in an existing hierarchy. Training algorithms are derived for both cases, and illustrated on real data by clustering news stories and categorising newsgroup messages. Finally, the generative model may be used to derive a Fisher kernel expressing similarity between documents.
Discriminative, Generative and Imitative Learning
, 2002
"... I propose a common framework that combines three different paradigms in machine learning: generative, discriminative and imitative learning. A generative probabilistic distribution is a principled way to model many machine learning and machine perception problems. Therein, one provides domain specif ..."
Abstract

Cited by 34 (1 self)
 Add to MetaCart
I propose a common framework that combines three different paradigms in machine learning: generative, discriminative and imitative learning. A generative probabilistic distribution is a principled way to model many machine learning and machine perception problems. Therein, one provides domain specific knowledge in terms of structure and parameter priors over the joint space of variables. Bayesian networks and Bayesian statistics provide a rich and flexible language for specifying this knowledge and subsequently refining it with data and observations. The final result is a distribution that is a good generator of novel exemplars.
Dynamic Bayesian Networks for Information Fusion with Applications to HumanComputer Interfaces
, 1999
"... Recent advances in various display and virtual technologies coupled with an explosion in available computing power have given rise to a numberofnovel humancomputer interaction (HCI) modalities  speech, visionbased gesture recognition, eye tracking, EEG, etc. However, despite the abundance of nov ..."
Abstract

Cited by 33 (1 self)
 Add to MetaCart
Recent advances in various display and virtual technologies coupled with an explosion in available computing power have given rise to a numberofnovel humancomputer interaction (HCI) modalities  speech, visionbased gesture recognition, eye tracking, EEG, etc. However, despite the abundance of novel interaction devices, the naturalness and efficiency of HCI has remained low. This is due in particular to the lack of robust sensory data interpretation techniques. To deal with the task of interpreting single and multiple interaction modalities this dissertation establishes a novel probabilistic approach based on dynamic Bayesian networks (DBNs). As a generalization of the successful hidden Markov models, DBNs are a natural basis for the general temporal action interpretation task. The problem of interpretation of single or multiple interacting modalities can then be viewed as a Bayesian inference task. In this work three complex DBN models are introduced: mixtures of DBNs, mixedstate DBNs, and coupled HMMs. Indepth study of these models yields efficient approximate inference and parameter learning techniques applicable to a wide variety of problems. Experimental validation of the proposed approaches in the domains of gesture and speech recognition con rms the model's applicability to both unimodal and multimodal interpretation tasks.
A geometric view on bilingual lexicon extraction from comparable corpora
 In Proceedings of ACL04
, 2004
"... We present a geometric view on bilingual lexicon extraction from comparable corpora, which allows to reinterpret the methods proposed so far and identify unresolved problems. This motivates three new methods that aim at solving these problems. Empirical evaluation shows the strengths and weaknesses ..."
Abstract

Cited by 15 (3 self)
 Add to MetaCart
We present a geometric view on bilingual lexicon extraction from comparable corpora, which allows to reinterpret the methods proposed so far and identify unresolved problems. This motivates three new methods that aim at solving these problems. Empirical evaluation shows the strengths and weaknesses of these methods, as well as a significant gain in the accuracy of extracted lexicons. 1
Distributed Latent Variable Models of Lexical Cooccurrences
 IN PROCEEDINGS OF THE INTERNATIONAL WORKSHOP ON ARTIFICIAL INTELLIGENCE AND STATISTICS
, 2005
"... Lowdimensional representations for lexical cooccurrence data have become increasingly important in alleviating the sparse data problem inherent in natural language processing tasks. This work presents a distributed latent variable model for inducing these lowdimensional representations. The ..."
Abstract

Cited by 11 (1 self)
 Add to MetaCart
Lowdimensional representations for lexical cooccurrence data have become increasingly important in alleviating the sparse data problem inherent in natural language processing tasks. This work presents a distributed latent variable model for inducing these lowdimensional representations. The model takes