Results 1 - 10
of
16
A bayesian framework for word segmentation: Exploring the effects of context
- In 46th Annual Meeting of the ACL
, 2009
"... Since the experiments of Saffran et al. (1996a), there has been a great deal of interest in the question of how statistical regularities in the speech stream might be used by infants to begin to identify individual words. In this work, we use computational modeling to explore the effects of differen ..."
Abstract
-
Cited by 26 (7 self)
- Add to MetaCart
Since the experiments of Saffran et al. (1996a), there has been a great deal of interest in the question of how statistical regularities in the speech stream might be used by infants to begin to identify individual words. In this work, we use computational modeling to explore the effects of different assumptions the learner might make regarding the nature of words – in particular, how these assumptions affect the kinds of words that are segmented from a corpus of transcribed child-directed speech. We develop several models within a Bayesian ideal observer framework, and use them to examine the consequences of assuming either that words are independent units, or units that help to predict other units. We show through empirical and theoretical results that the assumption of independence causes the learner to undersegment the corpus, with many two- and three-word sequences (e.g. what’s that, do you, in the house) misidentified as individual words. In contrast, when the learner assumes that words are predictive, the resulting segmentation is far more accurate. These results indicate that taking context into account is important for a statistical word segmentation strategy to be successful, and raise the possibility that even young infants may be able to exploit more subtle statistical patterns than have usually been considered. 1
Maximum Margin Clustering Made Practical
"... Maximum margin clustering (MMC) is a recent large margin unsupervised learning approach that has often outperformed conventional clustering methods. Computationally, it involves non-convex optimization and has to be relaxed to different semidefinite programs (SDP). However, SDP solvers are computati ..."
Abstract
-
Cited by 23 (9 self)
- Add to MetaCart
Maximum margin clustering (MMC) is a recent large margin unsupervised learning approach that has often outperformed conventional clustering methods. Computationally, it involves non-convex optimization and has to be relaxed to different semidefinite programs (SDP). However, SDP solvers are computationally very expensive and only small data sets can be handled by MMC so far. To make MMC more practical, we avoid SDP relaxations and propose in this paper an efficient approach that performs alternating optimization directly on the original non-convex problem. A key step to avoid premature convergence is on the use of SVR with the Laplacian loss, instead of SVM with the hinge loss, in the inner optimization subproblem. Experiments on a number of synthetic and realworld data sets demonstrate that the proposed approach is often more accurate, much faster and can handle much larger data sets. 1.
Topic models over text streams: a study of batch and online unsupervised learning
- In Proc. 7th SIAM Int’l. Conf. on Data Mining
"... Topic modeling techniques have widespread use in text data mining applications. Some applications use batch models, which perform clustering on the document collection in aggregate. In this paper, we analyze and compare the performance of three recently-proposed batch topic models—Latent Dirichlet A ..."
Abstract
-
Cited by 12 (1 self)
- Add to MetaCart
Topic modeling techniques have widespread use in text data mining applications. Some applications use batch models, which perform clustering on the document collection in aggregate. In this paper, we analyze and compare the performance of three recently-proposed batch topic models—Latent Dirichlet Allocation (LDA), Dirichlet Compound Multinomial (DCM) mixtures and von-Mises Fisher (vMF) mixture models. In cases where offline clustering on complete document collections is infeasible due to resource and response-rate constraints, online unsupervised clustering methods that process incoming data incrementally are necessary. To this end, we propose online variants of vMF, EDCM and LDA. Experiments on large real-world document collections, in both the offline and online settings, demonstrate that though LDA is a good model for finding word-level topics, vMF finds better document-level topic clusters more efficiently, which is often important in text mining applications. Finally, we propose a practical heuristic for hybrid topic modeling, which learns online topic models on streaming text and intermittently runs batch topic models on aggregated documents offline. Such a hybrid model is useful for several applications (e.g., dynamic topic-based aggregation of user-generated content in social networks) that need a good tradeoff between the performance of batch offline algorithms and efficiency of incremental online algorithms. 1
People-LDA: Anchoring topics to people using face recognition
- In IEEE International Conference on Computer Vision
, 2007
"... ..."
Accounting for Burstiness in Topic Models
"... Many different topic models have been used successfully for a variety of applications. However, even state-of-the-art topic models suffer from the important flaw that they do not capture the tendency of words to appear in bursts; it is a fundamental property of language that if a word is used once i ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
Many different topic models have been used successfully for a variety of applications. However, even state-of-the-art topic models suffer from the important flaw that they do not capture the tendency of words to appear in bursts; it is a fundamental property of language that if a word is used once in a document, it is more likely to be used again. We introduce a topic model that uses Dirichlet compound multinomial (DCM) distributions to model this burstiness phenomenon. On both text and non-text datasets, the new model achieves better held-out likelihood than standard latent Dirichlet allocation (LDA). It is straightforward to incorporate the DCM extension into topic models that are more complex than LDA. 1.
Latent Class Models for Algorithm Portfolio Methods
"... Different solvers for computationally difficult problems such as satisfiability (SAT) perform best on different instances. Algorithm portfolios exploit this phenomenon by predicting solvers ’ performance on specific problem instances, then shifting computational resources to the solvers that appear ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
Different solvers for computationally difficult problems such as satisfiability (SAT) perform best on different instances. Algorithm portfolios exploit this phenomenon by predicting solvers ’ performance on specific problem instances, then shifting computational resources to the solvers that appear best suited. This paper develops a new approach to the problem of making such performance predictions: natural generative models of solver behavior. Two are proposed, both following from an assumption that problem instances cluster into latent classes: a mixture of multinomial distributions, and a mixture of Dirichlet compound multinomial distributions. The latter model extends the former to capture burstiness, the tendency of solver outcomes to recur. These models are integrated into an algorithm portfolio architecture and used to run standard SAT solvers on competition benchmarks. This approach is found competitive with the most prominent existing portfolio, SATzilla, which relies on domain-specific, hand-selected problem features; the latent class models, in contrast, use minimal domain knowledge. Their success suggests that these models can lead to more powerful and more general algorithm portfolio methods.
Spherical Topic Models
"... We introduce the Spherical Admixture Model (SAM), a Bayesian topic model over arbitrary ℓ2 normalized data. SAM models documents as points on a highdimensional spherical manifold, and is capable of representing negative wordtopic correlations and word presence/absence, unlike models with multinomial ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
We introduce the Spherical Admixture Model (SAM), a Bayesian topic model over arbitrary ℓ2 normalized data. SAM models documents as points on a highdimensional spherical manifold, and is capable of representing negative wordtopic correlations and word presence/absence, unlike models with multinomial document likelihood, such as LDA. In this paper, we evaluate SAM as a topic browser, focusing on its ability to model “negative ” topic features, and also as a dimensionality reduction method, using topic proportions as features for difficult classification tasks in natural language processing and computer vision. 1
Audio Classification of Bird Species: a Statistical Manifold Approach
"... Our goal is to automatically identify which species of bird is present in an audio recording using supervised learning. Devising effective algorithms for bird species classification is a preliminary step toward extracting useful ecological data from recordings collected in the field. We propose a pr ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Our goal is to automatically identify which species of bird is present in an audio recording using supervised learning. Devising effective algorithms for bird species classification is a preliminary step toward extracting useful ecological data from recordings collected in the field. We propose a probabilistic model for audio features within a short interval of time, then derive its Bayes risk-minimizing classifier, and show that it is closely approximated by a nearest-neighbor classifier using Kullback-Leibler divergence to compare histograms of features. We note that feature histograms can be viewed as points on a statistical manifold, and KL divergence approximates geodesic distances defined by the Fisher information metric on such manifolds. Motivated by this fact, we propose the use of another approximation to the Fisher information metric, namely the Hellinger metric. The proposed classifiers achieve over 90 % accuracy on a data set containing six species of bird, and outperform support vector machines. 1
Extracting Information from Informal Communication by
, 2007
"... This thesis focuses on the problem of extracting information from informal communication. Textual informal communication, such as e-mail, bulletin boards and blogs, has become a vast information resource. However, such information is poorly organized and difficult for a computer to understand due to ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
This thesis focuses on the problem of extracting information from informal communication. Textual informal communication, such as e-mail, bulletin boards and blogs, has become a vast information resource. However, such information is poorly organized and difficult for a computer to understand due to lack of editing and structure. Thus, techniques which work well for formal text, such as newspaper articles, may be considered insufficient on informal text. One focus of ours is to attempt to advance the state-of-the-art for sub-problems of the information extraction task. We make contributions to the problems of named entity extraction, co-reference resolution and context tracking. We channel our efforts toward methods which are particularly applicable to informal communication. We also consider a type of information which is somewhat unique to informal communication: preferences and opinions. Individuals often expression their opinions on products and services in such

