Results 1 - 10
of
37
Topics in semantic representation
- Psychological Review
, 2007
"... Processing language requires the retrieval of concepts from memory in response to an ongoing stream of information. This retrieval is facilitated if one can infer the gist of a sentence, conversation, or document computational problem underlying the extraction and use of gist, formulating this probl ..."
Abstract
-
Cited by 48 (8 self)
- Add to MetaCart
Processing language requires the retrieval of concepts from memory in response to an ongoing stream of information. This retrieval is facilitated if one can infer the gist of a sentence, conversation, or document computational problem underlying the extraction and use of gist, formulating this problem as a rational statistical inference. This leads to a novel approach to semantic representation in which word meanings are represented in terms of a set of probabilistic topics. The topic model performs well in predicting word association and the effects of semantic association and ambiguity on a variety of language-processing and memory tasks. It also provides a foundation for developing more richly structured statistical models of language, as the generative process assumed in the topic model can easily be extended to incorporate other kinds of semantic and syntactic structure.
Modeling Online Reviews with Multi-grain Topic Models
, 2008
"... In this paper we present a novel framework for extracting the ratable aspects of objects from online user reviews. Extracting such aspects is an important challenge in automatically mining product opinions from the web and in generating opinion-based summaries of user reviews [18, 19, 7, 12, 27, 36, ..."
Abstract
-
Cited by 37 (5 self)
- Add to MetaCart
In this paper we present a novel framework for extracting the ratable aspects of objects from online user reviews. Extracting such aspects is an important challenge in automatically mining product opinions from the web and in generating opinion-based summaries of user reviews [18, 19, 7, 12, 27, 36, 21]. Our models are based on extensions to standard topic modeling methods such as LDA and PLSA to induce multi-grain topics. We argue that multi-grain models are more appropriate for our task since standard models tend to produce topics that correspond to global properties of objects (e.g., the brand of a product type) rather than the aspects of an object that tend to be rated by a user. The models we present not only extract ratable aspects, but also cluster them into coherent topics, e.g., waitress and bartender are part of the same topic staff for restaurants. This differentiates it from much of the previous work which extracts aspects through term frequency analysis with minimal clustering. We evaluate the multi-grain models both qualitatively and quantitatively to show that they improve significantly upon standard topic models.
Minimum cut model for spoken lecture segmentation
- In Proceedings of the Annual Meeting of the Association for Computational Linguistics (COLING-ACL 2006
, 2006
"... We consider the task of unsupervised lecture segmentation. We formalize segmentation as a graph-partitioning task that optimizes the normalized cut criterion. Our approach moves beyond localized comparisons and takes into account longrange cohesion dependencies. Our results demonstrate that global a ..."
Abstract
-
Cited by 35 (7 self)
- Add to MetaCart
We consider the task of unsupervised lecture segmentation. We formalize segmentation as a graph-partitioning task that optimizes the normalized cut criterion. Our approach moves beyond localized comparisons and takes into account longrange cohesion dependencies. Our results demonstrate that global analysis improves the segmentation accuracy and is robust in the presence of speech recognition errors. 1
Learning Document-Level Semantic Properties from Free-text Annotations
"... This paper demonstrates a new method for leveraging unstructured annotations to infer semantic document properties. We consider the domain of product reviews, which are often annotated by their authors with free-text keyphrases, such as “a real bargain ” or “good value. ” We leverage these unstructu ..."
Abstract
-
Cited by 18 (2 self)
- Add to MetaCart
This paper demonstrates a new method for leveraging unstructured annotations to infer semantic document properties. We consider the domain of product reviews, which are often annotated by their authors with free-text keyphrases, such as “a real bargain ” or “good value. ” We leverage these unstructured annotations by clustering them into semantic properties, and then tying the induced clusters to hidden topics in the document text. This allows us to predict relevant properties of unannotated documents. Our approach is implemented in a hierarchical Bayesian model with joint inference, which increases the robustness of the keyphrase clustering and encourages document topics to correlate with semantically meaningful properties. We perform several evaluations of our model, and find that it substantially outperforms alternative approaches. 1
Bayesian Unsupervised Topic Segmentation
"... This paper describes a novel Bayesian approach to unsupervised topic segmentation. Unsupervised systems for this task are driven by lexical cohesion: the tendency of wellformed segments to induce a compact and consistent lexical distribution. We show that lexical cohesion can be placed in a Bayesian ..."
Abstract
-
Cited by 17 (3 self)
- Add to MetaCart
This paper describes a novel Bayesian approach to unsupervised topic segmentation. Unsupervised systems for this task are driven by lexical cohesion: the tendency of wellformed segments to induce a compact and consistent lexical distribution. We show that lexical cohesion can be placed in a Bayesian context by modeling the words in each topic segment as draws from a multinomial language model associated with the segment; maximizing the observation likelihood in such a model yields a lexically-cohesive segmentation. This contrasts with previous approaches, which relied on hand-crafted cohesion metrics. The Bayesian framework provides a principled way to incorporate additional features such as cue phrases, a powerful indicator of discourse structure that has not been previously used in unsupervised segmentation systems. Our model yields consistent improvements over an array of state-of-the-art systems on both text and speech datasets. We also show that both an entropy-based analysis and a well-known previous technique can be derived as special cases of the Bayesian framework. 1 1
A Topic Model for Word Sense Disambiguation
, 2007
"... We develop latent Dirichlet allocation with WORDNET (LDAWN), an unsupervised probabilistic topic model that includes word sense as a hidden variable. We develop a probabilistic posterior inference algorithm for simultaneously disambiguating a corpus and learning the domains in which to consider each ..."
Abstract
-
Cited by 16 (0 self)
- Add to MetaCart
We develop latent Dirichlet allocation with WORDNET (LDAWN), an unsupervised probabilistic topic model that includes word sense as a hidden variable. We develop a probabilistic posterior inference algorithm for simultaneously disambiguating a corpus and learning the domains in which to consider each word. Using the WORDNET hierarchy, we embed the construction of Abney and Light (1999) in the topic model and show that automatically learned domains improve WSD accuracy compared to alternative contexts.
A.: Segmenting meetings into agenda items by extracting implicit supervision from human note-taking
- In Proc. of IUI 2006
, 2007
"... Splitting a meeting into segments such that each segment contains discussions on exactly one agenda item is useful for tasks such as retrieval and summarization of agenda item discussions. However, accurate topic segmentation of meetings is a difficult task. In this paper, we investigate the idea of ..."
Abstract
-
Cited by 12 (1 self)
- Add to MetaCart
Splitting a meeting into segments such that each segment contains discussions on exactly one agenda item is useful for tasks such as retrieval and summarization of agenda item discussions. However, accurate topic segmentation of meetings is a difficult task. In this paper, we investigate the idea of acquiring implicit supervision from human meeting participants to solve the segmentation problem. Specifically we have implemented and tested a note taking interface that gives value to users by helping them organize and retrieve their notes easily, but that also extracts a segmentation of the meeting based on note taking behavior. We show that the segmentation so obtained achieves a Pk value of 0.212 which improves upon an unsupervised baseline by 45 % relative, and compares favorably with a current state–of–the–art algorithm. Most importantly, we achieve this performance without any features or algorithms in the classic sense. ACM Classification: H5.2 [Information interfaces and presentation]:
Global models of document structure using latent permutations
- In NAACL’09
, 2009
"... We present a novel Bayesian topic model for learning discourse-level document structure. Our model leverages insights from discourse theory to constrain latent topic assignments in a way that reflects the underlying organization of document topics. We propose a global model in which both topic selec ..."
Abstract
-
Cited by 12 (1 self)
- Add to MetaCart
We present a novel Bayesian topic model for learning discourse-level document structure. Our model leverages insights from discourse theory to constrain latent topic assignments in a way that reflects the underlying organization of document topics. We propose a global model in which both topic selection and ordering are biased to be similar across a collection of related documents. We show that this space of orderings can be elegantly represented using a distribution over permutations called the generalized Mallows model. Our structureaware approach substantially outperforms alternative approaches for cross-document comparison and single-document segmentation. 1 1
The CALO meeting speech recognition and understanding system
- in Proc. IEEE Spoken Language Technology Workshop
, 2008
"... The CALO Meeting Assistant provides for distributed meeting capture, annotation, automatic transcription and semantic analysis of multi-party meetings, and is part of the larger CALO personal assistant system. This paper summarizes the CALO-MA architecture and its speech recognition and understandin ..."
Abstract
-
Cited by 5 (4 self)
- Add to MetaCart
The CALO Meeting Assistant provides for distributed meeting capture, annotation, automatic transcription and semantic analysis of multi-party meetings, and is part of the larger CALO personal assistant system. This paper summarizes the CALO-MA architecture and its speech recognition and understanding components, which include realtime and offline speech transcription, dialog act segmentation and tagging, question-answer pair identification, action item recognition, and summarization. 1.
Exploiting Conversation Structure in Unsupervised Topic Segmentation for
"... This work concerns automatic topic segmentation of email conversations. We present a corpus of email threads manually annotated with topics, and evaluate annotator reliability. To our knowledge, this is the first such email corpus. We show how the existing topic segmentation models (i.e., Lexical Ch ..."
Abstract
-
Cited by 4 (4 self)
- Add to MetaCart
This work concerns automatic topic segmentation of email conversations. We present a corpus of email threads manually annotated with topics, and evaluate annotator reliability. To our knowledge, this is the first such email corpus. We show how the existing topic segmentation models (i.e., Lexical Chain Segmenter (LCSeg) and Latent Dirichlet Allocation (LDA)) which are solely based on lexical information, can be applied to emails. By pointing out where these methods fail and what any desired model should consider, we propose two novel extensions of the models that not only use lexical information but also exploit finer level conversation structure in a principled way. Empirical evaluation shows that LCSeg is a better model than LDA for segmenting an email thread into topical clusters and incorporating conversation structure into these models improves the performance significantly. 1

