Results 11 - 20
of
68
Distributed Latent Variable Models of Lexical Co-occurrences
- IN PROCEEDINGS OF THE INTERNATIONAL WORKSHOP ON ARTIFICIAL INTELLIGENCE AND STATISTICS
, 2005
"... Low-dimensional representations for lexical co-occurrence data have become increasingly important in alleviating the sparse data problem inherent in natural language processing tasks. This work presents a distributed latent variable model for inducing these low-dimensional representations. The ..."
Abstract
-
Cited by 10 (0 self)
- Add to MetaCart
Low-dimensional representations for lexical co-occurrence data have become increasingly important in alleviating the sparse data problem inherent in natural language processing tasks. This work presents a distributed latent variable model for inducing these low-dimensional representations. The model takes
A bayesian interpretation of interpolated kneserney
, 2006
"... Interpolated Kneser-Ney is one of the best smoothing methods for n-gram language models. Previous explanations for its superiority have been based on intuitive and empirical justifications of specific properties of the method. We propose a novel interpretation of interpolated Kneser-Ney as approxima ..."
Abstract
-
Cited by 8 (2 self)
- Add to MetaCart
Interpolated Kneser-Ney is one of the best smoothing methods for n-gram language models. Previous explanations for its superiority have been based on intuitive and empirical justifications of specific properties of the method. We propose a novel interpretation of interpolated Kneser-Ney as approximate inference in a hierarchical Bayesian model consisting of Pitman-Yor processes. As opposed to past explanations, our interpretation can recover exactly the formulation of interpolated Kneser-Ney, and performs better than interpolated Kneser-Ney when a better inference procedure is used. 1
A neural network for text representation
- Artificial Neural Networks: Biological Inspirations ICANN 2005: 15th International Conference
"... submitted for publication Abstract. Text categorization and retrieval tasks are often based on a good representation of textual data. Departing from the classical vector space model, several probabilistic models have been proposed recently, such as PLSA and LDA. In this paper, we propose the use of ..."
Abstract
-
Cited by 7 (2 self)
- Add to MetaCart
submitted for publication Abstract. Text categorization and retrieval tasks are often based on a good representation of textual data. Departing from the classical vector space model, several probabilistic models have been proposed recently, such as PLSA and LDA. In this paper, we propose the use of a neural network based, non-probabilistic, solution, which captures jointly a rich representation of words and documents. Experiments performed on two information retrieval tasks using the TDT2 database and the TREC-8 and 9 sets of queries yielded a better performance for the proposed neural network model, as compared to PLSA and the classical TFIDF representations. 2 IDIAP–RR 05-12
Visualizing similarity data with a mixture of maps
- In AI and Statistics, 2007. Society for Artificial Intelligence and Statistics
, 2007
"... We show how to visualize a set of pairwise similarities between objects by using several different two-dimensional maps, each of which captures different aspects of the similarity structure. When the objects are ambiguous words, for example, different senses of a word occur in different maps, so “ri ..."
Abstract
-
Cited by 7 (2 self)
- Add to MetaCart
We show how to visualize a set of pairwise similarities between objects by using several different two-dimensional maps, each of which captures different aspects of the similarity structure. When the objects are ambiguous words, for example, different senses of a word occur in different maps, so “river ” and “loan ” can both be close to “bank ” without being at all close to each other. Aspect maps resemble clustering because they model pair-wise similarities as a mixture of different types of similarity, but they also resemble local multi-dimensional scaling because they model each type of similarity by a twodimensional map. We demonstrate our method on a toy example, a database of human wordassociation data, a large set of images of handwritten digits, and a set of feature vectors that represent words. 1
Fast Semantic Extraction Using a Novel Neural Network Architecture
"... We describe a novel neural network architecture for the problem of semantic role labeling. Many current solutions are complicated, consist of several stages and handbuilt features, and are too slow to be applied as part of real applications that require such semantic labels, partly because of their ..."
Abstract
-
Cited by 7 (2 self)
- Add to MetaCart
We describe a novel neural network architecture for the problem of semantic role labeling. Many current solutions are complicated, consist of several stages and handbuilt features, and are too slow to be applied as part of real applications that require such semantic labels, partly because of their use of a syntactic parser (Pradhan et al., 2004; Gildea and Jurafsky, 2002). Our method instead learns a direct mapping from source sentence to semantic tags for a given predicate without the aid of a parser or a chunker. Our resulting system obtains accuracies comparable to the current state-of-the-art at a fraction of the computational cost. 1
Quick Training of Probabilistic Neural Nets by Importance Sampling
, 2003
"... Our previous work on statistical language modeling introduced the use of probabilistic feedforward neural networks to help dealing with the curse of dimensionality. Training this model by maximum likelihood however requires for each example to perform as many network passes as there are words in the ..."
Abstract
-
Cited by 7 (4 self)
- Add to MetaCart
Our previous work on statistical language modeling introduced the use of probabilistic feedforward neural networks to help dealing with the curse of dimensionality. Training this model by maximum likelihood however requires for each example to perform as many network passes as there are words in the vocabulary. Inspired by the contrastive divergence model, we propose and evaluate sampling-based methods which require network passes only for the observed "positive example" and a few sampled negative example words. A very significant speed-up is obtained with an adaptive importance sampling.
Learning Continuous Phrase Representations and Syntactic Parsing with Recursive Neural Networks
"... Natural language parsing has typically been done with small sets of discrete categories such as NP and VP, but this representation does not capture the full syntactic nor semantic richness of linguistic phrases, and attempts to improve on this by lexicalizing phrases only partly address the problem ..."
Abstract
-
Cited by 7 (4 self)
- Add to MetaCart
Natural language parsing has typically been done with small sets of discrete categories such as NP and VP, but this representation does not capture the full syntactic nor semantic richness of linguistic phrases, and attempts to improve on this by lexicalizing phrases only partly address the problem at the cost of huge feature spaces and sparseness. To address this, we introduce a recursive neural network architecture for jointly parsing natural language and learning vector space representations for variable-sized inputs. At the core of our architecture are context-sensitive recursive neural networks (CRNN). These networks can induce distributed feature representations for unseen phrases and provide syntactic information to accurately predict phrase structure trees. Most excitingly, the representation of each phrase also captures semantic information: For instance, the phrases “decline to comment” and “would not disclose the terms ” are close by in the induced embedding space. Our current system achieves an unlabeled bracketing F-measure of 92.1% on the Wall Street Journal dataset for sentences up to length 15. 1
Hierarchical Pitman-Yor language models for ASR in meetings
- In Proceedings of IEEE ASRU International Conference
, 2007
"... In this paper we investigate the application of a hierarchical Bayesian language model (LM) based on the Pitman-Yor process for automatic speech recognition (ASR) of multiparty meetings. The hierarchical Pitman-Yor language model (HPY-LM) provides a Bayesian interpretation of LM smoothing. An approx ..."
Abstract
-
Cited by 6 (3 self)
- Add to MetaCart
In this paper we investigate the application of a hierarchical Bayesian language model (LM) based on the Pitman-Yor process for automatic speech recognition (ASR) of multiparty meetings. The hierarchical Pitman-Yor language model (HPY-LM) provides a Bayesian interpretation of LM smoothing. An approximation to the HPYLM recovers the exact formulation of the interpolated Kneser-Ney smoothing method in n-gram models. This paper focuses on the application and scalability of HPYLM on a practical large vocabulary ASR system. Experimental results on NIST RT06s evaluation meeting data verify that HPYLM is a competitive and promising language modeling technique, which consistently performs better than interpolated Kneser-Ney and modified Kneser-Ney n-gram LMs in terms of both perplexity and word error rate.
Hierarchical distributed representations for statistical language modeling
- In Advances in Neural Information Processing Systems 17
, 2004
"... Statistical language models estimate the probability of a word occurring in a given context. The most common language models rely on a discrete enumeration of predictive contexts (e.g., n-grams) and consequently fail to capture and exploit statistical regularities across these contexts. In this pape ..."
Abstract
-
Cited by 6 (2 self)
- Add to MetaCart
Statistical language models estimate the probability of a word occurring in a given context. The most common language models rely on a discrete enumeration of predictive contexts (e.g., n-grams) and consequently fail to capture and exploit statistical regularities across these contexts. In this paper, we show how to learn hierarchical, distributed representations of word contexts that maximize the predictive value of a statistical language model. The representations are initialized by unsupervised algorithms for linear and nonlinear dimensionality reduction [14], then fed as input into a hierarchical mixture of experts, where each expert is a multinomial distribution over predicted words [12]. While the distributed representations in our model are inspired by the neural probabilistic language model of Bengio et al. [2, 3], our particular architecture enables us to work with significantly larger vocabularies and training corpora. For example, on a large-scale bigram modeling task involving a sixty thousand word vocabulary and a training corpus of three million sentences, we demonstrate consistent improvement over class-based bigram models [10, 13]. We also discuss extensions of our approach to longer multiword contexts. 1
Improve latent semantic analysis based language model by integrating multiple level knowledge
- In Proc. of ICSLP
, 2002
"... We describe an extension to the use of Latent Semantic Analysis (LSA) for language modeling. This technique makes it easier to exploit long distance relationships in natural language for which the traditional n-gram is unsuited. However, with the growth of length, the semantic representation of the ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
We describe an extension to the use of Latent Semantic Analysis (LSA) for language modeling. This technique makes it easier to exploit long distance relationships in natural language for which the traditional n-gram is unsuited. However, with the growth of length, the semantic representation of the history may be contaminated by irrelevant information, increasing the uncertainty in predicting the next word. To address this problem, we propose a multilevel framework dividing the history into three levels corresponding to document, paragraph and sentence. To combine the three levels of information with the n-gram, a Softmax network is used. We further present a statistical scheme that dynamically determines the unit scope in the generalization stage. The combination of all the techniques leads to a 14 % perplexity reduction on a subset of Wall Street Journal, compared with the trigram model. 1.

