## Multiview learning of word embeddings via cca (2011)

Venue: | In Proc. of NIPS |

Citations: | 15 - 4 self |

### BibTeX

@INPROCEEDINGS{Dhillon11multiviewlearning,

author = {Paramveer S. Dhillon and Dean Foster and Lyle Ungar},

title = {Multiview learning of word embeddings via cca},

booktitle = {In Proc. of NIPS},

year = {2011}

}

### OpenURL

### Abstract

Recently, there has been substantial interest in using large amounts of unlabeled data to learn word representations which can then be used as features in supervised classifiers for NLP tasks. However, most current approaches are slow to train, do not model the context of the word, and lack theoretical grounding. In this paper, we present a new learning method, Low Rank Multi-View Learning (LR-MVL) which uses a fast spectral method to estimate low dimensional context-specific word representations from unlabeled data. These representation features can then be used with any supervised learner. LR-MVL is extremely fast, gives guaranteed convergence to a global optimum, is theoretically elegant, and achieves state-ofthe-art performance on named entity recognition (NER) and chunking problems. 1 Introduction and Related Work Over the past decade there has been increased interest in using unlabeled data to supplement the labeled data in semi-supervised learning settings to overcome the inherent data sparsity and get improved generalization accuracies in high dimensional domains like NLP. Approaches like [1, 2]

### Citations

1245 | Combining Labeled and Unlabeled Data with Co-training
- Blum, Mitchell
- 1998
(Show Context)
Citation Context .... • Take the CCA between the hidden states and the tokens wt. The singular vectors associated with wt form a new estimate of the eigenfeature dictionary. LR-MVL can be viewed as a type of co-training =-=[13]-=-: The state of each token wt is similar to that of the tokens both before and after it, and it is also similar to the states of the other occurrences of the same word elsewhere in the document (used i... |

697 | Class-based Ngram models of natural language
- Brown
- 1992
(Show Context)
Citation Context ...ustering based word representations: Clustering methods, often hierarchical, are used to group distributionally similar words based on their contexts. The two dominant approaches are Brown Clustering =-=[3]-=- and [4]. As recently shown, HMMs can also be used to induce a multinomial distribution over possible clusters [5]. 2. Dense representations: These representations are dense, low dimensional and real-... |

548 | Distributional clustering of English words
- Pereira, Tishby, et al.
- 1993
(Show Context)
Citation Context ... based word representations: Clustering methods, often hierarchical, are used to group distributionally similar words based on their contexts. The two dominant approaches are Brown Clustering [3] and =-=[4]-=-. As recently shown, HMMs can also be used to induce a multinomial distribution over possible clusters [5]. 2. Dense representations: These representations are dense, low dimensional and real-valued. ... |

319 | A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data
- Ando, Zhang
(Show Context)
Citation Context ...upplement the labeled data in semi-supervised learning settings to overcome the inherent data sparsity and get improved generalization accuracies in high dimensional domains like NLP. Approaches like =-=[1, 2]-=- have been empirically very successful and have achieved excellent accuracies on a variety of NLP tasks. However, it is often difficult to adapt these approaches to use in conjunction with an existing... |

113 | A unified architecture for natural language processing: Deep neural networks with multitask learning
- Weston
- 2008
(Show Context)
Citation Context ...Each dimension of these representations captures latent information about a combination of syntactic and semantic word properties. They can either be induced using neural networks like C&W embeddings =-=[6]-=- and Hierarchical log-linear (HLBL) embeddings [7] or by eigen-decomposition of the word co-occurrence matrix, e.g. Latent Semantic Analysis/Latent Semantic Indexing (LSA/LSI) [8]. Unfortunately, most... |

96 | Using latent semantic analysis to improve access to textual information
- Dumais, Furnas, et al.
- 1988
(Show Context)
Citation Context ... like C&W embeddings [6] and Hierarchical log-linear (HLBL) embeddings [7] or by eigen-decomposition of the word co-occurrence matrix, e.g. Latent Semantic Analysis/Latent Semantic Indexing (LSA/LSI) =-=[8]-=-. Unfortunately, most of these representations are 1). slow to train, 2). sensitive to the scaling of the embeddings (especially ℓ2 based approaches like LSA/PCA), 3). can get stuck in local optima (l... |

59 | Design challenges and misconceptions in named entity recognition
- Ratinov, Roth
- 2009
(Show Context)
Citation Context ...window. • Previous two predictions yi−1 and yi−2 and conjunction of d and yi−1 • Embedding features (LR-MVL, C&W, HLBL, Brown etc.) in a window of 2 around the current word (if applicable). Following =-=[17]-=- we use regularized averaged perceptron model with above set of baseline features for the NER task. We also used their BILOU text chunk representation and fast greedy inference as it was shown to give... |

56 | A spectral algorithm for learning hidden Markov models
- Hsu, Kakade, et al.
- 2009
(Show Context)
Citation Context ...ic, but context oblivious embeddings (like the ones used by [6, 7]) can be trivially gotten from our model. Furthermore, building on recent advances in spectral learning for sequence models like HMMs =-=[9, 10, 11]-=- we show that LR-MVL has strong theoretical grounding. Particularly, we show that LR-MVL estimates low dimensional context-specific word embeddings which preserve all the information in the data if th... |

56 | Y.: Word Representations: A Simple and General Method for Semi-Supervised Learning
- Turian, Ratinov, et al.
- 2010
(Show Context)
Citation Context ...3 and the CoNLL ’00 datasets had ∼ 204K/51K/46K and ∼ 212K/−/47K tokens respectively for Train/Dev./Test sets. 4.1.1 Named Entity Recognition (NER) We use the same set of baseline features as used by =-=[15, 16]-=- in their experiments. The detailed list of features is as below: • Current Word wi; Its type information: all-capitalized, is-capitalized, all-digits and so on; Prefixes and suffixes of wi • Word tok... |

43 | Three new graphical models for statistical language modelling
- Mnih, Hinton
- 2007
(Show Context)
Citation Context ...atent information about a combination of syntactic and semantic word properties. They can either be induced using neural networks like C&W embeddings [6] and Hierarchical log-linear (HLBL) embeddings =-=[7]-=- or by eigen-decomposition of the word co-occurrence matrix, e.g. Latent Semantic Analysis/Latent Semantic Indexing (LSA/LSI) [8]. Unfortunately, most of these representations are 1). slow to train, 2... |

42 | Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions
- Halko, Martinsson, et al.
(Show Context)
Citation Context ..., A . for A few iterations (∼ 5) of the above algorithm are sufficient to converge to the solution. (Since the problem is convex, there is a single solution, so there is no issue of local minima.) As =-=[14]-=- show for PCA, one can start with a random matrix that is only slightly larger than the true rank k of the correlation matrix, and with extremely high likelihood converge in a few iterations to within... |

32 | Hilbert space embeddings of hidden markov models
- Song, Boots, et al.
- 2010
(Show Context)
Citation Context ...ic, but context oblivious embeddings (like the ones used by [6, 7]) can be trivially gotten from our model. Furthermore, building on recent advances in spectral learning for sequence models like HMMs =-=[9, 10, 11]-=- we show that LR-MVL has strong theoretical grounding. Particularly, we show that LR-MVL estimates low dimensional context-specific word embeddings which preserve all the information in the data if th... |

26 | Semi-supervised sequential labeling and segmentation using giga-word scale unlabeled data
- Suzuki, Isozaki
- 2008
(Show Context)
Citation Context ...upplement the labeled data in semi-supervised learning settings to overcome the inherent data sparsity and get improved generalization accuracies in high dimensional domains like NLP. Approaches like =-=[1, 2]-=- have been empirically very successful and have achieved excellent accuracies on a variety of NLP tasks. However, it is often difficult to adapt these approaches to use in conjunction with an existing... |

26 | Distributional representations for handling sparsity in supervised sequence labeling
- Huang, Yates
- 2009
(Show Context)
Citation Context ...milar words based on their contexts. The two dominant approaches are Brown Clustering [3] and [4]. As recently shown, HMMs can also be used to induce a multinomial distribution over possible clusters =-=[5]-=-. 2. Dense representations: These representations are dense, low dimensional and real-valued. Each dimension of these representations captures latent information about a combination of syntactic and s... |

24 | Phrase clustering for discriminative learning
- Lin, Wu
- 2009
(Show Context)
Citation Context ...ten from (A) in Algorithm 1. 2). F1-score= Harmonic Mean of Precision and Recall. 3). The current state-of-the-art for this NER task is 90.90 (Test Set) but using 700 billion tokens of unlabeled data =-=[19]-=-. Embedding/Model Test Set F1-Score Baseline 93.79 HLBL, 50-dim 94.00 C&W, 50-dim 94.10 Brown 3200 Clusters 94.11 Ando & Zhang ’05 94.39 Suzuki & Isozaki ’08 94.67 LR-MVL (CO) 50 × 3-dim 95.02 LR-MVL ... |

23 | Reducedrank hidden Markov models
- Siddiqi, Boots, et al.
- 2010
(Show Context)
Citation Context ...ic, but context oblivious embeddings (like the ones used by [6, 7]) can be trivially gotten from our model. Furthermore, building on recent advances in spectral learning for sequence models like HMMs =-=[9, 10, 11]-=- we show that LR-MVL has strong theoretical grounding. Particularly, we show that LR-MVL estimates low dimensional context-specific word embeddings which preserve all the information in the data if th... |

23 | A robust risk minimization based named entity recognition system
- Zhang, Johnson
- 2003
(Show Context)
Citation Context ...3 and the CoNLL ’00 datasets had ∼ 204K/51K/46K and ∼ 212K/−/47K tokens respectively for Train/Dev./Test sets. 4.1.1 Named Entity Recognition (NER) We use the same set of baseline features as used by =-=[15, 16]-=- in their experiments. The detailed list of features is as below: • Current Word wi; Its type information: all-capitalized, is-capitalized, all-digits and so on; Prefixes and suffixes of wi • Word tok... |

18 | Semi-supervised learning for natural language
- Liang
- 2005
(Show Context)
Citation Context ...he RCV1 corpus containing Reuters newswire from Aug ’96 to Aug ’97 and containing about 63 million tokens in 3.3 million sentences 5 . Case was left intact and we did not do the “cleaning” as done by =-=[18, 16]-=- i.e. remove all sentences which are less than 90% lowercase a-z, as our multi-view learning approach is robust to such noisy data, like news byline text (mostly all caps) which does not correlate str... |

2 |
Canonical Correlation Analysis (CCA
- Hotelling
- 1935
(Show Context)
Citation Context ... optima as is the case for an EM trained HMM. LR-MVL falls into category (2) mentioned above; it learns real-valued context-specific word embeddings by performing Canonical Correlation Analysis (CCA) =-=[12]-=- between the past and future views of low rank approximations of the data. However, LR-MVL is more general than those methods, which work on bigram or trigram co-occurrence matrices, in that it uses l... |