• Documents
  • Authors
  • Tables
  • Other Seers ▼
    RefSeer AckSeer CollabSeer SeerSeer
  • Log in
  • Sign up
  • MetaCart

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

Combining Distributional and Morphological Information for Part of Speech Induction (2003)

by Alexander Clark
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 21
Next 10 →

Unsupervised Argument Identification for Semantic Role Labeling

by Omri Abend, Roi Reichart
"... The task of Semantic Role Labeling (SRL) is often divided into two sub-tasks: verb argument identification, and argument classification. Current SRL algorithms show lower results on the identification sub-task. Moreover, most SRL algorithms are supervised, relying on large amounts of manually create ..."
Abstract - Cited by 8 (1 self) - Add to MetaCart
The task of Semantic Role Labeling (SRL) is often divided into two sub-tasks: verb argument identification, and argument classification. Current SRL algorithms show lower results on the identification sub-task. Moreover, most SRL algorithms are supervised, relying on large amounts of manually created data. In this paper we present an unsupervised algorithm for identifying verb arguments, where the only type of annotation required is POS tagging. The algorithm makes use of a fully unsupervised syntactic parser, using its output in order to detect clauses and gather candidate argument collocation statistics. We evaluate our algorithm on PropBank10, achieving a precision of 56%, as opposed to 47 % of a strong baseline. We also obtain an 8 % increase in precision for a Spanish corpus. This is the first paper that tackles unsupervised verb argument identification without using manually encoded rules or extensive lexical or syntactic resources. 1

Unsupervised induction of labeled parse trees by clustering with syntactic features. COLING ’08

by Roi Reichart , 2008
"... We present an algorithm for unsupervised induction of labeled parse trees. The algorithm has three stages: bracketing, initial labeling, and label clustering. Bracketing is done from raw text using an unsupervised incremental parser. Initial labeling is done using a merging model that aims at minimi ..."
Abstract - Cited by 8 (4 self) - Add to MetaCart
We present an algorithm for unsupervised induction of labeled parse trees. The algorithm has three stages: bracketing, initial labeling, and label clustering. Bracketing is done from raw text using an unsupervised incremental parser. Initial labeling is done using a merging model that aims at minimizing the grammar description length. Finally, labels are clustered to a desired number of labels using syntactic features extracted from the initially labeled trees. The algorithm obtains 59% labeled f-score on the WSJ10 corpus, as compared to 35 % in previous work, and substantial error reduction over a random baseline. We report results for English, German and Chinese corpora, using two label mapping methods and two label set sizes. 1

Automatic Selection of High Quality Parses Created By a Fully Unsupervised Parser

by Roi Reichart
"... The average results obtained by unsupervised statistical parsers have greatly improved in the last few years, but on many specific sentences they are of rather low quality. The output of such parsers is becoming valuable for various applications, and it is radically less expensive to create than man ..."
Abstract - Cited by 5 (2 self) - Add to MetaCart
The average results obtained by unsupervised statistical parsers have greatly improved in the last few years, but on many specific sentences they are of rather low quality. The output of such parsers is becoming valuable for various applications, and it is radically less expensive to create than manually annotated training data. Hence, automatic selection of high quality parses created by unsupervised parsers is an important problem. In this paper we present PUPA, a POS-based Unsupervised Parse Assessment algorithm. The algorithm assesses the quality of a parse tree using POS sequence statistics collected from a batch of parsed sentences. We evaluate the algorithm by using an unsupervised POS tagger and an unsupervised parser, selecting high quality parsed sentences from English (WSJ) and German (NEGRA) corpora. We show that PUPA outperforms the leading previous parse assessment algorithm for supervised parsers, as well as a strong unsupervised baseline. Consequently, PUPA allows obtaining high quality parses without any human involvement. 1

Special issue on “Probabilistic models of cognition

by Nick Chater, Christopher D. Manning - Trends in Cognitive Sciences
"... Probabilistic methods are providing new explanatory approaches to fundamental cognitive science questions of how humans structure, process and acquire language. This review examines probabilistic models defined over traditional symbolic structures. Language comprehension and production involve proba ..."
Abstract - Cited by 4 (0 self) - Add to MetaCart
Probabilistic methods are providing new explanatory approaches to fundamental cognitive science questions of how humans structure, process and acquire language. This review examines probabilistic models defined over traditional symbolic structures. Language comprehension and production involve probabilistic inference in such models; and acquisition involves choosing the best model, given innate constraints and linguistic and other input. Probabilistic models can account for the learning and processing of language, while maintaining the sophistication of symbolic models. A recent burgeoning of theoretical developments and online corpus creation has enabled large models to be tested, revealing probabilistic constraints in processing, undermining acquisition arguments based on a perceived poverty

Improved Fully Unsupervised Parsing with Zoomed Learning

by Roi Reichart
"... We introduce a novel training algorithm for unsupervised grammar induction, called Zoomed Learning. Given a training set T and a test set S, the goal of our algorithm is to identify subset pairs Ti, Si of T and S such that when the unsupervised parser is trained on a training subset Ti its results o ..."
Abstract - Cited by 3 (0 self) - Add to MetaCart
We introduce a novel training algorithm for unsupervised grammar induction, called Zoomed Learning. Given a training set T and a test set S, the goal of our algorithm is to identify subset pairs Ti, Si of T and S such that when the unsupervised parser is trained on a training subset Ti its results on its paired test subset Si are better than when it is trained on the entire training set T. A successful application of zoomed learning improves overall performance on the full test set S. We study our algorithm’s effect on the leading algorithm for the task of fully unsupervised parsing (Seginer, 2007) in three different English domains, WSJ, BROWN and GENIA, and show that it improves the parser F-score by up to 4.47%. 1

Improved Unsupervised POS Induction through Prototype Discovery

by Omri Abend, Roi Reichart
"... We present a novel fully unsupervised algorithm for POS induction from plain text, motivated by the cognitive notion of prototypes. The algorithm first identifies landmark clusters of words, serving as the cores of the induced POS categories. The rest of the words are subsequently mapped to these cl ..."
Abstract - Cited by 3 (0 self) - Add to MetaCart
We present a novel fully unsupervised algorithm for POS induction from plain text, motivated by the cognitive notion of prototypes. The algorithm first identifies landmark clusters of words, serving as the cores of the induced POS categories. The rest of the words are subsequently mapped to these clusters. We utilize morphological and distributional representations computed in a fully unsupervised manner. We evaluate our algorithm on English and German, achieving the best reported results for this task. 1

Fully Unsupervised Core-Adjunct Argument Classification

by Omri Abend
"... The core-adjunct argument distinction is a basic one in the theory of argument structure. The task of distinguishing between the two has strong relations to various basic NLP tasks such as syntactic parsing, semantic role labeling and subcategorization acquisition. This paper presents a novel unsupe ..."
Abstract - Cited by 3 (0 self) - Add to MetaCart
The core-adjunct argument distinction is a basic one in the theory of argument structure. The task of distinguishing between the two has strong relations to various basic NLP tasks such as syntactic parsing, semantic role labeling and subcategorization acquisition. This paper presents a novel unsupervised algorithm for the task that uses no supervised models, utilizing instead state-of-the-art syntactic induction algorithms. This is the first work to tackle this task in a fully unsupervised scenario. 1

Type Level Clustering Evaluation: New Measures and a POS Induction Case Study

by Roi Reichart, Omri Abend
"... Clustering is a central technique in NLP. Consequently, clustering evaluation is of great importance. Many clustering algorithms are evaluated by their success in tagging corpus tokens. In this paper we discuss type level evaluation, which reflects class membership only and is independent of the tok ..."
Abstract - Cited by 3 (0 self) - Add to MetaCart
Clustering is a central technique in NLP. Consequently, clustering evaluation is of great importance. Many clustering algorithms are evaluated by their success in tagging corpus tokens. In this paper we discuss type level evaluation, which reflects class membership only and is independent of the token statistics of a particular reference corpus. Type level evaluation casts light on the merits of algorithms, and for some applications is a more natural measure of the algorithm’s quality. We propose new type level evaluation measures that, contrary to existing measures, are applicable when items are polysemous, the common case in NLP. We demonstrate the benefits of our measures using a detailed case study, POS induction. We experiment with seven leading algorithms, obtaining useful insights and showing that token and type level measures can weakly or even negatively correlate, which underscores the fact that these two approaches reveal different aspects of clustering quality. 1

Improved Unsupervised POS Induction Using Intrinsic Clustering Quality and a Zipfian Constraint

by Roi Reichart, Raanan Fattal
"... Modern unsupervised POS taggers usually apply an optimization procedure to a nonconvex function, and tend to converge to local maxima that are sensitive to starting conditions. The quality of the tagging induced by such algorithms is thus highly variable, and researchers report average results over ..."
Abstract - Cited by 2 (1 self) - Add to MetaCart
Modern unsupervised POS taggers usually apply an optimization procedure to a nonconvex function, and tend to converge to local maxima that are sensitive to starting conditions. The quality of the tagging induced by such algorithms is thus highly variable, and researchers report average results over several random initializations. Consequently, applications are not guaranteed to use an induced tagging of the quality reported for the algorithm. In this paper we address this issue using an unsupervised test for intrinsic clustering quality. We run a base tagger with different random initializations, and select the best tagging using the quality test. As a base tagger, we modify a leading unsupervised POS tagger (Clark, 2003) to constrain the distributions of word types across clusters to be Zipfian, allowing us to utilize a perplexity-based quality test. We show that the correlation between our quality test and gold standard-based tagging quality measures is high. Our results are better in most evaluation measures than all results reported in the literature for this task, and are always better than the Clark average results. 1

Part-of-Speech Tagging from 97 % to 100%: Is It Time for Some Linguistics?

by Christopher D. Manning
"... Abstract. I examine what would be necessary to move part-of-speech tagging performance from its current level of about 97.3 % token accuracy (56 % sentence accuracy) to close to 100 % accuracy. I suggest that it must still be possible to greatly increase tagging performance and examine some useful i ..."
Abstract - Cited by 1 (0 self) - Add to MetaCart
Abstract. I examine what would be necessary to move part-of-speech tagging performance from its current level of about 97.3 % token accuracy (56 % sentence accuracy) to close to 100 % accuracy. I suggest that it must still be possible to greatly increase tagging performance and examine some useful improvements that have recently been made to the Stanford Part-of-Speech Tagger. However, an error analysis of some of the remaining errors suggests that there is limited further mileage to be had either from better machine learning or better features in a discriminative sequence classifier. The prospects for further gains from semisupervised learning also seem quite limited. Rather, I suggest and begin to demonstrate that the largest opportunity for further progress comes fromimprovingthetaxonomicbasisofthelinguistic resourcesfrom which taggers are trained. That is, from improved descriptive linguistics. However, I conclude by suggesting that there are also limits to this process. The status of some words may not be able to be adequately captured
The National Science Foundation
  • About CiteSeerX
  • Submit Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2010 The Pennsylvania State University