Results 1 - 10
of
39
Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network
- IN PROCEEDINGS OF HLT-NAACL
, 2003
"... We present a new part-of-speech tagger that demonstrates the following ideas: (i) explicit use of both preceding and following tag contexts via a dependency network representation, (ii) broad use of lexical features, including jointly conditioning on multiple consecutive words, (iii) effective ..."
Abstract
-
Cited by 181 (12 self)
- Add to MetaCart
We present a new part-of-speech tagger that demonstrates the following ideas: (i) explicit use of both preceding and following tag contexts via a dependency network representation, (ii) broad use of lexical features, including jointly conditioning on multiple consecutive words, (iii) effective use of priors in conditional loglinear models, and (iv) fine-grained modeling of unknown word features. Using these ideas together, the resulting tagger gives a 97.24% accuracy on the Penn Treebank WSJ, an error reduction of 4.4% on the best previous single automatically learned tagging result.
An Intrinsic Information Content Metric for Semantic Similarity in WordNet
, 2004
"... Information Content (IC) is an important dimension of word knowledge when assessing the similarity of two terms or word senses. The conventional way of measuring the IC of word senses is to combine knowledge of their hierarchical structure from an ontology like WordNet with statistics on their actua ..."
Abstract
-
Cited by 39 (2 self)
- Add to MetaCart
Information Content (IC) is an important dimension of word knowledge when assessing the similarity of two terms or word senses. The conventional way of measuring the IC of word senses is to combine knowledge of their hierarchical structure from an ontology like WordNet with statistics on their actual usage in text as derived from a large corpus. In this paper we present a wholly intrinsic measure of IC that relies on hierarchical structure alone. We report that this measure is consequently easier to calculate, yet when used as the basis of a similarity mechanism it yields judgments that correlate more closely with human assessments than other, extrinsic measures of IC that additionally employ corpus analysis.
SVMTool: A general POS tagger generator based on Support Vector Machines
, 2004
"... This report presents the svmtool , a simple, flexible, effective and efficient part--of--speech tagger based on Support Vector Machines. The svmtool offers a fairly good balance among these properties which make it really practical for current NLP applications. It is very easy to use and easily c ..."
Abstract
-
Cited by 34 (0 self)
- Add to MetaCart
This report presents the svmtool , a simple, flexible, effective and efficient part--of--speech tagger based on Support Vector Machines. The svmtool offers a fairly good balance among these properties which make it really practical for current NLP applications. It is very easy to use and easily configurable so as to perfectly fit the needs of a number of different applications. Results are also very competitive, achieving an accuracy of 97.16% for English on the Wall Street Journal corpus. It has been also successfully applied to Spanish and Catalan exhibiting a similar performance. A first release of the svmtool Perl prototype is now freely available for public use. A more efficient C++ version is coming very soon, by summer 2004.
An unsupervised morpheme-based hmm for hebrew morphological disambiguation
- In COLING/ACL2006
, 2006
"... Morphological disambiguation is the process of assigning one set of morphological features to each individual word in a text. When the word is ambiguous (there are several possible analyses for the word), a disambiguation procedure based on the word context must be applied. This paper deals with mor ..."
Abstract
-
Cited by 18 (6 self)
- Add to MetaCart
Morphological disambiguation is the process of assigning one set of morphological features to each individual word in a text. When the word is ambiguous (there are several possible analyses for the word), a disambiguation procedure based on the word context must be applied. This paper deals with morphological disambiguation of the Hebrew language, which combines morphemes into a word in both agglutinative and fusional ways. We present an unsupervised stochastic model – the only resource we use is a morphological analyzer – which deals with the data sparseness problem caused by the affixational morphology of the Hebrew language. We present a text encoding method for languages with affixational morphology in which the knowledge of word formation rules (which are quite restricted in Hebrew) helps in the disambiguation. We adapt HMM algorithms for learning and searching this text representation, in such a way that segmentation and tagging can be learned in parallel in one step. Results on a large scale evaluation indicate that this learning improves disambiguation for complex tag sets. Our method is applicable to other languages with affix morphology. 1
Fast and Accurate Part-of-Speech Tagging: The SVM Approach Revisited
, 2003
"... In this paper we present a very simple and effective part--of--speech tagger based on Support Vector Machines (SVM). Simplicity and efficiency are achieved by working with linear separators in the primal formulation of SVM, and by using a greedy left-to-right tagging scheme. ..."
Abstract
-
Cited by 16 (2 self)
- Add to MetaCart
In this paper we present a very simple and effective part--of--speech tagger based on Support Vector Machines (SVM). Simplicity and efficiency are achieved by working with linear separators in the primal formulation of SVM, and by using a greedy left-to-right tagging scheme.
Impact of Automatic Comma Prediction on POS/Name Tagging of Speech
- Proc. of the IEEE/ACL 2006 Workshop on Spoken Language Technology
, 2006
"... This work looks at the impact of automatically predicted commas on part-of-speech (POS) and name tagging of speech recognition transcripts of Mandarin broadcast news. There is a significant gain in both POS and name tagging accuracy due to using automatically predicted commas over sentence boundary ..."
Abstract
-
Cited by 13 (8 self)
- Add to MetaCart
This work looks at the impact of automatically predicted commas on part-of-speech (POS) and name tagging of speech recognition transcripts of Mandarin broadcast news. There is a significant gain in both POS and name tagging accuracy due to using automatically predicted commas over sentence boundary prediction alone. One difference between Mandarin and English is that there are two types of commas, and experiments here show that, while they can be reliably distinguished in automatic prediction, the distinction does not give a clear benefit for POS or name tagging. Index Terms — natural language, speech recognition 1.
Mandarin Part-Of-Speech Tagging and Discriminative Reranking
- Proc. of the EMNLP 2007
, 2007
"... We present in this paper methods to improve HMM-based part-of-speech (POS) tagging of Mandarin. We model the emission probability of an unknown word using all the characters in the word, and enrich the standard left-to-right trigram estimation of word emission probabilities with a right-to-left pred ..."
Abstract
-
Cited by 13 (2 self)
- Add to MetaCart
We present in this paper methods to improve HMM-based part-of-speech (POS) tagging of Mandarin. We model the emission probability of an unknown word using all the characters in the word, and enrich the standard left-to-right trigram estimation of word emission probabilities with a right-to-left prediction of the word by making use of the current and next tags. In addition, we utilize the RankBoost-based reranking algorithm to rerank the N-best outputs of the HMMbased tagger using various n-gram, morphological, and dependency features. Two methods are proposed to improve the generalization performance of the reranking algorithm. Our reranking model achieves an accuracy of 94.68 % using n-gram and morphological features on the Penn Chinese Treebank 5.2, and is able to further improve the accuracy to 95.11 % with the addition of dependency features. 1
A joint language model with fine-grain syntactic tags
- In EMNLP
, 2009
"... We present a scalable joint language model designed to utilize fine-grain syntactic tags. We discuss challenges such a design faces and describe our solutions that scale well to large tagsets and corpora. We advocate the use of relatively simple tags that do not require deep linguistic knowledge of ..."
Abstract
-
Cited by 12 (6 self)
- Add to MetaCart
We present a scalable joint language model designed to utilize fine-grain syntactic tags. We discuss challenges such a design faces and describe our solutions that scale well to large tagsets and corpora. We advocate the use of relatively simple tags that do not require deep linguistic knowledge of the language but provide more structural information than POS tags and can be derived from automatically generated parse trees – a combination of properties that allows easy adoption of this model for new languages. We propose two fine-grain tagsets and evaluate our model using these tags, as well as POS tags and SuperARV tags in a speech recognition task and discuss future directions. 1
A global optimization framework for meeting summarization
- in Proc. IEEE ICASSP
, 2009
"... We introduce a model for extractive meeting summarization based on the hypothesis that utterances convey bits of information, or concepts. Using keyphrases as concepts weighted by frequency, and an integer linear program to determine the best set of utterances, that is, covering as many concepts as ..."
Abstract
-
Cited by 11 (3 self)
- Add to MetaCart
We introduce a model for extractive meeting summarization based on the hypothesis that utterances convey bits of information, or concepts. Using keyphrases as concepts weighted by frequency, and an integer linear program to determine the best set of utterances, that is, covering as many concepts as possible while satisfying a length constraint, we achieve ROUGE scores at least as good as a ROUGEbased oracle derived from human summaries. This brings us to a critical discussion of ROUGE and the future of extractive meeting summarization. Index Terms — meeting summarization, integer linear programming, summarization evaluation 1.
Trigram morphosyntactic tagger for Polish
- In Proceedings of the International IIS:IIPWM'04 Conference
, 2004
"... Abstract. We introduce an implementation of a plain trigram part-of-speech tagger which appears to work well on Polish texts. At this moment the tagger achieves 9.4 % error rate, which makes it signficantly better than our previous stochastic disambiguator. Since the trigram model for Polish behaves ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
Abstract. We introduce an implementation of a plain trigram part-of-speech tagger which appears to work well on Polish texts. At this moment the tagger achieves 9.4 % error rate, which makes it signficantly better than our previous stochastic disambiguator. Since the trigram model for Polish behaves similarly to Czech, we hope to reach Czech state-of-art error rate when the quality of the training data improves. 1

