Results 1 - 10
of
333
An end-to-end discriminative approach to machine translation
- In Proceedings of the Joint International Conference on Computational Linguistics and Association of Computational Linguistics (COLING/ACL
, 2006
"... We present a perceptron-style discriminative approach to machine translation in which large feature sets can be exploited. Unlike discriminative reranking approaches, our system can take advantage of learned features in all stages of decoding. We first discuss several challenges to error-driven disc ..."
Abstract
-
Cited by 77 (2 self)
- Add to MetaCart
We present a perceptron-style discriminative approach to machine translation in which large feature sets can be exploited. Unlike discriminative reranking approaches, our system can take advantage of learned features in all stages of decoding. We first discuss several challenges to error-driven discriminative approaches. In particular, we explore different ways of updating parameters given a training example. We find that making frequent but smaller updates is preferable to making fewer but larger updates. Then, we discuss an array of features and show both how they quantitatively increase BLEU score and how they qualitatively interact on specific examples. One particular feature we investigate is a novel way to introduce learning into the initial phrase extraction process, which has previously been entirely heuristic. 1
Improvements In Part-of-Speech Tagging With an Application To German
- In Proceedings of the ACL SIGDAT-Workshop
, 1995
"... This paper presents a couple of extensions to a basic Markov Model tagger (called TreeTagger) which improve its accuracy when trained on small corpora. The basic tagger was originally developed for English [Schmid, 1994]. The extensions together reduced error rates on a German test corpus by more th ..."
Abstract
-
Cited by 73 (1 self)
- Add to MetaCart
This paper presents a couple of extensions to a basic Markov Model tagger (called TreeTagger) which improve its accuracy when trained on small corpora. The basic tagger was originally developed for English [Schmid, 1994]. The extensions together reduced error rates on a German test corpus by more than a third.
Learning Taxonomic Relations from Heterogeneous Evidence
"... We present a novel approach to the automatic acquisition of taxonomic relations. The main difference to earlier approaches is that we do not only consider one single source of evidence, i.e. a specific algorithm or approach, but examine the possibility of learning taxonomic relations by considerin ..."
Abstract
-
Cited by 63 (8 self)
- Add to MetaCart
We present a novel approach to the automatic acquisition of taxonomic relations. The main difference to earlier approaches is that we do not only consider one single source of evidence, i.e. a specific algorithm or approach, but examine the possibility of learning taxonomic relations by considering various and heterogeneous forms of evidence. In particular, we derive these different evidences by using well-known NLP techniques and resources and combine them via two simple strategies. Our approach shows very promising results compared to other results from the literature. The main aim of the work presented in this paper is (i) to gain insight into the behaviour of different approaches to learn taxonomic relations, (ii) to provide a first step towards combining these different approaches, and (iii) to establish a baseline for further research.
Reading Tea Leaves: How Humans Interpret Topic Models
"... Probabilistic topic models are a popular tool for the unsupervised analysis of text, providing both a predictive model of future text and a latent topic representation of the corpus. Practitioners typically assume that the latent space is semantically meaningful. It is used to check models, summariz ..."
Abstract
-
Cited by 45 (5 self)
- Add to MetaCart
Probabilistic topic models are a popular tool for the unsupervised analysis of text, providing both a predictive model of future text and a latent topic representation of the corpus. Practitioners typically assume that the latent space is semantically meaningful. It is used to check models, summarize the corpus, and guide exploration of its contents. However, whether the latent space is interpretable is in need of quantitative evaluation. In this paper, we present new quantitative methods for measuring semantic meaning in inferred topics. We back these measures with large-scale user studies, showing that they capture aspects of the model that are undetected by previous measures of model quality based on held-out likelihood. Surprisingly, topic models which perform better on held-out likelihood may infer less semantically meaningful topics. 1
Shallow Morphological Analysis in Monolingual Information Retrieval for Dutch, German and Italian
- Evaluation of Cross-Language Information Retrieval Systems, CLEF 2001, volume 2406 of Lecture Notes in Computer Science
, 2001
"... This paper describes the experiments of our team for CLEF 2001, which includes both official and post-submission runs. We took part in the monolingual task, for Dutch, German, and Italian. The focus of our experiments was on the effects of morphological analyses such as stemming and compound spli ..."
Abstract
-
Cited by 41 (13 self)
- Add to MetaCart
This paper describes the experiments of our team for CLEF 2001, which includes both official and post-submission runs. We took part in the monolingual task, for Dutch, German, and Italian. The focus of our experiments was on the effects of morphological analyses such as stemming and compound splitting on retrieval effectiveness. Confirming earlier reports on retrieval in compound splitting languages such as Dutch and German, we found improvements to be around 25% for German and as much as 69% for Dutch. For Italian, lexiconbased stemming resulted in gains of up to 25%. 1
Learning Bilingual Lexicons from Monolingual Corpora
"... We present a method for learning bilingual translation lexicons from monolingual corpora. Word types in each language are characterized by purely monolingual features, such as context counts and orthographic substrings. Translations are induced using a generative model based on canonical correlation ..."
Abstract
-
Cited by 30 (1 self)
- Add to MetaCart
We present a method for learning bilingual translation lexicons from monolingual corpora. Word types in each language are characterized by purely monolingual features, such as context counts and orthographic substrings. Translations are induced using a generative model based on canonical correlation analysis, which explains the monolingual lexicons in terms of latent matchings. We show that high-precision lexicons can be learned in a variety of language pairs and from a range of corpus types. 1
An Environment for Morphosyntactic Processing of Unrestricted Spanish Text
, 1998
"... We present in this paper a fast, broad-coverage, accurate morphological analyzer for Spanish words, MACO+, which is an extended and improved version of that described in (Acebo et al., 1994). The earlier version had two main flaws: it was not transportable, and it was too slow to enable massive text ..."
Abstract
-
Cited by 24 (6 self)
- Add to MetaCart
We present in this paper a fast, broad-coverage, accurate morphological analyzer for Spanish words, MACO+, which is an extended and improved version of that described in (Acebo et al., 1994). The earlier version had two main flaws: it was not transportable, and it was too slow to enable massive text processing. The presented system not only overcomes those two flaws, but also offers improved coverage and accuracy. We also present two general part-of-speech taggers, which can be used to disambiguate the output of the morphological analyzer. All modules run in any Unix/Linux machine as a pipeline process and they may also be used inside the GATE environment for NLP (Cunningham et al., 1996). The system is currently being used to annotate the LexEsp corpus, a 5.5 million word corpus of Spanish, in a bootstrapping refining procedure. Initial evaluation and results are reported. Keywords: Morphological analysis, corpus linguistics, POS tagging, linguistic resources. 1 Introduction and Mot...
Unsupervised part-of-speech tagging employing efficient graph clustering
- In Proceedings of the COLING/ACL 2006 Student Research Workshop
, 2006
"... An unsupervised part-of-speech (POS) tagging system that relies on graph clustering methods is described. Unlike in current state-of-the-art approaches, the kind and number of different tags is generated by the method itself. We compute and merge two partitionings of word graphs: one based on contex ..."
Abstract
-
Cited by 24 (1 self)
- Add to MetaCart
An unsupervised part-of-speech (POS) tagging system that relies on graph clustering methods is described. Unlike in current state-of-the-art approaches, the kind and number of different tags is generated by the method itself. We compute and merge two partitionings of word graphs: one based on context similarity of high frequency words, another on log-likelihood statistics for words of lower frequencies. Using the resulting word clusters as a lexicon, a Viterbi POS tagger is trained, which is refined by a morphological component. The approach is evaluated on three different languages by measuring agreement with existing taggers.
Morphological Tagging: Data vs. Dictionaries
, 2000
"... Part of Speech tagging for English seems to have reached the the human levels of error, but full morphological tagging for inflectionally rich languages, such as Romanian, Czech, or Hungarian, is still an open problem, and the results are far from being satisfactory. This paper presents results obta ..."
Abstract
-
Cited by 22 (1 self)
- Add to MetaCart
Part of Speech tagging for English seems to have reached the the human levels of error, but full morphological tagging for inflectionally rich languages, such as Romanian, Czech, or Hungarian, is still an open problem, and the results are far from being satisfactory. This paper presents results obtained by using a universalized exponential feature-based model for five such languages. It focuses on the data sparseness issue, which is especially severe for such languages (the more so that there are no extensive annotated data for those languages). In conclusion, we argue strongly that the use of an independent morphological dictionary is the preferred choice to more annotated data under such circumstances.
Extracting Semantic Orientations of Words using Spin Model
- In ACL
, 2005
"... We propose a method for extracting semantic orientations of words: desirable or undesirable. Regarding semantic orientations as spins of electrons, we use the mean field approximation to compute the approximate probability function of the system instead of the intractable actual probability function ..."
Abstract
-
Cited by 21 (0 self)
- Add to MetaCart
We propose a method for extracting semantic orientations of words: desirable or undesirable. Regarding semantic orientations as spins of electrons, we use the mean field approximation to compute the approximate probability function of the system instead of the intractable actual probability function. We also propose a criterion for parameter selection on the basis of magnetization. Given only a small number of seed words, the proposed method extracts semantic orientations with high accuracy in the experiments on English lexicon. The result is comparable to the best value ever reported. 1

