Results 31 - 40
of
61
Automatically Learning Source-side Reordering Rules for Large Scale Machine Translation
"... We describe an approach to automatically learn reordering rules to be applied as a preprocessing step in phrase-based machine translation. We learn rules for 8 different language pairs, showing BLEU improvements for all of them, and demonstrate that many important order transformations (SVO to SOV o ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
We describe an approach to automatically learn reordering rules to be applied as a preprocessing step in phrase-based machine translation. We learn rules for 8 different language pairs, showing BLEU improvements for all of them, and demonstrate that many important order transformations (SVO to SOV or VSO, headmodifier, verb movement) can be captured by this approach. 1
A semi-supervised approach to question classification ∗
"... Abstract. This paper presents a machine learning approach to question classification. We have defined a kernel function based on latent semantic information acquired from unlabeled data. This kernel allows including external semantic knowledge into the supervised learning process. We have combined t ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Abstract. This paper presents a machine learning approach to question classification. We have defined a kernel function based on latent semantic information acquired from unlabeled data. This kernel allows including external semantic knowledge into the supervised learning process. We have combined this knowledge with a bag-of-words approach by means of composite kernels to obtain state-of-the-art results. As the semantic information is acquired from unlabeled text, our system can be easily adapted to different languages and domains. 1
Effective Measures of Domain Similarity for Parsing
"... It is well known that parsing accuracy suffers when a model is applied to out-of-domain data. It is also known that the most beneficial data to parse a given domain is data that matches the domain (Sekine, 1997; Gildea, 2001). Hence, an important task is to select appropriate domains. However, most ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
It is well known that parsing accuracy suffers when a model is applied to out-of-domain data. It is also known that the most beneficial data to parse a given domain is data that matches the domain (Sekine, 1997; Gildea, 2001). Hence, an important task is to select appropriate domains. However, most previous work on domain adaptation relied on the implicit assumption that domains are somehow given. As more and more data becomes available, automatic ways to select data that is beneficial for a new (unknown) target domain are becoming attractive. This paper evaluates various ways to automatically acquire related training data for a given test set. The results show that an unsupervised technique based on topic models is effective – it outperforms random data selection on both languages examined, English and Dutch. Moreover, the technique works better than manually assigned labels gathered from meta-data that is available for English. 1
Language-independent Compound Splitting with Morphological Operations
"... Translating compounds is an important problem in machine translation. Since many compounds have not been observed during training, they pose a challenge for translation systems. Previous decompounding methods have often been restricted to a small set of languages as they cannot deal with more comple ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Translating compounds is an important problem in machine translation. Since many compounds have not been observed during training, they pose a challenge for translation systems. Previous decompounding methods have often been restricted to a small set of languages as they cannot deal with more complex compound forming processes. We present a novel and unsupervised method to learn the compound parts and morphological operations needed to split compounds into their compound parts. The method uses a bilingual corpus to learn the morphological operations required to split a compound into its parts. Furthermore, monolingual corpora are used to learn and filter the set of compound part candidates. We evaluate our method within a machine translation task and show significant improvements for various languages to show the versatility of the approach. 1
Piggyback: Using Search Engines for Robust Cross-Domain Named Entity Recognition
"... We use search engine results to address a particularly difficult cross-domain language processing task, the adaptation of named entity recognition (NER) from news text to web queries. The key novelty of the method is that we submit a token with context to a search engine and use similar contexts in ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
We use search engine results to address a particularly difficult cross-domain language processing task, the adaptation of named entity recognition (NER) from news text to web queries. The key novelty of the method is that we submit a token with context to a search engine and use similar contexts in the search results as additional information for correctly classifying the token. We achieve strong gains in NER performance on news, in-domain and out-of-domain, and on web queries. 1
Further Advances in Forecasting Day-Ahead Electricity Prices Using Time Series Models
"... Abstract- Forecasting prices in electricity markets is critical for consumers and producers in planning their operations and managing their price risk. We utilize the generalized autoregressive conditionally heteroskedastic (GARCH) method to forecast the electricity prices in two regions of New York ..."
Abstract
- Add to MetaCart
Abstract- Forecasting prices in electricity markets is critical for consumers and producers in planning their operations and managing their price risk. We utilize the generalized autoregressive conditionally heteroskedastic (GARCH) method to forecast the electricity prices in two regions of New York: New York City and Central New York State. We contrast the one-day forecasts of the GARCH against techniques such as dynamic regression, transfer function models, and exponential smoothing. We also examine the effect on our forecasting of omitting some of the extreme values in the electricity prices. We show that accounting for the extreme values and the heteroskedactic variance in the electricity price time-series can significantly improve the accuracy of the forecasting. Additionally, we document the higher volatility in New York City electricity prices. Differences in volatility between regions are important in the pricing of electricity options and for analyzing market performance.
Reduction of Dutch Sentences for Automatic Subtitling
- Computational Linguistics in the Netherlands 2003. Selected Papers from the Fourteenth CLIN Meeting
, 2003
"... We compare machine learning approaches for sentence length reduction for automatic generation of subtitles for deaf and hearing-impaired people with a method which relies on hand-crafted deletion rules. We describe building the necessary resources for this task: a parallel corpus of examples of news ..."
Abstract
- Add to MetaCart
We compare machine learning approaches for sentence length reduction for automatic generation of subtitles for deaf and hearing-impaired people with a method which relies on hand-crafted deletion rules. We describe building the necessary resources for this task: a parallel corpus of examples of news broadcasts of the Flemish VRT broadcasting corporation, and a Dutch shallow parser based on the material of the Spoken Dutch Corpus (CGN). We evaluate the sentence simplifiers and discuss their performance.
Appropriate Kernel Functions for Support Vector Machine Learning with Sequences of Symbolic Data
, 2005
"... In classification problems, machine learning algorithms often make use of the assumption that (dis)similar inputs lead to (dis)similar outputs. In this case, two questions naturally arise: what does it mean for two inputs to be similar and how can this be used in a learning algorithm ? In suppor ..."
Abstract
- Add to MetaCart
In classification problems, machine learning algorithms often make use of the assumption that (dis)similar inputs lead to (dis)similar outputs. In this case, two questions naturally arise: what does it mean for two inputs to be similar and how can this be used in a learning algorithm ? In support vector machines, similarity between input examples is implicitly expressed by a kernel function that calculates inner products in the feature space. For numerical input examples the concept of an inner product is easy to define, for discrete structures like sequences of symbolic data however these concepts are less obvious. This article describes an approach to SVM learning for symbolic data that can serve as an alternative to the bag-of-words approach under certain circumstances.
IXA NLP Group University of the Basque Country
"... This paper presents an empirical study on the robustness and generalization of two alternative role sets for semantic role labeling: Prop-Bank numbered roles and VerbNet thematic roles. By testing a state–of–the–art SRL system with the two alternative role annotations, we show that the PropBank role ..."
Abstract
- Add to MetaCart
This paper presents an empirical study on the robustness and generalization of two alternative role sets for semantic role labeling: Prop-Bank numbered roles and VerbNet thematic roles. By testing a state–of–the–art SRL system with the two alternative role annotations, we show that the PropBank role set is more robust to the lack of verb–specific semantic information and generalizes better to infrequent and unseen predicates. Keeping in mind that thematic roles are better for application needs, we also tested the best way to generate VerbNet annotation. We conclude that tagging first PropBank roles and mapping into Verb-Net roles is as effective as training and tagging directly on VerbNet, and more robust for domain shifts. 1
A Preliminary Study on the Robustness and Generalization of Role Sets for Semantic Role Labeling
"... Abstract. Most Semantic Role Labeling (SRL) systems rely on available annotated corpora, being PropBank the most widely used corpus so far. Propbank role set is based on theory-neutral numbered arguments, which are linked to fine grained verb-dependant semantic roles through the verb framesets. Rece ..."
Abstract
- Add to MetaCart
Abstract. Most Semantic Role Labeling (SRL) systems rely on available annotated corpora, being PropBank the most widely used corpus so far. Propbank role set is based on theory-neutral numbered arguments, which are linked to fine grained verb-dependant semantic roles through the verb framesets. Recently, thematic roles from the computational verb lexicon VerbNet have been suggested to be more adequate for generalization and portability of SRL systems, since they represent a compact set of verb-independent general roles widely used in linguistic theory. Such thematic roles could also put SRL systems closer to application needs. This paper presents a comparative study of the behavior of a state-of-theart SRL system on both role role sets based on the SemEval-2007 English dataset, which comprises the 50 most frequent verbs in PropBank. 1

