Results 1 - 10
of
26
Memory-Based Morphological Analysis
, 1999
"... We present a general architecture for efficient and deterministic morphological analysis based on memory-based learning, and apply it to morphological analysis of Dutch. The system makes direct mappings from letters in context to rich categories that encode morphological boundaries, syntactic class ..."
Abstract
-
Cited by 40 (15 self)
- Add to MetaCart
We present a general architecture for efficient and deterministic morphological analysis based on memory-based learning, and apply it to morphological analysis of Dutch. The system makes direct mappings from letters in context to rich categories that encode morphological boundaries, syntactic class labels, and spelling changes. Both precision and recall of labeled morphemes are over 84% on held-out dictionary test words and estimated to be over 93% in free text.
A Multi-Strategy Approach to Improving Pronunciation by Analogy
"... Pronunciation by analogy (PbA) is a data-driven method for relating letters to sound, with potential application to next-generation text-to-speech systems. This paper extends previous work on PbA in several directions. First, we have included `full' pattern matching between input letter string and d ..."
Abstract
-
Cited by 25 (3 self)
- Add to MetaCart
Pronunciation by analogy (PbA) is a data-driven method for relating letters to sound, with potential application to next-generation text-to-speech systems. This paper extends previous work on PbA in several directions. First, we have included `full' pattern matching between input letter string and dictionary entries, as well as including lexical stress in letter-to-phoneme conversion. Second, we have extended the method to phonemeto -letter conversion. Third, and most important, we have experimented with multiple, different strategies for scoring the candidate pronunciations. Individual scores for each strategy are obtained on the basis of rank and either multiplied or summed to produce a final, overall score. Five strategies have been studied and results obtained from all 31 possible combinations. The two combination methods perform comparably, with the product rule only very marginally superior to the sum rule. Nonparametric statistical analysis reveals that performance improves as more strategies are included in the combination: this trend is very highly significant ( p 0 0005). Accordingly for letter-to-phoneme conversion, best results are obtained when all five strategies are combined: word accuracy is raised to 65.5% relative to 61.7% for our best previous result and 63.0% for the best-performing single strategy. These improvements are very highly significant ( p 0 and p 0 00011 respectively). Similar results were found for phoneme-to-letter and letter-to-stress conversion, although the former was an easier problem for PbA than letter-to-phoneme conversion and the latter was harder. The main sources of error for the multi-strategy approach are very similar to those for the best single strategy, and mostly involve vowel letters and phonemes. 1
Evaluating the Pronunciation Component of Text-to-Speech Systems for English: A Performance Comparison of Different Approaches
- IN SPEECH AND LANGUAGE TECHNOLOGY (SALT) CLUB WORKSHOP ON EVALUATION IN SPEECH AND LANGUAGE TECHNOLOGY
, 1997
"... The automatic derivation of word pronunciations from input text is a central task for any text-to-speech system. For general English text at least, this is often thought to be a solved problem, with manually-derived linguistic rules assumed capable of handling `novel' words missing from the system ..."
Abstract
-
Cited by 24 (8 self)
- Add to MetaCart
The automatic derivation of word pronunciations from input text is a central task for any text-to-speech system. For general English text at least, this is often thought to be a solved problem, with manually-derived linguistic rules assumed capable of handling `novel' words missing from the system dictionary. Data-driven methods, based on machine learning of the regularities implicit in a large pronouncing dictionary, have received considerable attention recently but are generally thought to perform less well. However, these tentative beliefs are at best uncertain without powerful methods for comparing text-to-phoneme subsystems. This paper contributes to the development of such methods by comparing the performance of four representative approaches to automatic phonemisation on the same test dictionary. As well as rule-based approaches, three data-driven techniques are evaluated: pronunciation by analogy (PbA), NETspeak and IB1-IG (a modified k-nearest neighbour method). Issues involved in comparative evaluation are detailed and elucidated. The data-driven techniques outperform rules in accuracy of letter-to-phoneme translation by a very significant margin but require aligned text-phoneme training data and are slower. Best translation results are obtained with PbA at approximately 72% words correct on a reasonably large pronouncing dictionary, compared to something like 26% words correct for the rules, indicating that automatic pronunciation of text is not a solved problem.
From Data to Speech: A General Approach
- Natural Language Engineering
, 2000
"... We present a data-to-speech system called D2S, which can be used for the creation of datato -speech systems in different languages and domains. The most important characteristic of a data-to-speech system is that it combines language and speech generation: language generation is used to produce a na ..."
Abstract
-
Cited by 21 (9 self)
- Add to MetaCart
We present a data-to-speech system called D2S, which can be used for the creation of datato -speech systems in different languages and domains. The most important characteristic of a data-to-speech system is that it combines language and speech generation: language generation is used to produce a natural language text expressing the system's input data, and speech generation is used to make this text audible. In D2S, this combination is exploited by using linguistic information available in the language generation module for the computation of prosody. This allows us to achieve a better prosodic output quality than can be achieved in a plain text-to-speech system. For language generation in D2S, the use of syntactically enriched templates is guided by knowledge of the discourse context, while for speech generation pre-recorded phrases are combined in a prosodically sophisticated manner. This combination of techniques makes it possible to create linguistically sound but efficient system...
Unsupervised learning of word segmentation rules with genetic algorithms and inductive logic programming
- Machine Learning
, 2001
"... Abstract. This article presents a combination of unsupervised and supervised learning techniques for the generation of word segmentation rules from a raw list of words. First, a language bias for word segmentation is introduced and a simple genetic algorithm is used in the search for a segmentation ..."
Abstract
-
Cited by 14 (1 self)
- Add to MetaCart
Abstract. This article presents a combination of unsupervised and supervised learning techniques for the generation of word segmentation rules from a raw list of words. First, a language bias for word segmentation is introduced and a simple genetic algorithm is used in the search for a segmentation that corresponds to the best bias value. In the second phase, the words segmented by the genetic algorithm are used as an input for the first order decision list learner CLOG. The result is a set of first order rules which can be used for segmentation of unseen words. When applied on either the training data or unseen data, these rules produce segmentations which are linguistically meaningful, and to a large degree conforming to the annotation provided. Keywords: unsupervised machine learning, inductive logic programming, natural language, word segmentation 1.
Careful Abstraction from Instance Families in Memory-Based Language Learning
- Journal for Experimental and Theoretrical Artificial Intelligence
, 1999
"... ion from Instance Families in Memory-Based Language Learning Antal van den Bosch ILK Research Group, Computational Linguistics Tilburg University, The Netherlands email: Antal.vdnBosch@kub.nl Contact: Antal van den Bosch ILK Research Group / Computational Linguistics Faculty of Arts Tilburg Universi ..."
Abstract
-
Cited by 12 (6 self)
- Add to MetaCart
ion from Instance Families in Memory-Based Language Learning Antal van den Bosch ILK Research Group, Computational Linguistics Tilburg University, The Netherlands email: Antal.vdnBosch@kub.nl Contact: Antal van den Bosch ILK Research Group / Computational Linguistics Faculty of Arts Tilburg University P.O. Box 90153 NL-5000 LE Tilburg The Netherlands phone (voice) +31.13.4668260 phone (fax) +31.13.4663110 Running heading: Careful abstraction from instance families Abstract Empirical studies in inductive language learning point at pure memory-based learning as a successful approach to many language learning tasks, often performing better than lerning methods that abstract from the learning material. The possibility is left open, however, that limited, careful abstraction in memory-based learning may be harmless to generalisation, as long as the disjunctivity of language data is preserved. We compare three types of careful abstraction: editing, oblivious (partial) decision-tree abstra...
Do Not Forget: Full Memory in Memory-Based Learning of Word Pronunciation
- proceedings of NeMLap3/CoNLL98
, 1998
"... Memory-based learning, keeping full memory of learning material, appears a viable approach to learning tasks, and is often superior in generalization accuracy to eager learning approaches that abstract from learning mate- rial. Here we investigate three Iw'tial memorybased learning approaches ..."
Abstract
-
Cited by 9 (1 self)
- Add to MetaCart
Memory-based learning, keeping full memory of learning material, appears a viable approach to learning tasks, and is often superior in generalization accuracy to eager learning approaches that abstract from learning mate- rial. Here we investigate three Iw'tial memorybased learning approaches which remove from memory specific task instance types estimated to be exceptional. The three approaches each implement one heuristic function for estimating excepttonality of instance types: (i) typicaltry, (ii) class prediction strength, and friendly-neighbourhood size. Experiments are performed with the memory-based learning algorithm ml-Ia trained on English word pro- nunciation. We find that removing instance types with low prediction strength (it) is the only tested method which does not seriously harm generalization accuracy. We conclude that keeping full memory of types rather than tokens, and excluding minority ambiguities pear to be the only performance-preserving optimi -ations of memory-based learning.
When small disjuncts abound, try lazy learning: A case study
, 1997
"... Machine learning is becoming recognised as a source of generic and powerful tools for tasks studied and implemented in language technology. Lazy learning with information-theoretic similarity matching has appeared a salient approach, demonstrated to be superior over other machine-learning approaches ..."
Abstract
-
Cited by 8 (1 self)
- Add to MetaCart
Machine learning is becoming recognised as a source of generic and powerful tools for tasks studied and implemented in language technology. Lazy learning with information-theoretic similarity matching has appeared a salient approach, demonstrated to be superior over other machine-learning approaches in various comparative studies. It is asserted both in theoretical machine learning and in reports on applications of machine learning to natural language that the success of lazy learning may be due to the fact that language data contains small disjuncts, i.e., small clusters of identically-classified instances. We propose three measures to discover small disjuncts in our data: (i) we count and analyse indexed clusters of instances in induced decision trees; (ii) we count clusters of friendly (identically-classified) instances immediately surrounding instances by using similarity metrics from lazy learning; (iii) we compare average sizes of friendly-instance clusters using different simila...
Instance-Family Abstraction in Memory-Based Language Learning
- Machine Learning: Proceedings of the Sixteenth International Conference
, 1999
"... ion in Memory-Based Language Learning Antal van den Bosch ILK / Computational Linguistics Tilburg University The Netherlands Antal.vdnBosch@kub.nl Abstract Memory-based learning appears relatively successful when the learning data is highly disjunct, i.e., when classes are scattered over many smal ..."
Abstract
-
Cited by 7 (3 self)
- Add to MetaCart
ion in Memory-Based Language Learning Antal van den Bosch ILK / Computational Linguistics Tilburg University The Netherlands Antal.vdnBosch@kub.nl Abstract Memory-based learning appears relatively successful when the learning data is highly disjunct, i.e., when classes are scattered over many small families of instances in instance space, as in many language learning tasks. Abstraction over borders of disjuncts tends to harm generalization performance. However, careful abstraction in memory-based learning may be harmless when it preserves the disjunctivity of the learning data. We investigate the effect of careful abstraction in a series of language-learning task studies, and a small benchmark-task study. We find that when combined with feature weighting or value-distance metrics, careful abstraction, as implemented in the new fambl algorithm, can equal the generalization accuracies of pure memory-based learning, while attaining fair levels of memory compression. 1 INTRODUCTION Memo...

