Results 1 - 10
of
22
Unsupervised Language Acquisition: Theory and Practice
, 2001
"... In this thesis I present various algorithms for the unsupervised machine learning of aspects of natural languages using a variety of statistical models. The scientific object of the work is to examine the validity of the so-called Argument from the Poverty of the Stimulus advanced in favour of the p ..."
Abstract
-
Cited by 32 (0 self)
- Add to MetaCart
In this thesis I present various algorithms for the unsupervised machine learning of aspects of natural languages using a variety of statistical models. The scientific object of the work is to examine the validity of the so-called Argument from the Poverty of the Stimulus advanced in favour of the proposition that humans have language-specific innate knowledge. I start by examining an a priori argument based on Gold's theorem, that purports to prove that natural languages cannot be learned, and some formal issues related to the choice of statistical grammars rather than symbolic grammars. I present three novel algorithms for learning various parts of natural languages: first, an algorithm for the induction of syntactic categories from unlabelled text using distributional information, that can deal with ambiguous and rare words; secondly, a set of algorithms for learning morphological processes in a variety of languages, including languages such as Arabic with nonconcatenative morphology; thirdly an algorithm for the unsupervised induction of a context-free grammar from tagged text. I carefully examine the interaction between the various components, and show how these algorithms can form the basis for a empiricist model of language acquisition. I therefore conclude that the Argument from the Poverty of the Stimulus is unsupported by the evidence.
Morphological typology of languages for IR
- Journal of Documentation
, 2001
"... This paper presents a morphological classification of languages from the IR perspective. Linguistic typology research has shown that the morphological complexity of every language in the world can be described by two variables, index of synthesis and index of fusion. These variables provide a theore ..."
Abstract
-
Cited by 20 (2 self)
- Add to MetaCart
This paper presents a morphological classification of languages from the IR perspective. Linguistic typology research has shown that the morphological complexity of every language in the world can be described by two variables, index of synthesis and index of fusion. These variables provide a theoretical basis for IR research handling morphological issues. A common theoretical framework is needed in particular because of the increasing significance of cross-language retrieval research and CLIR systems processing different languages. The paper elaborates the linguistic morphological typology for the purposes of IR research. It studies how the indexes of synthesis and fusion could be used as practical tools in mono- and cross-lingual IR research. The need for semantic and syntactic typologies is discussed. The paper also reviews studies made in different languages on the effects of morphology and stemming in IR. 1.
Analysis Of Phoneme-Based Features For Language Identification
- Proc ICASSP
, 1994
"... This paper presents an analysis of the phonemic language identification system introduced in [5], now extended to recognize German in addition to English and Japanese. In this system language identification is based on features derived from a superset of phonemes of all three languages. As we increa ..."
Abstract
-
Cited by 11 (4 self)
- Add to MetaCart
This paper presents an analysis of the phonemic language identification system introduced in [5], now extended to recognize German in addition to English and Japanese. In this system language identification is based on features derived from a superset of phonemes of all three languages. As we increase the number of languages, the need to reduce the feature space becomes apparent. Practical analysis of single-feature statistics in conjunction with linguistic knowledge leads to 90% reduction of the feature space with only a 5% loss in performance. Thus, the system discriminates between Japanese and English with 84.1% accuracy based on only 15 features compared to 84.6% based on the complete set of 318 phonemic features (or 83.6% using 333 broad-category features [4]). Results indicate that a language identification system may be designed based on linguistic knowledge and then implemented with a neural network of appropriate complexity. 1. INTRODUCTION In [5] we introduced a language-ide...
Automatically Extracting and Comparing Lexicalized Grammars for Different Languages
- In Proc. of the Seventeenth International Joint 30 / Data Oriented Parsing Conference on Arti Intelligence (IJCAI-2001
, 2001
"... In this paper, we present a quantitative comparison between the syntactic structures of three languages: English, Chinese and Korean. This is made possible by first extracting Lexicalized Tree Adjoining Grammars from annotated corpora for each language and then performing the comparison on the ..."
Abstract
-
Cited by 10 (1 self)
- Add to MetaCart
In this paper, we present a quantitative comparison between the syntactic structures of three languages: English, Chinese and Korean. This is made possible by first extracting Lexicalized Tree Adjoining Grammars from annotated corpora for each language and then performing the comparison on the extracted grammars. We found that the majority of the core grammar structures for these three languages are easily inter-mappable. 1
Automatic language identification
, 2001
"... Automatic language identification of speech is the process by which the language of a digitized speech utterance is recognized by a computer. In this paper, we will describe the set of available cues for language identification of speech and discuss the different approaches to building working syste ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
Automatic language identification of speech is the process by which the language of a digitized speech utterance is recognized by a computer. In this paper, we will describe the set of available cues for language identification of speech and discuss the different approaches to building working systems. This overview includes a range of historical approaches, contemporary systems that have been evaluated on standard databases, and promising future approaches. Comparative
Finding the correct interpretation of Swedish compounds, a statistical approach
- In Proc. 4th Int. Conf. Language Resources and Evaluation (LREC
, 2004
"... This paper treats compound splitting for Swedish, where compounding is productive and very common. A method for splitting compounds and several methods for choosing the correct interpretation of ambiguous compounds are presented. 99 % of all compounds are split, 97 % of these are correctly interpret ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
This paper treats compound splitting for Swedish, where compounding is productive and very common. A method for splitting compounds and several methods for choosing the correct interpretation of ambiguous compounds are presented. 99 % of all compounds are split, 97 % of these are correctly interpreted. 1.
Unification-Based Persian Morphology
- In Proceedings of CICLing 2000, Alexander Gelbukh, Center of Investigation on Computation-IPN
, 2000
"... this paper, we describe the implementation of an inflectional morphological analyzer for Persian, which is based on finite state transducers and typed feature structures with unification. The analyzer was designed to provide an interface to the syntactic parser in the Shiraz Persian-English machine ..."
Abstract
-
Cited by 6 (2 self)
- Add to MetaCart
this paper, we describe the implementation of an inflectional morphological analyzer for Persian, which is based on finite state transducers and typed feature structures with unification. The analyzer was designed to provide an interface to the syntactic parser in the Shiraz Persian-English machine translation system (http://crl.nmsu.edu/shiraz) and was tested on online newspaper articles. The system includes a dictionary with 50,000 entries which is used for lookup after morphological analysis has been performed.
Mathematical linguistics
, 2007
"... but in fact this is still an early draft, version 0.56, August 1 2001. Please do ..."
Abstract
-
Cited by 5 (3 self)
- Add to MetaCart
but in fact this is still an early draft, version 0.56, August 1 2001. Please do
Lexical Exceptions in Stress Systems: Arguments from early language acquisition and adult speech processing
, 2001
"... this paper, I will introduce additional typological facts concerning lexical exceptions in languages with phonological stress rules, and look at these generalizations from a different viewpoint than that of metrical phonology. Starting from the observation that a language can only have lexical excep ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
this paper, I will introduce additional typological facts concerning lexical exceptions in languages with phonological stress rules, and look at these generalizations from a different viewpoint than that of metrical phonology. Starting from the observation that a language can only have lexical exceptions if its native speakers can perceive them as such, I consider these These points, as well as the proposed analysis of exceptional final stress, will be taken up in more detail in section 3.2. generalizations in light of early language acquisition and its consequences for adult speech processing. Specifically, I will argue that a language allows for exceptions to the main stress rule if and only if its native speakers encode stress in their phonological representation of words in the mental lexicon. Crucially, the question as to whether stress is encoded or not will be shown to depend upon the age at which infants acquire the stress regularity of their language. Following Dupoux & Peperkamp (in press), I will argue that some languages with a purely phonological stress rule are structured such that infants can infer the stress regularity before they can segment speech into separate words. Adult speakers of these languages do not encode stress in the phonological representation; hence, they are unable to store exceptional stress patterns. In other languages with a purely phonological stress rule, by contrast, the regularity can only be inferred once word segmentation is in place. It will be argued that adult speakers of these languages redundantly encode stress in the phonological representation; hence, they are able to store exceptional stress patterns. Foreign words can thus enter the former type of languages without modification of their stress pattern; in the latter ...

