Results 11 - 20
of
31
Identifying Phrasal Connectives In Italian Using Quantitative Methods
, 2001
"... This paper intends to present the main lines of work in progress based on this empirical approach to linguistic analysis. In particular, we focus our attention on some problems relating to the morpho-syntactic annotation of corpora ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
This paper intends to present the main lines of work in progress based on this empirical approach to linguistic analysis. In particular, we focus our attention on some problems relating to the morpho-syntactic annotation of corpora
Influence of Language Models and Candidate Set Size on Contextual Post-processing for Chinese Script Recognition
- In Proceedings of the 17th International Conference on Pattern Recognition
, 2004
"... In Chinese language, word is the basic syntaxmeaningful unit, however, each character also has the definite meaning itself. In this paper, we compare the perplexities of four n-gram language models (characterbased bigram, character-based trigram, word-based bigram and class-based bigram) and their i ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
In Chinese language, word is the basic syntaxmeaningful unit, however, each character also has the definite meaning itself. In this paper, we compare the perplexities of four n-gram language models (characterbased bigram, character-based trigram, word-based bigram and class-based bigram) and their influence on the performance of contextual post-processing of Chinese scripts in an offline handwritten Chinese character recognition system. We also demonstrate the influence of the candidate set size on the performance of contextual post-processing in detail, and indicate that the number of candidates should vary with each script. 1.
A Bernoulli mixture model for word categorisation
, 2001
"... The problem of word categorisation is formulated as one of unsupervised mixture modelling where Bernoulli distributions capture contextual information. ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
The problem of word categorisation is formulated as one of unsupervised mixture modelling where Bernoulli distributions capture contextual information.
Improved Unsupervised POS Induction Using Intrinsic Clustering Quality and a Zipfian Constraint
"... Modern unsupervised POS taggers usually apply an optimization procedure to a nonconvex function, and tend to converge to local maxima that are sensitive to starting conditions. The quality of the tagging induced by such algorithms is thus highly variable, and researchers report average results over ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Modern unsupervised POS taggers usually apply an optimization procedure to a nonconvex function, and tend to converge to local maxima that are sensitive to starting conditions. The quality of the tagging induced by such algorithms is thus highly variable, and researchers report average results over several random initializations. Consequently, applications are not guaranteed to use an induced tagging of the quality reported for the algorithm. In this paper we address this issue using an unsupervised test for intrinsic clustering quality. We run a base tagger with different random initializations, and select the best tagging using the quality test. As a base tagger, we modify a leading unsupervised POS tagger (Clark, 2003) to constrain the distributions of word types across clusters to be Zipfian, allowing us to utilize a perplexity-based quality test. We show that the correlation between our quality test and gold standard-based tagging quality measures is high. Our results are better in most evaluation measures than all results reported in the literature for this task, and are always better than the Clark average results. 1
Automatic Language Identification with Sequences of Language-Independent Phoneme Clusters
, 1996
"... Automatic language identification involves analyzing language-specific features in speech to determine the language of an utterance without regard to topic, speaker or length of speech. Although much progress has been made in recent years, language identification systems have not been built on under ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Automatic language identification involves analyzing language-specific features in speech to determine the language of an utterance without regard to topic, speaker or length of speech. Although much progress has been made in recent years, language identification systems have not been built on underlying theory or linguistically meaningful design criteria. This thesis is motivated by the belief that features used to discriminate between languages should be linguistically sound; the result is a unique combination of design, theory and implementation. In this thesis a "word-spotting" algorithm is introduced motivated by a perceptual study [82] reporting that human subjects use language- dependent phonemes and short sequences to identify languages. In order to find an optimal set of phoneme-like tokens to represent speech in a linguistically meaningful way, a mathematical model of the discrimination between two languages is developed. This model permits the automatic design of a token representation of speech by selecting a list of discriminating "words" in a data-driven manner. The resulting system has the flexibility to automatically take into account the inherent structure of the languages to be discriminated. A second mathematical model is developed to measure the impact of inaccurate automatic alignment of tokens on language discrimination. This model indicates why some algorithms aiming to compensate for these inaccuracies have not been successful. The theoretical models and the "word"-spotting algorithms have been implemented and validated on both generated and real-world speech data. This dissertation makes several significant contributions: the design of a simple and linguistically sound language-identification module; a flexible automatic feature extraction algorithm; a mathematical model to estimate the discriminability of two languages; and a mathematical model to capture the impact of inaccurate alignment on the discriminability of two languages.
Machine Translation with Grammar Association:
"... Grammar Association is a technique for Machine Translation and Language Understanding introduced in 1993 by Vidal, Pieraccini and Levin. All the statistical and structural models involved in the translation process are automatically built from bilingual examples, and the optimal translation o ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Grammar Association is a technique for Machine Translation and Language Understanding introduced in 1993 by Vidal, Pieraccini and Levin. All the statistical and structural models involved in the translation process are automatically built from bilingual examples, and the optimal translation of new sentences can be efficiently found by Dynamic Programming algorithms. This paper presents and discusses Grammar Association state of the art, including a new statistical model: Loco C.
Controlling Complexity in Part-of-Speech Induction
"... We consider the problem of fully unsupervised learning of grammatical (part-of-speech) categories from unlabeled text. The standard maximum-likelihood hidden Markov model for this task performs poorly, because of its weak inductive bias and large model capacity. We address this problem by refining t ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
We consider the problem of fully unsupervised learning of grammatical (part-of-speech) categories from unlabeled text. The standard maximum-likelihood hidden Markov model for this task performs poorly, because of its weak inductive bias and large model capacity. We address this problem by refining the model and modifying the learning objective to control its capacity via parametric and non-parametric constraints. Our approach enforces word-category association sparsity, adds morphological and orthographic features, and eliminates hard-to-estimate parameters for rare words. We develop an efficient learning algorithm that is not much more computationally intensive than standard training and provide an open-source implementation. Our experiments on five diverse languages (Bulgarian, Danish, English, Portuguese, Spanish) achieve significant improvements compared with previous methods for the same task. 1.
Text Tokenization for Knowledge-free Automatic Extraction of Lexical Similarities
"... Previous studies on automatic extraction of lexical similarities have considered as semantic unit of text the word. However, the theory of contextual lexical semantics implies that larger segments of text, namely non-compositional multiwords, are more appropriate for this role. We experimentally tes ..."
Abstract
- Add to MetaCart
Previous studies on automatic extraction of lexical similarities have considered as semantic unit of text the word. However, the theory of contextual lexical semantics implies that larger segments of text, namely non-compositional multiwords, are more appropriate for this role. We experimentally tested the applicability of this notion applying automatic collocation extraction to identify and merge such multiwords prior to the similarity estimation process. Employing an automatic WordNet-based comparative evaluation scheme along with a manual evaluation procedure, we ascertain improvement of the extracted similarity relations.
Combining Part of Speech Induction and Morphological Induction
, 2004
"... Linguistic information is useful in natural language processing, information retrieval and a multitude of sub-tasks involving language analysis. Two types of linguistic information in all languages are part of speech and morphology. Part of speech information reflects syntactic structure and can ass ..."
Abstract
- Add to MetaCart
Linguistic information is useful in natural language processing, information retrieval and a multitude of sub-tasks involving language analysis. Two types of linguistic information in all languages are part of speech and morphology. Part of speech information reflects syntactic structure and can assist in tasks such as speech recognition, machine translation and word sense disambiguation. Morphological information describes the structure of words and has application in automated spelling correction, natural language generation and information retrieval for morphologically complex languages. Machine learning methods in natural language processing acquire linguistic information from corpora of natural language text. While supervised learning algorithms are trained on texts that have been annotated with linguistic features, induction algorithms learn linguistic information from unannotated corpora. Such algorithms avoid any requirement for linguistically annotated training data- a resource that is highly time-intensive to produce. However, in learning from unannotated corpora, only limited sources of information are available. In practice, part of speech induction methods usually learn from distributional evidence about the contexts in

