Results 1 - 10
of
75
A second-order hidden markov model for part-of-speech tagging
- In Proceedings of the 37th Annual Meeting of the ACL
, 1999
"... This paper describes an extension to the hidden Markov model for part-of-speech tagging using second-order approximations for both contex-tual and lexical probabilities. This model in-creases the accuracy of the tagger to state of the art levels. These approximations make use of more contextual info ..."
Abstract
-
Cited by 51 (5 self)
- Add to MetaCart
This paper describes an extension to the hidden Markov model for part-of-speech tagging using second-order approximations for both contex-tual and lexical probabilities. This model in-creases the accuracy of the tagger to state of the art levels. These approximations make use of more contextual information than standard statistical systems. New methods of smoothing the estimated probabilities are also introduced to address the sparse data problem. 1
Description of the LTG system used for MUC-7
- In Proceedings of 7th Message Understanding Conference (MUC-7
, 1998
"... The basic building blocks in our muc system are reusable text handling tools which wehave been developing and using for a number of years at the Language Technology Group. They are modular tools with stream input/output; each tooldoesavery speci c job, but can be combined with other tools in a unix ..."
Abstract
-
Cited by 38 (6 self)
- Add to MetaCart
The basic building blocks in our muc system are reusable text handling tools which wehave been developing and using for a number of years at the Language Technology Group. They are modular tools with stream input/output; each tooldoesavery speci c job, but can be combined with other tools in a unix pipeline. Di erent combinations of the same tools can thus be used in a pipeline for completing di erent tasks. Our architecture imposes an additional constraint on the input/output streams: they should have a common syntactic format. For this common format we chose eXtensible Markup Language (xml). xml is an o cial, simpli ed version of Standard Generalised Markup Language (sgml), simpli ed to make processing easier [3]. Wewere involved in the developmentofthexml standard, building on our expertise in the design of our own Normalised sgml (nsl) and nsl tool lt nsl [10], and our xml tool lt xml [11]. A detailed comparison of this sgml-oriented architecture with more traditional data-base oriented architectures can be found in [9]. A tool in our architecture is thus a piece of software which uses an api for all its access to xml and sgml data and performs a particular task: exploiting markup which has previously been added by other tools, removing markup, or adding new markup to the stream(s) without destroying the previously added
LT TTT - A Flexible Tokenisation Tool
- In Proceedings of Second International Conference on Language Resources and Evaluation
, 2000
"... We describe LT TTT, a recently developed software system which provides tools to perform text tokenisation and mark-up. The system includes ready-made components to segment text into paragraphs, sentences, words and other kinds of token but, crucially, it also allows users to tailor rule-sets to pro ..."
Abstract
-
Cited by 35 (4 self)
- Add to MetaCart
We describe LT TTT, a recently developed software system which provides tools to perform text tokenisation and mark-up. The system includes ready-made components to segment text into paragraphs, sentences, words and other kinds of token but, crucially, it also allows users to tailor rule-sets to produce mark-up appropriate for particular applications. We present three case studies of our use of LT TTT: named-entity recognition (MUC-7), citation recognition and mark-up and the preparation of a corpus in the medical domain. We conclude with a discussion of the use of browsers to visualise marked-up text. 1. Introduction The LTG's Text Tokenisation Toolkit (LT TTT, Grover et al., 1999) was developed within an XML processing paradigm whereby tools are combined together in a pipeline allowing each to add, modify or remove some piece of mark-up. The tools are compatible with the LT XML toolset (Thompson et al., 1997) and use the LT XML API to manipulate attribute values and character data ...
A maximum entropy model of phonotactics and phonotactic learning
, 2006
"... The study of phonotactics (e.g., the ability of English speakers to distinguish possible words like blick from impossible words like *bnick) is a central topic in phonology. We propose a theory of phonotactic grammars and a learning algorithm that constructs such grammars from positive evidence. Our ..."
Abstract
-
Cited by 35 (5 self)
- Add to MetaCart
The study of phonotactics (e.g., the ability of English speakers to distinguish possible words like blick from impossible words like *bnick) is a central topic in phonology. We propose a theory of phonotactic grammars and a learning algorithm that constructs such grammars from positive evidence. Our grammars consist of constraints that are assigned numerical weights according to the principle of maximum entropy. Possible words are assessed by these grammars based on the weighted sum of their constraint violations. The learning algorithm yields grammars that can capture both categorical and gradient phonotactic patterns. The algorithm is not provided with any constraints in advance, but uses its own resources to form constraints and weight them. A baseline model, in which Universal Grammar is reduced to a feature set and an SPE-style constraint format, suffices to learn many phonotactic phenomena. In order to learn nonlocal phenomena such as stress and vowel harmony, it is necessary to augment the model with autosegmental tiers and metrical grids. Our results thus offer novel, learning-theoretic support for such representations. We apply the model to English syllable onsets, Shona vowel harmony, quantity-insensitive stress typology, and the full phonotactics of Wargamay, showing that the learned grammars capture the distributional generalizations of these languages and accurately predict the findings of a phonotactic experiment.
A Sequential Model for Multi-Class Classification. EMNLP ’01
, 2001
"... Many classification problems require decisions among a large number of competing classes. These tasks, however, are not handled well by general purpose learning methods and are usually addressed in an ad-hoc fashion. We suggest a general approach – a sequential learning model that utilizes classifie ..."
Abstract
-
Cited by 32 (11 self)
- Add to MetaCart
Many classification problems require decisions among a large number of competing classes. These tasks, however, are not handled well by general purpose learning methods and are usually addressed in an ad-hoc fashion. We suggest a general approach – a sequential learning model that utilizes classifiers to sequentially restrict the number of competing classes while maintaining, with high probability, the presence of the true outcome in the candidates set. Some theoretical and computational properties of the model are discussed and we argue that these are important in NLP-like domains. The advantages of the model are illustrated in an experiment in partof-speech tagging. 1
Combining Distributional and Morphological Information for Part of Speech Induction
- In Proceedings of the tenth Annual Meeting of the European Association for Computational Linguistics EACL 2003
, 2003
"... In this paper we discuss algorithms for clustering words into classes from unlabelled text using unsupervised algorithms, based on distributional and morphological information. We show how the use of morphological information can improve the performance on rare words, and that this is robust ..."
Abstract
-
Cited by 26 (1 self)
- Add to MetaCart
In this paper we discuss algorithms for clustering words into classes from unlabelled text using unsupervised algorithms, based on distributional and morphological information. We show how the use of morphological information can improve the performance on rare words, and that this is robust across a wide range of languages.
Modeling English Past Tense Intuitions with Minimal Generalization
, 2002
"... We describe here a supervised learning model that, given paradigms of related words, learns the morphological and phonological rules needed to derive the paradigm. The model can use its rules to make guesses about how novel forms would be inflected, and has been tested experimentally against the int ..."
Abstract
-
Cited by 22 (11 self)
- Add to MetaCart
We describe here a supervised learning model that, given paradigms of related words, learns the morphological and phonological rules needed to derive the paradigm. The model can use its rules to make guesses about how novel forms would be inflected, and has been tested experimentally against the intuitions of human speakers.
Islands of Reliability for Regular Morphology: Evidence from Italian
- Language
, 2002
"... The representation of regular morphological processes has been the subject of much controversy, particularly in the debate between single and dual route models of morphology. ..."
Abstract
-
Cited by 18 (4 self)
- Add to MetaCart
The representation of regular morphological processes has been the subject of much controversy, particularly in the debate between single and dual route models of morphology.
Morphology Induction from Term Clusters
- PROCEEDINGS OF THE 9TH CONFERENCE ON COMPUTATIONAL NATURAL LANGUAGE LEARNING (CONLL)
, 2005
"... We address the problem of learning a morphological automaton directly from a monolingual text corpus without recourse to additional resources. Like previous work in this area, our approach exploits orthographic regularities in a search for possible morphological segmentation points. Instead o ..."
Abstract
-
Cited by 18 (0 self)
- Add to MetaCart
We address the problem of learning a morphological automaton directly from a monolingual text corpus without recourse to additional resources. Like previous work in this area, our approach exploits orthographic regularities in a search for possible morphological segmentation points. Instead of affixes, however, we search for affix transformation rules that express correspondences between term clusters induced from the data. This focuses the system on substrings having syntactic function, and yields cluster-to-cluster transformation rules which enable the system to process unknown morphological forms of known words accurately. A stem-

