Results 1 - 10
of
19
Tagging English Text with a Probabilistic Model
, 1994
"... In this paper we present some experiments on the use of a probabilistic model to tag English text, i.e. to assign to each word the correct tag (part of speech) in the context of the sentence. The main novelty of these experiments is the use of untagged text in the training of the model. We have used ..."
Abstract
-
Cited by 212 (0 self)
- Add to MetaCart
In this paper we present some experiments on the use of a probabilistic model to tag English text, i.e. to assign to each word the correct tag (part of speech) in the context of the sentence. The main novelty of these experiments is the use of untagged text in the training of the model. We have used a simple triclass Markov model and are looking for the best way to estimate the parameters of this model, depending on the kind and amount of training data provided. Two approaches in particular are compared and combined: using text that has been tagged by hand and computing relative frequency counts, using text without tags and training the model as a hidden Markov process, according to a Maximum Likelihood principle
Grammatical Category Disambiguation by Statistical Optimization
- COMPUTATIONAL LINGUISTICS
, 1988
"... [This paper focuses on the]... task of [part-of-speech] disambiguation, and particularly on a new algorithm called VOLSUNGA, which avoids syntactic-level analysis, yields about 96% accuracy, and runs in far less time and space than previous attempts. The most recent previous algorithm runs in NP (No ..."
Abstract
-
Cited by 148 (0 self)
- Add to MetaCart
[This paper focuses on the]... task of [part-of-speech] disambiguation, and particularly on a new algorithm called VOLSUNGA, which avoids syntactic-level analysis, yields about 96% accuracy, and runs in far less time and space than previous attempts. The most recent previous algorithm runs in NP (Non-Polynomial) time, while VOLSUNGA runs in linear time. This is provably optimal; no improvements in the order of its execution time and space are possible. VOLSUNGA is also robust in cases of ungrammaticality. Improvements to this accuracy may be made, perhaps the most potentially significant being to include some higher-level information. With such additions, the accuracy of statistically-based algorithms will approach 100%; and the few remaining cases may be largely those with which humans also find difficulty. In subsequent sections we examine several disambiguation algorithms. Their techniques, accuracies, and efficiencies are analyzed. After presenting the research carried out to date, a discussion of VOLSUNGA's application to the Brown Corpus...
Designing Statistical Language Learners: Experiments on Noun Compounds
, 1995
"... Statistical language learning research takes the view that many traditional natural language processing tasks can be solved by training probabilistic models of language on a sufficient volume of training data. The design of statistical language learners therefore involves answering two questions: (i ..."
Abstract
-
Cited by 65 (0 self)
- Add to MetaCart
Statistical language learning research takes the view that many traditional natural language processing tasks can be solved by training probabilistic models of language on a sufficient volume of training data. The design of statistical language learners therefore involves answering two questions: (i) Which of the multitude of possible language models will most accurately reflect the properties necessary to a given task? (ii) What will constitute a sufficient volume of training data? Regarding the first question, though a variety of successful models have been discovered, the space of possible designs remains largely unexplored. Regarding the second, exploration of the design space has so far proceeded without an adequate answer. The goal of this thesis is to advance the exploration of the statistical language learning design space. In pursuit of that goal, the thesis makes two main theoretical contributions: it identifies a new class of designs by providing a novel theory of statistical natural language processing, and it presents the foundations for a predictive theory of data requirements to assist in future design explorations. The first of these contributions is called the meaning distributions theory. This theory
Claws4: The Tagging Of The British National Corpus
, 1994
"... this paper is to describe the CLAWS4 general-purpose grammatical tagger, used for the tagging of the 100-million-word British National Corpus, of which c.70 million words have been tagged at the time of writing (April 1994). 1 We will empbasise the goals of (a) generd-purpose adaptability, (b) incor ..."
Abstract
-
Cited by 43 (1 self)
- Add to MetaCart
this paper is to describe the CLAWS4 general-purpose grammatical tagger, used for the tagging of the 100-million-word British National Corpus, of which c.70 million words have been tagged at the time of writing (April 1994). 1 We will empbasise the goals of (a) generd-purpose adaptability, (b) incorporation of linguistic knowledge to improve qu,'dity and consistency, and (c) accuracy, measured consistently and in a linguistically informed way
Tagging accurately - Don't guess if you know
- In Proceedings of ANLP '94
, 1994
"... We discuss combining knowledge-based (or rule-based) and statistical part-of-speech taggers. We use two mature taggers, ENGCG and Xerox Tagger, to independently tag the same text and combine the results to produce a fully disambiguated text. In a 27000 word test sample taken from a previously ..."
Abstract
-
Cited by 25 (4 self)
- Add to MetaCart
We discuss combining knowledge-based (or rule-based) and statistical part-of-speech taggers. We use two mature taggers, ENGCG and Xerox Tagger, to independently tag the same text and combine the results to produce a fully disambiguated text. In a 27000 word test sample taken from a previously unseen corpus we achieve 98.5 % accuracy. This paper presents the data in detail. We describe the problems we encountered in the course of combining the two taggers and discuss the problem of evaluating taggers.
Speech Recognition And The Frequency Of Recently Used Words: A Modified Markov Model For Natural Language
, 1988
"... Speech recognition systems incorporate a language model which, at each stage of the recognition task, assigns a probability of occurrence to each word in the vocabulary. A class of Markov language models identified by Jclinck has achieved considerable success in this domain. A modification of the Ma ..."
Abstract
-
Cited by 23 (0 self)
- Add to MetaCart
Speech recognition systems incorporate a language model which, at each stage of the recognition task, assigns a probability of occurrence to each word in the vocabulary. A class of Markov language models identified by Jclinck has achieved considerable success in this domain. A modification of the Markov approach, which assigns higher probabilities to recently used words, is proposed and tested against a pure Markov model. Parameter calculation and comparison of the two models both involve use of the LOB Corpus of tagged modern English.
Three studies of grammar-based surface parsing of unrestricted English text
, 1994
"... The dissertation addresses the design of parsing grammars for automatic surface-syntactic analysis of unconstrained English text. It consists of a summary and three articles. Morphological disambiguation documents a grammar for morphological (or part-ofspeech) disambiguation of English, done with ..."
Abstract
-
Cited by 15 (1 self)
- Add to MetaCart
The dissertation addresses the design of parsing grammars for automatic surface-syntactic analysis of unconstrained English text. It consists of a summary and three articles. Morphological disambiguation documents a grammar for morphological (or part-ofspeech) disambiguation of English, done within the Constraint Grammar framework proposed by Fred Karlsson. The disambiguator seeks to discard those of the alternative morphological analyses proposed by the lexical analyser that are contextually illegitimate. The 1,100 constraints express some 23 general, essentially syntactic statements as restrictions on the linear order of morphological tags. The error rate of the morphological disambiguator is about ten times smaller than that of another state-of-the-art probabilistic disambiguator, given that both are allowed to leave some of the hardest ambiguities unresolve...
A Syntax-Based Part-of-Speech Analyser
- IN EACL-95
, 1995
"... There are two main methodologies for constructing the knowledge base of a natural language analyser: the linguis- tic and the data"driven. Recent state-of- the-art part-of-speech taggers are based on the data"driven approach. Because of the known feasibility of the linguistic rule-based approach at ..."
Abstract
-
Cited by 10 (0 self)
- Add to MetaCart
There are two main methodologies for constructing the knowledge base of a natural language analyser: the linguis- tic and the data"driven. Recent state-of- the-art part-of-speech taggers are based on the data"driven approach. Because of the known feasibility of the linguistic rule-based approach at related levels of description, the success of the data" driven approach in part-of-speech analysis may appear surprising. In this paper, a case is made for the syntactic nature of part-of-speech tagging. A new tagger of English that uses only linguistic distributional rules is outlined and empirically evaluated. Tested against a benchmark corpus of 38,000 words of previously unseen text, this syntax-based system reaches an accuracy of above 99%. Compared to the 95-97% accuracy of its best competitors, this result suggests the feasibility of the linguistic approach also in part-of-speech analysis.
Automatic Acquisition of Word Classification using Distributional Analysis of Content Words with Respect to Function Words
, 2002
"... This project describes a method which can automatically infer word classification. Previous systems designed to assign parts-of-speech to words sought the use of training data or were built upon rules devised by experts in linguistics. The report details the use of an unsupervised approach that can ..."
Abstract
-
Cited by 4 (3 self)
- Add to MetaCart
This project describes a method which can automatically infer word classification. Previous systems designed to assign parts-of-speech to words sought the use of training data or were built upon rules devised by experts in linguistics. The report details the use of an unsupervised approach that can reduce significantly the reliance on prior linguistic intuition. The study looks in to how words behave relative to the function words. As these are the most common words, there is a great deal of information that can be attained. It was possible to analyse how the content words from a given body of text were distributed with respect to the function words. This information could be used as a profile, and therefore content words with a similar profile against the function words could be assumed to be of similar word class. Agglomerative hierarchical clustering techniques were applied to partition words into different clusters. Words that were deemed similar were grouped together, and thus, each cluster should contain words that posses the same part-of-speech. This project performed many experiments to investigate how the many factors affected the overall clustering performance, in order to find the optimal parameters. The results report an accuracy of 87% when performed on the LOB corpus. Experiments were also carried out with an alternative Spanish corpus and the clustering accuracy achieved 85%. Semantic clustering was also observed indicating the effectiveness of the described approach for the task of automatically acquiring word classification.
Elimination of lexical ambiguities by grammars. The ELAG system
, 1998
"... We present a new, INTEX-compatible formalism for the description of distributional constraints, ELAG (Elimination of lexical ambiguity by grammars). The constraints may be checked against text, and the lexical ambiguity of the text may thus be partly resolved. We describe and exemplify the main prop ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
We present a new, INTEX-compatible formalism for the description of distributional constraints, ELAG (Elimination of lexical ambiguity by grammars). The constraints may be checked against text, and the lexical ambiguity of the text may thus be partly resolved. We describe and exemplify the main properties of ELAG with the aid of simple rules, formalizing exploitable constraints. We specify in detail the effect of applying an ELAG rule or grammar to a text. We examine the practical properties of the formalism from the point of view of a rule writer. We describe our evaluation procedure for the lexical disambiguation results.

