Results 1 - 10
of
16
Introduction to the Special Issue on Computational Linguistics using Large Corpora
- Computational Linguistics
, 1993
"... ..."
Word sense disambiguation: The state of the art
- Computational Linguistics
, 1998
"... The automatic disambiguation of word senses has been an interest and concern since the earliest days of computer treatment of language in the 1950's. Sense disambiguation is an “intermediate task ” (Wilks and Stevenson, 1996) which is not an end in itself, but rather is necessary at one level or ano ..."
Abstract
-
Cited by 92 (3 self)
- Add to MetaCart
The automatic disambiguation of word senses has been an interest and concern since the earliest days of computer treatment of language in the 1950's. Sense disambiguation is an “intermediate task ” (Wilks and Stevenson, 1996) which is not an end in itself, but rather is necessary at one level or another to accomplish most natural language processing tasks. It is
Good-Turing smoothing without tears
- Journal of Quantitative Linguistics
, 1995
"... The performance of statistically based techniques for many tasks such as spelling correction, sense disambiguation, and translation is improved if one can estimate a probability for an object of interest which has not been seen before. Good-Turing methods are one means of estimating these probabilit ..."
Abstract
-
Cited by 20 (0 self)
- Add to MetaCart
The performance of statistically based techniques for many tasks such as spelling correction, sense disambiguation, and translation is improved if one can estimate a probability for an object of interest which has not been seen before. Good-Turing methods are one means of estimating these probabilities for previously unseen objects. However, the use of Good-Turing methods requires a smoothing step which must smooth in regions of vastly different accuracy. Such smoothers are difficult to use, and may have hindered the use of Good-Turing methods in computational linguistics. This paper presents a method which uses the simplest possible smooth, a straight line, together with a rule for switching from Turing estimates which are more accurate at low frequencies. We call this method the Simple Good-Turing (SGT) method. Two examples, one from prosody, the other from morphology, are used to illustrate the SGT. While the goal of this research was to provide a simple estimator, the SGT turns out to be the most accurate of several methods applied in a set of Monte Carlo examples which satisfy the assumptions of the Good-Turing methods. The accuracy of the SGT is compared to two other methods for estimating the same probabilities, the Expected Likelihood Estimate (ELE) and two way cross validation. The SGT method is
Statistical Learning of Harmonic Movement
- JOURNAL OF NEW MUSIC RESEARCH
, 1999
"... We explore the application of statistical techniques, borrowed from natural language processing, to music. A probabilistic method is used to capture and generalise from the local harmonic movement of a corpus of seventeenth-century dance music. The probabilistic grammars so generated are then use ..."
Abstract
-
Cited by 18 (3 self)
- Add to MetaCart
We explore the application of statistical techniques, borrowed from natural language processing, to music. A probabilistic method is used to capture and generalise from the local harmonic movement of a corpus of seventeenth-century dance music. The probabilistic grammars so generated are then used for experiments in generation (composition). The corpus
Category-Based Statistical Language Models
, 1997
"... this document. The first section, in chapter 3, develops a model for syntactic dependencies based on word-category n-grams. The second section, in chapter 4, extends this model by allowing short-range word relations to be captured through the incorporation of selected word n-grams. ..."
Abstract
-
Cited by 11 (2 self)
- Add to MetaCart
this document. The first section, in chapter 3, develops a model for syntactic dependencies based on word-category n-grams. The second section, in chapter 4, extends this model by allowing short-range word relations to be captured through the incorporation of selected word n-grams.
Extension of Zipf’s Law to Word and Character N-Grams for English and Chinese
- Journal of Computational Linguistics and Chinese Language Processing
, 2003
"... It is shown that for a large corpus, Zipf 's law for both words in English and characters in Chinese does not hold for all ranks. The frequency falls below the frequency predicted by Zipf's law for English words for rank greater than about 5,000 and for Chinese characters for rank greater than about ..."
Abstract
-
Cited by 6 (3 self)
- Add to MetaCart
It is shown that for a large corpus, Zipf 's law for both words in English and characters in Chinese does not hold for all ranks. The frequency falls below the frequency predicted by Zipf's law for English words for rank greater than about 5,000 and for Chinese characters for rank greater than about 1,000. However, when single words or characters are combined together with n-gram words or characters in one list and put in order of frequency, the frequency of tokens in the combined list follows Zipf’s law approximately with the slope close to-1 on a log-log plot for all n-grams, down to the lowest frequencies in both languages. This behaviour is also found for English 2-byte and 3-byte word fragments. It only happens when all n-grams are used, including semantically incomplete n-grams. Previous theories do not predict this behaviour, possibly because conditional probabilities of tokens have not been properly represented.
Two Questions about Data-Oriented Parsing
- IN PROCEEDINGS FOURTH WORKSHOP ON VERY LARGE CORPORA
, 1996
"... In this paper I present ongoing work on the data-oriented parsing (DOP) model. In previous work, DOP was tested on a cleaned-up set of analyzed part-of-speech strings from the Penn Treebank, achieving excellent test results. This left, however, two important questions unanswered: (1) how does DOP ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
In this paper I present ongoing work on the data-oriented parsing (DOP) model. In previous work, DOP was tested on a cleaned-up set of analyzed part-of-speech strings from the Penn Treebank, achieving excellent test results. This left, however, two important questions unanswered: (1) how does DOP perform if tested on unedited data, and (2) how can DOP be used for parsing word strings that contain unknown words? This paper addresses these questions. We show that parse results on unedited data are worse than on cleaned-up data, although still very competitive if compared to other models. As to the parsing of word strings, we show that the hardness of the problem does not so much depend on unknown words, but on previously unseen lexical categories of known words. We give a novel method for parsing these words by estimating the probabilities of unknown subtrees. The method is of general interest since it shows that good performance can be obtained without the use of a part-of- speech tagger. To the best of our knowledge, our method outperforms other statistical parsers tested on Penn Treebank word strings.
Zipf and Type-Token rules for the English and Irish languages
, 2004
"... The Zipf curve of log of frequency against log of rank for a large English corpus of 500 million word tokens and 689,000 word types is shown to have the usual slope close to –1 for rank less than 5,000, but then for a higher rank it turns to give a slope close to –2. This is apparently mainly due to ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
The Zipf curve of log of frequency against log of rank for a large English corpus of 500 million word tokens and 689,000 word types is shown to have the usual slope close to –1 for rank less than 5,000, but then for a higher rank it turns to give a slope close to –2. This is apparently mainly due to foreign words and place names. The Zipf curve for a highly-inflected language (the Indo-European Celtic language, Irish) is also given. Because of the larger number of word types per lemma, it remains flatter than the English curve maintaining a slope of –1 until a turning point of about rank 30,000. A formula which calculates the number of tokens given the number of types is derived in terms of the rank at the turning point, 5,000 for English and 30,000 for Irish.
Tagging a Corpus of Spoken Swedish
- International Journal of Corpus Linguistics
, 2001
"... In this article, we present and evaluate a method for training a statistical partof-speech tagger on data from written language and then adapting it to the requirements of tagging a corpus of transcribed spoken language, in our case spoken Swedish. This is currently a significant problem for many re ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
In this article, we present and evaluate a method for training a statistical partof-speech tagger on data from written language and then adapting it to the requirements of tagging a corpus of transcribed spoken language, in our case spoken Swedish. This is currently a significant problem for many research groups working with spoken language, since the availability of tagged training data from spoken language is still very limited for most languages. The overall accuracy of the tagger developed for spoken Swedish is quite respectable, varying from 95% to 97 % depending on the tagset used. In conclusion, we argue that the method presented here gives good tagging accuracy with relatively little effort.

