Results 1 - 10
of
62
TnT - A Statistical Part-Of-Speech Tagger
, 2000
"... Trigrams'n'Tags (TnT) is an efficient statistical part-of-speech tagger. Contrary to claims found elsewhere in the literature, we argue that a tagger based on Markov models performs at least as well as other current approaches, including the Maximum Entropy framework. A recent comparison has even sh ..."
Abstract
-
Cited by 293 (3 self)
- Add to MetaCart
Trigrams'n'Tags (TnT) is an efficient statistical part-of-speech tagger. Contrary to claims found elsewhere in the literature, we argue that a tagger based on Markov models performs at least as well as other current approaches, including the Maximum Entropy framework. A recent comparison has even shown that TnT performs significantly better for the tested corpora. We describe the basic model of TnT, the techniques used for smoothing and for handling unknown words. Furthermore, we present evaluations on two corpora.
Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition
- Proceedings of CoNLL-2003
, 2003
"... ..."
Part-of-Speech Tagging and Partial Parsing
- Corpus-Based Methods in Language and Speech
, 1996
"... m we can carve o# next. `Partial parsing' is a cover term for a range of di#erent techniques for recovering some but not all of the information contained in a traditional syntactic analysis. Partial parsing techniques, like tagging techniques, aim for reliability and robustness in the face of the va ..."
Abstract
-
Cited by 85 (0 self)
- Add to MetaCart
m we can carve o# next. `Partial parsing' is a cover term for a range of di#erent techniques for recovering some but not all of the information contained in a traditional syntactic analysis. Partial parsing techniques, like tagging techniques, aim for reliability and robustness in the face of the vagaries of natural text, by sacrificing completeness of analysis and accepting a low but non-zero error rate. 1 Tagging The earliest taggers [35, 51] had large sets of hand-constructed rules for assigning tags on the basis of words' character patterns and on the basis of the tags assigned to preceding or following words, but they had only small lexica, primarily for exceptions to the rules. TAGGIT [35] was used to generate an initial tagging of the Brown corpus, which was then hand-edited. (Thus it provided the data that has since been used to train other taggers [20].) The tagger described by Garside [56, 34], CLAWS, was a probabilistic version of TAGGIT, and the DeRose tagger improved on
Methods for the Qualitative Evaluation of Lexical Association Measures
- In Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics
, 2001
"... This paper presents methods for a qualitative, unbiased comparison of lexical association measures and the results we have obtained for adjective-noun pairs and preposition-noun-verb triples extracted from German corpora. In our approach, we compare the entire list of candidates, sorted accor ..."
Abstract
-
Cited by 52 (6 self)
- Add to MetaCart
This paper presents methods for a qualitative, unbiased comparison of lexical association measures and the results we have obtained for adjective-noun pairs and preposition-noun-verb triples extracted from German corpora. In our approach, we compare the entire list of candidates, sorted according to the particular measures, to a reference set of manually identified "true positives".
Phrase Recognition and Expansion for Short, Precision-biased Queries based on a Query Log
"... In this paper we examine the question of query parsing for World Wide Web queries and present a novel method for phrase recognition and expansion. Given a training corpus of approximately 16 million Web queries and a handwritten context-free grammar, the EM algorithm is used to estimate the paramete ..."
Abstract
-
Cited by 30 (0 self)
- Add to MetaCart
In this paper we examine the question of query parsing for World Wide Web queries and present a novel method for phrase recognition and expansion. Given a training corpus of approximately 16 million Web queries and a handwritten context-free grammar, the EM algorithm is used to estimate the parameters of a probabilistic context-free grammar (PCFG) with a system developed by Carroll [5]. We use the PCFG to compute the most probable parse for a user query, reflecting linguistic structure and word usage of the domain being parsed. The optimal syntactic parse for a user query thus obtained is employed for phrase recognition and expansion. Phrase recognition is used to increase retrieval precision; phrase expansion is applied to make the best use possible of very short Web queries.
Automatic phonemic transcription and linguistic annotation from known text with Hidden Markov Models. An Aligner for German
, 1995
"... This paper describes the architecture of a word and phoneme aligner based on Hidden Markov Models (HMMs). It was developed to allow for word, syllable and segment length extraction as part of a feature extraction stage for prosody recognition. From a given orthographic ASCII text and sampled speech ..."
Abstract
-
Cited by 10 (2 self)
- Add to MetaCart
This paper describes the architecture of a word and phoneme aligner based on Hidden Markov Models (HMMs). It was developed to allow for word, syllable and segment length extraction as part of a feature extraction stage for prosody recognition. From a given orthographic ASCII text and sampled speech data, a label file with phonemes, syllables or words is automatically generated. Linguistic categories coded in the lexicon can also be included in the generated label files.
Webcrow: A web-based system for crossword solving
- In Proc. of AAAI ’05
, 2005
"... Language games represent one of the most fascinating challenges of research in artificial intelligence. In this paper we give an overview of WebCrow, a system that tackles crosswords using the Web as a knowledge base. This appears to be a novel approach with respect to the available literature. It i ..."
Abstract
-
Cited by 9 (2 self)
- Add to MetaCart
Language games represent one of the most fascinating challenges of research in artificial intelligence. In this paper we give an overview of WebCrow, a system that tackles crosswords using the Web as a knowledge base. This appears to be a novel approach with respect to the available literature. It is also the first solver for non-English crosswords and it has been designed to be potentially multilingual. Although WebCrow has been implemented only in a preliminary version, it already displays very interesting results reaching the performance of a human beginner: crosswords that are “easy ” for expert humans are solved, within competition time limits, with 80 % of correct words and over 90 % of correct letters.
Nouns are vectors, adjectives are matrices: Representing adjective-noun constructions in semantic space
"... We propose an approach to adjective-noun composition (AN) for corpus-based distributional semantics that, building on insights from theoretical linguistics, represents nouns as vectors and adjectives as data-induced (linear) functions (encoded as matrices) over nominal vectors. Our model significant ..."
Abstract
-
Cited by 9 (2 self)
- Add to MetaCart
We propose an approach to adjective-noun composition (AN) for corpus-based distributional semantics that, building on insights from theoretical linguistics, represents nouns as vectors and adjectives as data-induced (linear) functions (encoded as matrices) over nominal vectors. Our model significantly outperforms the rivals on the task of reconstructing AN vectors not seen in training. A small post-hoc analysis further suggests that, when the model-generated AN vector is not similar to the corpus-observed AN vector, this is due to anomalies in the latter. We show moreover that our approach provides two novel ways to represent adjective meanings, alternative to its representation via corpus-based co-occurrence vectors, both outperforming the latter in an adjective clustering task. 1
An iterative data collection approach for multimodal dialogue systems
- In: Proc. of the 3rd International Conference on Language Resources and Evaluation. Las Palmas, Canary Islands
, 2002
"... This paper deals with the way in which data for multimodal dialogue systems are collected. We argue that for multimodal data, an iterative data collection strategy should be followed. Instead of a single major data collection effort using a “Wizard of OZ ” (WOZ) or “prompting ” experimental setup, s ..."
Abstract
-
Cited by 8 (2 self)
- Add to MetaCart
This paper deals with the way in which data for multimodal dialogue systems are collected. We argue that for multimodal data, an iterative data collection strategy should be followed. Instead of a single major data collection effort using a “Wizard of OZ ” (WOZ) or “prompting ” experimental setup, several smaller data collections should accompany the system development. We also describe the “script ” experimental setup we developed. It is in between the WOZ and prompting setup, and can be used as a cost effective design for the first data collection within the iterative data collection strategy. 1.
A Self-Learning Context-Aware Lemmatizer for German
- In Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (HLT/EMNLP 2005
, 2005
"... Accurate lemmatization of German nouns mandates the use of a lexicon. Comprehensive lexicons, however, are expensive to build and maintain. We present a selflearning lemmatizer capable of automatically creating a full-form lexicon by processing German documents. 1 ..."
Abstract
-
Cited by 8 (1 self)
- Add to MetaCart
Accurate lemmatization of German nouns mandates the use of a lexicon. Comprehensive lexicons, however, are expensive to build and maintain. We present a selflearning lemmatizer capable of automatically creating a full-form lexicon by processing German documents. 1

