Results 1 -
8 of
8
Chinese word segmentation and named entity recognition: a pragmatic approach
- Computational Linguistics
, 2005
"... This paper presents a pragmatic approach to Chinese word segmentation. It differentiates from most of the previous approaches mainly in three respects. First of all, while theoretical linguists have defined Chinese words with various linguistic criteria, Chinese words in this study are defined pragm ..."
Abstract
-
Cited by 14 (1 self)
- Add to MetaCart
This paper presents a pragmatic approach to Chinese word segmentation. It differentiates from most of the previous approaches mainly in three respects. First of all, while theoretical linguists have defined Chinese words with various linguistic criteria, Chinese words in this study are defined pragmatically as segmentation units whose definition depends on how they are used and processed in realistic computer applications. Secondly, we propose a pragmatic mathematical framework in which segmenting known words and detecting unknown words of different types (i.e. morphologically derived words, factoids, named entities, and other unlisted words) can be performed simultaneously in a unified way. These tasks are usually conducted separately in other systems. Finally, we do not assume the existence of a universal word segmentation standard which is application independent. Instead, we argue for the necessity of multiple segmentation standards due to the pragmatic fact that different NLP applications might require different granularities of Chinese words. These pragmatic approaches have been implemented in an adaptive Chinese word segmenter, called MSRSeg (access
Off all things the measure is man Automatic classification of emotions and inter-labeler consistency
- Proceeding of the IEEE ICASSP,2005
, 2005
"... In traditional classification problems, the reference needed for training a classifier is given and considered to be absolutely correct. However, this does not apply to all tasks. In emotion recognition in non-acted speech, for instance, one often does not know which emotion was really intended by t ..."
Abstract
-
Cited by 13 (5 self)
- Add to MetaCart
In traditional classification problems, the reference needed for training a classifier is given and considered to be absolutely correct. However, this does not apply to all tasks. In emotion recognition in non-acted speech, for instance, one often does not know which emotion was really intended by the speaker. Hence, the data is annotated by a group of human labelers who do not agree on one common class in most cases. Often, similar classes are confused systematically. We propose a new entropy-based method to evaluate classification results taking into account these systematic confusions. We can show that a classifier which achieves a recognition rate of “only ” about 60 % on a four-class-problem performs as well as our five human labelers on average. 1.
© 2005 Association for Computational Linguistics
"... This paper presents a pragmatic approach to Chinese word segmentation. It differentiates from most of the previous approaches mainly in three respects. First of all, while theoretical linguists have defined Chinese words with various linguistic criteria, Chinese words in this study are defined pragm ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
This paper presents a pragmatic approach to Chinese word segmentation. It differentiates from most of the previous approaches mainly in three respects. First of all, while theoretical linguists have defined Chinese words with various linguistic criteria, Chinese words in this study are defined pragmatically as segmentation units whose definition depends on how they are used and processed in realistic computer applications. Secondly, we propose a pragmatic mathematical framework in which segmenting known words and detecting unknown words of different types (i.e. morphologically derived words, factoids, named entities, and other unlisted words) can be performed simultaneously in a unified way. These tasks are usually conducted separately in other systems. Finally, we do not assume the existence of a universal word segmentation standard which is application independent. Instead, we argue for the necessity of multiple segmentation standards due to the pragmatic fact that different NLP applications might require different granularities of Chinese words. These pragmatic approaches have been implemented in an adaptive Chinese word segmenter, called MSRSeg, which will be described in detail. It consists of two components: (1) a generic segmenter that is based on the framework of linear mixture models, and provides a unified approach to the five fundamental features of word-level Chinese language processing: lexicon word processing, morphological analysis, factoid detection, named entity recognition, and new word identification; and (2) a set of output adaptors for adapting the output of the former to different application-specific standards. Evaluation on five test sets with different standards shows that the adaptive system achieves state-of-the-art performance on all the test sets. 1.
Enhanced Integrated Scoring for Cleaning Dirty Texts
, 810
"... An increasing number of approaches for ontology engineering from text are gearing towards the use of online sources such as company intranet and the World Wide Web. Despite such rise, not much work can be found in aspects of preprocessing and cleaning dirty texts from online sources. This paper pres ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
An increasing number of approaches for ontology engineering from text are gearing towards the use of online sources such as company intranet and the World Wide Web. Despite such rise, not much work can be found in aspects of preprocessing and cleaning dirty texts from online sources. This paper presents an enhancement of an Integrated Scoring for Spelling error correction, Abbreviation expansion and Case restoration (ISSAC). ISSAC is implemented as part of a text preprocessing phase in an ontology engineering system. New evaluations performed on the enhanced ISSAC using 700 chat records reveal an improved accuracy of 98 % as compared to 96.5 % and 71 % based on the use of only basic ISSAC and of Aspell, respectively.
DEVELOPMENT OF ANNOTATED BANGLA SPEECH CORPORA
"... This paper describes the development procedure of three different Bangla read speech corpora which can be used for phonetic research and developing speech applications. Several criteria were maintained in the corpora development process that includes considering the phonetic and prosodic features du ..."
Abstract
- Add to MetaCart
This paper describes the development procedure of three different Bangla read speech corpora which can be used for phonetic research and developing speech applications. Several criteria were maintained in the corpora development process that includes considering the phonetic and prosodic features during text selection. On the other hand, a specification was maintained in the recording phase as the speaking style is a vital part in speech applications. We also concentrated on proper text normalization, pronunciation, aligning, and labeling. The labeling was done manually – in the present endeavor sentence level labeling (annotation) was completed by maintaining a specification so that it could be expanded in future. Index Terms —Speech corpora, Phonetic research, Speech processing
Text Normalization for the . . .
- G.A. VOUROS AND T. PANAYIOTOPOULOS (EDS.): SETN 2004, LNAI 3025, PP. 390--399
, 2004
"... In this paper we present a novel approach, called "Text to Pronunciation (TtP)", for the proper normalization of Non-Standard Words (NSWs) in unrestricted texts. The methodology deals with inflection issues for the consistency of the NSWs with the syntactic structure of the utterances they belong ..."
Abstract
- Add to MetaCart
In this paper we present a novel approach, called "Text to Pronunciation (TtP)", for the proper normalization of Non-Standard Words (NSWs) in unrestricted texts. The methodology deals with inflection issues for the consistency of the NSWs with the syntactic structure of the utterances they belong to. Moreover,

