Results 1 - 10
of
38
Mostly-Unsupervised Statistical Segmentation of Japanese: Applications to Kanji
, 2000
"... Given the lack of word delimiters in written Japanese, word segmentation is generally considered a crucial first step in processing Japanese texts. Typical Japanese segmentation algorithms rely either on a lexicon and grammar or on pre-segmented data. In contrast, we introduce a novel statistical me ..."
Abstract
-
Cited by 32 (1 self)
- Add to MetaCart
Given the lack of word delimiters in written Japanese, word segmentation is generally considered a crucial first step in processing Japanese texts. Typical Japanese segmentation algorithms rely either on a lexicon and grammar or on pre-segmented data. In contrast, we introduce a novel statistical method utilizing unsegmented training data, with performance on kanji sequences comparable to and sometimes surpassing that of morphological analyzers over a variety of error metrics.
An Algorithm for Segmenting Categorical Time Series into Meaningful Episodes
, 2001
"... . This paper describes an unsupervised algorithm for segmenting categorical time series. The algorithm first collects statistics about the frequency and boundary entropy of ngrams, then passes a window over the series and has two expert methods decide where in the window boundaries should be drawn. ..."
Abstract
-
Cited by 27 (5 self)
- Add to MetaCart
. This paper describes an unsupervised algorithm for segmenting categorical time series. The algorithm first collects statistics about the frequency and boundary entropy of ngrams, then passes a window over the series and has two expert methods decide where in the window boundaries should be drawn. The algorithm segments text into words successfully, and has also been tested with a data set of mobile robot activities. We claim that the algorithm finds
Chinese segmentation and new word detection using conditional random fields
- In Proceedings of COLING
, 2004
"... Chinese word segmentation is a difficult, important and widely-studied sequence modeling problem. This paper demonstrates the ability of linear-chain conditional random fields (CRFs) to perform robust and accurate Chinese word segmentation by providing a principled framework that easily supports the ..."
Abstract
-
Cited by 25 (0 self)
- Add to MetaCart
Chinese word segmentation is a difficult, important and widely-studied sequence modeling problem. This paper demonstrates the ability of linear-chain conditional random fields (CRFs) to perform robust and accurate Chinese word segmentation by providing a principled framework that easily supports the integration of domain knowledge in the form of multiple lexicons of characters and words. We also present a probabilistic new word detection method, which further improves performance. Our system is evaluated on four datasets used in a recent comprehensive Chinese word segmentation competition. State-of-the-art performance is obtained. 1
Text Classification and Segmentation Using Minimum Cross-Entropy
, 2000
"... Several methods for classifying and segmenting text are described. These are based on ranking text sequences by their cross-entropy calculated using a fixed order character-based Markov model adapted from the PPM text compression algorithm. Experimental results show that the methods are a signi cant ..."
Abstract
-
Cited by 18 (0 self)
- Add to MetaCart
Several methods for classifying and segmenting text are described. These are based on ranking text sequences by their cross-entropy calculated using a fixed order character-based Markov model adapted from the PPM text compression algorithm. Experimental results show that the methods are a signi cant improvement over previously used methods in a number of areas. For example, text can be classified with a very high degree of accuracy by authorship, language, dialect and genre. Highly accurate text segmentation is also possible -- the accuracy of the PPM-based Chinese word segmenter is close to 99% on Chinese news text; similarly, a PPM-based method of segmenting text by language achieves an accuracy of over 99%.
Contentful Mental States for Robot Baby
, 2002
"... In this paper we claim that meaningful representations can be learned by programs, although today they are almost always designed by skilled engineers. We discuss several kinds of meaning that representations might have, and focus on a functional notion of meaning as appropriate for programs to ..."
Abstract
-
Cited by 14 (2 self)
- Add to MetaCart
In this paper we claim that meaningful representations can be learned by programs, although today they are almost always designed by skilled engineers. We discuss several kinds of meaning that representations might have, and focus on a functional notion of meaning as appropriate for programs to learn. Specifically, a representation is meaningful if it incorporates an indicator of external conditions and if the indicator relation informs action. We survey methods for inducing kinds of representations we call structural abstractions. Prototypes of sensory time series are one kind of structural abstraction, and though they are not denoting or compositional, they do support planning. Deictic representations of objects and prototype representations of words enable a program to learn the denotational meanings of words. Finally, we discuss two algorithms designed to find the macroscopic structure of episodes in a domain-independent way.
Chinese word segmentation and named entity recognition: a pragmatic approach
- Computational Linguistics
, 2005
"... This paper presents a pragmatic approach to Chinese word segmentation. It differentiates from most of the previous approaches mainly in three respects. First of all, while theoretical linguists have defined Chinese words with various linguistic criteria, Chinese words in this study are defined pragm ..."
Abstract
-
Cited by 14 (1 self)
- Add to MetaCart
This paper presents a pragmatic approach to Chinese word segmentation. It differentiates from most of the previous approaches mainly in three respects. First of all, while theoretical linguists have defined Chinese words with various linguistic criteria, Chinese words in this study are defined pragmatically as segmentation units whose definition depends on how they are used and processed in realistic computer applications. Secondly, we propose a pragmatic mathematical framework in which segmenting known words and detecting unknown words of different types (i.e. morphologically derived words, factoids, named entities, and other unlisted words) can be performed simultaneously in a unified way. These tasks are usually conducted separately in other systems. Finally, we do not assume the existence of a universal word segmentation standard which is application independent. Instead, we argue for the necessity of multiple segmentation standards due to the pragmatic fact that different NLP applications might require different granularities of Chinese words. These pragmatic approaches have been implemented in an adaptive Chinese word segmenter, called MSRSeg (access
Accessor variety criteria for chinese word extraction
- Computational Linguistics
, 2004
"... We are interested in the problem of word extraction from Chinese text collections. We de�ne a word to be a meaningful string composed of several Chinese characters. For example,, ‘percent’, and, ‘more and more’, are not recognized as traditional Chinese words from the viewpoint of some people. Howev ..."
Abstract
-
Cited by 12 (0 self)
- Add to MetaCart
We are interested in the problem of word extraction from Chinese text collections. We de�ne a word to be a meaningful string composed of several Chinese characters. For example,, ‘percent’, and, ‘more and more’, are not recognized as traditional Chinese words from the viewpoint of some people. However, in our work, they are words because they are very widely used and have speci�c meanings. We start with the viewpoint that a word is a distinguished linguistic entity that can be used in many different language environments. We consider the characters that are directly before a string (predecessors) and the characters that are directly after a string (successors) as important factors for determining the independence of the string. We call such characters accessors of the string, consider the number of distinct predecessors and successors of a string in a largecorpus (TREC 5 and TREC 6 documents), and use them as the measurement of the context independency of a string from the rest of the sentences in the document. Our experiments con�rm our hypothesis and show that this simple rule gives quite good results for Chinese word extraction and is comparable to, and for long words outperforms, other iterative methods. 1.
Adaptive Text Mining: Inferring Structure from Sequences
- J. Discrete Algorithms
, 2000
"... Text mining is about inferring structure from sequences representing natural language text, and may be defined as the process of analyzing text to extract information that is useful for particular purposes. Although hand-crafted heuristics are a common practical approach for extracting informatio ..."
Abstract
-
Cited by 9 (0 self)
- Add to MetaCart
Text mining is about inferring structure from sequences representing natural language text, and may be defined as the process of analyzing text to extract information that is useful for particular purposes. Although hand-crafted heuristics are a common practical approach for extracting information from text, a general, and generalizable, approach requires adaptive techniques. This paper studies the way in which the adaptive techniques used in text compression can be applied to text mining. It develops several examples: extraction of hierarchical phrase structures from text, identification of keyphrases in documents, locating proper names and quantities of interest in a piece of text, text categorization, word segmentation, acronym extraction, and structure recognition. We conclude that compression forms a sound unifying principle that allows many text mining problems to be tacked adaptively.
Voting Experts: An Unsupervised Algorithm for Segmenting Sequences
, 2006
"... We describe a statistical signature of chunks and an algorithm for finding chunks. While there is no formal definition of chunks, they may be reliably identified as configurations with low internal entropy or unpredictability and high entropy at their boundaries. We show that the log frequency of a ..."
Abstract
-
Cited by 9 (3 self)
- Add to MetaCart
We describe a statistical signature of chunks and an algorithm for finding chunks. While there is no formal definition of chunks, they may be reliably identified as configurations with low internal entropy or unpredictability and high entropy at their boundaries. We show that the log frequency of a chunk is a measure of its internal entropy. The Voting-Experts exploits the signature of chunks to find word boundaries in text from four languages and episode boundaries in the activities of a mobile robot. 1

