Results 1 -
9 of
9
A Statistical Model for Hangeul-Hanja Conversion in Terminology Domain
"... Sino-Korean words, which are historically borrowed from Chinese language, could be represented with both Hanja (Chinese characters) and Hangeul (Korean characters) writings. Previous Korean Input Method Editors (IMEs) provide only a simple dictionary-based approach for Hangeul-Hanja conversion. This ..."
Abstract
- Add to MetaCart
Sino-Korean words, which are historically borrowed from Chinese language, could be represented with both Hanja (Chinese characters) and Hangeul (Korean characters) writings. Previous Korean Input Method Editors (IMEs) provide only a simple dictionary-based approach for Hangeul-Hanja conversion. This paper presents a sentencebased statistical model for Hangeul-Hanja conversion, with word tokenization included as a hidden process. As a result, we reach 91.4 % of character accuracy and 81.4 % of word accuracy in terminology domain, when only very limited Hanja data is available.
ANew Statistical Approach to Personal Name Extraction
- In ICML
, 2002
"... We propose a new statistical approach to extracting personal names from a corpus. One of the key points of our approach is that it can both automatically learn the characteristics of personal names from a large training corpus and make good use of human empirical knowledge (e.g., Context Free ..."
Abstract
- Add to MetaCart
We propose a new statistical approach to extracting personal names from a corpus. One of the key points of our approach is that it can both automatically learn the characteristics of personal names from a large training corpus and make good use of human empirical knowledge (e.g., Context Free Grammar). Furthermore, our approach also assigns confidence measures to the extracted personal names, compared with traditional simple true/false determination.
Corpus-Based Pinyin Name Resolution
- Proceedings of the First SIGHAN Workshop on Chinese Language Processing (COLING
, 2002
"... For readers of English text who know some Chinese, Pinyin codes that spell out Chinese names are often ambiguous as to their original Chinese character representations if the names are new or not well known. For English-Chinese cross language retrieval, failure to accurately translate Pinyin n ..."
Abstract
- Add to MetaCart
For readers of English text who know some Chinese, Pinyin codes that spell out Chinese names are often ambiguous as to their original Chinese character representations if the names are new or not well known. For English-Chinese cross language retrieval, failure to accurately translate Pinyin names in a query to Chinese characters can lead to dismal retrieval effectiveness. This paper presents an approach of extracting Pinyin names from English text, suggesting translations to these Pinyin using a database of names and their characters with usage probabilities, followed with IR techniques with a corpus as a disambiguation tool to resolve the translation candidates.
Improving PinYin to Chinese Conversion With a Whole Sentence . . .
"... We address the problem of statistical language modeling in the context of PinYin to Chinese (PTC) conversion, a similar problem to speech recognition but without acoustic recognition step. Inputted phonetic syllables were first segmented and converted into word lattice, which was then scored ..."
Abstract
- Add to MetaCart
We address the problem of statistical language modeling in the context of PinYin to Chinese (PTC) conversion, a similar problem to speech recognition but without acoustic recognition step. Inputted phonetic syllables were first segmented and converted into word lattice, which was then scored within a Source-Channel framework in order to find the most probable Chinese sentence. In particular, we discuss the use of a Whole Sentence Maximum Entropy (WSME) model, an expressive framework for constructing language models with diverse features. Experiment showed WSME model trained with d2-ngrams and word triggers achieved a 20% reduction in perplexity and a 11.05% reduction in character conversion error over a baseline trigram.
Why Press Backspace? Understanding User Input Behaviors in Chinese Pinyin Input Method
"... Chinese Pinyin input method is very important for Chinese language information processing. Users may make errors when they are typing in Chinese words. In this paper, we are concerned with the reasons that cause the errors. Inspired by the observation that pressing backspace is one of the most commo ..."
Abstract
- Add to MetaCart
Chinese Pinyin input method is very important for Chinese language information processing. Users may make errors when they are typing in Chinese words. In this paper, we are concerned with the reasons that cause the errors. Inspired by the observation that pressing backspace is one of the most common user behaviors to modify the errors, we collect 54, 309, 334 error-correction pairs from a realworld data set that contains 2, 277, 786 users via backspace operations. In addition, we present a comparative analysis of the data to achieve a better understanding of users ’ input behaviors. Comparisons with English typos suggest that some language-specific properties result in a part of Chinese input errors. 1
Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence User-Dependent Aspect Model for Collaborative Activity Recognition ∗
"... Activity recognition aims to discover one or more users ’ actions and goals based on sensor readings. In the real world, a single user’s data are often insufficient for training an activity recognition model due to the data sparsity problem. This is especially true when we are interested in obtainin ..."
Abstract
- Add to MetaCart
Activity recognition aims to discover one or more users ’ actions and goals based on sensor readings. In the real world, a single user’s data are often insufficient for training an activity recognition model due to the data sparsity problem. This is especially true when we are interested in obtaining a personalized model. In this paper, we study how to collaboratively use different users ’ sensor data to train a model that can provide personalized activity recognition for each user. We propose a user-dependent aspect model for this collaborative activity recognition task. Our model introduces user aspect variables to capture the user grouping information, so that a target user can also benefit from her similar users in the same group to train the recognition model. In this way, we can greatly reduce the need for much valuable and expensive labeled data required in training the recognition model for each user. Our model is also capable of incorporating time information and handling new user in activity recognition. We evaluate our model on a real-world WiFi data set obtained from an indoor environment, and show that the proposed model can outperform several state-of-art baseline algorithms. 1
CHIME: An Efficient Error-Tolerant Chinese Pinyin Input Method
"... Chinese Pinyin input methods are very important for Chinese language processing. In many cases, users may make typing errors. For example, a user wants to type in “shenme ” ( 什 么 , meaning “what” in English) but may type in “shenem ” instead. Existing Pinyin input methods fail in converting such ..."
Abstract
- Add to MetaCart
Chinese Pinyin input methods are very important for Chinese language processing. In many cases, users may make typing errors. For example, a user wants to type in “shenme ” ( 什 么 , meaning “what” in English) but may type in “shenem ” instead. Existing Pinyin input methods fail in converting such a Pinyin sequence with errors to the right Chinese words. To solve this problem, we developed an efficient error-tolerant Pinyin input method called “CHIME ” that can handle typing errors. By incorporating state-of-the-art techniques and languagespecific features, the method achieves a better performance than state-of-the-art input methods. It can efficiently find relevant words in milliseconds for an input Pinyin sequence. 1
Efficient dictionary and language model compression for
"... Reducing size of dictionary and language model is critical when applying them to real world applications including machine translation and input method editors (IME). Especially for IME, we have to drastically compress them without sacrificing lookup speed, since IMEs need to be executed on local co ..."
Abstract
- Add to MetaCart
Reducing size of dictionary and language model is critical when applying them to real world applications including machine translation and input method editors (IME). Especially for IME, we have to drastically compress them without sacrificing lookup speed, since IMEs need to be executed on local computers. This paper presents novel lossless compression algorithms for both dictionary and language model based on succinct data structures. Proposed two data structures are used in our product “Google Japanese Input ” 1, and its open-source version “Mozc ” 2. 1
A Unified Approach to Transliteration-based Text Input with Online Spelling Correction
"... This paper presents an integrated, end-to-end approach to online spelling correction for text input. Online spelling correction refers to the spelling correction as you type, as opposed to post-editing. The online scenario is particularly important for languages that routinely use transliteration-ba ..."
Abstract
- Add to MetaCart
This paper presents an integrated, end-to-end approach to online spelling correction for text input. Online spelling correction refers to the spelling correction as you type, as opposed to post-editing. The online scenario is particularly important for languages that routinely use transliteration-based text input methods, such as Chinese and Japanese, because the desired target characters cannot be input at all unless they are in the list of candidates provided by an input method, and spelling errors prevent them from appearing in the list. For example, a user might type suesheng by mistake to mean xuesheng 学 生 'student ' in Chinese; existing input methods fail to convert this misspelled input to the desired target Chinese characters. In this paper, we propose a unified approach to the problem of spelling correction and transliteration-based character conversion using an approach inspired by the phrasebased statistical machine translation framework. At the phrase (substring) level, k most probable pinyin (Romanized Chinese) corrections are generated using a monotone decoder; at the sentence level, input pinyin strings are directly transliterated into target Chinese characters by a decoder using a loglinear model that refer to the features of both levels. A new method of automatically deriving parallel training data from user keystroke logs is also presented. Experiments on Chinese pinyin conversion show that our integrated method reduces the character error rate by 20 % (from 8.9 % to 7.12%) over the previous state-of-the art based on a noisy channel model. 1

