• Documents
  • Authors
  • Tables
  • Other Seers ▼
    RefSeer AckSeer CollabSeer SeerSeer
  • Log in
  • Sign up
  • MetaCart

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

The Use of Clustering Techniques for Language Modeling - Application to Asian Languages

by Jianfeng Gao, Joshua T. Goodman, Jiangbo Miao
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 13
Next 10 →

Toward a unified approach to statistical language modeling for Chinese

by Jianfeng Gao, Joshua Goodman, Mingjing Li, Kai-fu Lee , 2001
"... This article presents a unified approach to Chinese statistical language modeling (SLM). Applying SLM techniques like trigram language models to Chinese is challenging because (1) there is no standard definition of words in Chinese; (2) word boundaries are not marked by spaces; and (3) there is a de ..."
Abstract - Cited by 40 (16 self) - Add to MetaCart
This article presents a unified approach to Chinese statistical language modeling (SLM). Applying SLM techniques like trigram language models to Chinese is challenging because (1) there is no standard definition of words in Chinese; (2) word boundaries are not marked by spaces; and (3) there is a dearth of training data. Our unified approach automatically and consistently gathers a high-quality training data set from the Web, creates a high-quality lexicon, segments the training data using this lexicon, and compresses the language model, all by using the maximum likelihood principle, which is consistent with trigram model training. We show that each of the methods leads to improvements over standard SLM, and that the combined method yields the best pinyin conversion result reported.

discriminant model for information retrieval

by Jianfeng Gao, Haoliang Qi, Xinsong Xia, Jian-yun Nie - In the Proceedings of SIGIR’2005 , 2005
"... This paper presents a new discriminative model for information retrieval (IR), referred to as linear discriminant model (LDM), which provides a flexible framework to incorporate arbitrary features. LDM is different from most existing models in that it takes into account a variety of linguistic featu ..."
Abstract - Cited by 38 (12 self) - Add to MetaCart
This paper presents a new discriminative model for information retrieval (IR), referred to as linear discriminant model (LDM), which provides a flexible framework to incorporate arbitrary features. LDM is different from most existing models in that it takes into account a variety of linguistic features that are derived from the component models of HMM that is widely used in language modeling approaches to IR. Therefore, LDM is a means of melding discriminative and generative models for IR. We present two algorithms of parameter learning for LDM. One is to optimize the average precision (AP) directly using an iterative procedure. The other is a perceptron-based algorithm that minimizes the number of discordant document-pairs in a rank list. The effectiveness of our approach has been evaluated on the task of ad hoc retrieval using six English and Chinese TREC test sets. Results show that (1) in most test sets, LDM significantly outperforms the state-of-the-art language modeling approaches and the classical probabilistic retrieval model; (2) it is more appropriate to train LDM using a measure of AP rather than likelihood if the IR system is graded on AP; and (3) linguistic features (e.g. phrases and dependences) are effective for IR if they are incorporated properly.

Chinese word segmentation and named entity recognition: a pragmatic approach

by Jianfeng Gao, Andi Wu, Mu Li, Chang-ning Huang - Computational Linguistics , 2005
"... This paper presents a pragmatic approach to Chinese word segmentation. It differentiates from most of the previous approaches mainly in three respects. First of all, while theoretical linguists have defined Chinese words with various linguistic criteria, Chinese words in this study are defined pragm ..."
Abstract - Cited by 14 (1 self) - Add to MetaCart
This paper presents a pragmatic approach to Chinese word segmentation. It differentiates from most of the previous approaches mainly in three respects. First of all, while theoretical linguists have defined Chinese words with various linguistic criteria, Chinese words in this study are defined pragmatically as segmentation units whose definition depends on how they are used and processed in realistic computer applications. Secondly, we propose a pragmatic mathematical framework in which segmenting known words and detecting unknown words of different types (i.e. morphologically derived words, factoids, named entities, and other unlisted words) can be performed simultaneously in a unified way. These tasks are usually conducted separately in other systems. Finally, we do not assume the existence of a universal word segmentation standard which is application independent. Instead, we argue for the necessity of multiple segmentation standards due to the pragmatic fact that different NLP applications might require different granularities of Chinese words. These pragmatic approaches have been implemented in an adaptive Chinese word segmenter, called MSRSeg (access

Compressing Trigram Language Models With Golomb Coding

by Ken Church, Redmond Wa, Ted Hart, Jianfeng Gao
"... Trigram language models are compressed using a Golomb coding method inspired by the original Unix spell program. Compression methods trade off space, time and accuracy (loss). The proposed HashTBO method optimizes space at the expense of time and accuracy. Trigram language models are normally consid ..."
Abstract - Cited by 13 (3 self) - Add to MetaCart
Trigram language models are compressed using a Golomb coding method inspired by the original Unix spell program. Compression methods trade off space, time and accuracy (loss). The proposed HashTBO method optimizes space at the expense of time and accuracy. Trigram language models are normally considered memory hogs, but with HashTBO, it is possible to squeeze a trigram language model into a few megabytes or less. HashTBO made it possible to ship a trigram contextual speller in Microsoft Office 2007.

Exploring web scale language models for search query processing

by Jian Huang, Xiaolong Li, Jianfeng Gao, Kuansan Wang, Jiangbo Miao, Fritz Behr - In Proceedings of WWW 2010
"... It has been widely observed that search queries are composed in a very different style from that of the body or the title of a document. Many techniques explicitly accounting for this language style discrepancy have shown promising results for information retrieval, yet a large scale analysis on the ..."
Abstract - Cited by 11 (7 self) - Add to MetaCart
It has been widely observed that search queries are composed in a very different style from that of the body or the title of a document. Many techniques explicitly accounting for this language style discrepancy have shown promising results for information retrieval, yet a large scale analysis on the extent of the language differences has been lacking. In this paper, we present an extensive study on this issue by examining the language model properties of search queries and the three text streams associated with each web document: the body, the title, and the anchor text. Our information theoretical analysis shows that queries seem to be composed in a way most similar to how authors summarize documents in anchor texts or titles, offering a quantitative explanation to the observations in past work. We apply these web scale n-gram language models to three search query processing (SQP) tasks: query spelling correction, query bracketing and long query segmentation. By controlling the size and the order of different language models, we find that the perplexity metric to be a good accuracy indicator for these query processing tasks. We show that using smoothed language models yields significant accuracy gains for query bracketing for instance, compared to using web counts as in the literature. We also demonstrate that applying web-scale language models can have marked accuracy advantage over smaller ones.

A Large Scale Ranker-Based System for Search Query Spelling Correction

by Jianfeng Gao, Xiaolong Li, Daniel Micol, Chris Quirk
"... This paper makes three significant extensions to a noisy channel speller designed for standard written text to target the challenging domain of search queries. First, the noisy channel model is subsumed by a more general ranker, which allows a variety of features to be easily incorporated. Second, a ..."
Abstract - Cited by 8 (2 self) - Add to MetaCart
This paper makes three significant extensions to a noisy channel speller designed for standard written text to target the challenging domain of search queries. First, the noisy channel model is subsumed by a more general ranker, which allows a variety of features to be easily incorporated. Second, a distributed infrastructure is proposed for training and applying Web scale n-gram language models. Third, a new phrase-based error model is presented. This model places a probability distribution over transformations between multi-word phrases, and is estimated using large amounts of query-correction pairs derived from search logs. Experiments show that each of these extensions leads to significant improvements over the state-of-the-art baseline methods. 1

Exploring Asymmetric Clustering for Statistical Language Modeling

by Jianfeng Gao, Joshua Goodman, Guihong Cao, Hang Li - Proceedings of the Fortieth Annual Meeting of the Association for Computational Linguistics (ACL’2002). Philadelphia , 2002
"... The n-gram model is a stochastic model, which predicts the next word (predicted word) given the previous words (conditional words) in a word sequence. ..."
Abstract - Cited by 7 (0 self) - Add to MetaCart
The n-gram model is a stochastic model, which predicts the next word (predicted word) given the previous words (conditional words) in a word sequence.

Statistical query translation models for cross-language information retrieval

by Jianfeng Gao, Jian-yun Nie, Ming Zhou - ACM Transactions on Asian Language Information Processing (TALIP , 2006
"... Query translation is an important task in cross-language information retrieval (CLIR), which aims to determine the best translation words and weights for a query. This paper presents three statistical query translation models that focus on resolution of query translation ambiguities. All the models ..."
Abstract - Cited by 6 (0 self) - Add to MetaCart
Query translation is an important task in cross-language information retrieval (CLIR), which aims to determine the best translation words and weights for a query. This paper presents three statistical query translation models that focus on resolution of query translation ambiguities. All the models assume that the selection of the translation of a query term depends on the translations of other terms in the query. They differ in the way linguistic structures are detected and exploited. The co-occurrence model treats a query as a bag of words, and use all the other terms in the query as the context for translation disambiguation. The other two models exploit linguistic dependencies among terms. The noun phrase (NP) translation model detects NPs in a query, and translates each NP as a unit by assuming that the translation of a term only depends on other terms within the same NP. Similarly, the dependency translation model detects and translates dependency triples, such as verb-object, as units. The evaluations show that linguistic structures always lead to more precise translations. The experiments of CLIR on TREC Chinese collections show that all the three models have a positive impact on query translation, and lead to significant improvements of CLIR performance over the simple dictionary-based translation method. The best results are obtained by combining the three models.

A Chinese Term Clustering Mechanism for Generating Semantic Concepts of a News Ontology

by Chang-shing Lee, Yau-hwang Kuo, Chia-hsin Liao, Zhi-wei Jian - Journal of Computational Linguistics and Chinese Language Processing , 2005
"... In order to efficiently manage and use knowledge, ontology technologies are widely applied to various kinds of domain knowledge. This paper proposes a Chinese term clustering mechanism for generating semantic concepts of a news ontology. We utilize the parallel fuzzy inference mechanism to infer the ..."
Abstract - Cited by 1 (1 self) - Add to MetaCart
In order to efficiently manage and use knowledge, ontology technologies are widely applied to various kinds of domain knowledge. This paper proposes a Chinese term clustering mechanism for generating semantic concepts of a news ontology. We utilize the parallel fuzzy inference mechanism to infer the conceptual resonance strength of a Chinese term pair. There are four input fuzzy variables, consisting of a Part-of-Speech (POS) fuzzy variable, Term Vocabulary (TV) fuzzy variable, Term Association (TA) fuzzy variable, and Common Term Association (CTA) fuzzy variable, and one output fuzzy variable, the Conceptual Resonance Strength (CRS), in the mechanism. In addition, the CKIP tool is used in Chinese natural language processing tasks, including POS tagging, refining tagging, and stop word filtering. The fuzzy compatibility relation approach to the semantic concept clustering is also proposed. Simulation results show that our approach can effectively cluster Chinese terms to generate the semantic concepts of a news ontology.

© 2005 Association for Computational Linguistics

by Jianfeng Gao, Andi Wu, Mu Li, Chang-ning Huang
"... This paper presents a pragmatic approach to Chinese word segmentation. It differentiates from most of the previous approaches mainly in three respects. First of all, while theoretical linguists have defined Chinese words with various linguistic criteria, Chinese words in this study are defined pragm ..."
Abstract - Cited by 1 (1 self) - Add to MetaCart
This paper presents a pragmatic approach to Chinese word segmentation. It differentiates from most of the previous approaches mainly in three respects. First of all, while theoretical linguists have defined Chinese words with various linguistic criteria, Chinese words in this study are defined pragmatically as segmentation units whose definition depends on how they are used and processed in realistic computer applications. Secondly, we propose a pragmatic mathematical framework in which segmenting known words and detecting unknown words of different types (i.e. morphologically derived words, factoids, named entities, and other unlisted words) can be performed simultaneously in a unified way. These tasks are usually conducted separately in other systems. Finally, we do not assume the existence of a universal word segmentation standard which is application independent. Instead, we argue for the necessity of multiple segmentation standards due to the pragmatic fact that different NLP applications might require different granularities of Chinese words. These pragmatic approaches have been implemented in an adaptive Chinese word segmenter, called MSRSeg, which will be described in detail. It consists of two components: (1) a generic segmenter that is based on the framework of linear mixture models, and provides a unified approach to the five fundamental features of word-level Chinese language processing: lexicon word processing, morphological analysis, factoid detection, named entity recognition, and new word identification; and (2) a set of output adaptors for adapting the output of the former to different application-specific standards. Evaluation on five test sets with different standards shows that the adaptive system achieves state-of-the-art performance on all the test sets. 1.
The National Science Foundation
  • About CiteSeerX
  • Submit Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2010 The Pennsylvania State University