Results 1 - 10
of
11
A Large Scale Ranker-Based System for Search Query Spelling Correction
"... This paper makes three significant extensions to a noisy channel speller designed for standard written text to target the challenging domain of search queries. First, the noisy channel model is subsumed by a more general ranker, which allows a variety of features to be easily incorporated. Second, a ..."
Abstract
-
Cited by 8 (2 self)
- Add to MetaCart
This paper makes three significant extensions to a noisy channel speller designed for standard written text to target the challenging domain of search queries. First, the noisy channel model is subsumed by a more general ranker, which allows a variety of features to be easily incorporated. Second, a distributed infrastructure is proposed for training and applying Web scale n-gram language models. Third, a new phrase-based error model is presented. This model places a probability distribution over transformations between multi-word phrases, and is estimated using large amounts of query-correction pairs derived from search logs. Experiments show that each of these extensions leads to significant improvements over the state-of-the-art baseline methods. 1
Data-Driven Response Generation in Social Media
"... We present a data-driven approach to generating responses to Twitter status posts, based on phrase-based Statistical Machine Translation. We find that mapping conversational stimuli onto responses is more difficult than translating between languages, due to the wider range of possible responses, the ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
We present a data-driven approach to generating responses to Twitter status posts, based on phrase-based Statistical Machine Translation. We find that mapping conversational stimuli onto responses is more difficult than translating between languages, due to the wider range of possible responses, the larger fraction of unaligned words/phrases, and the presence of large phrase pairs whose alignment cannot be further decomposed. After addressing these challenges, we compare approaches based on SMT and Information Retrieval in a human evaluation. We show that SMT outperforms IR on this task, and its output is preferred over actual human responses in 15 % of cases. As far as we are aware, this is the first work to investigate the use of phrase-based SMT to directly translate a linguistic stimulus into an appropriate response. 1
Online Spelling Correction for Query Completion
"... In this paper, we study the problem of online spelling correction for query completions. Misspelling is a common phenomenon among search engines queries. In order to help users effectively express their information needs, mechanisms for automatically correcting misspelled queries are required. Onlin ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
In this paper, we study the problem of online spelling correction for query completions. Misspelling is a common phenomenon among search engines queries. In order to help users effectively express their information needs, mechanisms for automatically correcting misspelled queries are required. Online spelling correction aims to provide spell corrected completion suggestions as a query is incrementally entered. As latency is crucial to the utility of the suggestions, such an algorithm needs to be not only accurate, but also efficient. To tackle this problem, we propose and study a generative model for input queries, based on a noisy channel transformation of the intended queries. Utilizing spelling correction pairs, we train a Markov n-gram transformation model that captures user spelling behavior in an unsupervised fashion. To find the top spellcorrected completion suggestions in real-time, we adapt the A* search algorithm with various pruning heuristics to dynamically expand the search space efficiently. Evaluation of the proposed methods demonstrates a substantial increase in the effectiveness of online spelling correction over existing techniques.
Hashing-based Approaches to Spelling Correction of Personal Names
"... We propose two hashing-based solutions to the problem of fast and effective personal names spelling correction in People Search applications. The key idea behind our methods is to learn hash functions that map similar names to similar (and compact) binary codewords. The two methods differ in the dat ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
We propose two hashing-based solutions to the problem of fast and effective personal names spelling correction in People Search applications. The key idea behind our methods is to learn hash functions that map similar names to similar (and compact) binary codewords. The two methods differ in the data they use for learning the hash functions- the first method uses a set of names in a given language/script whereas the second uses a set of bilingual names. We show that both methods give excellent retrieval performance in comparison to several baselines on two lists of misspelled personal names. More over, the method that uses bilingual data for learning hash functions gives the best performance. 1
Phrase-Based Translation Model for Question Retrieval in Community Question Answer Archives
"... Community-based question answer (Q&A) has become an important issue due to the popularity of Q&A archives on the web. This paper is concerned with the problem of question retrieval. Question retrieval in Q&A archives aims to find historical questions that are semantically equivalent or relevant to t ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Community-based question answer (Q&A) has become an important issue due to the popularity of Q&A archives on the web. This paper is concerned with the problem of question retrieval. Question retrieval in Q&A archives aims to find historical questions that are semantically equivalent or relevant to the queried questions. In this paper, we propose a novel phrase-based translation model for question retrieval. Compared to the traditional word-based translation models, the phrasebased translation model is more effective because it captures contextual information in modeling the translation of phrases as a whole, rather than translating single words in isolation. Experiments conducted on real Q&A data demonstrate that our proposed phrasebased translation model significantly outperforms the state-of-the-art word-based translation model. 1
A Graph Approach to Spelling Correction in Domain-Centric Search
"... Spelling correction for keyword-search queries is challenging in restricted domains such as personal email (or desktop) search, due to the scarcity of query logs, and due to the specialized nature of the domain. For that task, this paper presents an algorithm that is based on statistics from the cor ..."
Abstract
- Add to MetaCart
Spelling correction for keyword-search queries is challenging in restricted domains such as personal email (or desktop) search, due to the scarcity of query logs, and due to the specialized nature of the domain. For that task, this paper presents an algorithm that is based on statistics from the corpus data (rather than the query log). This algorithm, which employs a simple graph-based approach, can incorporate different types of data sources with different levels of reliability (e.g., email subject vs. email body), and can handle complex spelling errors like splitting and merging of words. An experimental study shows the superiority of the algorithm over existing alternatives in the email domain. 1
Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence User-Dependent Aspect Model for Collaborative Activity Recognition ∗
"... Activity recognition aims to discover one or more users ’ actions and goals based on sensor readings. In the real world, a single user’s data are often insufficient for training an activity recognition model due to the data sparsity problem. This is especially true when we are interested in obtainin ..."
Abstract
- Add to MetaCart
Activity recognition aims to discover one or more users ’ actions and goals based on sensor readings. In the real world, a single user’s data are often insufficient for training an activity recognition model due to the data sparsity problem. This is especially true when we are interested in obtaining a personalized model. In this paper, we study how to collaboratively use different users ’ sensor data to train a model that can provide personalized activity recognition for each user. We propose a user-dependent aspect model for this collaborative activity recognition task. Our model introduces user aspect variables to capture the user grouping information, so that a target user can also benefit from her similar users in the same group to train the recognition model. In this way, we can greatly reduce the need for much valuable and expensive labeled data required in training the recognition model for each user. Our model is also capable of incorporating time information and handling new user in activity recognition. We evaluate our model on a real-world WiFi data set obtained from an indoor environment, and show that the proposed model can outperform several state-of-art baseline algorithms. 1
CHIME: An Efficient Error-Tolerant Chinese Pinyin Input Method
"... Chinese Pinyin input methods are very important for Chinese language processing. In many cases, users may make typing errors. For example, a user wants to type in “shenme ” ( 什 么 , meaning “what” in English) but may type in “shenem ” instead. Existing Pinyin input methods fail in converting such ..."
Abstract
- Add to MetaCart
Chinese Pinyin input methods are very important for Chinese language processing. In many cases, users may make typing errors. For example, a user wants to type in “shenme ” ( 什 么 , meaning “what” in English) but may type in “shenem ” instead. Existing Pinyin input methods fail in converting such a Pinyin sequence with errors to the right Chinese words. To solve this problem, we developed an efficient error-tolerant Pinyin input method called “CHIME ” that can handle typing errors. By incorporating state-of-the-art techniques and languagespecific features, the method achieves a better performance than state-of-the-art input methods. It can efficiently find relevant words in milliseconds for an input Pinyin sequence. 1
CloudSpeller: Spelling Correction for Search Queries by Using a Unified Hidden Markov Model with Web-scale Resources
"... Query spelling correction is a crucial component of moden search engines that can help users to express an information need more accurately and thus improve search quality. In participation of the Microsoft Speller Challenge, we proposed and implemented an efficient end-to-end speller correction sys ..."
Abstract
- Add to MetaCart
Query spelling correction is a crucial component of moden search engines that can help users to express an information need more accurately and thus improve search quality. In participation of the Microsoft Speller Challenge, we proposed and implemented an efficient end-to-end speller correction system, namely CloudSpeller. The CloudSpeller system uses a Hidden Markov model to effectively model major types of spelling errors in a unified framework, in which we integrate a large-scale lexicon constructed using Wikipedia, an error model trained from high confidence correction pairs, and the Microsoft Web N-gram service. Our system achieves excellent performance on two search query spelling correction datasets, reaching 0.970 and 0.940 F1 scores on the TREC dataset and the MSN dataset respectively.
A Unified Approach to Transliteration-based Text Input with Online Spelling Correction
"... This paper presents an integrated, end-to-end approach to online spelling correction for text input. Online spelling correction refers to the spelling correction as you type, as opposed to post-editing. The online scenario is particularly important for languages that routinely use transliteration-ba ..."
Abstract
- Add to MetaCart
This paper presents an integrated, end-to-end approach to online spelling correction for text input. Online spelling correction refers to the spelling correction as you type, as opposed to post-editing. The online scenario is particularly important for languages that routinely use transliteration-based text input methods, such as Chinese and Japanese, because the desired target characters cannot be input at all unless they are in the list of candidates provided by an input method, and spelling errors prevent them from appearing in the list. For example, a user might type suesheng by mistake to mean xuesheng 学 生 'student ' in Chinese; existing input methods fail to convert this misspelled input to the desired target Chinese characters. In this paper, we propose a unified approach to the problem of spelling correction and transliteration-based character conversion using an approach inspired by the phrasebased statistical machine translation framework. At the phrase (substring) level, k most probable pinyin (Romanized Chinese) corrections are generated using a monotone decoder; at the sentence level, input pinyin strings are directly transliterated into target Chinese characters by a decoder using a loglinear model that refer to the features of both levels. A new method of automatically deriving parallel training data from user keystroke logs is also presented. Experiments on Chinese pinyin conversion show that our integrated method reduces the character error rate by 20 % (from 8.9 % to 7.12%) over the previous state-of-the art based on a noisy channel model. 1

