• Documents
  • Authors
  • Tables
  • Other Seers ▼
    RefSeer AckSeer CollabSeer SeerSeer
  • Log in
  • Sign up
  • MetaCart

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

An Improved Error Model for Noisy Channel Spelling Correction (2000)

by Eric Brill, Robert C. Moore
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 56
Next 10 →

Multipath Translation Lexicon Induction via Bridge Languages

by Gideon S. Mann, David Yarowsky, Bridge Languages - In Proceedings of NAACL 2001 , 2001
"... This paper presents a method for inducing translation lexicons based on transduction models of cognate pairs via bridge languages. Bilingual lexicons within languages families are induced using probabilistic string edit distance models. Translation lexicons for arbitrary distant language pairs are t ..."
Abstract - Cited by 45 (1 self) - Add to MetaCart
This paper presents a method for inducing translation lexicons based on transduction models of cognate pairs via bridge languages. Bilingual lexicons within languages families are induced using probabilistic string edit distance models. Translation lexicons for arbitrary distant language pairs are then generated by a combination of these intra-family translation models and one or more cross-family online dictionaries. Up to 95% exact match accuracy is achieved on the target vocabulary (30-68% of inter-family test pairs). Thus substantial portions of translation lexicons can be generated accurately for languages where no bilingual dictionary or parallel corpora may exist.

A conditional random field for discriminatively-trained finite-state string edit distance

by Andrew Mccallum, Kedar Bellare - In Conference on Uncertainty in AI (UAI , 2005
"... The need to measure sequence similarity arises in information extraction, object identity, data mining, biological sequence analysis, and other domains. This paper presents discriminative string-edit CRFs, a finitestate conditional random field model for edit sequences between strings. Conditional r ..."
Abstract - Cited by 33 (5 self) - Add to MetaCart
The need to measure sequence similarity arises in information extraction, object identity, data mining, biological sequence analysis, and other domains. This paper presents discriminative string-edit CRFs, a finitestate conditional random field model for edit sequences between strings. Conditional random fields have advantages over generative approaches to this problem, such as pair HMMs or the work of Ristad and Yianilos, because as conditionally-trained methods, they enable the use of complex, arbitrary actions and features of the input strings. As in generative models, the training data does not have to specify the edit sequences between the given string pairs. Unlike generative models, however, our model is trained on both positive and negative instances of string pairs. We present positive experimental results on several data sets. 1

Pronunciation Modeling for Improved Spelling Correction

by Kristina Toutanova, Robert C. Moore - Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics , 2002
"... This paper presents a method for incorporating word pronunciation information in a noisy channel model for spelling correction. ..."
Abstract - Cited by 30 (0 self) - Add to MetaCart
This paper presents a method for incorporating word pronunciation information in a noisy channel model for spelling correction.

Learning a spelling error model from search query logs

by Farooq Ahmad - In Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (Vancouver, British , 2005
"... Applying the noisy channel model to search query spelling correction requires an error model and a language model. Typically, the error model relies on a weighted string edit distance measure. The weights can be learned from pairs of misspelled words and their corrections. This paper investigates us ..."
Abstract - Cited by 18 (0 self) - Add to MetaCart
Applying the noisy channel model to search query spelling correction requires an error model and a language model. Typically, the error model relies on a weighted string edit distance measure. The weights can be learned from pairs of misspelled words and their corrections. This paper investigates using the Expectation Maximization algorithm to learn edit distance weights directly from search query logs, without relying on a corpus of paired words. 1

Exploring distributional similarity based models for query spelling correction

by Mu Li, Yang Zhang - In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics , 2006
"... A query speller is crucial to search engine in improving web search relevance. This paper describes novel methods for use of distributional similarity estimated from query logs in learning improved query spelling correction models. The key to our methods is the property of distributional similarity ..."
Abstract - Cited by 17 (0 self) - Add to MetaCart
A query speller is crucial to search engine in improving web search relevance. This paper describes novel methods for use of distributional similarity estimated from query logs in learning improved query spelling correction models. The key to our methods is the property of distributional similarity between two terms: it is high between a frequently occurring misspelling and its correction, and low between two irrelevant terms only with similar spellings. We present two models that are able to take advantage of this property. Experimental results demonstrate that the distributional similarity based models can significantly outperform their baseline systems in the web query spelling correction task. 1

Automatically Harvesting Katakana-English Term Pairs from Search Engine Query Logs

by Eric Brill, Gary Kacmarcik, Chris Brockett , 2001
"... This paper describes a method of extracting katakana words and phrases, along with their English counterparts from non-aligned monolingual web search engine query logs. The method employs a trainable edit distance function to find pairs that have a high probability of bei ..."
Abstract - Cited by 15 (0 self) - Add to MetaCart
This paper describes a method of extracting katakana words and phrases, along with their English counterparts from non-aligned monolingual web search engine query logs. The method employs a trainable edit distance function to find <katakana, English> pairs that have a high probability of being equivalent. These pairs can then be used to further bootstrap training of the edit distance function, resulting in improved back-transliteration from katakana to English. In addition, this is an effective method for mining large numbers of katakana strings to enhance a bilingual lexicon. The improved edit distance function and enhanced lexicon can be used for more accurate alignment of bitexts, and for application during runtime MT and multilingual IR.

tRuEcasIng

by Lucian Vlad Lita, Abe Ittycheriah, Salim Roukos, Nanda Kambhatla, Ibm T. J. Watson , 2003
"... Truecasing is the process of restoring case information to badly-cased or noncased text. This paper explores truecasing issues and proposes a statistical, language modeling based truecaser which achieves an accuracy of on news articles. Task based evaluation shows a 26% F-measure improveme ..."
Abstract - Cited by 15 (0 self) - Add to MetaCart
Truecasing is the process of restoring case information to badly-cased or noncased text. This paper explores truecasing issues and proposes a statistical, language modeling based truecaser which achieves an accuracy of on news articles. Task based evaluation shows a 26% F-measure improvement in named entity recognition when using truecasing.

Learning phrase-based spelling error models from clickthrough data

by Xu Sun, Daniel Micol, Jianfeng Gao, Chris Quirk - In ACL , 2010
"... This paper explores the use of clickthrough data for query spelling correction. First, large amounts of query-correction pairs are derived by analyzing users ' query reformulation behavior encoded in the clickthrough data. Then, a phrase-based error model that accounts for the transformation probabi ..."
Abstract - Cited by 11 (2 self) - Add to MetaCart
This paper explores the use of clickthrough data for query spelling correction. First, large amounts of query-correction pairs are derived by analyzing users ' query reformulation behavior encoded in the clickthrough data. Then, a phrase-based error model that accounts for the transformation probability between multi-term phrases is trained and integrated into a query speller system. Experiments are carried out on a human-labeled data set. Results show that the system using the phrase-based error model outperforms significantly its baseline systems. 1

A Large Scale Ranker-Based System for Search Query Spelling Correction

by Jianfeng Gao, Xiaolong Li, Daniel Micol, Chris Quirk
"... This paper makes three significant extensions to a noisy channel speller designed for standard written text to target the challenging domain of search queries. First, the noisy channel model is subsumed by a more general ranker, which allows a variety of features to be easily incorporated. Second, a ..."
Abstract - Cited by 8 (2 self) - Add to MetaCart
This paper makes three significant extensions to a noisy channel speller designed for standard written text to target the challenging domain of search queries. First, the noisy channel model is subsumed by a more general ranker, which allows a variety of features to be easily incorporated. Second, a distributed infrastructure is proposed for training and applying Web scale n-gram language models. Third, a new phrase-based error model is presented. This model places a probability distribution over transformations between multi-word phrases, and is estimated using large amounts of query-correction pairs derived from search logs. Experiments show that each of these extensions leads to significant improvements over the state-of-the-art baseline methods. 1

Improved string matching under noisy channel conditions

by Kevyn Collins-thompson - In Proceedings of CIKM , 2001
"... Many document-based applications, including popular Web browsers, email viewers, and word processors, have a ‘Find on this Page ’ feature that allows a user to find every occurrence of a given string in the document. If the document text being searched is derived from a noisy process such as optical ..."
Abstract - Cited by 7 (1 self) - Add to MetaCart
Many document-based applications, including popular Web browsers, email viewers, and word processors, have a ‘Find on this Page ’ feature that allows a user to find every occurrence of a given string in the document. If the document text being searched is derived from a noisy process such as optical character recognition (OCR), the effectiveness of typical string matching can be greatly reduced. This paper describes an enhanced string-matching algorithm for degraded text that improves recall, while keeping precision at acceptable levels. The algorithm is more general than most approximate matching algorithms and allows string-to-string edits with arbitrary costs. We develop a method for evaluating our technique and use it to examine the relative effectiveness of each sub-component of the algorithm. Of the components we varied, we find that using confidence information from the recognition process lead to the largest improvements in matching accuracy.
The National Science Foundation
  • About CiteSeerX
  • Submit Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2010 The Pennsylvania State University