• Documents
  • Authors
  • Tables
  • Other Seers ▼
    RefSeer AckSeer CollabSeer SeerSeer
  • Log in
  • Sign up
  • MetaCart

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

Improving trigram language modeling with the world wide web (2000)

by X Zhu, R Rosenfeld
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 21
Next 10 →

Web-based models for natural language processing

by Mirella Lapata, Frank Keller - ACM Transactions on Speech and Language Processing , 2005
"... Previous work demonstrated that Web counts can be used to approximate bigram counts, suggesting that Web-based frequencies should be useful for a wide variety of Natural Language Processing (NLP) tasks. However, only a limited number of tasks have so far been tested using Web-scale data sets. The pr ..."
Abstract - Cited by 48 (0 self) - Add to MetaCart
Previous work demonstrated that Web counts can be used to approximate bigram counts, suggesting that Web-based frequencies should be useful for a wide variety of Natural Language Processing (NLP) tasks. However, only a limited number of tasks have so far been tested using Web-scale data sets. The present article overcomes this limitation by systematically investigating the performance of Web-based models for several NLP tasks, covering both syntax and semantics, both generation and analysis, and a wider range of n-grams and parts of speech than have been previously explored. For the majority of our tasks, we find that simple, unsupervised models perform better when n-gram counts are obtained from the Web rather than from a large corpus. In some cases, performance can be improved further by using backoff or interpolation techniques that combine Web counts and corpus counts. However, unsupervised Web-based models generally fail to outperform supervised state-ofthe-art models trained on smaller corpora. We argue that Web-based models should therefore be used as a baseline for, rather than an alternative to, standard supervised models.

Statistical language model adaptation: review and perspectives

by Jerome R. Bellegarda - Speech Communication , 2004
"... Speech recognition performance is severely affected when the lexical, syntactic, or semantic characteristics of the discourse in the training and recognition tasks differ. The aim of language model adaptation is to exploit specific, albeit limited, knowledge about the recognition task to compensate ..."
Abstract - Cited by 35 (0 self) - Add to MetaCart
Speech recognition performance is severely affected when the lexical, syntactic, or semantic characteristics of the discourse in the training and recognition tasks differ. The aim of language model adaptation is to exploit specific, albeit limited, knowledge about the recognition task to compensate for this mismatch. More generally, an adaptive language model seeks to maintain an adequate representation of the current task domain under changing conditions involving potential variations in vocabulary, syntax, content, and style. This paper presents an overview of the major approaches proposed to address this issue, and offers some perspectives regarding their comparative merits and associated tradeoffs. Ó 2003 Elsevier B.V. All rights reserved. 1.

Applied Text Analytics for Blogs

by Gilad Mishne - UNIVERSITY OF AMSTERDAM , 2007
"... ..."
Abstract - Cited by 13 (0 self) - Add to MetaCart
Abstract not found

TIJAH at INEX 2004: Modeling Phrases and Relevance Feedback

by Vojkan Mihajlović, Georgina Ramírez, Arjen P. De Vries, Djoerd Hiemstra, Henk Ernst Blok - In Proceedings of the 3rd INEX Workshop, LNCS 3493 , 2005
"... Abstract. This paper discusses our participation in INEX using the TIJAH XML-IR system. We have enriched the TIJAH system, which follows a standard layered database architecture, with several new features. An extensible conceptual level processing unit has been added to the system. The algebra on th ..."
Abstract - Cited by 10 (7 self) - Add to MetaCart
Abstract. This paper discusses our participation in INEX using the TIJAH XML-IR system. We have enriched the TIJAH system, which follows a standard layered database architecture, with several new features. An extensible conceptual level processing unit has been added to the system. The algebra on the logical level and the implementation on the physical level have been extended to support phrase search and structural relevance feedback. The conceptual processing unit is capable of rewriting NEXI content-only and content-and-structure queries into the internal form, based on the retrieval model parameter specification, that is either predefined or based on relevance feedback. Relevance feedback parameters are produced based on the data fusion of result element score values and sizes, and relevance assessments. The introduction of new operators supporting phrase search in score region algebra on the logical level is discussed in the paper, as well as their implementation on the physical level using the pre-post numbering scheme. The framework for structural relevance feedback is also explained in the paper. We conclude with a preliminary analysis of the system performance based on INEX 2004 runs.

A Comparison of Techniques for Estimating IDF Values to Generate Lexical Signatures for the Web

by Martin Klein, Michael L. Nelson - In Proceedings of WIDM ’08
"... For bounded datasets such as the TREC Web Track the computation of term frequency (TF) and inverse document frequency (IDF) is not difficult. However, since IDF cannot be directly calculated for the entire web, it must be estimated. We see a need to estimate accurate IDF values to generate TF-IDF ba ..."
Abstract - Cited by 8 (4 self) - Add to MetaCart
For bounded datasets such as the TREC Web Track the computation of term frequency (TF) and inverse document frequency (IDF) is not difficult. However, since IDF cannot be directly calculated for the entire web, it must be estimated. We see a need to estimate accurate IDF values to generate TF-IDF based lexical signatures (LSs) of web pages. Future applications for generating such LSs require a real time IDF computation. Therefore we conducted a comparison study of different methods to estimate IDF values of web pages. Our objective is to investigate how accurate these estimation methods are compared to the a baseline. We use the Google N-grams as our baseline and compare it against two IDF estimation techniques which are based on: 1) a “local universe ” consisting of textual content and the according document frequencies from copies of URLs from the Internet Archive and 2) “screen scraping”, a technique to query the Google web interface for document frequencies. We found a term overlap of 70 to 80 % between the results of the two methods and the baseline. We further discovered a great agreement in rank correlation of TF-IDF ranked terms between our methods. Kendall τ is approximately 0.8 and the M-Score (penalizing discordances in higher ranks) is even higher, it peaks at well above 0.9. These preliminary results lead us to the conclusion that both methods are appropriate for creating accurate IDF values for web pages.

A study of using search engine page hits as a proxy for n-gram frequencies

by Preslav Nakov, Marti Hearst - In Proceedings of the RANLP’05 , 2005
"... The idea of using the Web as a corpus for linguistic research is getting increasingly popular. Most often this means using Web search engine page hit counts as estimates for n-gram frequencies. While the results so far have been very encouraging, some researchers worry about what appears to be the i ..."
Abstract - Cited by 7 (1 self) - Add to MetaCart
The idea of using the Web as a corpus for linguistic research is getting increasingly popular. Most often this means using Web search engine page hit counts as estimates for n-gram frequencies. While the results so far have been very encouraging, some researchers worry about what appears to be the instability of these estimates. Using a particular NLP task, we compare the variability in the n-gram counts across different search engines as well as for the same search engine across time, finding that although there are measurable differences, they are not statistically significantly different for the task examined. 1

Factoid Question Answering over Unstructured and Structured Web Content

by Silviu Cucerzan, Eugene Agichtein
"... We describe our experience with two new, builtfrom -scratch, web-based question answering systems applied to the TREC 2005 Main Question Answering task, which use complementary models of answering questions over both structured and unstructured content on the Web. Our approaches depart from pr ..."
Abstract - Cited by 5 (0 self) - Add to MetaCart
We describe our experience with two new, builtfrom -scratch, web-based question answering systems applied to the TREC 2005 Main Question Answering task, which use complementary models of answering questions over both structured and unstructured content on the Web. Our approaches depart from previous question answering (QA) work in several ways. For unstructured content, we used a web-based system with novel features such as web snippet pattern matching and generic answer type matching using web counts. We also experimented with a new, complementary question answering approach that uses information from the millions of tables and lists that abound on the web.

Rapid Language Model Development Using External Resources for New Spoken Dialog Domains

by Ruhi Sarikaya, Agustin Gravano, Yuqing Gao - in Proc. ICASSP, 2005
"... This paper addresses a critical problem in deploying a spoken dialog system (SDS). One of the main bottlenecks of SDS deployment for a new domain is data sparseness in building a statistical language model. Our goal is to devise a method to efficiently build a reliable language model for a new SDS. ..."
Abstract - Cited by 5 (0 self) - Add to MetaCart
This paper addresses a critical problem in deploying a spoken dialog system (SDS). One of the main bottlenecks of SDS deployment for a new domain is data sparseness in building a statistical language model. Our goal is to devise a method to efficiently build a reliable language model for a new SDS. We consider the worst yet quite common scenario where only a small amount (∼1.7K utterances) of domain specific data is available for the target domain. We present a new method that exploits external static text resources that are collected for other speech recognition tasks as well as dynamic text resources acquired from World Wide Web (WWW). We show that language models built using external resources can jointly be used with limited in–domain (baseline) language model to obtain significant improvements in speech recognition accuracy. Combining language models built using external resources with the in–domain language model provides over 20 % reduction in WER over the baseline in–domain language model. Equivalently, we achieve almost the same level of performance by having ten times as much in–domain data (17K utterances). 1.

Evaluating Methods to Rediscover Missing Web Pages from the Web Infrastructure ABSTRACT

by Martin Klein
"... Missing web pages (pages that return the 404 “Page Not Found”error) are part of the browsing experience. The manual use of search engines to rediscover missing pages can be frustrating and unsuccessful. We compare four automated methods for rediscovering web pages. We extract the page’s title, gener ..."
Abstract - Cited by 4 (3 self) - Add to MetaCart
Missing web pages (pages that return the 404 “Page Not Found”error) are part of the browsing experience. The manual use of search engines to rediscover missing pages can be frustrating and unsuccessful. We compare four automated methods for rediscovering web pages. We extract the page’s title, generate the page’s lexical signature (LS), obtain the page’s tags from the bookmarking website delicious.com and generate a LS from the page’s link neighborhood. We use the output of all methods to query Internet search engines and analyze their retrieval performance. Our results show that both LSs and titles perform fairly well with over 60 % URIs returned top ranked from Yahoo!. However, the combination of methods improves the retrieval performance. Considering the complexity of the LS generation, querying the title first and in case of insufficient results querying the LSs second is the preferable setup. This combination accounts for more than 75 % top ranked URIs.

D.: Resolving translation ambiguity using monolingual corpora

by Yan Qu, Greg Grefenstette, David A. Evans - In: Cross Language Evaluation Forum 2002 , 2002
"... Choosing the correct target words is a difficult problem for machine translation. In cross-language information retrieval, this problem of choice is mitigated since more than one translation alternative can be retained in the target query. Between choosing just one word as a translation and keeping ..."
Abstract - Cited by 2 (0 self) - Add to MetaCart
Choosing the correct target words is a difficult problem for machine translation. In cross-language information retrieval, this problem of choice is mitigated since more than one translation alternative can be retained in the target query. Between choosing just one word as a translation and keeping all the possible translations for each source word, one can apply a range of filtering techniques for eliminating some words and keeping others. In the bilingual track of CLEF 2002, focusing on word translation ambiguity, we experimented with several techniques for choosing the best target translation for each source query word by using co-occurrence statistics in a reference corpus consisting of documents in the target language. One of two distinct corpora was used, the target-language test corpus or the World Wide Web. Our techniques give one best translation per source query word. We also experimented with combining these word choice results (providing up to three translations for each word) in the final translated query. The source query languages were Spanish and Chinese; the target language documents were in English. We submitted four automatic runs for each language pair. When the methods were combined, mixing results obtained with different reference corpora, the recall and average precision of Spanish-to-English retrieval reached 95 % and 97%, respectively, of the recall and average precision of an English monolingual retrieval run. For Chinese-to-English text retrieval, the recall and average precision reached 89 % and 60%, respectively, of the English run. 1.
The National Science Foundation
  • About CiteSeerX
  • Submit Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2010 The Pennsylvania State University