Results 11  20
of
33
MultiStyle Language Model for Web Scale Information Retrieval
"... Web documents are typically associated with many text streams, including the body, the title and the URL that are determined by the authors, and the anchor text or search queries used by others to refer to the documents. Through a systematic large scale analysis on their cross entropy, we show that ..."
Abstract

Cited by 5 (3 self)
 Add to MetaCart
Web documents are typically associated with many text streams, including the body, the title and the URL that are determined by the authors, and the anchor text or search queries used by others to refer to the documents. Through a systematic large scale analysis on their cross entropy, we show that these text streams appear to be composed in different language styles, and hence warrant respective language models to properly describe their properties. We propose a language modeling approach to Web document retrieval in which each document is characterized by a mixture model with components corresponding to the various text streams associated with the document. Immediate issues for such a mixture model arise as all the text streams are not always present for the documents, and they do not share the same lexicon, making it challenging to properly combine the statistics from the mixture components. To address these issues, we introduce an “openvocabulary” smoothing technique so that all the component language models have the same cardinality and their scores can simply be linearly combined. To ensure that the approach can cope with Web scale applications, the model training algorithm is designed to require no labeled data and can be fully automated with few heuristics and no empirical parameter tunings. The evaluation on Web document ranking tasks shows that the component language models indeed have varying degrees of capabilities as predicted by the crossentropy analysis, and the combined mixture model outperforms the stateoftheart BM25F based system.
Concentration Bounds for Unigrams Language Model
, 2004
"... We show several PACstyle concentration bounds for learning unigrams language model. One interesting quantity is the probability of all words appearing exactly k times in a sample of size m. A standard estimator for this quantity is the GoodTuring estimator. The existing analysis on its error shows ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
We show several PACstyle concentration bounds for learning unigrams language model. One interesting quantity is the probability of all words appearing exactly k times in a sample of size m. A standard estimator for this quantity is the GoodTuring estimator. The existing analysis on its error shows a PAC bound of approximately O . We improve its dependency on k to O 4 # k . We also analyze the empirical frequencies estimator, showing that its PAC error bound is approximately . We derive a combined estimator, which has , for any k. A standard measure...
Superior Guarantees for Sequential Prediction and Lossless Compression via Alphabet Decomposition
"... We present worst case bounds for the learning rate of a known prediction method that is based on hierarchical applications of binary context tree weighting (CTW) predictors. A heuristic application of this approach that relies on Huffman’s alphabet decomposition is known to achieve stateoftheart p ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
We present worst case bounds for the learning rate of a known prediction method that is based on hierarchical applications of binary context tree weighting (CTW) predictors. A heuristic application of this approach that relies on Huffman’s alphabet decomposition is known to achieve stateoftheart performance in prediction and lossless compression benchmarks. We show that our new bound for this heuristic is tighter than the best known performance guarantees for prediction and lossless compression algorithms in various settings. This result substantiates the efficiency of this hierarchical method and provides a compelling explanation for its practical success. In addition, we present the results of a few experiments that examine other possibilities for improving the multialphabet prediction performance of CTWbased algorithms.
Universal compression of Markov and related sources over arbitrary alphabets
 IEEE TRANSACTIONS ON INFORMATION THEORY
, 2006
"... Recent work has considered encoding a string by separately conveying its symbols and its pattern—the order in which the symbols appear. It was shown that the patterns of i.i.d. strings can be losslessly compressed with diminishing persymbol redundancy. In this paper the pattern redundancy of distri ..."
Abstract

Cited by 3 (2 self)
 Add to MetaCart
Recent work has considered encoding a string by separately conveying its symbols and its pattern—the order in which the symbols appear. It was shown that the patterns of i.i.d. strings can be losslessly compressed with diminishing persymbol redundancy. In this paper the pattern redundancy of distributions with memory is considered. Close lower and upper bounds are established on the pattern redundancy of strings generated by Hidden Markov Models with a small number of states, showing in particular that their persymbol pattern redundancy diminishes with increasing string length. The upper bounds are obtained by analyzing the growth rate of the number of multidimensional integer partitions, and the lower bounds, using Hayman’s Theorem.
A Better GoodTuring Estimator for Sequence Probabilities
, 704
"... Abstract — We consider the problem of estimating the probability of an observed string drawn i.i.d. from an unknown distribution. The key feature of our study is that the length of the observed string is assumed to be of the same order as the size of the underlying alphabet. In this setting, many le ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
Abstract — We consider the problem of estimating the probability of an observed string drawn i.i.d. from an unknown distribution. The key feature of our study is that the length of the observed string is assumed to be of the same order as the size of the underlying alphabet. In this setting, many letters are unseen and the empirical distribution tends to overestimate the probability of the observed letters. To overcome this problem, the traditional approach to probability estimation is to use the classical GoodTuring estimator. We introduce a natural scaling model and use it to show that the GoodTuring sequence probability estimator is not consistent. We then introduce a novel sequence probability estimator that is indeed consistent under the natural scaling model. I.
A Universal Compression Perspective of Smoothing
"... We analyze smoothing algorithms from a universalcompression perspective. Instead of evaluating their performance on an empirical sample, we analyze their performance on the most inconvenient sample possible. Consequently the performance of the algorithm can be guaranteed even on unseen data. We sho ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
We analyze smoothing algorithms from a universalcompression perspective. Instead of evaluating their performance on an empirical sample, we analyze their performance on the most inconvenient sample possible. Consequently the performance of the algorithm can be guaranteed even on unseen data. We show that universal compression bounds can explain the empirical performance of several smoothing methods. We also describe a new interpolated additive smoothing algorithm, and show that it has lower training complexity and better compression performance than existing smoothing techniques. Key words: Language modeling, universal compression, smoothing 1
HAYMAN ADMISSIBLE FUNCTIONS IN SEVERAL VARIABLES
"... Abstract. An alternative generalisation of Hayman’s admissible functions ([17]) to functions in several variables is developed and a multivariate asymptotic expansion for the coefficients is proved. In contrast to existing generalisations of Hayman admissibility ([7]), most of the closure properties ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
Abstract. An alternative generalisation of Hayman’s admissible functions ([17]) to functions in several variables is developed and a multivariate asymptotic expansion for the coefficients is proved. In contrast to existing generalisations of Hayman admissibility ([7]), most of the closure properties which are satisfied by Hayman’s admissible functions can be shown to hold for this class of functions as well. 1.
Book Review The Essential Turing Reviewed by Andrew Hodges The Essential Turing
"... The Essential Turing is a selection of writings of the ..."
ON THE RELATION BETWEEN ADDITIVE SMOOTHING AND UNIVERSAL CODING
"... We analyze the performance of smoothing methods for language modeling from the perspective of universal compression. We use existing asymptotic bounds on the performance of simple additive rules for compression of finitealphabet memoryless sources to explain the empirical predictive abilities of ad ..."
Abstract
 Add to MetaCart
We analyze the performance of smoothing methods for language modeling from the perspective of universal compression. We use existing asymptotic bounds on the performance of simple additive rules for compression of finitealphabet memoryless sources to explain the empirical predictive abilities of additive smoothing techniques. We further suggest a smoothing method that overcomes some of the problems observed in previous approaches. The new method outperforms existing ones on the Wall Street Journal(WSJ) database for bigram and trigram models. We then suggest possible directions for future research. 1.
Summary
"... Estimating bacterial diversity from clone libraries with flat rank abundance distributions ..."
Abstract
 Add to MetaCart
Estimating bacterial diversity from clone libraries with flat rank abundance distributions