Results 1 - 10
of
21
On prediction using variable order Markov models
- JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH
, 2004
"... This paper is concerned with algorithms for prediction of discrete sequences over a finite alphabet, using variable order Markov models. The class of such algorithms is large and in principle includes any lossless compression algorithm. We focus on six prominent prediction algorithms, including Cont ..."
Abstract
-
Cited by 42 (1 self)
- Add to MetaCart
This paper is concerned with algorithms for prediction of discrete sequences over a finite alphabet, using variable order Markov models. The class of such algorithms is large and in principle includes any lossless compression algorithm. We focus on six prominent prediction algorithms, including Context Tree Weighting (CTW), Prediction by Partial Match (PPM) and Probabilistic Suffix Trees (PSTs). We discuss the properties of these algorithms and compare their performance using real life sequences from three domains: proteins, English text and music pieces. The comparison is made with respect to prediction quality as measured by the average log-loss. We also compare classification algorithms based on these predictors with respect to a number of large protein classification tasks. Our results indicate that a “decomposed” CTW (a variant of the CTW algorithm) and PPM outperform all other algorithms in sequence prediction tasks. Somewhat surprisingly, a different algorithm, which is a modification of the Lempel-Ziv compression algorithm, significantly outperforms all algorithms on the protein classification problems.
Multiple Kernel Tracking with SSD
- IN CVPR’04
, 2004
"... Kernel-based objective functions optimized using the mean shift algorithm have been demonstrated as an effective means of tracking in video sequences. The resulting algorithms combine the robustness and invariance properties afforded by traditional density-based measures of image similarity, while c ..."
Abstract
-
Cited by 37 (1 self)
- Add to MetaCart
Kernel-based objective functions optimized using the mean shift algorithm have been demonstrated as an effective means of tracking in video sequences. The resulting algorithms combine the robustness and invariance properties afforded by traditional density-based measures of image similarity, while connecting these techniques to continuous optimization algorithms. This paper
Machine learning methods for predicting failures in hard drives: A multiple-instance application
- Journal of Machine Learning research
, 2005
"... We compare machine learning methods applied to a difficult real-world problem: predicting computer hard-drive failure using attributes monitored internally by individual drives. The problem is one of detecting rare events in a time series of noisy and nonparametrically-distributed data. We develop a ..."
Abstract
-
Cited by 17 (1 self)
- Add to MetaCart
We compare machine learning methods applied to a difficult real-world problem: predicting computer hard-drive failure using attributes monitored internally by individual drives. The problem is one of detecting rare events in a time series of noisy and nonparametrically-distributed data. We develop a new algorithm based on the multiple-instance learning framework and the naive Bayesian classifier (mi-NB) which is specifically designed for the low false-alarm case, and is shown to have promising performance. Other methods compared are support vector machines (SVMs), unsupervised clustering, and non-parametric statistical tests (rank-sum and reverse arrangements). The failure-prediction performance of the SVM, rank-sum and mi-NB algorithm is considerably better than the threshold method currently implemented in drives, while maintaining low false alarm rates. Our results suggest that nonparametric statistical tests should be considered for learning problems involving detecting rare events in time series data. An appendix details the calculation of rank-sum significance probabilities in the case of discrete, tied observations, and we give new recommendations about when the exact calculation should be used instead of the commonly-used normal approximation. These normal approximations may be particularly inaccurate for rare event problems like hard drive failures.
Universal compression of memoryless sources over unknown alphabets
- IEEE TRANSACTIONS ON INFORMATION THEORY
, 2004
"... It has long been known that the compression redundancy of independent and identically distributed (i.i.d.) strings increases to infinity as the alphabet size grows. It is also apparent that any string can be described by separately conveying its symbols, and its pattern—the order in which the symbol ..."
Abstract
-
Cited by 16 (5 self)
- Add to MetaCart
It has long been known that the compression redundancy of independent and identically distributed (i.i.d.) strings increases to infinity as the alphabet size grows. It is also apparent that any string can be described by separately conveying its symbols, and its pattern—the order in which the symbols appear. Concentrating on the latter, we show that the patterns of i.i.d. strings over all, including infinite and even unknown, alphabets, can be compressed with diminishing redundancy, both in block and sequentially, and that the compression can be performed in linear time. To establish these results, we show that the number of patterns is the Bell number, that the number of patterns with a given number of symbols is the Stirling number of the second kind, and that the redundancy of patterns can be bounded using results of Hardy and Ramanujan on the number of integer partitions. The results also imply an asymptotically optimal solution for the Good-Turing probability-estimation problem.
Probabilistic Finite-State Machines - Part I
"... Probabilistic finite-state machines are used today in a variety of areas in pattern recognition, or in fields to which pattern recognition is linked: computational linguistics, machine learning, time series analysis, circuit testing, computational biology, speech recognition and machine translatio ..."
Abstract
-
Cited by 9 (1 self)
- Add to MetaCart
Probabilistic finite-state machines are used today in a variety of areas in pattern recognition, or in fields to which pattern recognition is linked: computational linguistics, machine learning, time series analysis, circuit testing, computational biology, speech recognition and machine translation are some of them. In part I of this paper we survey these generative objects and study their definitions and properties. In part II, we will study the relation of probabilistic finite-state automata with other well known devices that generate strings as hidden Markov models and n-grams, and provide theorems, algorithms and properties that represent a current state of the art of these objects.
A lower bound on compression of unknown alphabets
- Theoret. Comput. Sci
, 2005
"... Many applications call for universal compression of strings over large, possibly infinite, alphabets. However, it has long been known that the resulting redundancy is infinite even for i.i.d. distributions. It was recently shown that the redudancy of the strings ’ patterns, which abstract the values ..."
Abstract
-
Cited by 6 (3 self)
- Add to MetaCart
Many applications call for universal compression of strings over large, possibly infinite, alphabets. However, it has long been known that the resulting redundancy is infinite even for i.i.d. distributions. It was recently shown that the redudancy of the strings ’ patterns, which abstract the values of the symbols, retaining only their relative precedence, is sublinear in the blocklength n, hence the per-symbol redundancy diminishes to zero. In this paper we show that pattern redundancy is at least (1.5 log 2 e) n 1/3 bits. To do so, we construct a generating function whose coefficients lower bound the redundancy, and use Hayman’s saddle-point approximation technique to determine the coefficients ’ asymptotic behavior. 1
Concentration Bounds for Unigrams Language Model
, 2004
"... We show several PAC-style concentration bounds for learning unigrams language model. One interesting quantity is the probability of all words appearing exactly k times in a sample of size m. A standard estimator for this quantity is the Good-Turing estimator. The existing analysis on its error shows ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
We show several PAC-style concentration bounds for learning unigrams language model. One interesting quantity is the probability of all words appearing exactly k times in a sample of size m. A standard estimator for this quantity is the Good-Turing estimator. The existing analysis on its error shows a PAC bound of approximately O . We improve its dependency on k to O 4 # k . We also analyze the empirical frequencies estimator, showing that its PAC error bound is approximately . We derive a combined estimator, which has , for any k. A standard measure...
Multi-Style Language Model for Web Scale Information Retrieval
"... Web documents are typically associated with many text streams, including the body, the title and the URL that are determined by the authors, and the anchor text or search queries used by others to refer to the documents. Through a systematic large scale analysis on their cross entropy, we show that ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
Web documents are typically associated with many text streams, including the body, the title and the URL that are determined by the authors, and the anchor text or search queries used by others to refer to the documents. Through a systematic large scale analysis on their cross entropy, we show that these text streams appear to be composed in different language styles, and hence warrant respective language models to properly describe their properties. We propose a language modeling approach to Web document retrieval in which each document is characterized by a mixture model with components corresponding to the various text streams associated with the document. Immediate issues for such a mixture model arise as all the text streams are not always present for the documents, and they do not share the same lexicon, making it challenging to properly combine the statistics from the mixture components. To address these issues, we introduce an “openvocabulary” smoothing technique so that all the component language models have the same cardinality and their scores can simply be linearly combined. To ensure that the approach can cope with Web scale applications, the model training algorithm is designed to require no labeled data and can be fully automated with few heuristics and no empirical parameter tunings. The evaluation on Web document ranking tasks shows that the component language models indeed have varying degrees of capabilities as predicted by the cross-entropy analysis, and the combined mixture model outperforms the state-of-the-art BM25F based system.
Entropy Inference and the James-Stein Estimator, with Application to Nonlinear Gene Association Networks
"... We present a procedure for effective estimation of entropy and mutual information from smallsample data, and apply it to the problem of inferring high-dimensional gene association networks. Specifically, we develop a James-Stein-type shrinkage estimator, resulting in a procedure that is highly effic ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
We present a procedure for effective estimation of entropy and mutual information from smallsample data, and apply it to the problem of inferring high-dimensional gene association networks. Specifically, we develop a James-Stein-type shrinkage estimator, resulting in a procedure that is highly efficient statistically as well as computationally. Despite its simplicity, we show that it outperforms eight other entropy estimation procedures across a diverse range of sampling scenarios and data-generating models, even in cases of severe undersampling. We illustrate the approach by analyzing E. coli gene expression data and computing an entropy-based gene-association network from gene expression data. A computer program is available that implements the proposed shrinkage estimator. Keywords: entropy, shrinkage estimation, James-Stein estimator, “small n, large p ” setting, mutual information, gene association network
Superior Guarantees for Sequential Prediction and Lossless Compression via Alphabet Decomposition
"... We present worst case bounds for the learning rate of a known prediction method that is based on hierarchical applications of binary context tree weighting (CTW) predictors. A heuristic application of this approach that relies on Huffman’s alphabet decomposition is known to achieve state-ofthe-art p ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
We present worst case bounds for the learning rate of a known prediction method that is based on hierarchical applications of binary context tree weighting (CTW) predictors. A heuristic application of this approach that relies on Huffman’s alphabet decomposition is known to achieve state-ofthe-art performance in prediction and lossless compression benchmarks. We show that our new bound for this heuristic is tighter than the best known performance guarantees for prediction and lossless compression algorithms in various settings. This result substantiates the efficiency of this hierarchical method and provides a compelling explanation for its practical success. In addition, we present the results of a few experiments that examine other possibilities for improving the multialphabet prediction performance of CTW-based algorithms.

