Results 1  10
of
24
Being Bayesian about network structure
 Machine Learning
, 2000
"... Abstract. In many multivariate domains, we are interested in analyzing the dependency structure of the underlying distribution, e.g., whether two variables are in direct interaction. We can represent dependency structures using Bayesian network models. To analyze a given data set, Bayesian model sel ..."
Abstract

Cited by 202 (5 self)
 Add to MetaCart
Abstract. In many multivariate domains, we are interested in analyzing the dependency structure of the underlying distribution, e.g., whether two variables are in direct interaction. We can represent dependency structures using Bayesian network models. To analyze a given data set, Bayesian model selection attempts to find the most likely (MAP) model, and uses its structure to answer these questions. However, when the amount of available data is modest, there might be many models that have nonnegligible posterior. Thus, we want compute the Bayesian posterior of a feature, i.e., the total posterior probability of all models that contain it. In this paper, we propose a new approach for this task. We first show how to efficiently compute a sum over the exponential number of networks that are consistent with a fixed order over network variables. This allows us to compute, for a given order, both the marginal probability of the data and the posterior of a feature. We then use this result as the basis for an algorithm that approximates the Bayesian posterior of a feature. Our approach uses a Markov Chain Monte Carlo (MCMC) method, but over orders rather than over network structures. The space of orders is smaller and more regular than the space of structures, and has much a smoother posterior “landscape”. We present empirical results on synthetic and reallife datasets that compare our approach to full model averaging (when possible), to MCMC over network structures, and to a nonBayesian bootstrap approach.
Mismatch string kernels for discriminative protein classification
 Bioinformatics
, 2004
"... Motivation: Classification of proteins sequences into functional and structural families based on sequence homology is a central problem in computational biology. Discriminative supervised machine learning approaches provide good performance, but simplicity and computational efficiency of training a ..."
Abstract

Cited by 131 (8 self)
 Add to MetaCart
Motivation: Classification of proteins sequences into functional and structural families based on sequence homology is a central problem in computational biology. Discriminative supervised machine learning approaches provide good performance, but simplicity and computational efficiency of training and prediction are also important concerns. Results: We introduce a class of string kernels, called mismatch kernels, for use with support vector machines (SVMs) in a discriminative approach to the problem of protein classification and remote homology detection. These kernels measure sequence similarity based on shared occurrences of fixedlength patterns in the data, allowing for mutations between patterns.Thus, the kernels provide a biologically wellmotivated way to compare protein sequences without relying on familybased generative models such as hidden Markov models. We compute the kernels efficiently using a mismatch tree data structure, allowing us to calculate the contributions of all patterns occurring in the data in one pass while traversing the tree. When used with an SVM, the kernels enable fast prediction on test sequences. We report experiments on two benchmark SCOP datasets, where we show that the mismatch kernel used with an SVM classifier performs competitively with stateoftheart methods for homology detection, particularly when very few training examples are available. Examination of the highestweighted patterns learned by the SVM classifier recovers biologically important motifs in protein families and superfamilies. Availability: SVM software is publicly available at
PACBayesian Model Averaging
 In Proceedings of the Twelfth Annual Conference on Computational Learning Theory
, 1999
"... PACBayesian learning methods combine the informative priors of Bayesian methods with distributionfree PAC guarantees. Building on earlier methods for PACBayesian model selection, this paper presents a method for PACBayesian model averaging. The main result is a bound on generalization error of a ..."
Abstract

Cited by 75 (2 self)
 Add to MetaCart
PACBayesian learning methods combine the informative priors of Bayesian methods with distributionfree PAC guarantees. Building on earlier methods for PACBayesian model selection, this paper presents a method for PACBayesian model averaging. The main result is a bound on generalization error of an arbitrary weighted mixture of concepts that depends on the empirical error of that mixture and the KLdivergence of the mixture from the prior. A simple characterization is also given for the error bound achieved by the optimal weighting. 1
Competitive online statistics
 International Statistical Review
, 1999
"... A radically new approach to statistical modelling, which combines mathematical techniques of Bayesian statistics with the philosophy of the theory of competitive online algorithms, has arisen over the last decade in computer science (to a large degree, under the influence of Dawid’s prequential sta ..."
Abstract

Cited by 63 (10 self)
 Add to MetaCart
A radically new approach to statistical modelling, which combines mathematical techniques of Bayesian statistics with the philosophy of the theory of competitive online algorithms, has arisen over the last decade in computer science (to a large degree, under the influence of Dawid’s prequential statistics). In this approach, which we call “competitive online statistics”, it is not assumed that data are generated by some stochastic mechanism; the bounds derived for the performance of competitive online statistical procedures are guaranteed to hold (and not just hold with high probability or on the average). This paper reviews some results in this area; the new material in it includes the proofs for the performance of the Aggregating Algorithm in the problem of linear regression with square loss. Keywords: Bayes’s rule, competitive online algorithms, linear regression, prequential statistics, worstcase analysis.
PACBayesian stochastic model selection
 Machine Learning
, 2003
"... Abstract PACBayesian learning methods combine the informative priors of Bayesian methods with distributionfree PAC guarantees. Stochastic model selection predicts a class label by stochastically sampling a classifier according to a "posterior distribution " on classifiers. This paper giv ..."
Abstract

Cited by 60 (2 self)
 Add to MetaCart
Abstract PACBayesian learning methods combine the informative priors of Bayesian methods with distributionfree PAC guarantees. Stochastic model selection predicts a class label by stochastically sampling a classifier according to a "posterior distribution " on classifiers. This paper gives a PACBayesian performance guarantee for stochastic model selection that is superior to analogous guarantees for deterministic model selection. The guarantee is stated in terms of the training error of the stochastic classifier and the KLdivergence of the posterior from the prior. It is shown that the posterior optimizing the performance guarantee is a Gibbs distribution. Simpler posterior distributions are also derived that have nearly optimal performance guarantees.
Modeling system calls for intrusion detection with dynamic window sizes
 In Proceedings of DARPA Information Survivabilty Conference and Exposition II (DISCEX
, 2001
"... We extend prior research on system call anomaly detection modeling methods for intrusion detection by incorporating dynamic window sizes. The window size is the length of the subsequence of a system call trace which is used as the basic unit for modeling program or process behavior. In this work we ..."
Abstract

Cited by 30 (7 self)
 Add to MetaCart
We extend prior research on system call anomaly detection modeling methods for intrusion detection by incorporating dynamic window sizes. The window size is the length of the subsequence of a system call trace which is used as the basic unit for modeling program or process behavior. In this work we incorporate dynamic window sizes and show marked improvements in anomaly detection. We present two methods for estimating the optimal window size based on the available training data. The first method is an entropy modeling method which determines the optimal single window size for the data. The second method is a probability modeling method that takes into account context dependent window sizes. A context dependent window size model is motivated by the way that system calls are generated by processes. Sparse Markov transducers (SMTs) are used to compute the context dependent window size model. We show over actual system call traces that the entropy modeling methods lead to the optimal single window size. We also show that context dependent window sizes outperform traditional system call modeling methods. 1
Tracking the best of many experts
 in Proceedings of the 18th Annual Conference on Learning Theory, COLT 2005
, 2005
"... Abstract. An algorithm is presented for online prediction that allows to track the best expert efficiently even if the number of experts is exponentially large, provided that the set of experts has a certain structure allowing efficient implementations of the exponentially weighted average predictor ..."
Abstract

Cited by 14 (10 self)
 Add to MetaCart
Abstract. An algorithm is presented for online prediction that allows to track the best expert efficiently even if the number of experts is exponentially large, provided that the set of experts has a certain structure allowing efficient implementations of the exponentially weighted average predictor. As an example we work out the case where each expert is represented by a path in a directed graph and the loss of each expert is the sum of the weights over the edges in the path. 1
An Analysis of Reduced Error Pruning
 Journal of Artificial Intelligence Research
, 2001
"... Topdown induction of decision trees has been observed to suffer from the inadequate functioning of the pruning phase. In particular, it is known that the size of the resulting tree grows linearly with the sample size, even though the accuracy of the tree does not improve. Reduced Error Pruning is a ..."
Abstract

Cited by 13 (4 self)
 Add to MetaCart
Topdown induction of decision trees has been observed to suffer from the inadequate functioning of the pruning phase. In particular, it is known that the size of the resulting tree grows linearly with the sample size, even though the accuracy of the tree does not improve. Reduced Error Pruning is an algorithm that has been used as a representative technique in attempts to explain the problems of decision tree learning. In this paper we present analyses of Reduced Error Pruning in three different settings. First we study the basic algorithmic properties of the method, properties that hold independent of the input decision tree and pruning examples. Then we examine a situation that intuitively should lead to the subtree under consideration to be replaced by a leaf node, one in which the class label and attribute values of the pruning examples are independent of each other. This analysis is conducted under two different assumptions. The general analysis shows that the pruning probability of a node fitting pure noise is bounded by a function that decreases exponentially as the size of the tree grows. In a specific analysis we assume that the examples are distributed uniformly to the tree. This assumption lets us approximate the number of subtrees that are pruned because they do not receive any pruning examples. This paper clarifies the different variants of the Reduced Error Pruning algorithm, brings new insight to its algorithmic properties, analyses the algorithm with less imposed assumptions than before, and includes the previously overlooked empty subtrees to the analysis.
Tracking the best quantizer
, 2008
"... An algorithm is presented for online prediction that allows to track the best expert efficiently even when the number of experts is exponentially large, provided that the set of experts has a certain additive structure. As an example, we work out the case where each expert is represented by a path ..."
Abstract

Cited by 12 (6 self)
 Add to MetaCart
An algorithm is presented for online prediction that allows to track the best expert efficiently even when the number of experts is exponentially large, provided that the set of experts has a certain additive structure. As an example, we work out the case where each expert is represented by a path in a directed graph and the loss of each expert is the sum of the weights over the edges in the path. These results are then used to construct universal limiteddelay schemes for lossy coding of individual sequences. In particular, we consider the problem of tracking the best scalar quantizer that is adaptively matched to the source sequence with piecewise different behavior. A randomized algorithm is presented which can perform, on any source sequence, asymptotically as well as the best scalar quantization algorithm that is matched to the sequence and is allowed to change the employed quantizer for a given number of times. The complexity of the algorithm is quadratic in the sequence length, but at the price of some deterioration in performance, the complexity can be made linear. Analogous results are obtained for sequential multiresolution and multiple description scalar quantization of individual sequences.