Results 11  20
of
108
Mixing Strategies for Density Estimation
 Ann. Statist
"... General results on adaptive density estimation are obtained with respect to any countable collection of estimation strategies under KullbackLeibler and square L 2 losses. It is shown that without knowing which strategy works best for the underlying density, a single strategy can be constructed by m ..."
Abstract

Cited by 39 (9 self)
 Add to MetaCart
General results on adaptive density estimation are obtained with respect to any countable collection of estimation strategies under KullbackLeibler and square L 2 losses. It is shown that without knowing which strategy works best for the underlying density, a single strategy can be constructed by mixing the proposed ones to be adaptive in terms of statistical risks. A consequence is that under some mild conditions, an asymptotically minimaxrate adaptive estimator exists for a given countable collection of density classes, i.e., a single estimator can be constructed to be simultaneously minimaxrate optimal for all the function classes being considered. A demonstration is given for highdimensional density estimation on [0; 1] d where the constructed estimator adapts to smoothness and interactionorder over some piecewise Besov classes, and is consistent for all the densities with finite entropy. 1. Introduction. In Recent years, there has been an increasing interest in adaptive fu...
On predictive distributions and Bayesian networks
 Statistics and Computing
, 2000
"... this paper we are interested in discrete prediction problems for a decisiontheoretic setting, where the ..."
Abstract

Cited by 38 (29 self)
 Add to MetaCart
this paper we are interested in discrete prediction problems for a decisiontheoretic setting, where the
Combining Different Procedures for Adaptive Regression
 Journal of Multivariate Analysis
, 1998
"... Given any countable collection of regression procedures (e.g., kernel, spline, wavelet, local polynomial, neural nets, etc), we show that a single adaptive procedure can be constructed to share the advantages of them to a great extent in terms of global squared L 2 risk. The combined procedure basic ..."
Abstract

Cited by 36 (7 self)
 Add to MetaCart
Given any countable collection of regression procedures (e.g., kernel, spline, wavelet, local polynomial, neural nets, etc), we show that a single adaptive procedure can be constructed to share the advantages of them to a great extent in terms of global squared L 2 risk. The combined procedure basically pays a price only of order 1=n for adaptation over the collection. An interesting consequence is that for a countable collection of classes of regression functions (possibly of completely different characteristics), a minimaxrate adaptive estimator can be constructed such that it automatically converges at the right rate for each of the classes being considered.
A General Minimax Result for Relative Entropy
 IEEE Trans. Inform. Theory
, 1996
"... : Suppose Nature picks a probability measure P ` on a complete separable metric space X at random from a measurable set P \Theta = fP ` : ` 2 \Thetag. Then, without knowing `, a statistician picks a measure Q on X. Finally, the statistician suffers a loss D(P ` jjQ), the relative entropy between P ..."
Abstract

Cited by 35 (2 self)
 Add to MetaCart
: Suppose Nature picks a probability measure P ` on a complete separable metric space X at random from a measurable set P \Theta = fP ` : ` 2 \Thetag. Then, without knowing `, a statistician picks a measure Q on X. Finally, the statistician suffers a loss D(P ` jjQ), the relative entropy between P ` and Q. We show that the minimax and maximin values of this game are always equal, and there is always a minimax strategy in the closure of the set of all Bayes strategies. This generalizes previous results of Gallager, and Davisson and LeonGarcia. Index terms: minimax theorem, minimax redundancy, minimax risk, Bayes risk, relative entropy, KullbackLeibler divergence, density estimation, source coding, channel capacity, computational learning theory 1 Introduction Consider a sequential estimation game in which a statistician is given n independent observations Y 1 ; : : : ; Yn distributed according to an unknown distribution ~ P ` chosen at random by Nature from the set f ~ P ` : ` 2 \...
Precise Minimax Redundancy and Regret
 IEEE TRANS. INFORMATION THEORY
, 2004
"... Recent years have seen a resurgence of interest in redundancy of lossless coding. The redundancy (regret) of universal xed{to{variable length coding for a class of sources determines by how much the actual code length exceeds the optimal (ideal over the class) code length. In a minimax scenario ..."
Abstract

Cited by 33 (13 self)
 Add to MetaCart
Recent years have seen a resurgence of interest in redundancy of lossless coding. The redundancy (regret) of universal xed{to{variable length coding for a class of sources determines by how much the actual code length exceeds the optimal (ideal over the class) code length. In a minimax scenario one nds the best code for the worst source either in the worst case (called also maximal minimax) or on average. We rst study the worst case minimax redundancy over a class of stationary ergodic sources and replace Shtarkov's bound by an exact formula. Among others, we prove that a generalized Shannon code minimizes the worst case redundancy, derive asymptotically its redundancy, and establish some general properties. This allows us to obtain precise redundancy rates for memoryless, Markov and renewal sources. For example, we derive the exact constant of the redundancy rate for memoryless and Markov sources by showing that an integer nature of coding contributes log(log m=(m 1))= log m+ o(1) where m is the size of the alphabet. Then we deal with the average minimax redundancy and regret. Our approach
Predictability, Complexity, and Learning
, 2001
"... We define predictive information Ipred(T) as the mutual information between the past and the future of a time series. Three qualitatively different behaviors are found in the limit of large observation times T: Ipred(T) can remain finite, grow logarithmically, or grow as a fractional power law. If t ..."
Abstract

Cited by 30 (2 self)
 Add to MetaCart
We define predictive information Ipred(T) as the mutual information between the past and the future of a time series. Three qualitatively different behaviors are found in the limit of large observation times T: Ipred(T) can remain finite, grow logarithmically, or grow as a fractional power law. If the time series allows us to learn a model with a finite number of parameters, then Ipred(T) grows logarithmically with a coefficient that counts the dimensionality of the model space. In contrast, powerlaw growth is associated, for example, with the learning of infinite parameter (or nonparametric) models such as continuous functions with smoothness constraints. There are connections between the predictive information and measures of complexity that have been defined both in learning theory and the analysis of physical systems through statistical mechanics and dynamical systems theory. Furthermore, in the same way that entropy provides the unique measure of available information consistent with some simple and plausible conditions, we argue that the divergent part of Ipred(T) provides the unique measure for the complexity of dynamics underlying a time series. Finally, we discuss how these ideas may be useful in problems in physics, statistics, and biology.
On Universal Prediction and Bayesian Confirmation
 Theoretical Computer Science
, 2007
"... The Bayesian framework is a wellstudied and successful framework for inductive reasoning, which includes hypothesis testing and confirmation, parameter estimation, sequence prediction, classification, and regression. But standard statistical guidelines for choosing the model class and prior are not ..."
Abstract

Cited by 23 (13 self)
 Add to MetaCart
The Bayesian framework is a wellstudied and successful framework for inductive reasoning, which includes hypothesis testing and confirmation, parameter estimation, sequence prediction, classification, and regression. But standard statistical guidelines for choosing the model class and prior are not always available or can fail, in particular in complex situations. Solomonoff completed the Bayesian framework by providing a rigorous, unique, formal, and universal choice for the model class and the prior. I discuss in breadth how and in which sense universal (noni.i.d.) sequence prediction solves various (philosophical) problems of traditional Bayesian sequence prediction. I show that Solomonoff’s model possesses many desirable properties: Strong total and future bounds, and weak instantaneous bounds, and in contrast to most classical continuous prior densities has no zero p(oste)rior problem, i.e. can confirm universal hypotheses, is reparametrization and regrouping invariant, and avoids the oldevidence and updating problem. It even performs well
The Horseshoe Estimator for Sparse Signals
, 2008
"... This paper proposes a new approach to sparsity called the horseshoe estimator. The horseshoe is a close cousin of other widely used Bayes rules arising from, for example, doubleexponential and Cauchy priors, in that it is a member of the same family of multivariate scale mixtures of normals. But th ..."
Abstract

Cited by 22 (6 self)
 Add to MetaCart
This paper proposes a new approach to sparsity called the horseshoe estimator. The horseshoe is a close cousin of other widely used Bayes rules arising from, for example, doubleexponential and Cauchy priors, in that it is a member of the same family of multivariate scale mixtures of normals. But the horseshoe enjoys a number of advantages over existing approaches, including its robustness, its adaptivity to different sparsity patterns, and its analytical tractability. We prove two theorems that formally characterize both the horseshoe’s adeptness at large outlying signals, and its superefficient rate of convergence to the correct estimate of the sampling density in sparse situations. Finally, using a combination of real and simulated data, we show that the horseshoe estimator corresponds quite closely to the answers one would get by pursuing a full Bayesian modelaveraging approach using a discrete mixture prior to model signals and noise.
Modelbased decoding, information estimation, and changepoint detection in multineuron spike trains
 UNDER REVIEW, NEURAL COMPUTATION
, 2007
"... Understanding how stimulus information is encoded in spike trains is a central problem in computational neuroscience. Decoding methods provide an important tool for addressing this problem, by allowing us to explicitly read out the information contained in spike responses. Here we introduce several ..."
Abstract

Cited by 21 (12 self)
 Add to MetaCart
Understanding how stimulus information is encoded in spike trains is a central problem in computational neuroscience. Decoding methods provide an important tool for addressing this problem, by allowing us to explicitly read out the information contained in spike responses. Here we introduce several decoding methods based on pointprocess neural encoding models (i.e. “forward ” models that predict spike responses to novel stimuli). These models have concave loglikelihood functions, allowing for efficient fitting via maximum likelihood. Moreover, we may use the likelihood of the observed spike trains under the model to perform optimal decoding. We present: (1) a tractable algorithm for computing the maximum a posteriori (MAP) estimate of the stimulus — the most probable stimulus to have generated the observed single or multiplespike train response, given some prior distribution over the stimulus; (2) a Gaussian approximation to the posterior distribution, which allows us to quantify the fidelity with which various stimulus features are encoded; (3) an efficient method for estimating the mutual information between the stimulus and the response; and (4) a framework for the detection of changepoint times (e.g. the time at which the stimulus undergoes a change in mean or variance), by marginalizing over the posterior distribution of stimuli. We show several examples illustrating the performance of these estimators with simulated data.
Bayesian Model Selection in Finite Mixtures by Marginal Density Decompositions
 JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION
, 2001
"... ..."