Results 1 -
5 of
5
A Neyman-Pearson approach to statistical learning
- IEEE Trans. Inform. Theory
, 2005
"... The Neyman-Pearson (NP) approach to hypothesis testing is useful in situations where different types of error have different consequences or a priori probabilities are unknown. For any α> 0, the Neyman-Pearson lemma specifies the most powerful test of size α, but assumes the distributions for each h ..."
Abstract
-
Cited by 19 (8 self)
- Add to MetaCart
The Neyman-Pearson (NP) approach to hypothesis testing is useful in situations where different types of error have different consequences or a priori probabilities are unknown. For any α> 0, the Neyman-Pearson lemma specifies the most powerful test of size α, but assumes the distributions for each hypothesis are known or (in some cases) the likelihood ratio is monotonic in an unknown parameter. This paper investigates an extension of NP theory to situations in which one has no knowledge of the underlying distributions except for a collection of independent and identically distributed training examples from each hypothesis. Building on a “fundamental lemma ” of Cannon et al., we demonstrate that several concepts from statistical learning theory have counterparts in the NP context. Specifically, we consider constrained versions of empirical risk minimization (NP-ERM) and structural risk minimization (NP-SRM), and prove performance guarantees for both. General conditions are given under which NP-SRM leads to strong universal consistency. We also apply NP-SRM to (dyadic) decision trees to derive rates of convergence. Finally, we present explicit algorithms to implement NP-SRM for histograms and dyadic decision trees. 1
GOES-8 X-ray sensor variance stabilization using the multiscale data-driven Haar-Fisz transform
, 2005
"... Summary. We consider the stochastic mechanisms behind the data collected by the solar X-ray sensor (XRS) on board the the GOES-8 satellite. We discover and justify a non-trivial mean-variance relationship within the XRS data. Transforming such data so that its variance is stable and its distribution ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
Summary. We consider the stochastic mechanisms behind the data collected by the solar X-ray sensor (XRS) on board the the GOES-8 satellite. We discover and justify a non-trivial mean-variance relationship within the XRS data. Transforming such data so that its variance is stable and its distribution is taken closer to the Gaussian is the aim of many techniques (e.g. Anscombe, Box-Cox). Recently, new techniques based on the Haar-Fisz transform have been introduced that use a multiscale method to transform and stabilize data with a known meanvariance relationship. In many practical cases, such as the XRS data, the variance of the data can be assumed to increase with the mean, but other characteristics of the distribution are unknown. We introduce a method, the data-driven Haar-Fisz transform (DDHFT), which uses Haar-Fisz but also estimates the mean-variance relationship. For known noise distributions, the DDHFT is shown to be competitive with the fixed Haar-Fisz methods. We show how our DDHFT method denoises the XRS series where other existing methods fail. Keywords: GOES-8, XRS, X-ray, variance stabilization, Gaussianisation, Haar-Fisz. 1.
Probabilistic Segmentation and Intensity Estimation for Microarray Images
, 2005
"... We describe a probabilistic approach to simultaneous image segmentation and intensity estimation for cDNA microarray experiments. The approach overcomes several limitations of existing methods. In particular it a) uses a flexible Markov random field approach to segmentation that allows for a wider r ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
We describe a probabilistic approach to simultaneous image segmentation and intensity estimation for cDNA microarray experiments. The approach overcomes several limitations of existing methods. In particular it a) uses a flexible Markov random field approach to segmentation that allows for a wider range of spot shapes than existing methods, including relatively-common ”doughnut-shaped ” spots; b) models the image directly as background plus hybridization intensity, and estimates the two quantities simultaneously, avoiding the common logical error that estimates of foreground may be less than those of corresponding background if the two are estimated separately; c) uses a probabilistic modelling approach to simultaneously perform segmentation and intensity estimation, and to compute spot quality measures. We describe two approaches to parameter estimation: a fast algorithm, based on the Expectation-Maximisation (EM) and the Iterated Conditional Modes (ICM) algorithms, and a fully Bayesian framework. These approaches produce comparable results, and both appear to offer some advantages over other methods. We use an HIV experiment to compare our approach to two commercial software products: Spot and Arrayvision.
Data-adaptive test statistics for microarray data
- Bioinformatics
, 2005
"... Motivation: An important task in microarray data analysis is the selection of genes that are differentially expressed between different tissue samples, such as healthy and diseased. However, microarray data contain an enormous number of dimensions (genes) and very few samples (arrays), a mismatch wh ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Motivation: An important task in microarray data analysis is the selection of genes that are differentially expressed between different tissue samples, such as healthy and diseased. However, microarray data contain an enormous number of dimensions (genes) and very few samples (arrays), a mismatch which poses fundamental statistical problems for the selection process that have defied easy resolution. Results: In this paper, we present a novel approach to the selection of differentially expressed genes in which test statistics are learned from data using a simple notion of reproducibility in selection results as the learning criterion. Reproducibility, as we define it, can be computed without any knowledge of the ‘ground-truth’, but takes advantage of certain properties of microarray data to provide an asymptotically valid guide to expected loss under the true data-generating distribution. We are therefore able to indirectly minimize expected loss, and obtain results substantially more robust than conventional methods. We apply our method to simulated and oligonucleotide array data. Availability: By request to the corresponding author. Contact:

