## Entropy Inference and the James-Stein Estimator, with Application to Nonlinear Gene Association Networks

Citations: | 16 - 1 self |

### BibTeX

@MISC{Hausser_entropyinference,

author = {Jean Hausser and Bioinformatics Biozentrum and Korbinian Strimmer and Xiaotong Shen},

title = {Entropy Inference and the James-Stein Estimator, with Application to Nonlinear Gene Association Networks},

year = {}

}

### OpenURL

### Abstract

We present a procedure for effective estimation of entropy and mutual information from smallsample data, and apply it to the problem of inferring high-dimensional gene association networks. Specifically, we develop a James-Stein-type shrinkage estimator, resulting in a procedure that is highly efficient statistically as well as computationally. Despite its simplicity, we show that it outperforms eight other entropy estimation procedures across a diverse range of sampling scenarios and data-generating models, even in cases of severe undersampling. We illustrate the approach by analyzing E. coli gene expression data and computing an entropy-based gene-association network from gene expression data. A computer program is available that implements the proposed shrinkage estimator. Keywords: entropy, shrinkage estimation, James-Stein estimator, “small n, large p ” setting, mutual information, gene association network

### Citations

1917 | Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing - Development - 2008 |

1280 | Information Theory, Inference, and Learning Algorithms - MacKay - 2003 |

722 | Cryptography: Theory and Practice - Stinson - 1995 |

480 | Core Team (2008). R: A Language and Environment for Statistical Computing - Development |

407 | High-dimensional graphs and variable selection with the
- Meinshausen, Bühlmann
- 2006
(Show Context)
Citation Context ...e been proposed to enable the inference of large-scale correlation networks (Butte et al., 2000) and of high-dimensional partial 8correlation graphs (Dobra et al., 2004; Schäfer and Strimmer, 2005a; =-=Meinshausen and Bühlmann, 2006-=-), for learning vector-autoregressive (Opgen-Rhein and Strimmer, 2007a) and state space models (Rangel et al., 2004; Lähdesmäki and Shmulevich, 2008), and to reconstruct directed “causal” interaction ... |

400 |
On the population frequencies of species and the estimation of population parameters
- Good
- 1953
(Show Context)
Citation Context ...r Another recently proposed estimator is due to Chao and Shen (2003). This approach applies the Horvitz-Thompson estimator (Horvitz and Thompson, 1952) in combination with the Good-Turing correction (=-=Good, 1953-=-; Orlitsky et al., 2003) of the empirical cell probabilities to the problem of entropy estimation. The Good-Turing-corrected frequency estimates are ˆθ GT k m1 = (1 − n )ˆθ ML k , where m1 is the numb... |

302 |
Estimation with quadratic loss
- James, Stein
- 1961
(Show Context)
Citation Context ... computational sense. James-Stein-type shrinkage is a simple analytic device to perform regularized highdimensional inference. It is ideally suited for small-sample settings - the original estimator (=-=James and Stein, 1961-=-) considered sample size n = 1. A general recipe for constructing shrinkage estimators is given in Appendix A. In this section, we describe the application of this approach to the specific problem of ... |

267 |
A Generalization of Sampling without Replacement from a Finite Population
- Horvitz, Thompson
- 1952
(Show Context)
Citation Context ...pensive and somewhat slow for practical applications. 2.5 Chao-Shen Estimator Another recently proposed estimator is due to Chao and Shen (2003). This approach applies the Horvitz-Thompson estimator (=-=Horvitz and Thompson, 1952-=-) in combination with the Good-Turing correction (Good, 1953; Orlitsky et al., 2003) of the empirical cell probabilities to the problem of entropy estimation. The Good-Turing-corrected frequency estim... |

236 |
Inferring cellular networks using probabilistic graphical models
- Friedman
- 2004
(Show Context)
Citation Context ...ts to unravel the molecular mechanisms of diseases and to aid the understanding of cellular function. To this end, an extensive literature on the “reverse engineering” of gene networks has developed (=-=Friedman, 2004-=-). Using gene expression or proteomic data statistical learning procedures are employed to deduce associations and dependencies among genes. Among many others, methods have been proposed to enable the... |

213 |
An invariant form for the prior probability in estimation problems
- Jeffreys
- 1946
(Show Context)
Citation Context ... the Dirichlet prior in the Bayesian estimators of cell frequencies, and corresponding entropy estimators. a k Cell frequency prior Entropy estimator 0 no prior maximum likelihood 1/2 Jeffreys prior (=-=Jeffreys, 1946-=-) Krichevsky and Trofimov (1981) 1 Bayes-Laplace uniform prior Holste et al. (1998) √ 1/p Perks prior (Perks, 1947) Schürmann and Grassberger (1996) n/p minimax prior (Trybula, 1958) 4where m1 is the... |

148 | An empirical bayes approach to inferring large-scale gene association networks
- Schäfer, Strimmer
- 2005
(Show Context)
Citation Context ...mong many others, methods have been proposed to enable the inference of large-scale correlation networks (Butte et al., 2000) and of high-dimensional partial 8correlation graphs (Dobra et al., 2004; =-=Schäfer and Strimmer, 2005-=-a; Meinshausen and Bühlmann, 2006), for learning vector-autoregressive (Opgen-Rhein and Strimmer, 2007a) and state space models (Rangel et al., 2004; Lähdesmäki and Shmulevich, 2008), and to reconstru... |

141 | A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics
- Schäfer, Strimmer
(Show Context)
Citation Context ...mong many others, methods have been proposed to enable the inference of large-scale correlation networks (Butte et al., 2000) and of high-dimensional partial 8correlation graphs (Dobra et al., 2004; =-=Schäfer and Strimmer, 2005-=-a; Meinshausen and Bühlmann, 2006), for learning vector-autoregressive (Opgen-Rhein and Strimmer, 2007a) and state space models (Rangel et al., 2004; Lähdesmäki and Shmulevich, 2008), and to reconstru... |

134 | Trofimov, “The performance of universal encoding - Krichevsky, K - 1981 |

133 | Sparse graphical models for exploring gene expression data
- Dobra, Hans, et al.
- 2004
(Show Context)
Citation Context ...dencies among genes. Among many others, methods have been proposed to enable the inference of large-scale correlation networks (Butte et al., 2000) and of high-dimensional partial correlation graphs (=-=Dobra et al., 2004-=-; Schäfer and Strimmer, 2005a; Meinshausen and Bühlmann, 2006), for learning vector-autoregressive (Opgen-Rhein and Strimmer, 2007a) and state space models (Rangel et al., 2004; Lähdesmäki and Shmulev... |

123 | Improved estimation of the covariance matrix of stock returns with an application to portfolio selection - Ledoit, Wolf, et al. |

112 | Entropy and information in neural spike trains - Strong, Koberle, et al. - 1998 |

89 |
Discovering functional relationships between RNA expression and chemotherapeutic susceptibility using relevance networks
- Butte, Tamayo, et al.
- 2000
(Show Context)
Citation Context ...stical learning procedures are employed to deduce associations and dependencies among genes. Among many others, methods have been proposed to enable the inference of large-scale correlation networks (=-=Butte et al., 2000-=-) and of high-dimensional partial 8correlation graphs (Dobra et al., 2004; Schäfer and Strimmer, 2005a; Meinshausen and Bühlmann, 2006), for learning vector-autoregressive (Opgen-Rhein and Strimmer, ... |

80 | ARACNE: an algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context
- Margolin, Nemenman, et al.
- 2006
(Show Context)
Citation Context ...tinson (2006); Yeo and Burge (2004); MacKay (2003); Strong et al. (1998). Here we focus on estimating entropy from small-sample data, with applications in genomics and gene network inference in mind (=-=Margolin et al., 2006-=-; Meyer et al., 2007). For the definition of Shannon entropy consider a categorical random variable with alphabet size p and associated cell probabilities θ1, . . . , θp with θ k > 0 and ∑k θ k = 1. I... |

77 |
Note on the bias of information estimates
- Miller
- 1955
(Show Context)
Citation Context ...ing plugin entropy estimator ˆH ML is not. First order bias correction leads to ˆH MM = ˆH ML + m>0 − 1 , 2n where m>0 is the number of cells with yk > 0. This is known as the Miller-Madow estimator (=-=Miller, 1955-=-). 2.3 Bayesian Estimators Bayesian regularization of cell counts may lead to vast improvements over the ML estimator (Agresti and Hitchcock, 2005). Using the Dirichlet distribution with parameters a1... |

50 | Estimating high-dimensional directed acyclic graphs with the PC-algorithm
- Kalisch, Bühlmann
- 2007
(Show Context)
Citation Context ...earning vector-autoregressive (Opgen-Rhein and Strimmer, 2007a) and state space models (Rangel et al., 2004; Lähdesmäki and Shmulevich, 2008), and to reconstruct directed “causal” interaction graphs (=-=Kalisch and Bühlmann, 2007-=-; OpgenRhein and Strimmer, 2007b). The restriction to linear models in most of the literature is owed at least in part to the already substantial challenges involved in estimating linear high-dimensio... |

44 |
2003), Always good turing: Asymptotically optimal probability estimation
- Orlitsky, Santhanam, et al.
(Show Context)
Citation Context ...cently proposed estimator is due to Chao and Shen (2003). This approach applies the Horvitz-Thompson estimator (Horvitz and Thompson, 1952) in combination with the Good-Turing correction (Good, 1953; =-=Orlitsky et al., 2003-=-) of the empirical cell probabilities to the problem of entropy estimation. The Good-Turing-corrected frequency estimates are ˆθ GT k m1 = (1 − n ) ˆθ ML k , Table 1: Common choices for the parameters... |

44 | Modeling T-cell activation using gene expression profiing and state space models
- Rangel
(Show Context)
Citation Context ...al 8correlation graphs (Dobra et al., 2004; Schäfer and Strimmer, 2005a; Meinshausen and Bühlmann, 2006), for learning vector-autoregressive (Opgen-Rhein and Strimmer, 2007a) and state space models (=-=Rangel et al., 2004-=-; Lähdesmäki and Shmulevich, 2008), and to reconstruct directed “causal” interaction graphs (Kalisch and Bühlmann, 2007; OpgenRhein and Strimmer, 2007b). The restriction to linear models in most of th... |

42 |
2007), Accurate ranking of differentially expressed genes by a distribution-free shrinkage approach
- Opegen-Rhein, Strimmer
(Show Context)
Citation Context ...tworks (Butte et al., 2000) and of high-dimensional partial 8correlation graphs (Dobra et al., 2004; Schäfer and Strimmer, 2005a; Meinshausen and Bühlmann, 2006), for learning vector-autoregressive (=-=Opgen-Rhein and Strimmer, 2007-=-a) and state space models (Rangel et al., 2004; Lähdesmäki and Shmulevich, 2008), and to reconstruct directed “causal” interaction graphs (Kalisch and Bühlmann, 2007; OpgenRhein and Strimmer, 2007b). ... |

42 |
Stein's Estimation Rule and Its Competitors—An Empirical Bayes Approach
- Efron, Monis
- 1973
(Show Context)
Citation Context ...e tk = ak A and λ = A n+A then ˆθ Shrink k 5 = ˆθ Bayes k . This implies that the shrinkageestimator is an empirical Bayes estimator with a data-driven choice of the flattening constants – see also (=-=Efron and Morris, 1973-=-). For every choice of A there exists an equivalent shrinkage intensity λ. Conversely, for every λ there exist an equivalent A = n λ 1−λ . Remark 2: Developing A = n λ 1−λ = n(λ + λ2 + . . .) we obtai... |

34 |
From Correlation to Causation Networks: A Simple Approximate Learning Algorithm and its Application to High-Dimensional Plant Gene Expression Data”, BMC Systems Biology
- Opgen-Rhein, Strimmer
- 2007
(Show Context)
Citation Context ... networks (Butte et al., 2000) and of high-dimensional partial correlation graphs (Dobra et al., 2004; Schäfer and Strimmer, 2005a; Meinshausen and Bühlmann, 2006), for learning vectorautoregressive (=-=Opgen-Rhein and Strimmer, 2007-=-a) and state space models (Rangel et al., 2004; Lähdesmäki and Shmulevich, 2008), and to reconstruct directed “causal” interaction graphs (Kalisch and Bühlmann, 2007; Opgen-Rhein and Strimmer, 2007b).... |

29 | Entropy estimation of symbol sequences - Schürmann, Grassberg - 1996 |

24 |
Some observations on inverse probability including a new indifference rule
- PERKS
- 1947
(Show Context)
Citation Context ... frequency prior Entropy estimator 0 no prior maximum likelihood 1/2 Jeffreys prior (Jeffreys, 1946) Krichevsky and Trofimov (1981) 1 Bayes-Laplace uniform prior Holste et al. (1998) 1/p Perks prior (=-=Perks, 1947-=-) Schürmann and Grassberger (1996) √ n/p minimax prior (Trybula, 1958) Table 1: Common choices for the parameters of the Dirichlet prior in the Bayesian estimators of cell frequencies, and correspondi... |

19 |
Information-theoretic inference of large transcriptional regulatory networks
- Meyer, Kontos, et al.
- 2007
(Show Context)
Citation Context ...ge (2004), MacKay (2003) and Strong et al. (1998). Here we focus on estimating entropy from small-sample data, with applications in genomics and gene network inference in mind (Margolin et al., 2006; =-=Meyer et al., 2007-=-). To define the Shannon entropy, consider a categorical random variable with alphabet size p and associated cell probabilities θ1,...,θp with θk > 0 and ∑k θk = 1. Throughout the article, we assume c... |

19 | Entropy and inference, revisited
- Nemenman, Shafee, et al.
- 2002
(Show Context)
Citation Context ...ears ago, it is only recently that the specific issues arising in high-dimensional, undersampled data sets have attracted attention. This has lead to two recent innovations, namely the NSB algorithm (=-=Nemenman et al., 2002-=-) and the Chao-Shen estimator (Chao and Shen, 2003), both of which are now widely considered as benchmarks for the small-sample entropy estimation problem (Vu et al., 2007). Here, we introduce a novel... |

16 |
Nonparametric estimate of Shannon's index of diversity when there are unseen species in a sample
- Chao, Shen
- 2003
(Show Context)
Citation Context ...ific issues arising in high-dimensional, low-sampled datasets have attracted attention. This has lead to two recent innovations, the NSB algorithm (Nemenman et al., 2002) and the Chao-Shen estimator (=-=Chao and Shen, 2003-=-), both of which are now widely considered as benchmarks for the small-sample entropy estimation problem (Vu et al., 2007). Here, we introduce a novel and highly efficient small-sample entropy estimat... |

15 | Some shrinkage techniques for estimating the mean - Thompson - 1968 |

10 |
Simultaneous estimation of multinomial cell probabilities
- Fienberg, Holland
- 1973
(Show Context)
Citation Context ...e exist an equivalent A = n λ 1−λ . Remark 2: Developing A = n λ 1−λ = n(λ + λ2 + . . .) we obtain the approximate estimate Â = n ˆλ, which in turn recovers the “pseudo-Bayes” estimator described in (=-=Fienberg and Holland, 1973-=-). Remark 3: The shrinkage estimator assumes a fixed and known p. In many practical applications this will indeed be the case, e.g., if the observed counts are due to discretization (see also the data... |

8 | On Prior Distributions for Binary Trials - Geisser - 1984 |

8 |
Improving Efficiency by Shrinkage
- Gruber
- 1998
(Show Context)
Citation Context ...nsidered as benchmarks for the small-sample entropy estimation problem (Vu et al., 2007). Here, we introduce a novel and highly efficient small-sample entropy estimator based on JamesStein shrinkage (=-=Gruber, 1998-=-). Our method is fully analytic and hence computationally inexpensive. Moreover, our procedure simultaneously provides estimates of the entropy and of the cell frequencies suitable for plugging into t... |

8 |
Some Problems of Simultaneous Minimax Estimation
- Trybula
(Show Context)
Citation Context ...2 Jeffreys prior (Jeffreys, 1946) Krichevsky and Trofimov (1981) 1 Bayes-Laplace uniform prior Holste et al. (1998) 1/p Perks prior (Perks, 1947) Schürmann and Grassberger (1996) √ n/p minimax prior (=-=Trybula, 1958-=-) Table 1: Common choices for the parameters of the Dirichlet prior in the Bayesian estimators of cell frequencies, and corresponding entropy estimators. 1471HAUSSER AND STRIMMER While the multinomia... |

6 | Bayes estimators of generalized entropies - Holste, Grosse, et al. - 1998 |

6 | Reverse engineering of the stress response during expression of a recombinant protein - Schmidt-Heck, Guthke, et al. |

5 |
Bayesian inference for categorical data analysis
- Agresti, Hitchcock
- 2005
(Show Context)
Citation Context ... of cells with y k > 0. This is known as the Miller-Madow estimator (Miller, 1955). 2.3 Bayesian estimators Bayesian regularization of cell counts may lead to vast improvements over the ML estimator (=-=Agresti and Hitchcock, 2005-=-). Using the Dirichlet distribution as prior with parameters a1, a2, . . . , ap the resulting posterior distribution is also Dirichlet, with mean ˆθ Bayes k = y k + a k n + A , 3where A = ∑ p k=1 a k... |

5 | A Galtonian Perspective on Shrinkage Estimators,” Stat - Stigler - 1990 |

3 | A simple method for improving some estimators - Goodman - 1953 |

3 | A comparison of bayes-laplace, jeffreys, and other priors - Tuyl, Gerlach, et al. |

3 | Coverage-adjusted entropy estimation - Vu, Yu, et al. |

2 |
Learning the structure of dynamic Bayesian networks from time series and steady state measurements
- Lahdesmaki, Shmulevich
- 2008
(Show Context)
Citation Context ...hs (Dobra et al., 2004; Schäfer and Strimmer, 2005a; Meinshausen and Bühlmann, 2006), for learning vector-autoregressive (Opgen-Rhein and Strimmer, 2007a) and state space models (Rangel et al., 2004; =-=Lähdesmäki and Shmulevich, 2008-=-), and to reconstruct directed “causal” interaction graphs (Kalisch and Bühlmann, 2007; Opgen-Rhein and Strimmer, 2007b). The restriction to linear models in most of the literature is owed at least in... |

2 | Theory in practice - unknown authors - 1974 |

1 | On the histogram as a density estimator: L2 theory. Z. Wahrscheinlichkeitstheorie verw. Gebiete - Freedman, Diaconis - 1981 |

1 | An invariant form for the prior probability in estimation problems - Probab - 1961 |

1 | Improving Efficiency By Shrinkage - J - 1998 |

1 |
Coverage-adjusted entropy estimation. Stat
- Vu, Yu, et al.
- 2007
(Show Context)
Citation Context ...ely the NSB algorithm (Nemenman et al., 2002) and the Chao-Shen estimator (Chao and Shen, 2003), both of which are now widely considered as benchmarks for the small-sample entropy estimation problem (=-=Vu et al., 2007-=-). Here, we introduce a novel and highly efficient small-sample entropy estimator based on James-Stein shrinkage (Gruber, 1998). Our method is fully analytic and hence 1 In this paper we use the follo... |

1 | On prior distributions for binary trials - unknown authors - 1984 |