Results 1 - 10
of
51
Assessing Degeneracy in Statistical Models of Social Networks
- Journal of the American Statistical Association
, 2003
"... discussions. This paper presents recent advances in the statistical modeling of random graphs that have an impact on the empirical study of social networks. Statistical exponential family models (Wasserman and Pattison 1996) are a generalization of the Markov random graph models introduced by Frank ..."
Abstract
-
Cited by 45 (12 self)
- Add to MetaCart
discussions. This paper presents recent advances in the statistical modeling of random graphs that have an impact on the empirical study of social networks. Statistical exponential family models (Wasserman and Pattison 1996) are a generalization of the Markov random graph models introduced by Frank and Strauss (1986), which in turn are derived from developments in spatial statistics (Besag 1974). These models recognize the complex dependencies within relational data structures. A major barrier to the application of random graph models to social networks has been the lack of a sound statistical theory to evaluate model fit. This problem has at least three aspects: the specification of realistic models, the algorithmic difficulties of the inferential methods, and the assessment of the degree to which the graph structure produced by the models matches that of the data. We discuss these and related issues of the model degeneracy and inferential degeneracy for commonly used estimators.
Logistic Regression in Rare Events Data
, 1999
"... We study rare events data, binary dependent variables with dozens to thousands of times fewer ones (events, such as wars, vetoes, cases of political activism, or epidemiological infections) than zeros (“nonevents”). In many literatures, these variables have proven difficult to explain and predict, a ..."
Abstract
-
Cited by 33 (4 self)
- Add to MetaCart
We study rare events data, binary dependent variables with dozens to thousands of times fewer ones (events, such as wars, vetoes, cases of political activism, or epidemiological infections) than zeros (“nonevents”). In many literatures, these variables have proven difficult to explain and predict, a problem that seems to have at least two sources. First, popular statistical procedures, such as logistic regression, can sharply underestimate the probability of rare events. We recommend corrections that outperform existing methods and change the estimates of absolute and relative risks by as much as some estimated effects reported in the literature. Second, commonly used data collection strategies are grossly inefficient for rare events data. The fear of collecting data with too few events has led to data collections with huge numbers of observations but relatively few, and poorly measured, explanatory variables, such as in international conflict data with more than a quarter-million dyads, only a few of which are at war. As it turns out, more efficient sampling designs exist for making valid inferences, such as sampling all available events (e.g., wars) and a tiny fraction of nonevents (peace). This enables scholars to save as much as 99 % of their (nonfixed) data collection costs or to collect much more meaningful explanatory
Partial least squares: A versatile tool for the analysis of high-dimensional genomic data
- Briefings in Bioinformatics
, 2007
"... Partial Least Squares (PLS) is a highly efficient statistical regression technique that is well suited for the analysis of high-dimensional genomic data. In this paper we review the theory and applications of PLS both under methodological and biological points of view. Focusing on microarray express ..."
Abstract
-
Cited by 15 (5 self)
- Add to MetaCart
Partial Least Squares (PLS) is a highly efficient statistical regression technique that is well suited for the analysis of high-dimensional genomic data. In this paper we review the theory and applications of PLS both under methodological and biological points of view. Focusing on microarray expression data we provide a systematic comparison of the PLS approaches currently employed, and discuss problems as different as tumor classification, identification of relevant genes, survival analysis and modeling of gene networks. 2 1
Classification using generalized partial least squares
, 2005
"... The gpls package includes functions for classification using generalized partial least squares approaches. Both two-group and multi-group (more than 2 groups) classifications can be done. The basic functionalities are based on and extended from the Iteratively ReWeighted Least Squares (IRWPLS) by Ma ..."
Abstract
-
Cited by 14 (2 self)
- Add to MetaCart
The gpls package includes functions for classification using generalized partial least squares approaches. Both two-group and multi-group (more than 2 groups) classifications can be done. The basic functionalities are based on and extended from the Iteratively ReWeighted Least Squares (IRWPLS) by Marx (1996). Additionally, Firth’s bias reduction procedure (Firth, 1992a,b, 1993) is
Differential privacy for statistics: What we know and what we want to learn
- In Proceedings of the 33rd International Colloquium on Automata, Languages and Programming, volume 4052 of LECTURE NOTES IN COMPUTER SCIENCE
"... Abstract. We motivate and review the definition of differential privacy, survey some results on differentially private statistical estimators, and outline a research agenda. This survey is based on two presentations given by the authors at an NCHS/CDC sponsored workshop on data privacy in May 2008. ..."
Abstract
-
Cited by 12 (0 self)
- Add to MetaCart
Abstract. We motivate and review the definition of differential privacy, survey some results on differentially private statistical estimators, and outline a research agenda. This survey is based on two presentations given by the authors at an NCHS/CDC sponsored workshop on data privacy in May 2008. 1
A WEAKLY INFORMATIVE DEFAULT PRIOR DISTRIBUTION FOR LOGISTIC AND OTHER REGRESSION MODELS
"... We propose a new prior distribution for classical (nonhierarchical) logistic regression models, constructed by first scaling all nonbinary variables to have mean 0 and standard deviation 0.5, and then placing independent Student-t prior distributions on the coefficients. As a default choice, we reco ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
We propose a new prior distribution for classical (nonhierarchical) logistic regression models, constructed by first scaling all nonbinary variables to have mean 0 and standard deviation 0.5, and then placing independent Student-t prior distributions on the coefficients. As a default choice, we recommend the Cauchy distribution with center 0 and scale 2.5, which in the simplest setting is a longer-tailed version of the distribution attained by assuming one-half additional success and one-half additional failure in a logistic regression. Cross-validation on a corpus of datasets shows the Cauchy class of prior distributions to outperform existing implementations of Gaussian and Laplace priors. We recommend this prior distribution as a default choice for routine applied use. It has the advantage of always giving answers, even when there is complete separation in logistic regression (a common problem, even when the sample size is large and the number of predictors is small), and also automatically applying more shrinkage to higher-order interactions. This can
Bias of Estimators and Regularization Terms
- in Proceedings of 1998 Workshop on Information-Based Induction Sciences (IBIS'98), Izu
, 1998
"... : In this paper, a role of regularization terms (penalty terms) is discussed from the view point of minimizing the generalization error. First the bias of minimum training error estimation is clarified. The bias is caused by the nonlinearity of the learning system and depends on the number of traini ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
: In this paper, a role of regularization terms (penalty terms) is discussed from the view point of minimizing the generalization error. First the bias of minimum training error estimation is clarified. The bias is caused by the nonlinearity of the learning system and depends on the number of training examples. Then an appropriate size of the regularization term is considered by taking account of the balance of the bias and the variance of the estimator, so that the generalization error is minimized. In this framework, the optimal size of the regularization term is calculated with the second and third order derivatives of the loss function. When the learning system has a large number of modifiable parameters, it is computationally expensive to calculate the higher order derivatives, thus we propose a simple method of approximating the optimal size via a generalized AIC. 1 Introduction In order to avoid "over-fitting" problem of learning machines, such as neural networks, regularizatio...
Explaining Rare Events in International Relations
, 2000
"... Some of the most important phenomena in international conflict are coded as "rare events data," binary dependent variables with dozens to thousands of times fewer events, such as wars, coups, etc., than "nonevents". Unfortunately, rare events data are difficult to explain and predict, a problem that ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
Some of the most important phenomena in international conflict are coded as "rare events data," binary dependent variables with dozens to thousands of times fewer events, such as wars, coups, etc., than "nonevents". Unfortunately, rare events data are difficult to explain and predict, a problem that seems to have at least two sources. First, and most importantly, the data collection strategies used in international conflict are grossly inefficient. The fear of collecting data with too few events has led to data collections with huge numbers of observations but relatively few, and poorly measured, explanatory variables. As it turns out, more efficient sampling designs exist for making valid inferences, such as sampling all available events (e.g., wars) and a tiny fraction of non-events (peace). This enables scholars to save as much as 99% of their (non-fixed) data collection costs, or to collect much more meaningful explanatory variables. Second, logistic regression, and other commonly ...
Boosting SVM Classifiers with Logistic Regression
, 2003
"... The support vector machine classifier is a linear maximum margin classifier. It performs very well in many classification applications. Although, it could be extended to nonlinear cases by exploiting the idea of kernel, it might still su#er from the heterogeneity in the training examples. Since t ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
The support vector machine classifier is a linear maximum margin classifier. It performs very well in many classification applications. Although, it could be extended to nonlinear cases by exploiting the idea of kernel, it might still su#er from the heterogeneity in the training examples. Since there are very few theories in the literature to guide us on how to choose kernel functions, the selection of kernel is usually based on a try-and-error manner.

