Results 21 - 30
of
663
A Statistical Perspective on Knowledge Discovery in Databases
, 1996
"... The quest to find models usefully characterizing data is a process central to the scientific method, and has been carried out on many fronts. Researchers from an expanding number of fields have designed algorithms to discover rules or equations that capture key relationships between variables in a d ..."
Abstract
-
Cited by 40 (0 self)
- Add to MetaCart
The quest to find models usefully characterizing data is a process central to the scientific method, and has been carried out on many fronts. Researchers from an expanding number of fields have designed algorithms to discover rules or equations that capture key relationships between variables in a database. The task of this chapter is to provide a perspective on statistical techniques applicable to KDD; accordingly, we review below some major advances in statistics in the last few decades. We next highlight some distinctives of what may be called a "statistical viewpoint." Finally we overview some influential classical and modern statistical methods for practical model induction.
Inferring gene regulatory networks from time-ordered gene expression data of bacillus subtilis using differential equations
- Pac. Symp. Biocomput
, 2003
"... Abstract. Recently, cDNA microarray experiments have generated large amounts of gene expression data. In time-ordered gene expression data, the expression levels are measured at several points in time following some experimental manipulation. A gene regulatory network can be inferred by describing t ..."
Abstract
-
Cited by 39 (13 self)
- Add to MetaCart
Abstract. Recently, cDNA microarray experiments have generated large amounts of gene expression data. In time-ordered gene expression data, the expression levels are measured at several points in time following some experimental manipulation. A gene regulatory network can be inferred by describing the gene expression data in terms of a linear system of differential equations. As biologically the gene regulatory network is known to be sparse, we expect most coefficients in such a linear system of differential equations to be zero. In previously proposed methods to infer a linear system of differential equations, some ad hoc assumptions are made to limit the number of nonzero coefficients in the system. Instead, we propose to infer the degree of sparseness of the gene regulatory network from the data, where we determine which coefficients are nonzero by using Akaike’s Information Criterion. 1
Model Choice: A Minimum Posterior Predictive Loss Approach
, 1998
"... Model choice is a fundamental and much discussed activity in the analysis of data sets. Hierarchical models introducing random effects can not be handled by classical methods. Bayesian approaches using predictive distributions can, though the formal solution, which includes Bayes factors as a specia ..."
Abstract
-
Cited by 39 (10 self)
- Add to MetaCart
Model choice is a fundamental and much discussed activity in the analysis of data sets. Hierarchical models introducing random effects can not be handled by classical methods. Bayesian approaches using predictive distributions can, though the formal solution, which includes Bayes factors as a special case, can be criticized. We propose a predictive criterion where the goal is good prediction of a replicate of the observed data but tempered by fidelity to the observed values. We obtain this criterion by minimizing posterior loss for a given model and then, for models under consideration, select the one which minimizes this criterion. For a broad range of losses, the criterion emerges approximately as a form partitioned into a goodness-of-fit term and a penalty term. In the context of generalized linear mixed effects models we obtain a penalized deviance criterion comprised of a piece which is a Bayesian deviance measure and a piece which is a penalty for model complexity. We illustrate ...
A Model Selection Approach to Assessing the Information in the Term Structure Using Linear Models and Artificial Neural Networks
- Journal of Business and Economic Statistics
, 1992
"... We take a model selection approach to the question of whether forward interest rates are useful in predicting future spot rates, using a variety of out-of-sample forecast-based model selection criteria: forecast mean squared error, forecast direction accuracy, and forecast-based trading system profi ..."
Abstract
-
Cited by 39 (11 self)
- Add to MetaCart
We take a model selection approach to the question of whether forward interest rates are useful in predicting future spot rates, using a variety of out-of-sample forecast-based model selection criteria: forecast mean squared error, forecast direction accuracy, and forecast-based trading system profitability. We also examine the usefulness of a class of novel prediction models called "artificial neural networks," and investigate the issue of appropriate window sizes for rolling-window-based prediction methods. Results indicate that the premium of the forward rate over the spot rate helps to predict the sign of future changes in the interest rate. Further, model selection based on an in-sample Schwarz Information Criterion (SIC) does not appear to be a reliable guide to out-of-sample performance, in the case of short-term interest rates. Thus, the in-sample SIC apparently fails to offer a convenient shortcut to true out-of-sample performance measures. Keywords: Artificial Neural Network...
Constrained-Realization Monte-Carlo method for Hypothesis Testing
- Physica D
"... : We compare two theoretically distinct approaches to generating artificial (or "surrogate") data for testing hypotheses about a given data set. The first and more straightforward approach is to fit a single "best" model to the original data, and then to generate surrogate data sets that are "typica ..."
Abstract
-
Cited by 38 (1 self)
- Add to MetaCart
: We compare two theoretically distinct approaches to generating artificial (or "surrogate") data for testing hypotheses about a given data set. The first and more straightforward approach is to fit a single "best" model to the original data, and then to generate surrogate data sets that are "typical realizations" of that model. The second approach concentrates not on the model but directly on the original data; it attempts to constrain the surrogate data sets so that they exactly agree with the original data for a specified set of sample statistics. Examples of these two approaches are provided for two simple cases: a test for deviations from a gaussian distribution, and a test for serial dependence in a time series. Additionally, we consider tests for nonlinearity in time series based on a Fourier transform (FT) method and on more conventional autoregressive moving-average (ARMA) fits to the data. The comparative performance of hypothesis testing schemes based on these two approaches...
Detecting and modeling doors with mobile robots
- In Proc. of the IEEE Int. Conf. on Robotics & Automation (ICRA
, 2004
"... Abstract — We describe a probabilistic framework for detection and modeling of doors from sensor data acquired in corridor environments with mobile robots. The framework captures shape, color, and motion properties of door and wall objects. The probabilistic model is optimized with a version of the ..."
Abstract
-
Cited by 37 (2 self)
- Add to MetaCart
Abstract — We describe a probabilistic framework for detection and modeling of doors from sensor data acquired in corridor environments with mobile robots. The framework captures shape, color, and motion properties of door and wall objects. The probabilistic model is optimized with a version of the expectation maximization algorithm, which segments the environment into door and wall objects and learns their properties. The framework allows the robot to generalize the properties of detected object instances to new object instances. We demonstrate the algorithm on real-world data acquired by a Pioneer robot equipped with a laser range finder and an omni-directional camera. Our results show that our algorithm reliably segments the environment into walls and doors, finding both doors that move and doors that do not move. We show that our approach achieves better results than models that only capture behavior, or only capture appearance. I.
The composite absolute penalties family for grouped and hierarchical variable selection
- Ann. Statist
"... Extracting useful information from high-dimensional data is an important focus of today’s statistical research and practice. Penalized loss function minimization has been shown to be effective for this task both theoretically and empirically. With the virtues of both regularization and sparsity, the ..."
Abstract
-
Cited by 37 (1 self)
- Add to MetaCart
Extracting useful information from high-dimensional data is an important focus of today’s statistical research and practice. Penalized loss function minimization has been shown to be effective for this task both theoretically and empirically. With the virtues of both regularization and sparsity, the L1-penalized squared error minimization method Lasso has been popular in regression models and beyond. In this paper, we combine different norms including L1 to form an intelligent penalty in order to add side information to the fitting of a regression or classification model to obtain reasonable estimates. Specifically, we introduce the Composite Absolute Penalties (CAP) family, which allows given grouping and hierarchical relationships between the predictors to be expressed. CAP penalties are built by defining groups and combining the properties of norm penalties at the across-group and within-group levels. Grouped selection occurs for nonoverlapping groups. Hierarchical variable selection is reached
First-order methods for sparse covariance selection
- SIAM Journal on Matrix Analysis and Applications
"... Abstract. Given a sample covariance matrix, we solve a maximum likelihood problem penalized by the number of nonzero coefficients in the inverse covariance matrix. Our objective is to find a sparse representation of the sample data and to highlight conditional independence relationships between the ..."
Abstract
-
Cited by 36 (1 self)
- Add to MetaCart
Abstract. Given a sample covariance matrix, we solve a maximum likelihood problem penalized by the number of nonzero coefficients in the inverse covariance matrix. Our objective is to find a sparse representation of the sample data and to highlight conditional independence relationships between the sample variables. We first formulate a convex relaxation of this combinatorial problem, we then detail two efficient first-order algorithms with low memory requirements to solve large-scale, dense problem instances.
Toward improved ranking metrics
- IEEE Transactions on Pattern Analysis and Machine Intelligence
, 2000
"... AbstractÐIn many computer vision algorithms, a metric or similarity measure is used to determine the distance between two features. The Euclidean or SSD (sum of the squared differences) metric is prevalent and justified from a maximum likelihood perspective when the additive noise distribution is Ga ..."
Abstract
-
Cited by 36 (11 self)
- Add to MetaCart
AbstractÐIn many computer vision algorithms, a metric or similarity measure is used to determine the distance between two features. The Euclidean or SSD (sum of the squared differences) metric is prevalent and justified from a maximum likelihood perspective when the additive noise distribution is Gaussian. Based on real noise distributions measured from international test sets, we have found that the Gaussian noise distribution assumption is often invalid. This implies that other metrics, which have distributions closer to the real noise distribution, should be used. In this paper, we consider three different applications: content-based retrieval in image databases, stereo matching, and motion tracking. In each of them, we experiment with different modeling functions for the noise distribution and compute the accuracy of the methods using the corresponding distance measures. In our experiments, we compared the SSD metric, the SAD (sum of the absolute differences) metric, the Cauchy metric, and the Kullback relative information. For several algorithms from the research literature which used the SSD or SAD, we showed that greater accuracy could be obtained by using the Cauchy metric instead. Index TermsÐMaximum likelihood, ranking metrics, content-based retrieval, color indexing, stereo matching, motion tracking. 1
Statistical Relational Learning for Document Mining
, 2003
"... A major obstacle to fully integrated deployment of statistical learners is the assumption that data sits in a single table, even though most real-world databases have complex relational structures. In this paper, we introduce an integrated approach to building regression models from data stored ..."
Abstract
-
Cited by 35 (5 self)
- Add to MetaCart
A major obstacle to fully integrated deployment of statistical learners is the assumption that data sits in a single table, even though most real-world databases have complex relational structures. In this paper, we introduce an integrated approach to building regression models from data stored in relational databases. Potential features are generated by structured search of the space of queries to the database, and then tested for inclusion in a logistic regression. We present experimental results for the task of predicting where scientific papers will be published based on relational data taken from CiteSeer. This data includes word counts in the document, frequently cited authors or papers, co-citations, publication venues of cited papers, word co-occurrences, and word counts in cited or citing documents. Our approach results in classification accuracies superior to those achieved when using classical "flat" features. Our classification task also serves as a "where to publish?" conference/journal recommendation task.

