Results 1  10
of
10
Learning Bayesian Networks From Dependency Networks: A Preliminary Study
 in Proceedings of the Ninth International Workshop on Artificial Intelligence and Statistics, Key West, FL
, 2003
"... In this paper we describe how to learn Bayesian networks from a summary of complete data in the form of a dependency network rather than from data directly. This method allows us to gain the advantages of both representations: scalable algorithms for learning dependency networks and convenien ..."
Abstract

Cited by 7 (1 self)
 Add to MetaCart
In this paper we describe how to learn Bayesian networks from a summary of complete data in the form of a dependency network rather than from data directly. This method allows us to gain the advantages of both representations: scalable algorithms for learning dependency networks and convenient inference with Bayesian networks. Our approach is to use a dependency network as an "oracle" for the statistics needed to learn a Bayesian network. We show that the general problem is NPhard and develop a greedy search algorithm. We conduct a preliminary experimental evaluation and find that the prediction accuracy of the Bayesian networks constructed from our algorithm almost equals that of Bayesian networks learned directly from the data.
Summary of biosurveillancerelevant technologies
, 2003
"... This short report, compiled upon request from Dave Siegrist and Ted Senator, surveys the spectrum of technologies that can help with Biosurveillance. We indicate which we have chosen, so far, to use in our development of analysis methods and our reasons. 1 Timeweighted averaging This is directly ap ..."
Abstract

Cited by 4 (2 self)
 Add to MetaCart
This short report, compiled upon request from Dave Siegrist and Ted Senator, surveys the spectrum of technologies that can help with Biosurveillance. We indicate which we have chosen, so far, to use in our development of analysis methods and our reasons. 1 Timeweighted averaging This is directly applicable to a scalar signal (such as “number of respiratory cases today”. This method, more commonly used in computational finance, simply compares the count during the current time period with the weighted average of the counts of recent days. Exponential weighting is typically used, where the halflife is known as the “time window ” parameter. This timewindow parameter is typically chosen by hand. We prefer the Serfling and Univariate HMM methods described below. 2 Serfling method This method (Serfling, 1963) is a cyclic regression model, and is the standard CDC algorithm for flu detection. It is, again, applicable to scalar signals. It assumes that the signal follows a sinusoid with a period of one year, and thus finds the four parameters ¢¤£¦¥¨ § and © in where the parameters are chosen to minimize the sum of squares of residuals. It is an easy matter of regression analysis to determine, on any date, whether
Efficient Learning and Feature Selection in HighDimensional Regression
, 2010
"... We present a novel algorithm for efficient learning and feature selection in highdimensional regression problems. We arrive at this model through a modification of the standard regression model, enabling us to derive a probabilistic version of the wellknown statistical regression technique of back ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
We present a novel algorithm for efficient learning and feature selection in highdimensional regression problems. We arrive at this model through a modification of the standard regression model, enabling us to derive a probabilistic version of the wellknown statistical regression technique of backfitting. Using the expectationmaximization algorithm, along with variational approximation methods to overcome intractability, we extend our algorithm to include automatic relevance detection of the input features. This variational Bayesian least squares (VBLS) approach retains its simplicity as a linear model, but offers a novel statistically robust blackbox approach to generalized linear regression with highdimensional inputs. It can be easily extended to nonlinear regression and classification problems. In particular, we derive the framework of sparse Bayesian learning, the relevance vector machine, with VBLS at its core, offering significant computational and robustness advantages for this class of methods. The iterative nature of VBLS makes it most suitable for realtime incremental learning, which is crucial especially in the application domain of robotics, brainmachine interfaces, and neural prosthetics, where realtime learning of models for control is needed. We evaluate our algorithm on synthetic and neurophysiological data sets, as well as on standard regression and classification benchmark data sets, comparing it with other competitive statistical approaches and demonstrating its suitability as a dropin replacement for other generalized linear regression techniques.
Sequential Update of ADtrees
"... Ingcreasingly, datamining algorithms must deal with databases that continuously grow over time. These algorithms must avoid repeatedly scanning their databases. When database attributes are symbolic, ADtrees have already shown to be efficient structures to store sufficient statistics in main memory ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
Ingcreasingly, datamining algorithms must deal with databases that continuously grow over time. These algorithms must avoid repeatedly scanning their databases. When database attributes are symbolic, ADtrees have already shown to be efficient structures to store sufficient statistics in main memory and to accelerate the mining process in batch environments. Here we present an efficient method to sequentially update ADtrees that is suitable for incremental environments. 1.
Refinement of Bayesian Network Structures upon New Data
"... Refinement of Bayesian network structures using new data becomes more and more relevant. Some work has been done there; however, one problem has not been considered yet what to do when new data has fewer or more attributes than the existing model. In both cases data contains important knowledge and ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
Refinement of Bayesian network structures using new data becomes more and more relevant. Some work has been done there; however, one problem has not been considered yet what to do when new data has fewer or more attributes than the existing model. In both cases data contains important knowledge and every effort must be made in order to extract it. In this paper, we propose a general merging algorithm to deal with situations when new data has different set of attributes. The merging algorithm updates sufficient statistics when new data is received. It expands the flexibility of Bayesian network structure refinement methods. The new algorithm is evaluated in extensive experiments, and its applications are discussed at length. 1
TCube: Quick Response to AdHoc Time Series Queries against Large Datasets
"... Abstract. We present a novel data structure called TCube which dramatically improves response time to adhoc time series queries against large datasets. We have tested TCube on both synthetic and real world data (emergency room patient visits, pharmacy sales) containing millions of records. The re ..."
Abstract
 Add to MetaCart
Abstract. We present a novel data structure called TCube which dramatically improves response time to adhoc time series queries against large datasets. We have tested TCube on both synthetic and real world data (emergency room patient visits, pharmacy sales) containing millions of records. The results indicate that TCube responds to complex queries 1,000 times faster when compared to the stateoftheart commercial time series extraction tools. This speedup has two main benefits: (1) It enables massive scale statistical data mining of large collections of time series, and (2) It allows its users to perform many complex adhoc queries without inconvenient delays. 1
Learning in FirstOrder Probabilistic Representations
, 2003
"... Learning probabilistic models has been an important direction of research in the machine learning community, as has been learning firstorder logic models. Ideally, we would like to be able to combine the two, i.e., to learn firstorder probabilistic models. Because of their ability to handle uncert ..."
Abstract
 Add to MetaCart
Learning probabilistic models has been an important direction of research in the machine learning community, as has been learning firstorder logic models. Ideally, we would like to be able to combine the two, i.e., to learn firstorder probabilistic models. Because of their ability to handle uncertainty and compactly model complex domains, these models are the object of growing research interest. This research comprises three main directions: knowledgebased model construction (KBMC), stochastic logic programs (SLPs), and probabilistic relational models (PRMs). This paper surveys these approaches, and suggests opportunities for further research and improvement, particularly with regard to modifying them so they may scale to handle large amounts of training data.
ADtrees for Sequential Data and Ngram Counting
"... Abstract — We consider the problem of efficiently storing ngram counts for large n over very large corpora. In such cases, the efficient storage of sufficient statistics can have a dramatic impact on system performance. One popular model for storing such data derived from tabular data sets with man ..."
Abstract
 Add to MetaCart
Abstract — We consider the problem of efficiently storing ngram counts for large n over very large corpora. In such cases, the efficient storage of sufficient statistics can have a dramatic impact on system performance. One popular model for storing such data derived from tabular data sets with many attributes is the ADtree. Here, we adapt the ADtree to benefit from the sequential structure of corporatype data. We demonstrate the usefulness of our approach on a portion of the wellknown Wall Street Journal corpus from the Penn Treebank and show that our approach is exponentially more efficient than the naïve approach to storing ngrams and is also significantly more efficient than a traditional prefix tree. I.
A Data Structure for Fast Extraction of Time Series from Large Datasets
, 2007
"... This report introduces a data structure called TCube designed to dramatically improve response time to adhoc time series queries against large datasets. We have tested TCube on both synthetic and real world data (emergency room patient visits, pharmacy sales) containing millions of records. The r ..."
Abstract
 Add to MetaCart
This report introduces a data structure called TCube designed to dramatically improve response time to adhoc time series queries against large datasets. We have tested TCube on both synthetic and real world data (emergency room patient visits, pharmacy sales) containing millions of records. The results indicate that TCube responds to complex queries 1,000 times faster when compared to the stateoftheart commercial time series extraction tools. This speedup has two main benefits: (1) It enables massive scale statistical mining of large collections of time series data, and (2) It allows its users to perform many complex adhoc queries without inconvenient delays. These benefits have been already found useful in applications related to practice of monitoring safety of food and agriculture, in detection of emerging patterns of failures in maintenance and supply management systems, as well as in the original application domain: biosurveillance. Keywords: time series, databases, data structures, cached sufficient statistics 2 1
Extending Enclosure Arithmetic to Statistical Regular Subpavings
, 2009
"... The quantity and complexity of data collected in most disciplines today is overwhelming. Computeraided statistical decisions based on standard methods and theory will often be impractical, inappropriate or ineffective for such massive datasets. Limitations on machine memory and speed of access and ..."
Abstract
 Add to MetaCart
The quantity and complexity of data collected in most disciplines today is overwhelming. Computeraided statistical decisions based on standard methods and theory will often be impractical, inappropriate or ineffective for such massive datasets. Limitations on machine memory and speed of access and operations pose physical challenges for massive data processing, especially in the presence of updates. There is an urgent need to develop theory and methods that account for the engineering constraints on representing and processing massive data in computing machines as highlighted in [7]. We propose to use setvalued mathematics [1, 6, 12], recursively computable statistics [2, 4, 9], multidimensional metric datastructures [6, 15, 16] and extended enclosure arithmetics to produce memorylimited and statistically consistent enclosures of nonparametric density and plugin estimates for static as well as dynamic massive data problems. We restrict our proposed study to problems involving observations x1,..., xn in some mdimensional metricspace, i.e. xi ∈ R m, where m ≤ 50 and n is typically more than the available primary machine memory. Our goal is to build a stable version of MRS: A C++ class library for statistical set processing that is generic, threadsafe and exactly implements our computational statistical theory.