Results 1 
6 of
6
The Anchors Hierarchy: Using the Triangle Inequality to Survive High Dimensional Data
 In Twelfth Conference on Uncertainty in Artificial Intelligence
, 2000
"... This paper is about metric data structures in highdimensional or nonEuclidean space that permit cached sufficient statistics accelerations of learning algorithms. ..."
Abstract

Cited by 75 (8 self)
 Add to MetaCart
This paper is about metric data structures in highdimensional or nonEuclidean space that permit cached sufficient statistics accelerations of learning algorithms.
A dynamic adaptation of ADtrees for efficient machine learning on large data sets
 In Proceedings of the 17th International Conference on Machine Learning
, 2000
"... This paper has no novel learning or statistics: it is concerned with making a wide class of preexisting statistics and learning algorithms computationally tractable when faced with data sets with massive numbers of records or attributes. It briefly reviews the static ADtree structure of Moore and L ..."
Abstract

Cited by 15 (5 self)
 Add to MetaCart
This paper has no novel learning or statistics: it is concerned with making a wide class of preexisting statistics and learning algorithms computationally tractable when faced with data sets with massive numbers of records or attributes. It briefly reviews the static ADtree structure of Moore and Lee (1998), and offers a new structure with more attractive properties: (1) the new structure scales better with the number of attributes in the data set; (2) it has zero initial build time; (3) it adaptively caches only statistics relevant to the current task; and (4) it can be used incrementally in cases where new data is frequently being appended to the data set. We provide a careful explanation of the data structure, and then empirically evaluate the performance under varying access patterns induced by different learning algorithms such as association rules, decision trees and Bayes net structures. We conclude by discussing the longer term benefits of the new structure: the eventual ability to apply ADtrees to data sets with realvalued attributes. 1. Description of ADtrees 1.1 What is an ADtree? Table 1 shows a tiny data set with M 3 symbolic (i.e., categorical) attributes (the columns), and R 6 records (the rows). A counting query has the form C(a1 2 a2 1), and is a request to count the number of a3 records matching the query, with asterisks interpreted as “don’t cares”. C(a1 2 a2 a3 1)=3 in our example. Moore and Lee (1998) and Anderson and Moore (1998) introduced a new data structure for representing the cached counting statistics for a categorical data set, called an AllTable 1. Sample data set with three attributes and six records. a2=1 MCV a2=2
Summary of biosurveillancerelevant technologies
, 2003
"... This short report, compiled upon request from Dave Siegrist and Ted Senator, surveys the spectrum of technologies that can help with Biosurveillance. We indicate which we have chosen, so far, to use in our development of analysis methods and our reasons. 1 Timeweighted averaging This is directly ap ..."
Abstract

Cited by 4 (2 self)
 Add to MetaCart
This short report, compiled upon request from Dave Siegrist and Ted Senator, surveys the spectrum of technologies that can help with Biosurveillance. We indicate which we have chosen, so far, to use in our development of analysis methods and our reasons. 1 Timeweighted averaging This is directly applicable to a scalar signal (such as “number of respiratory cases today”. This method, more commonly used in computational finance, simply compares the count during the current time period with the weighted average of the counts of recent days. Exponential weighting is typically used, where the halflife is known as the “time window ” parameter. This timewindow parameter is typically chosen by hand. We prefer the Serfling and Univariate HMM methods described below. 2 Serfling method This method (Serfling, 1963) is a cyclic regression model, and is the standard CDC algorithm for flu detection. It is, again, applicable to scalar signals. It assumes that the signal follows a sinusoid with a period of one year, and thus finds the four parameters ¢¤£¦¥¨ § and © in where the parameters are chosen to minimize the sum of squares of residuals. It is an easy matter of regression analysis to determine, on any date, whether
Cached Sufficient Statistics for Automated Mining and Discovery from Massive Data Sources
, 2000
"... nto a new, fundamentally impossible realm where the data sources are just too large to assimilate by humans. This situation is ironic given the large investment the US has put into gathering scientific data. The only alternative is automated discovery. It is our thesis that the emerging technology ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
nto a new, fundamentally impossible realm where the data sources are just too large to assimilate by humans. This situation is ironic given the large investment the US has put into gathering scientific data. The only alternative is automated discovery. It is our thesis that the emerging technology of cached sufficient statistics will be critical to developing automated discovery on massive data. A cached sufficient statistics representation is a data structure that summarizes statistical information in a database. For example, human users, or statistical programs, often need to query some quantity (such as a mean or variance) about some subset of the attributes (such as size, position and shape) over some subset of the records. When this happens, we want the cached sufficient statistic representation to intercept the request and, instead of answering it slowly by database accesses over billions of records, answer it immediately. The interesting technical challenge is: given that there
A Dynamic Adaptation of ADtrees for Efficient Machine Learning
 In Proceedings of the Seventeenth International Conference on Machine Learning
, 2000
"... This paper has no novel learning or statistics: it is concerned with making a wide class of preexisting statistics and learning algorithms computationally tractable when faced with data sets with massive numbers of records or attributes. ..."
Abstract
 Add to MetaCart
This paper has no novel learning or statistics: it is concerned with making a wide class of preexisting statistics and learning algorithms computationally tractable when faced with data sets with massive numbers of records or attributes.
Cached Sufficient Statistics for Automated Mining and Discovery from Massive Data Sources
, 1999
"... There many massive databases in industry and science. There are also many ways that decision makers, scientists, and the public need to interact with these data sources. Wide ranging statistics and machine learning algorithms similarly need to query databases, sometimes millions of times for a singl ..."
Abstract
 Add to MetaCart
There many massive databases in industry and science. There are also many ways that decision makers, scientists, and the public need to interact with these data sources. Wide ranging statistics and machine learning algorithms similarly need to query databases, sometimes millions of times for a single inference. With millions or billions of records (e.g. biotechnology databases, inventory management systems, astrophysics sky surveys, corporate sales information, science lab data repositories) this can be intractable using current algorithms. The Auton lab (at Carnegie Mellon University) and Schenley Park Research Inc. (a startup company), both jointly run by Andrew Moore and Jeff Schneider, are concerned with the fundamental computer science of making very advanced data analysis techniques computationally feasible for massive datasets.