Results 1 -
5 of
5
The Anchors Hierarchy: Using the Triangle Inequality to Survive High Dimensional Data
- In Twelfth Conference on Uncertainty in Artificial Intelligence
, 2000
"... This paper is about metric data structures in high-dimensional or non-Euclidean space that permit cached sufficient statistics accelerations of learning algorithms. ..."
Abstract
-
Cited by 65 (9 self)
- Add to MetaCart
This paper is about metric data structures in high-dimensional or non-Euclidean space that permit cached sufficient statistics accelerations of learning algorithms.
Summary of biosurveillance-relevant technologies
, 2003
"... This short report, compiled upon request from Dave Siegrist and Ted Senator, surveys the spectrum of technologies that can help with Biosurveillance. We indicate which we have chosen, so far, to use in our development of analysis methods and our reasons. 1 Time-weighted averaging This is directly ap ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
This short report, compiled upon request from Dave Siegrist and Ted Senator, surveys the spectrum of technologies that can help with Biosurveillance. We indicate which we have chosen, so far, to use in our development of analysis methods and our reasons. 1 Time-weighted averaging This is directly applicable to a scalar signal (such as “number of respiratory cases today”. This method, more commonly used in computational finance, simply compares the count during the current time period with the weighted average of the counts of recent days. Exponential weighting is typically used, where the half-life is known as the “time window ” parameter. This time-window parameter is typically chosen by hand. We prefer the Serfling and Univariate HMM methods described below. 2 Serfling method This method (Serfling, 1963) is a cyclic regression model, and is the standard CDC algorithm for flu detection. It is, again, applicable to scalar signals. It assumes that the signal follows a sinusoid with a period of one year, and thus finds the four parameters ¢¤£¦¥¨ § and © in where the parameters are chosen to minimize the sum of squares of residuals. It is an easy matter of regression analysis to determine, on any date, whether
Cached Sufficient Statistics for Automated Mining and Discovery from Massive Data Sources
, 2000
"... nto a new, fundamentally impossible realm where the data sources are just too large to assimilate by humans. This situation is ironic given the large investment the US has put into gathering scientific data. The only alternative is automated discovery. It is our thesis that the emerging technology ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
nto a new, fundamentally impossible realm where the data sources are just too large to assimilate by humans. This situation is ironic given the large investment the US has put into gathering scientific data. The only alternative is automated discovery. It is our thesis that the emerging technology of cached sufficient statistics will be critical to developing automated discovery on massive data. A cached sufficient statistics representation is a data structure that summarizes statistical information in a database. For example, human users, or statistical programs, often need to query some quantity (such as a mean or variance) about some subset of the attributes (such as size, position and shape) over some subset of the records. When this happens, we want the cached sufficient statistic representation to intercept the request and, instead of answering it slowly by database accesses over billions of records, answer it immediately. The interesting technical challenge is: given that there
A Dynamic Adaptation of AD-trees for Efficient Machine Learning
- In Proceedings of the Seventeenth International Conference on Machine Learning
, 2000
"... This paper has no novel learning or statistics: it is concerned with making a wide class of preexisting statistics and learning algorithms computationally tractable when faced with data sets with massive numbers of records or attributes. ..."
Abstract
- Add to MetaCart
This paper has no novel learning or statistics: it is concerned with making a wide class of preexisting statistics and learning algorithms computationally tractable when faced with data sets with massive numbers of records or attributes.
Cached Sufficient Statistics for Automated Mining and Discovery from Massive Data Sources
, 1999
"... There many massive databases in industry and science. There are also many ways that decision makers, scientists, and the public need to interact with these data sources. Wide ranging statistics and machine learning algorithms similarly need to query databases, sometimes millions of times for a singl ..."
Abstract
- Add to MetaCart
There many massive databases in industry and science. There are also many ways that decision makers, scientists, and the public need to interact with these data sources. Wide ranging statistics and machine learning algorithms similarly need to query databases, sometimes millions of times for a single inference. With millions or billions of records (e.g. biotechnology databases, inventory management systems, astrophysics sky surveys, corporate sales information, science lab data repositories) this can be intractable using current algorithms. The Auton lab (at Carnegie Mellon University) and Schenley Park Research Inc. (a startup company), both jointly run by Andrew Moore and Jeff Schneider, are concerned with the fundamental computer science of making very advanced data analysis techniques computationally feasible for massive datasets.

