Results 1  10
of
63
Approximating MultiDimensional Aggregate Range Queries Over Real Attributes
, 2000
"... Finding approximate answers to multidimensional range queries over real valued attributes has significant applications in data exploration and database query optimization. In this paper we consider the following problem: given a table of d attributes whose domain is the real numbers, and a quer ..."
Abstract

Cited by 85 (9 self)
 Add to MetaCart
Finding approximate answers to multidimensional range queries over real valued attributes has significant applications in data exploration and database query optimization. In this paper we consider the following problem: given a table of d attributes whose domain is the real numbers, and a query that specifies a range in each dimension, find a good approximation of the number of records in the table that satisfy the query. We present a new histogram technique that is designed to approximate the density of multidimensional datasets with real attributes. Our technique finds buckets of variable size, and allows the buckets to overlap. Overlapping buckets allow more efficient approximation of the density. The size of the cells is based on the local density of the data. This technique leads to a faster and more compact approximation of the data distribution. We also show how to generalize kernel density estimators, and how to apply them on the multidimensional query approxim...
Efficient discovery of errortolerant frequent itemsets in high dimensions
 In SIGKDD 2001
, 2001
"... We present a generalization of frequent itemsets allowing for the notion of errors in the itemset definition. We motivate the problem and present an efficient algorithm that identifies errortolerant frequent clusters of items in transactional data (customerpurchase data, web browsing data, text, etc ..."
Abstract

Cited by 62 (0 self)
 Add to MetaCart
(Show Context)
We present a generalization of frequent itemsets allowing for the notion of errors in the itemset definition. We motivate the problem and present an efficient algorithm that identifies errortolerant frequent clusters of items in transactional data (customerpurchase data, web browsing data, text, etc.). The algorithm exploits sparseness of the underlying data to find large groups of items that are correlated over database records (rows). The notion of transaction coverage allows us to extend the algorithm and view it as a fast clustering algorithm for discovering segments of similar transactions in binary sparse data. We evaluate the new algorithm on three realworld applications: clustering highdimensional data, query selectivity estimation and collaborative filtering. Results show that the algorithm consistently uncovers structure in large sparse databases that other traditional clustering algorithms fail to find.
ICICLES: Selftuning Samples for Approximate Query Answering
 VLDB
, 2000
"... Approximate query answering systems provide very fast alternatives to OLAP systems when applications are tolerant to small errors in query answers. ..."
Abstract

Cited by 61 (0 self)
 Add to MetaCart
Approximate query answering systems provide very fast alternatives to OLAP systems when applications are tolerant to small errors in query answers.
Dwarf: Shrinking the PetaCube
 Proceedings of the 2002 ACM SIGMOD Conference
"... Dwarf is a highly compressed structure for computing, storing, and querying data cubes. Dwarf identifies prefix and suffix structural redundancies and factors them out by coalescing their store. Prefix redundancy is high on dense areas of cubes but suffix redundancy is significantly higher for spa ..."
Abstract

Cited by 52 (5 self)
 Add to MetaCart
(Show Context)
Dwarf is a highly compressed structure for computing, storing, and querying data cubes. Dwarf identifies prefix and suffix structural redundancies and factors them out by coalescing their store. Prefix redundancy is high on dense areas of cubes but suffix redundancy is significantly higher for sparse areas. Putting the two together fuses the exponential sizes of high dimensional full cubes into a dramatically condensed data structure. The elimination of suffix redundancy has an equally dramatic reduction in the computation of the cube because recomputation of the redundant suffixes is avoided. This effect is multiplied in the presence of correlation amongst attributes in the cube.
Beyond Independence: Probabilistic Models for Query Approximation on Binary Transaction Data
, 2001
"... We investigate the problem of generating fast approximate answers to queries for large sparse binary data sets. We focus in particular on probabilistic modelbased approaches to this problem and develop a number of techniques that are significantly more accurate than a baseline independence model. I ..."
Abstract

Cited by 51 (7 self)
 Add to MetaCart
(Show Context)
We investigate the problem of generating fast approximate answers to queries for large sparse binary data sets. We focus in particular on probabilistic modelbased approaches to this problem and develop a number of techniques that are significantly more accurate than a baseline independence model. In particular, we introduce a novel technique for building probabilistic models from frequent itemsets. The itemsets are treated as constraints on the distribution of the query variables and the maximum entropy principle is used online to build a joint probability model for attributes in the query. We show that the resulting probability model defines a Markov random field (MRF) and that the time taken to answer a query scales exponentially as a function of the induced width of the associated MRF graph. We empirically compare the MRF model to other probabilistic models, such as the independence model, the ChowLiu tree model, the Bernoulli mixture model, and the ADTree model. Experimental resu...
Scaling EM (ExpectationMaximization) Clustering to Large Databases
, 1999
"... Practical statistical clustering algorithms typically center upon an iterative refinement optimization procedure to compute a locally optimal clustering solution that maximizes the fit to data. These algorithms typically require many database scans to converge, and within each scan they require the ..."
Abstract

Cited by 51 (1 self)
 Add to MetaCart
Practical statistical clustering algorithms typically center upon an iterative refinement optimization procedure to compute a locally optimal clustering solution that maximizes the fit to data. These algorithms typically require many database scans to converge, and within each scan they require the access to every record in the data table. For large databases, the scans become prohibitively expensive. We present a scalable implementation of the ExpectationMaximization (EM) algorithm. The database community has focused on distancebased clustering schemes and methods have been developed to cluster either numerical or categorical data. Unlike distancebased algorithms (such as KMeans), EM constructs proper statistical models of the underlying data source and naturally generalizes to cluster databases containing both discretevalued and continuousvalued data. The scalable method is based on a decomposition of the basic statistics the algorithm needs: identifying regions of the data that...
StarCubing: Computing Iceberg Cubes by TopDown And BottomUp Integration
 In VLDB
, 2003
"... Data cube computation is one of the most essential but expensive operations in data warehousing. Previous studies have developedtwo major approaches, topdown vs. bottomup. The former, represented by the MultiWay Array Cube (called MultiWay) algorithm[25], aggregates simultaneously on multipledimens ..."
Abstract

Cited by 44 (10 self)
 Add to MetaCart
(Show Context)
Data cube computation is one of the most essential but expensive operations in data warehousing. Previous studies have developedtwo major approaches, topdown vs. bottomup. The former, represented by the MultiWay Array Cube (called MultiWay) algorithm[25], aggregates simultaneously on multipledimensions
Constrained KMeans Clustering
 MSRTR200065, Microsoft Research
, 2000
"... We consider practical methods for adding constraints to the KMeans clustering algorithm in order to avoid local solutions with empty clusters or clusters having very few points. We often observe this phenomena when applying KMeans to datasets where the number of dimensions is n 10 and the number ..."
Abstract

Cited by 39 (0 self)
 Add to MetaCart
(Show Context)
We consider practical methods for adding constraints to the KMeans clustering algorithm in order to avoid local solutions with empty clusters or clusters having very few points. We often observe this phenomena when applying KMeans to datasets where the number of dimensions is n 10 and the number of desired clusters is k 20. We propose explicitly adding k constraints to the underlying clustering optimization problem requiring that each cluster have at least a minimum number of points in it. We then investigate the resulting cluster assignment step. Preliminary numerical tests on real datasets indicate the constrained approach is less prone to poor local solutions, producing a better summary of the underlying data. Contrained KMeans Clustering 1 1 Introduction The KMeans clustering algorithm [5] has become a workhorse for the data analyst in many diverse fields. One drawback to the algorithm occurs when it is applied to datasets with m data points in n 10 dimensional real spac...
Efficient Aggregation over Objects with Extent (Extended Abstract)
 TechReport UCR CS 01 01, CS Dept
, 2002
"... We examine the problem of efficiently computing sum/count/avg aggregates over... ..."
Abstract

Cited by 38 (8 self)
 Add to MetaCart
We examine the problem of efficiently computing sum/count/avg aggregates over...
Condensed Cube: An Effective Approach to Reducing Data Cube Size
 In ICDE
, 2002
"... Precomputed data cube facilitates OLAP (OnLine Analytical Processing). It is a wellknown fact that data cube computation is an expensive operation, which attracts a lot of attention. While most proposed algorithms devoted themselves to optimizing memory management and reducing computation costs, ..."
Abstract

Cited by 33 (1 self)
 Add to MetaCart
(Show Context)
Precomputed data cube facilitates OLAP (OnLine Analytical Processing). It is a wellknown fact that data cube computation is an expensive operation, which attracts a lot of attention. While most proposed algorithms devoted themselves to optimizing memory management and reducing computation costs, less work addresses one of the fundamental issues: the size of a data cube is huge when a large base relation with a large number of attributes is involved. In this paper, we propose a new concept, called a condensed data cube. The condensed cube is of much smaller size of a complete noncondensed cube. More importantly, it is a fully precomputed cube without compression, and, hence, it requires neither decompression nor further aggregation when answering queries. Several algorithms for computing condensed cube are proposed. Results of experiments on the effectiveness of condensed data cube are presented, using both synthetic and realworld data . The results indicate that the proposed condensed cube can reduce both the cube size and therefore its computation time. 1.