Results 1  10
of
46
Automatic Subspace Clustering of High Dimensional Data
 Data Mining and Knowledge Discovery
, 2005
"... Data mining applications place special requirements on clustering algorithms including: the ability to find clusters embedded in subspaces of high dimensional data, scalability, enduser comprehensibility of the results, nonpresumption of any canonical data distribution, and insensitivity to the or ..."
Abstract

Cited by 561 (12 self)
 Add to MetaCart
Data mining applications place special requirements on clustering algorithms including: the ability to find clusters embedded in subspaces of high dimensional data, scalability, enduser comprehensibility of the results, nonpresumption of any canonical data distribution, and insensitivity to the order of input records. We present CLIQUE, a clustering algorithm that satisfies each of these requirements. CLIQUE identifies dense clusters in subspaces of maximum dimensionality. It generates cluster descriptions in the form of DNF expressions that are minimized for ease of comprehension. It produces identical results irrespective of the order in which input records are presented and does not presume any specific mathematical form for data distribution. Through experiments, we show that CLIQUE efficiently finds accurate clusters in large high dimensional datasets.
WaveletBased Histograms for Selectivity Estimation
 in SIGMOD
, 1998
"... Query optimization is an integral part of relational database management systems. One important task in query optimization is selectivity estimation, that is, given a query P , we need to estimate the fraction of records in the database that satisfy P . Many commercial database systems maintain hist ..."
Abstract

Cited by 210 (16 self)
 Add to MetaCart
Query optimization is an integral part of relational database management systems. One important task in query optimization is selectivity estimation, that is, given a query P , we need to estimate the fraction of records in the database that satisfy P . Many commercial database systems maintain histograms to approximate the frequency distribution of values in the attributes of relations. In this paper, we present a technique based upon a multiresolution wavelet decomposition for building histograms on the underlying data distributions, with applications to databases, statistics, and simulation. Histograms built on the cumulative data distributions give very good approximations with limited space usage. We give fast algorithms for constructing histograms and using them in an online fashion for selectivity estimation. Our histograms also provide quick approximate answers to OLAP queries when the exact answers are not required. Our method captures the joint distribution of multiple attri...
Approximate Computation of Multidimensional Aggregates of Sparse Data Using Wavelets
"... Computing multidimensional aggregates in high dimensions is a performance bottleneck for many OLAP applications. Obtaining the exact answer to an aggregation query can be prohibitively expensive in terms of time and/or storage space in a data warehouse environment. It is advantageous to have fast, a ..."
Abstract

Cited by 170 (2 self)
 Add to MetaCart
Computing multidimensional aggregates in high dimensions is a performance bottleneck for many OLAP applications. Obtaining the exact answer to an aggregation query can be prohibitively expensive in terms of time and/or storage space in a data warehouse environment. It is advantageous to have fast, approximate answers to OLAP aggregation queries. In this paper, we present anovel method that provides approximate answers to highdimensional OLAP aggregation queries in massive sparse data sets in a timeefficient and spaceefficient manner. We construct a compact data cube, which is an approximate and spaceefficient representation of the underlying multidimensional array, based upon a multiresolution wavelet decomposition. In the online phase, each aggregation query can generally be answered using the compact data cube in one I/O or a small number of I/Os, depending upon the desired accuracy. We present two I/Oefficient algorithms to construct the compact data cube for the important case of sparse highdimensional arrays, which often arise in practice. The traditional histogram methods are infeasible for the massive highdimensional data sets in OLAP applications. Previously developed wavelet techniques are efficient only for dense data. Our online query processing algorithm is very fast and capable of refining answers as the user demands more accuracy. Experiments on real data show that our method provides significantly more accurate results for typical OLAP aggregation queries than other efficient approximation techniques such as random sampling.
Incremental Computation and Maintenance of Temporal Aggregates
 Proc. of ICDE
, 2001
"... We consider the problems of computing aggregation queries in temporal databases, and of maintaining materialized temporal aggregate views efficiently. The latter problem is particularly challenging since a single data update can cause aggregate results to change over the entire time line. We introdu ..."
Abstract

Cited by 75 (4 self)
 Add to MetaCart
We consider the problems of computing aggregation queries in temporal databases, and of maintaining materialized temporal aggregate views efficiently. The latter problem is particularly challenging since a single data update can cause aggregate results to change over the entire time line. We introduce a new index structure called the SBtree, which incorporates features from both segmenttrees and Btrees. SBtrees support fast lookup of aggregate results based on time, and can be maintained efficiently when the data changes. We also extend the basic SBtree index to handle cumulative (also called movingwindow) aggregates. For materialized aggregate views in a temporal database or warehouse, we propose building and maintaining SBtree indices instead of the views themselves. 1.
Range Searching
, 1996
"... Range searching is one of the central problems in computational geometry, because it arises in many applications and a wide variety of geometric problems can be formulated as a rangesearching problem. A typical rangesearching problem has the following form. Let S be a set of n points in R d , an ..."
Abstract

Cited by 70 (1 self)
 Add to MetaCart
Range searching is one of the central problems in computational geometry, because it arises in many applications and a wide variety of geometric problems can be formulated as a rangesearching problem. A typical rangesearching problem has the following form. Let S be a set of n points in R d , and let R be a family of subsets; elements of R are called ranges . We wish to preprocess S into a data structure so that for a query range R, the points in S " R can be reported or counted efficiently. Typical examples of ranges include rectangles, halfspaces, simplices, and balls. If we are only interested in answering a single query, it can be done in linear time, using linear space, by simply checking for each point p 2 S whether p lies in the query range.
Relative Prefix Sums: An Efficient Approach for Querying Dynamic OLAP Data Cubes
"... Range sum queries on data cubes are a powerful tool for analysis. A range sum query applies an aggregation operation (e.g., SUM) over all selected cells in a data cube, where the selection is specified by providing ranges of values for numeric dimensions. Many application domains require that inform ..."
Abstract

Cited by 33 (6 self)
 Add to MetaCart
Range sum queries on data cubes are a powerful tool for analysis. A range sum query applies an aggregation operation (e.g., SUM) over all selected cells in a data cube, where the selection is specified by providing ranges of values for numeric dimensions. Many application domains require that information provided by analysis tools be current or "nearcurrent. " Existing techniques for range sum queries on data cubes, however, can incur update costs on the order of the size of the data cube. Since the size of a data cube is exponential in the number of its dimensions, rebuilding the entire data cube can be very costly. We present an approach that achieves constant time range sum queries while constraining update costs. Our method reduces the overall complexity of the range sum problem.
The R*_atree: An improved R*tree with Materialized Data for Supporting Range Queries on OLAPData
, 1998
"... OLAP applications make use of fast indexes and materialization of data. Most research treats just one topic. Either the materialized values or the design of index structures are considered. This paper examines a possible combination of both techniques. The R tree is taken as an example of a mult ..."
Abstract

Cited by 17 (0 self)
 Add to MetaCart
OLAP applications make use of fast indexes and materialization of data. Most research treats just one topic. Either the materialized values or the design of index structures are considered. This paper examines a possible combination of both techniques. The R tree is taken as an example of a multidimensional index structure. Aggregated data is stored in the inner nodes of the index structure in addition to the references to the successornodes. We describe how this mechanism works in detail and present results of performance evaluation. 1 Introduction OLAP became an important application during the last few years. OLAP allows to model data in a multidimensional way as a cube and to look at the data from many different perspectives. A typical query looks like: "Retrieve average price for 1000 custkey 2500, where part.type=sport car group by part.brand and supplier". Theoretical frameworks for multidimensional databases are described for example in [2] and [6]. There are severa...
Exploring Spatial Datasets with Histograms
 Proc. of ICDE
, 2001
"... As online spatial datasets grow both in number and sophistication, it becomes increasingly difficult for users to decide whether a dataset is suitable for their tasks, especially when they do not have prior knowledge of the dataset. In this paper, we propose browsing as an effective and efficient wa ..."
Abstract

Cited by 15 (0 self)
 Add to MetaCart
As online spatial datasets grow both in number and sophistication, it becomes increasingly difficult for users to decide whether a dataset is suitable for their tasks, especially when they do not have prior knowledge of the dataset. In this paper, we propose browsing as an effective and efficient way to explore the content of a spatial dataset. Browsing allows users to view the size of a result set before evaluating the query at the database, thereby avoiding zerohit/megahit queries and saving time and resources. Although the underlying technique supporting browsing is similar to range query aggregation and selectivity estimation, spatial dataset browsing poses some unique challenges. In this paper, we identify a set of spatial relations that need to be supported in browsing applications, namely, the contains, contained and the overlap relations.
Flexible Data Cubes for Online Aggregation
 In Database Theory  ICDT 2001, 8th International Conference, London , UK, January 46, 2001, Proceedings, volume 1973 of Lecture Notes in Computer Science
, 2001
"... . Applications like Online Analytical Processing depend heavily on the ability to quickly summarize large amounts of information. Techniques were proposed recently that speed up aggregate range queries on MOLAP data cubes by storing precomputed aggregates. These approaches try to handle data cub ..."
Abstract

Cited by 13 (0 self)
 Add to MetaCart
. Applications like Online Analytical Processing depend heavily on the ability to quickly summarize large amounts of information. Techniques were proposed recently that speed up aggregate range queries on MOLAP data cubes by storing precomputed aggregates. These approaches try to handle data cubes of any dimensionality by dealing with all dimensions at the same time and treat the different dimensions uniformly. The algorithms are typically complex, and it is difficult to prove their correctness and to analyze their performance. We present a new technique to generate Iterative Data Cubes (IDC) that addresses these problems. The proposed approach provides a modular framework for combining onedimensional aggregation techniques to create spaceoptimal highdimensional data cubes. A large variety of cost tradeoffs for highdimensional IDC can be generated, making it easy to find the right configuration based on the application requirements. 1 Introduction Data cubes are used i...
CRBTree: An Efficient Indexing Scheme for Range Aggregate Queries
 IN PROC. INTERNATIONAL CONFERENCE ON DATABASE THEORY
, 2003
"... We propose a new indexing scheme, called the CRBtree, for efficiently answering rangeaggregate queries. The rangeaggregate problem is defined as follows: Given a set of weighted points in R , compute the aggregate of weights of points that lie inside a ddimensional query rectangle. In this ..."
Abstract

Cited by 12 (1 self)
 Add to MetaCart
We propose a new indexing scheme, called the CRBtree, for efficiently answering rangeaggregate queries. The rangeaggregate problem is defined as follows: Given a set of weighted points in R , compute the aggregate of weights of points that lie inside a ddimensional query rectangle. In this paper we focus on COUNT, SUM, AVG aggregates. First, we develop an indexing scheme for answering twodimensional rangeCOUNT queries that uses O(N=B) disk blocks and answers a query in O(log B N) I/Os, where N is the number of input points and B is the disk block size. This is the first optimal index structure for the 2D rangeCOUNT problem. The index can be extended to obtain a nearlinearsize indexing structure for answering rangeSUM queries using O(log B N) I/Os. We also obtain similar bounds for rectangleintersection aggregate queries, in which the input is a set of weighted rectangles and a query asks to compute the aggregate of the weights of those input rectangles that overlap with the query rectangle. This result immediately improves a recent result on temporalaggregate queries. Our indexing scheme can be dynamized and extended to higher dimensions. Finally, we demonstrate the practical efficiency of our index by comparing its performance against kdBtree. For a dataset of around 100 million points, the CRBtree query time is 810 times faster than the kdBtree query time. Furthermore, unlike other indexing schemes, the query performance of CRBtree is oblivious to the distribution of the input points and placement, shape and size of the query rectangle.