Results 1 - 10
of
10
Navigating nets: Simple algorithms for proximity search (Extended Abstract)
, 2004
"... Robert Krauthgamer # James R. Lee + Abstract We present a simple deterministic data structure for maintaining a set S of points in a general metric space, while supporting proximity search (nearest neighbor and range queries) and updates to S (insertions and deletions). Our data structure consists ..."
Abstract
-
Cited by 105 (9 self)
- Add to MetaCart
Robert Krauthgamer # James R. Lee + Abstract We present a simple deterministic data structure for maintaining a set S of points in a general metric space, while supporting proximity search (nearest neighbor and range queries) and updates to S (insertions and deletions). Our data structure consists of a sequence of progressively finer #-nets of S, with pointers that allow us to navigate easily from one scale to the next.
Top-k selection queries over relational databases: Mapping strategies and performance evaluation
- TODS
, 2002
"... In many applications, users specify target values for certain attributes, without requiring exact matches to these values in return. Instead, the result to such queries is typically a rank of the “top k ” tuples that best match the given attribute values. In this paper, we study the advantages and l ..."
Abstract
-
Cited by 82 (6 self)
- Add to MetaCart
In many applications, users specify target values for certain attributes, without requiring exact matches to these values in return. Instead, the result to such queries is typically a rank of the “top k ” tuples that best match the given attribute values. In this paper, we study the advantages and limitations of processing a top-k query by translating it into a single range query that a traditional relational database management system (RDBMS) can process efficiently. In particular, we study how to determine a range query to evaluate a top-k query by exploiting the statistics available to an RDBMS, and the impact of the quality of these statistics on the retrieval efficiency of the resulting scheme. We also report the first experimental evaluation of the mapping strategies over a real RDBMS, namely over Microsoft’s SQL Server 7.0. The experiments show that our new techniques are robust and significantly more efficient than previously known strategies requiring at least one sequential scan of the data sets.
Independence is Good: Dependency-Based Histogram Synopses for High-Dimensional Data
- In SIGMOD
, 2001
"... Approximating the joint data distribution of a multi-dimensional data set through a compact and accurate histogram synopsis is a fundamental problem arising in numerous practical scenarios, including query optimization and approximate query answering. Existing solutions either rely on simplistic ind ..."
Abstract
-
Cited by 57 (10 self)
- Add to MetaCart
Approximating the joint data distribution of a multi-dimensional data set through a compact and accurate histogram synopsis is a fundamental problem arising in numerous practical scenarios, including query optimization and approximate query answering. Existing solutions either rely on simplistic independence assumptions or try to directly approximate the full joint data distribution over the complete set of attributes. Unfortunately, both approaches are doomed to fail for high-dimensional data sets with complex correlation patterns between attributes. In this paper, we propose a novel approach to histogram-based synopses that employs the solid foundation of statistical interaction models to explicitly identify and exploit the statistical characteristics of the data. Abstractly, our key idea is to break the synopsis into (1) a statistical interaction model that accurately captures significant correlation and independence patterns in data, and (2) a collection of histograms on low-dimensional marginals that, based on the model, can provide accurate approximations of the overall joint data distribution. Extensive experimental results with several real-life data sets verify the effectiveness of our approach. An important aspect of our general, model-based methodology is that it can be used to enhance the performance of other synopsis techniques that are based on data-space partitioning (e.g., wavelets) by providing an effective tool to deal with the “dimensionality curse”. 1.
Using the Fractal Dimension to Cluster Datasets
- IN PROCEEDINGS OF THE SIXTH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING
, 2000
"... Clustering is a widely used knowledge discovery technique. It helps uncovering structures in data that were not previously known. The clustering of large data sets has received a lot of attention in recent years, however, clustering is a still a challenging task since many published algorithms fail ..."
Abstract
-
Cited by 30 (4 self)
- Add to MetaCart
Clustering is a widely used knowledge discovery technique. It helps uncovering structures in data that were not previously known. The clustering of large data sets has received a lot of attention in recent years, however, clustering is a still a challenging task since many published algorithms fail to do well in scaling with the size of the data set and the number of dimensions that describe the points, or in finding arbitrary shapes of clusters, or dealing effectively with the presence of noise. In this paper, we present a new clustering algorithm, based in the fractal properties of the data sets. The new algorithm which we call Fractal Clustering (FC) places points incrementally in the cluster for which the change in the fractal dimension after adding the point is the least. This is a very natural way of clustering points, since points in the same cluster have a great degree of self-similarity among them (and much less self-similarity with respect to points in other clusters). FC requires one scan of the data, is suspendable at will, providing the best answer possible at that point, and is incremental. We show via experiments that FC effectively deals with large data sets, high-dimensionality and noise and is capable of recognizing clusters of arbitrary shape.
SASH: A Self-Adaptive Histogram Set for Dynamically Changing Workloads
, 2003
"... Most RDBMSs maintain a set of histograms for estimating the selectivities of given queries. ..."
Abstract
-
Cited by 14 (0 self)
- Add to MetaCart
Most RDBMSs maintain a set of histograms for estimating the selectivities of given queries.
Performance of Multiattribute Top-K Queries on Relational Systems
, 2000
"... In many applications, users specify target values for the attributes of a relation, and expect in return the k tuples that best match these values. Traditional RDBMSs do not process these "top-k queries" efficiently. In our previous work, we outlined a family of strategies to map a top-k query int ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
In many applications, users specify target values for the attributes of a relation, and expect in return the k tuples that best match these values. Traditional RDBMSs do not process these "top-k queries" efficiently. In our previous work, we outlined a family of strategies to map a top-k query into a traditional selection query that a RDBMS can process efficiently. The goal of such mapping strategies is to get all needed tuples (but minimize the number of retrieved tuples) and thus avoid "restarts" to get additional tuples. Unfortunately, no single mapping strategy performed consistently the best under all data distributions. In this paper, we develop a novel mapping technique that leverages information about the data distribution and adapts itself to the local characteristics of the data and the histograms available to do the mapping. We also report the first experimental evaluation of the new and old mapping strategies over a real RDBMS, namely over Microsoft's SQL Server 7.0. The experiments show that our new techniques are robust and significantly more efficient than previously known strategies requiring at least one sequential scan of the data sets.
Chaotic Mining: Knowledge Discovery Using the Fractal Dimension (Extended Abstract)
"... ) Daniel Barbar'a George Mason University Information and Software Engineering Department Fairfax, VA 22303 dbarbara@gmu.edu 1 Introduction Nature is filled with examples of phenomena that exhibit seemingly chaotic behavior, such as air turbulence, forest fires and the like. However, under this ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
) Daniel Barbar'a George Mason University Information and Software Engineering Department Fairfax, VA 22303 dbarbara@gmu.edu 1 Introduction Nature is filled with examples of phenomena that exhibit seemingly chaotic behavior, such as air turbulence, forest fires and the like. However, under this behavior it is almost always possible to find self-similarity, i.e. an invariance with respect to the scale used. The structures that appear as a consequence of self-similarity are known as fractals [12]. Fractals have been used in numerous disciplines (for a good coverage of the topic of fractals and their applications see [14]). In the database arena, fractals have been sucessfully used to analyze R-trees [6], Quadtrees [5], model distributions of data [7] and selectivity estimation [3]. Fractal sets are characterized by their fractal dimension. In truth, there exists an infinite family of fractal dimensions. By embedding the dataset in an n- dimensional grid whose cells have sides of si...
Tracking Clusters in Evolving Data Sets
"... As organizations accumulate data over time, the problem of tracking how patterns evolve becomes important. In this paper, we present an algorithm to track the evolution of cluster models in a stream of data. Our algorithm is based on the application of bounds derived using Cherno#'s inequality ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
As organizations accumulate data over time, the problem of tracking how patterns evolve becomes important. In this paper, we present an algorithm to track the evolution of cluster models in a stream of data. Our algorithm is based on the application of bounds derived using Cherno#'s inequality and makes use of a clustering algorithm that was previously developed by us, namely Fractal Clustering, which uses self-similarity as the propertyto group points together. Experiments show that our tracking algorithm is e#cient and e#ective in #nding changes on the patterns.
Chaotic Mining: Knowledge Discovery Using the Fractal
, 1999
"... ) Daniel Barbar'a George Mason University Information and Software Engineering Department Fairfax, VA 22303 dbarbara@gmu.edu March 22, 1999 1 Introduction Nature is filled with examples of phenomena that exhibit seemingly chaotic behavior, such as air turbulence, forest fires and the like. H ..."
Abstract
- Add to MetaCart
) Daniel Barbar'a George Mason University Information and Software Engineering Department Fairfax, VA 22303 dbarbara@gmu.edu March 22, 1999 1 Introduction Nature is filled with examples of phenomena that exhibit seemingly chaotic behavior, such as air turbulence, forest fires and the like. However, under this behavior it is almost always possible to find self-similarity, i.e. an invariance with respect to the scale used. The structures that appear as a consequence of self-similarity are known as fractals [12]. Fractals have been used in numerous disciplines (for a good coverage of the topic of fractals and their applications see [14]). In the database arena, fractals have been sucessfully used to analyze R-trees [6], Quadtrees [5], model distributions of data [7] and selectivity estimation [3]. Fractal sets are characterized by their fractal dimension. In truth, there exists an infinite family of fractal dimensions. By embedding the dataset in an n-dimensional grid which cell...
FRACTAL MINING -- Self Similarity-based Clustering and its Applications
"... Self-similarity is the property of being invariant with respect to the scale used to look at the data set. While fractals are self-similar at every scale used to look at them, many data sets exhibit self-similarity over a range of scales. Self-similarity can be measured using the fractal dimension. ..."
Abstract
- Add to MetaCart
Self-similarity is the property of being invariant with respect to the scale used to look at the data set. While fractals are self-similar at every scale used to look at them, many data sets exhibit self-similarity over a range of scales. Self-similarity can be measured using the fractal dimension. Fractal dimension is an important charactaristics for many complex systems and can serve as a powerful representation technique. In this chapter, we present a new clustering algorithm, based on self-similarity properties of the data sets, and also its applications to other fields in data mining, such as projected clustering and trend analysis. Clustering is a widely used knowledge discovery technique. It helps uncovering structures in data that were not previously known. The clustering of large data sets has received a lot of attention in recent years, however, clustering is a still a challenging task since many published algorithms fail to do well in scaling with the size of the data set and the number of dimensions that describe the points, or in finding arbitrary shapes of clusters, or dealing effectively with the presence of noise. The new algorithm which we call Fractal Clustering (FC) places points incrementally in the cluster for which the change in the fractal dimension after adding the point is the least. This is a very natural way of clustering points, since points in the same cluster have a great degree of self-similarity among them (and much less selfsimilarity with respect to points in other clusters). FC requires one scan of the data, is suspendable at will, providing the best answer possible at that point, and is incremental. We show via experiments that FC effectively deals with large data sets, high-dimensionality and noise and is capable of recognizing clusters of arbitrary shape.

