Results 1  10
of
11
Navigating nets: Simple algorithms for proximity search (Extended Abstract)
, 2004
"... Robert Krauthgamer # James R. Lee + Abstract We present a simple deterministic data structure for maintaining a set S of points in a general metric space, while supporting proximity search (nearest neighbor and range queries) and updates to S (insertions and deletions). Our data structure consists ..."
Abstract

Cited by 126 (12 self)
 Add to MetaCart
Robert Krauthgamer # James R. Lee + Abstract We present a simple deterministic data structure for maintaining a set S of points in a general metric space, while supporting proximity search (nearest neighbor and range queries) and updates to S (insertions and deletions). Our data structure consists of a sequence of progressively finer #nets of S, with pointers that allow us to navigate easily from one scale to the next.
Topk selection queries over relational databases: Mapping strategies and performance evaluation
 TODS
, 2002
"... In many applications, users specify target values for certain attributes, without requiring exact matches to these values in return. Instead, the result to such queries is typically a rank of the “top k ” tuples that best match the given attribute values. In this paper, we study the advantages and l ..."
Abstract

Cited by 102 (6 self)
 Add to MetaCart
In many applications, users specify target values for certain attributes, without requiring exact matches to these values in return. Instead, the result to such queries is typically a rank of the “top k ” tuples that best match the given attribute values. In this paper, we study the advantages and limitations of processing a topk query by translating it into a single range query that a traditional relational database management system (RDBMS) can process efficiently. In particular, we study how to determine a range query to evaluate a topk query by exploiting the statistics available to an RDBMS, and the impact of the quality of these statistics on the retrieval efficiency of the resulting scheme. We also report the first experimental evaluation of the mapping strategies over a real RDBMS, namely over Microsoft’s SQL Server 7.0. The experiments show that our new techniques are robust and significantly more efficient than previously known strategies requiring at least one sequential scan of the data sets.
Independence is Good: DependencyBased Histogram Synopses for HighDimensional Data
 In SIGMOD
, 2001
"... Approximating the joint data distribution of a multidimensional data set through a compact and accurate histogram synopsis is a fundamental problem arising in numerous practical scenarios, including query optimization and approximate query answering. Existing solutions either rely on simplistic ind ..."
Abstract

Cited by 61 (11 self)
 Add to MetaCart
Approximating the joint data distribution of a multidimensional data set through a compact and accurate histogram synopsis is a fundamental problem arising in numerous practical scenarios, including query optimization and approximate query answering. Existing solutions either rely on simplistic independence assumptions or try to directly approximate the full joint data distribution over the complete set of attributes. Unfortunately, both approaches are doomed to fail for highdimensional data sets with complex correlation patterns between attributes. In this paper, we propose a novel approach to histogrambased synopses that employs the solid foundation of statistical interaction models to explicitly identify and exploit the statistical characteristics of the data. Abstractly, our key idea is to break the synopsis into (1) a statistical interaction model that accurately captures significant correlation and independence patterns in data, and (2) a collection of histograms on lowdimensional marginals that, based on the model, can provide accurate approximations of the overall joint data distribution. Extensive experimental results with several reallife data sets verify the effectiveness of our approach. An important aspect of our general, modelbased methodology is that it can be used to enhance the performance of other synopsis techniques that are based on dataspace partitioning (e.g., wavelets) by providing an effective tool to deal with the “dimensionality curse”. 1.
Using the Fractal Dimension to Cluster Datasets
 IN PROCEEDINGS OF THE SIXTH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING
, 2000
"... Clustering is a widely used knowledge discovery technique. It helps uncovering structures in data that were not previously known. The clustering of large data sets has received a lot of attention in recent years, however, clustering is a still a challenging task since many published algorithms fail ..."
Abstract

Cited by 39 (4 self)
 Add to MetaCart
Clustering is a widely used knowledge discovery technique. It helps uncovering structures in data that were not previously known. The clustering of large data sets has received a lot of attention in recent years, however, clustering is a still a challenging task since many published algorithms fail to do well in scaling with the size of the data set and the number of dimensions that describe the points, or in finding arbitrary shapes of clusters, or dealing effectively with the presence of noise. In this paper, we present a new clustering algorithm, based in the fractal properties of the data sets. The new algorithm which we call Fractal Clustering (FC) places points incrementally in the cluster for which the change in the fractal dimension after adding the point is the least. This is a very natural way of clustering points, since points in the same cluster have a great degree of selfsimilarity among them (and much less selfsimilarity with respect to points in other clusters). FC requires one scan of the data, is suspendable at will, providing the best answer possible at that point, and is incremental. We show via experiments that FC effectively deals with large data sets, highdimensionality and noise and is capable of recognizing clusters of arbitrary shape.
SASH: A SelfAdaptive Histogram Set for Dynamically Changing Workloads
, 2003
"... Most RDBMSs maintain a set of histograms for estimating the selectivities of given queries. ..."
Abstract

Cited by 14 (0 self)
 Add to MetaCart
Most RDBMSs maintain a set of histograms for estimating the selectivities of given queries.
Performance of Multiattribute TopK Queries on Relational Systems
, 2000
"... In many applications, users specify target values for the attributes of a relation, and expect in return the k tuples that best match these values. Traditional RDBMSs do not process these "topk queries" efficiently. In our previous work, we outlined a family of strategies to map a topk query int ..."
Abstract

Cited by 5 (1 self)
 Add to MetaCart
In many applications, users specify target values for the attributes of a relation, and expect in return the k tuples that best match these values. Traditional RDBMSs do not process these "topk queries" efficiently. In our previous work, we outlined a family of strategies to map a topk query into a traditional selection query that a RDBMS can process efficiently. The goal of such mapping strategies is to get all needed tuples (but minimize the number of retrieved tuples) and thus avoid "restarts" to get additional tuples. Unfortunately, no single mapping strategy performed consistently the best under all data distributions. In this paper, we develop a novel mapping technique that leverages information about the data distribution and adapts itself to the local characteristics of the data and the histograms available to do the mapping. We also report the first experimental evaluation of the new and old mapping strategies over a real RDBMS, namely over Microsoft's SQL Server 7.0. The experiments show that our new techniques are robust and significantly more efficient than previously known strategies requiring at least one sequential scan of the data sets.
Using selfsimilarity to cluster large data sets
 DATA MINING AND KNOWLEDGE DISCOVERY
, 2003
"... Clustering is a widely used knowledge discovery technique. It helps uncovering structures in data that were not previously known. The clustering of large data sets has received a lot of attention in recent years, however, clustering is a still a challenging task since many published algorithms fail ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
Clustering is a widely used knowledge discovery technique. It helps uncovering structures in data that were not previously known. The clustering of large data sets has received a lot of attention in recent years, however, clustering is a still a challenging task since many published algorithms fail to do well in scaling with the size of the data set and the number of dimensions that describe the points, or in nding arbitrary shapes of clusters, or dealing effectively with the presence of noise. In this paper, we present a new clustering algorithm, based in selfsimilarity properties of the data sets. Selfsimilarity is the property of being invariant with respect to the scale used to look at the data set. While fractals are selfsimilar at every scale used to look at them, many data sets exhibit selfsimilarity over a range of scales. Selfsimilarity canbe measured using the fractal dimension. The new algorithm which we call Fractal Clustering (FC) places points incrementally in the cluster for which the change in the fractal dimension after adding the point is the least. This is a very natural way of clustering points, since points in the same cluster have a great degree of selfsimilarity among them (and much less selfsimilarity with respect to points in other clusters). FC requires one scan of the data, is suspendable at will, providing the best answer possible at that point, and is incremental. We show via experiments that FC effectively deals with large data sets, highdimensionality and noise and is capable of recognizing clusters of arbitrary shape.
Chaotic Mining: Knowledge Discovery Using the Fractal Dimension (Extended Abstract)
"... ) Daniel Barbar'a George Mason University Information and Software Engineering Department Fairfax, VA 22303 dbarbara@gmu.edu 1 Introduction Nature is filled with examples of phenomena that exhibit seemingly chaotic behavior, such as air turbulence, forest fires and the like. However, under this ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
) Daniel Barbar'a George Mason University Information and Software Engineering Department Fairfax, VA 22303 dbarbara@gmu.edu 1 Introduction Nature is filled with examples of phenomena that exhibit seemingly chaotic behavior, such as air turbulence, forest fires and the like. However, under this behavior it is almost always possible to find selfsimilarity, i.e. an invariance with respect to the scale used. The structures that appear as a consequence of selfsimilarity are known as fractals [12]. Fractals have been used in numerous disciplines (for a good coverage of the topic of fractals and their applications see [14]). In the database arena, fractals have been sucessfully used to analyze Rtrees [6], Quadtrees [5], model distributions of data [7] and selectivity estimation [3]. Fractal sets are characterized by their fractal dimension. In truth, there exists an infinite family of fractal dimensions. By embedding the dataset in an n dimensional grid whose cells have sides of si...
Tracking Clusters in Evolving Data Sets
"... As organizations accumulate data over time, the problem of tracking how patterns evolve becomes important. In this paper, we present an algorithm to track the evolution of cluster models in a stream of data. Our algorithm is based on the application of bounds derived using Cherno#'s inequality ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
As organizations accumulate data over time, the problem of tracking how patterns evolve becomes important. In this paper, we present an algorithm to track the evolution of cluster models in a stream of data. Our algorithm is based on the application of bounds derived using Cherno#'s inequality and makes use of a clustering algorithm that was previously developed by us, namely Fractal Clustering, which uses selfsimilarity as the propertyto group points together. Experiments show that our tracking algorithm is e#cient and e#ective in #nding changes on the patterns.
FRACTAL MINING  Self Similaritybased Clustering and its Applications
"... Selfsimilarity is the property of being invariant with respect to the scale used to look at the data set. While fractals are selfsimilar at every scale used to look at them, many data sets exhibit selfsimilarity over a range of scales. Selfsimilarity can be measured using the fractal dimension. ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
Selfsimilarity is the property of being invariant with respect to the scale used to look at the data set. While fractals are selfsimilar at every scale used to look at them, many data sets exhibit selfsimilarity over a range of scales. Selfsimilarity can be measured using the fractal dimension. Fractal dimension is an important charactaristics for many complex systems and can serve as a powerful representation technique. In this chapter, we present a new clustering algorithm, based on selfsimilarity properties of the data sets, and also its applications to other fields in data mining, such as projected clustering and trend analysis. Clustering is a widely used knowledge discovery technique. It helps uncovering structures in data that were not previously known. The clustering of large data sets has received a lot of attention in recent years, however, clustering is a still a challenging task since many published algorithms fail to do well in scaling with the size of the data set and the number of dimensions that describe the points, or in finding arbitrary shapes of clusters, or dealing effectively with the presence of noise. The new algorithm which we call Fractal Clustering (FC) places points incrementally in the cluster for which the change in the fractal dimension after adding the point is the least. This is a very natural way of clustering points, since points in the same cluster have a great degree of selfsimilarity among them (and much less selfsimilarity with respect to points in other clusters). FC requires one scan of the data, is suspendable at will, providing the best answer possible at that point, and is incremental. We show via experiments that FC effectively deals with large data sets, highdimensionality and noise and is capable of recognizing clusters of arbitrary shape.