Results 1 - 10
of
34
Maintaining Variance and k-Medians over Data Stream Windows
- In PODS
, 2003
"... The sliding window model is useful for discounting stale data in data stream applications. In this model, data elements arrive continually and only the most recent N elements are used when answering queries. We present a novel technique for solving two important and related problems in the sliding w ..."
Abstract
-
Cited by 60 (0 self)
- Add to MetaCart
The sliding window model is useful for discounting stale data in data stream applications. In this model, data elements arrive continually and only the most recent N elements are used when answering queries. We present a novel technique for solving two important and related problems in the sliding window model --- maintaining variance and maintaining a k-- median clustering. Our solution to the problem of maintaining variance provides a continually updated estimate of the variance of the last N values in a data stream with relative error of at most # using O( # 2 log N) memory. We present a constant-factor approximation algorithm which maintains an approximate k--median solution for the last N data points using O( N) memory, where # < 1/2 is a parameter which trades o# the space bound with the approximation factor of O(2 ).
Shared Memory Parallelization of Data Mining Algorithms: Techniques, Programming Interface, and Performance
- In Proceedings of the second SIAM conference on Data Mining
, 2002
"... With recent technological advances, shared memory parallel machines have become more scalable, and oer large main memories and high bus bandwidths. They are emerging as good platforms for data warehousing and data mining. In this paper, we focus on shared memory parallelization of data mining alg ..."
Abstract
-
Cited by 22 (7 self)
- Add to MetaCart
With recent technological advances, shared memory parallel machines have become more scalable, and oer large main memories and high bus bandwidths. They are emerging as good platforms for data warehousing and data mining. In this paper, we focus on shared memory parallelization of data mining algorithms.
Analysis of Predictive Spatio-Temporal Queries
- TODS
, 2003
"... this paper we present probabilistic cost models that estimate the selectivity of spatio-temporal window queries and joins, and the expected distance between a query and its nearest neighbor(s). Our models capture any query/object mobility combination (moving queries, moving objects or both) and any ..."
Abstract
-
Cited by 21 (5 self)
- Add to MetaCart
this paper we present probabilistic cost models that estimate the selectivity of spatio-temporal window queries and joins, and the expected distance between a query and its nearest neighbor(s). Our models capture any query/object mobility combination (moving queries, moving objects or both) and any data type (points and rectangles) in arbitrary dimensionality. In addition, we develop specialized spatio-temporal histograms, which take into account both location and velocity information, and can be incrementally maintained. Extensive performance evaluation verifies that the proposed techniques produce highly accurate estimation on both uniform and non-uniform data
Efficient Biased Sampling for Approximate Clustering and Outlier Detection in Large Datasets
- IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
, 2003
"... We investigate the use of biased sampling according to the density of the data set to speed up the operation of general data mining tasks, such as clustering and outlier detection in large multidimensional data sets. In density-biased sampling, the probability that a given point will be included in ..."
Abstract
-
Cited by 18 (1 self)
- Add to MetaCart
We investigate the use of biased sampling according to the density of the data set to speed up the operation of general data mining tasks, such as clustering and outlier detection in large multidimensional data sets. In density-biased sampling, the probability that a given point will be included in the sample depends on the local density of the data set. We propose a general technique for density-biased sampling that can factor in user requirements to sample for properties of interest and can be tuned for specific data mining tasks. This allows great flexibility and improved accuracy of the results over simple random sampling. We describe our approach in detail, we analytically evaluate it, and show how it can be optimized for approximate clustering and outlier detection. Finally, we present...
On scaling up balanced clustering algorithms
- In Proceedings of the SIAM International Conference on Data Mining
, 2002
"... "rand01 "-- 2003/4/14-- 10:12-- page 1-- #1 i i i i ..."
Abstract
-
Cited by 13 (3 self)
- Add to MetaCart
"rand01 "-- 2003/4/14-- 10:12-- page 1-- #1 i i i i
Taxaminer: An experimentation framework for automated taxonomy bootstrapping
- International Journal of Web and Grid Services, Special Issue on Semantic Web and Mining Reasoning
, 2005
"... Hierarchical taxonomies and thesauri are frequently used by content management systems for indexing, search and categorization. They are also being viewed as rudimentary ontologies for the emerging Semantic Web infrastructure. However, to date, development of taxonomies and thesauri are human intens ..."
Abstract
-
Cited by 11 (3 self)
- Add to MetaCart
Hierarchical taxonomies and thesauri are frequently used by content management systems for indexing, search and categorization. They are also being viewed as rudimentary ontologies for the emerging Semantic Web infrastructure. However, to date, development of taxonomies and thesauri are human intensive processes, requiring huge resources in terms of cost and time. It is critical that approaches to reduce human effort and resource commitments be investigated. Towards this end, we present an experimentation framework for automated taxonomy construction from a large corpus of documents. Our approach involves: (a) generation of a document cluster hierarchy; (b) extraction of a taxonomy from this hierarchy; and (c) assignment of labels to nodes in this taxonomy. We draw upon a suite of clustering and NLP techniques and identify parameters which form the basis of an experimentation framework. We also propose metrics to measure taxonomy quality and evaluate the impacts of these parameters on these quality metrics. The MEDLINE ® database is used as the document corpus and the MeSH thesaurus as the gold standard. Insights from these experiments are presented and discussed. 1.
An Efficient Approximation Scheme for Data Mining Tasks
, 2001
"... We investigate the use of biased sampling according to the density of the dataset, to speed up the operation of general data mining tasks, such as clustering and outlier detection in large multi-dimensional datasets. In density-biased sampling, the probability that a given point will be included in ..."
Abstract
-
Cited by 10 (0 self)
- Add to MetaCart
We investigate the use of biased sampling according to the density of the dataset, to speed up the operation of general data mining tasks, such as clustering and outlier detection in large multi-dimensional datasets. In density-biased sampling, the probability that a given point will be included in the sample depends on the local density of the dataset. We propose a general technique for density-biased sampling that can factor in user requirements to sample for properties of interest, and can be tuned for specific data mining tasks. This allows great flexibility, and improved accuracy of the results over simple random sampling. We describe our approach in detail, we analytically evaluate it, and show how it can be optimized for approximate clustering and outlier detection. Finally we present a thorough experimental evaluation of the proposed method, applying density-biased sampling on real and synthetic data sets, and employing clustering and outlier detection algorithms, thus highlighti...
FINDIT: a Fast and Intelligent Subspace Clustering Algorithm using Dimension Voting
- PhD thesis, Korea Advanced Institute of Science and Technology
, 2002
"... The aim of this paper is to present a novel subspace clustering method named FINDIT. Clustering is the process of finding interesting patterns residing in the dataset by grouping similar data objects from dissimilar ones based on their dimensional values. Subspace clustering is a new area of cluster ..."
Abstract
-
Cited by 10 (1 self)
- Add to MetaCart
The aim of this paper is to present a novel subspace clustering method named FINDIT. Clustering is the process of finding interesting patterns residing in the dataset by grouping similar data objects from dissimilar ones based on their dimensional values. Subspace clustering is a new area of clustering which achieves the clustering goal in high dimension by allowing clusters to be formed with their own correlated dimensions. In subspace clustering, selecting correct dimensions is very important because the distance between points is easily changed according to the selected dimensions. However, to select dimensions correctly is difficult, because data grouping and dimension selecting should be performed simultaneously. FINDIT determines the correlated dimensions for each cluster based on two key ideas: dimension-oriented distance measure which fully utilizes dimensional difference information, and dimension voting policy which determines important dimensions in a probabilistic way based on V nearest neighbors ’ information. Through various experiments on synthetic data, FINDIT is shown to be very successful in the high dimensional clustering problem. FINDIT satisfies most requirements for good clustering methods such as accuracy of results, robustness to the noise and the cluster density, and scalability to the dataset size and the dimensionality. Moreover, it is gracefully scalable to full dimension without any modification to algorithm.
The powermethod: A comprehensive estimation technique for multi-dimensional queries
- In CIKM
, 2003
"... Existing estimation approaches for multi-dimensional databases often rely on the assumption that data distribution in a small region is uniform, which seldom holds in practice. Moreover, their applicability is limited to specific estimation tasks under certain distance metric. This paper develops th ..."
Abstract
-
Cited by 9 (0 self)
- Add to MetaCart
Existing estimation approaches for multi-dimensional databases often rely on the assumption that data distribution in a small region is uniform, which seldom holds in practice. Moreover, their applicability is limited to specific estimation tasks under certain distance metric. This paper develops the Power-method, a comprehensive technique applicable to a wide range of query optimization problems under various metrics. The Powermethod eliminates the local uniformity assumption and is accurate even in scenarios where existing approaches completely fail. Furthermore, it performs estimation by evaluating only one simple formula with minimal computational overhead. Extensive experiments confirm that the Powermethod outperforms previous techniques in terms of accuracy and applicability to various optimization scenarios. 1.

