Results 11  20
of
230
Injecting utility into anonymized datasets
 In SIGMOD
, 2006
"... Limiting disclosure in data publishing requires a careful balance between privacy and utility. Information about individuals must not be revealed, but a dataset should still be useful for studying the characteristics of a population. Privacy requirements such as kanonymity and ℓdiversity are desig ..."
Abstract

Cited by 95 (4 self)
 Add to MetaCart
Limiting disclosure in data publishing requires a careful balance between privacy and utility. Information about individuals must not be revealed, but a dataset should still be useful for studying the characteristics of a population. Privacy requirements such as kanonymity and ℓdiversity are designed to thwart attacks that attempt to identify individuals in the data and to discover their sensitive information. On the other hand, the utility of such data has not been wellstudied. In this paper we will discuss the shortcomings of current heuristic approaches to measuring utility and we will introduce a formal approach to measuring utility. Armed with this utility metric, we will show how to inject additional information into kanonymous and ℓdiverse tables. This information has an intuitive semantic meaning, it increases the utility beyond what is possible in the original kanonymity and ℓdiversity frameworks, and it maintains the privacy guarantees of kanonymity and ℓdiversity. 1.
An investigation of practical approximate nearest neighbor algorithms
, 2004
"... This paper concerns approximate nearest neighbor searching algorithms, which have become increasingly important, especially in high dimensional perception areas such as computer vision, with dozens of publications in recent years. Much of this enthusiasm is due to a successful new approximate neares ..."
Abstract

Cited by 79 (2 self)
 Add to MetaCart
This paper concerns approximate nearest neighbor searching algorithms, which have become increasingly important, especially in high dimensional perception areas such as computer vision, with dozens of publications in recent years. Much of this enthusiasm is due to a successful new approximate nearest neighbor approach called Locality Sensitive Hashing (LSH). In this paper we ask the question: can earlier spatial data structure approaches to exact nearest neighbor, such as metric trees, be altered to provide approximate answers to proximity queries and if so, how? We introduce a new kind of metric tree that allows overlap: certain datapoints may appear in both the children of a parent. We also introduce new approximate kNN search algorithms on this structure. We show why these structures should be able to exploit the same randomprojectionbased approximations that LSH enjoys, but with a simpler algorithm and perhaps with greater efficiency. We then provide a detailed empirical evaluation on five large, high dimensional datasets which show up to 31fold accelerations over LSH. This result holds true throughout the spectrum of approximation levels.
Clustering of Time Series Subsequences is Meaningless: Implications for Past and Future Research
 In Proc. of the 3rd IEEE International Conference on Data Mining
, 2003
"... Time series data is perhaps the most frequently encountered type of data examined by the data mining community. Clustering is perhaps the most frequently used data mining algorithm, being useful in it’s own right as an exploratory technique, and also as a subroutine in more complex data mining algor ..."
Abstract

Cited by 78 (15 self)
 Add to MetaCart
Time series data is perhaps the most frequently encountered type of data examined by the data mining community. Clustering is perhaps the most frequently used data mining algorithm, being useful in it’s own right as an exploratory technique, and also as a subroutine in more complex data mining algorithms such as rule discovery, indexing, summarization, anomaly detection, and classification. Given these two facts, it is hardly surprising that time series clustering has attracted much attention. The data to be clustered can be in one of two formats: many individual time series, or a single time series, from which individual time series are extracted with a sliding window. Given the recent explosion of interest in streaming data and online algorithms, the latter case has received much attention. In this work we make a surprising claim. Clustering of streaming time series is completely meaningless. More concretely, clusters extracted from streaming time series are forced to obey a certain constraint that is pathologically unlikely to be satisfied by any dataset, and because of this, the clusters extracted by any clustering algorithm are essentially random. While this constraint can be intuitively demonstrated with a simple illustration and is simple to prove, it has never appeared in the literature. We can justify calling our claim surprising, since it invalidates the contribution of dozens of previously published papers. We will justify our claim with a theorem, illustrative examples, and a comprehensive set of experiments on reimplementations of previous work. Although the primary contribution of our work is to draw attention to the fact that an apparent solution to an important problem is incorrect and should no longer be used, we also introduce a novel method which, based on the concept of time series motifs, is able to meaningfully cluster some streaming time series datasets.
deltaClusters: Capturing Subspace Correlation in a Large Data Set
 Proc. of 18th IEEE Intern. Conf. on Data Engineering
, 2002
"... Clustering has been an active research area of great practical importance for recent years. Most previous clustering models have focused on grouping objects with similar values on a (sub)set of dimensions (e.g., subspace cluster) and assumed that every object has an associated value on every dimensi ..."
Abstract

Cited by 74 (4 self)
 Add to MetaCart
Clustering has been an active research area of great practical importance for recent years. Most previous clustering models have focused on grouping objects with similar values on a (sub)set of dimensions (e.g., subspace cluster) and assumed that every object has an associated value on every dimension (e.g., bicluster). These existing cluster models may not always be adequate in capturing coherence exhibited among objects. Strong coherence may still exist among a set of objects (on a subset of attributes) even if they take quite different values on each attribute and the attribute values are not fully specified. This is very common in many applications including bioinformatics analysis as well as collaborative filtering analysis, where the data may be incomplete and subject to biases. In bioinformatics, a bicluster model has recently been proposed to capture coherence among a subset of the attributes. Here, we introduce a more general model, referred to as the fficluster model, to capture coherence exhibited by a subset of objects on a subset of attributes, while allowing absent attribute values. A movebased algorithm (FLOC) is devised to efficiently produce a nearoptimal clustering results. The fficluster model takes the bicluster model as a special case, where the FLOC algorithm performs far superior to the bicluster algorithm. We demonstrate the correctness and efficiency of the fficluster model and the FLOC algorithm on a number of real and synthetic data sets.
Product quantization for nearest neighbor search
, 2010
"... This paper introduces a product quantization based approach for approximate nearest neighbor search. The idea is to decomposes the space into a Cartesian product of low dimensional subspaces and to quantize each subspace separately. A vector is represented by a short code composed of its subspace q ..."
Abstract

Cited by 71 (10 self)
 Add to MetaCart
This paper introduces a product quantization based approach for approximate nearest neighbor search. The idea is to decomposes the space into a Cartesian product of low dimensional subspaces and to quantize each subspace separately. A vector is represented by a short code composed of its subspace quantization indices. The Euclidean distance between two vectors can be efficiently estimated from their codes. An asymmetric version increases precision, as it computes the approximate distance between a vector and a code. Experimental results show that our approach searches for nearest neighbors efficiently, in particular in combination with an inverted file system. Results for SIFT and GIST image descriptors show excellent search accuracy outperforming three stateoftheart approaches. The scalability of our approach is validated on a dataset of two billion vectors.
STRIPES: An Efficient Index for Predicted Trajectories
 in SIGMOD
, 2004
"... Moving object databases are required to support queries on a large number of continuously moving objects. A key requirement for indexing methods in this domain is to efficiently support both update and query operations. Previous work on indexing such databases can be broadly divided into two categor ..."
Abstract

Cited by 66 (0 self)
 Add to MetaCart
Moving object databases are required to support queries on a large number of continuously moving objects. A key requirement for indexing methods in this domain is to efficiently support both update and query operations. Previous work on indexing such databases can be broadly divided into two categories: indexing the past positions and indexing the future predicted positions. In this paper we focus on an efficient indexing method for indexing the future positions of moving objects. In this paper we propose an indexing method, called STRIPES, which indexes predicted trajectories in a dual transformed space. Trajectories for objects in ddimensional space become points in a higherdimensional 2dspace. This dual transformed space is then indexed using a regular hierarchical grid decomposition indexing structure. STRIPES can evaluate a range of queries including timeslice, window, and moving queries. We have carried out extensive experimental evaluation comparing the performance of STRIPES with the best known existing predicted trajectory index (the TPR*tree), and show that our approach is significantly faster than TPR*tree for both updates and search queries. 1.
Indexing the Distance: An Efficient Method to KNN Processing
, 2001
"... In this paper, we present an efficient method, called iDistance, for Knearest neighbor (KNN) search in a highdimensional space. iDistance partitions the data and selects a reference point for each partition. The data in each cluster are transformed into a single dimensional space based on their si ..."
Abstract

Cited by 59 (13 self)
 Add to MetaCart
In this paper, we present an efficient method, called iDistance, for Knearest neighbor (KNN) search in a highdimensional space. iDistance partitions the data and selects a reference point for each partition. The data in each cluster are transformed into a single dimensional space based on their similarity with respect to a reference point. This allows the points to be indexed using a B + tree structure and KNN search be performed using onedimensional range search. The choice of partition and reference point provides the iDistance technique with degrees of freedom most other techniques do not have. We describe how appropriate choices here can effectively adapt the index structure to the data distribution. We conducted extensive experiments to evaluate the iDistance technique, and report results demonstrating its effectiveness.
Classifying Large Data Sets Using SVM with Hierarchical Clusters
 in Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
, 2003
"... Support vector machine (SVM) has been a promising method for classification and regression analysis because of its solid mathematical foundation which conveys several salient properties that other methods do not provide. However, despite the prominent properties of SVM, it is not as favored for larg ..."
Abstract

Cited by 52 (3 self)
 Add to MetaCart
Support vector machine (SVM) has been a promising method for classification and regression analysis because of its solid mathematical foundation which conveys several salient properties that other methods do not provide. However, despite the prominent properties of SVM, it is not as favored for largescale data mining as for pattern recognition or machine learning because the training complexity of SVM is highly dependent on the size of a data set. Many realworld data mining applications involve millions or billions of data records where even multiple scans of the entire data are too expensive to perform. This paper presents a new method, ClusteringBased SVM (CBSVM), which is specifically designed for handling very large data sets. CBSVM applies a hierarchical microclustering algorithm that scans the entire data set only once to provide an SVM with high quality samples that carry the statistical summaries of the data such that the summaries maximize the benefit of learning the SVM. CBSVM tries to generate the best SVM boundary for very large data sets given limited amount of resources. Our experiments on synthetic and real data sets show that CBSVM is highly scalable for very large data sets while also generating high classification accuracy.
Approximate nearest neighbor searching in multimedia databases
 In Proc of 17th IEEE Int. Conf. on Data Engineering (ICDE
, 2001
"... In this paper, we develop a general framework for approximate nearest neighbor queries. We categorize the current approaches for nearest neighbor query processing based on either their ability to reduce the data set that needs to be examined, or their ability to reduce the representation size of eac ..."
Abstract

Cited by 48 (11 self)
 Add to MetaCart
In this paper, we develop a general framework for approximate nearest neighbor queries. We categorize the current approaches for nearest neighbor query processing based on either their ability to reduce the data set that needs to be examined, or their ability to reduce the representation size of each data object. We first propose modifications to wellknown techniques to support the progressive processing of approximate nearest neighbor queries. A user may therefore stop the retrieval process once enough information has been returned. We then develop a new technique based on clustering that merges the benefits of the two general classes of approaches. Our clusterbased approach allows a user to progressively explore the approximate results with increasing accuracy. We propose a new metric for evaluation of approximate nearest neighbor searching techniques. Using both the proposed and the traditional metrics, we analyze and compare several techniques with a detailed performance evaluation. We demonstrate the feasibility and efficiency of approximate nearest neighbor searching. We perform experiments on several real data sets and establish the superiority of the proposed clusterbased technique over the existing techniques for approximate nearest neighbor searching. 1
Deflating the dimensionality curse using multiple fractal dimensions
 IN PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON DATA ENGINEERING
"... ..."