Results 1  10
of
20
Clustering data streams: Theory and practice
 IEEE TKDE
, 2003
"... Abstract—The data stream model has recently attracted attention for its applicability to numerous types of data, including telephone records, Web documents, and clickstreams. For analysis of such data, the ability to process the data in a single pass, or a small number of passes, while using little ..."
Abstract

Cited by 111 (3 self)
 Add to MetaCart
(Show Context)
Abstract—The data stream model has recently attracted attention for its applicability to numerous types of data, including telephone records, Web documents, and clickstreams. For analysis of such data, the ability to process the data in a single pass, or a small number of passes, while using little memory, is crucial. We describe such a streaming algorithm that effectively clusters large data streams. We also provide empirical evidence of the algorithm’s performance on synthetic and real data streams. Index Terms—Clustering, data streams, approximation algorithms. 1
Sublinear Time Algorithms for Metric Space Problems
"... In this paper we give approximation algorithms for the following problems on metric spaces: Furthest Pair, k median, Minimum Routing Cost Spanning Tree, Multiple Sequence Alignment, Maximum Traveling Salesman Problem, Maximum Spanning Tree and Average Distance. The key property of our algorithms i ..."
Abstract

Cited by 82 (2 self)
 Add to MetaCart
In this paper we give approximation algorithms for the following problems on metric spaces: Furthest Pair, k median, Minimum Routing Cost Spanning Tree, Multiple Sequence Alignment, Maximum Traveling Salesman Problem, Maximum Spanning Tree and Average Distance. The key property of our algorithms is that their running time is linear in the number of metric space points. As the full specification o`f an npoint metric space is of size \Theta(n 2 ), the complexity of our algorithms is sublinear with respect to the input size. All previous algorithms (exact or approximate) for the problems we consider have running time\Omega\Gamma n 2 ). We believe that our techniques can be applied to get similar bounds for other problems. 1 Introduction In recent years there has been a dramatic growth of interest in algorithms operating on massive data sets. This poses new challenges for algorithm design, as algorithms quite efficient on small inputs (for example, having quadratic running time) ...
Nearest Neighbors In HighDimensional Spaces
, 2004
"... In this chapter we consider the following problem: given a set P of points in a highdimensional space, construct a data structure which given any query point q nds the point in P closest to q. This problem, called nearest neighbor search is of significant importance to several areas of computer sci ..."
Abstract

Cited by 82 (2 self)
 Add to MetaCart
In this chapter we consider the following problem: given a set P of points in a highdimensional space, construct a data structure which given any query point q nds the point in P closest to q. This problem, called nearest neighbor search is of significant importance to several areas of computer science, including pattern recognition, searching in multimedial data, vector compression [GG91], computational statistics [DW82], and data mining. Many of these applications involve data sets which are very large (e.g., a database containing Web documents could contain over one billion documents). Moreover, the dimensionality of the points is usually large as well (e.g., in the order of a few hundred). Therefore, it is crucial to design algorithms which scale well with the database size as well as with the dimension. The nearestneighbor problem is an example of a large class of proximity problems, which, roughly speaking, are problems whose definitions involve the notion of...
Polynomial Time Approximation Schemes for Geometric kClustering
 J. OF THE ACM
, 2001
"... The JohnsonLindenstrauss lemma states that n points in a high dimensional Hilbert space can be embedded with small distortion of the distances into an O(log n) dimensional space by applying a random linear transformation. We show that similar (though weaker) properties hold for certain random linea ..."
Abstract

Cited by 32 (5 self)
 Add to MetaCart
(Show Context)
The JohnsonLindenstrauss lemma states that n points in a high dimensional Hilbert space can be embedded with small distortion of the distances into an O(log n) dimensional space by applying a random linear transformation. We show that similar (though weaker) properties hold for certain random linear transformations over the Hamming cube. We use these transformations to solve NPhard clustering problems in the cube as well as in geometric settings. More specifically, we address the following clustering problem. Given n points in a larger set (for example, R^d) endowed with a distance function (for example, L² distance), we would like to partition the data set into k disjoint clusters, each with a "cluster center", so as to minimize the sum over all data points of the distance between the point and the center of the cluster containing the point. The problem is provably NPhard in some high dimensional geometric settings, even for k = 2. We give polynomial time approximation schemes for this problem in several settings, including the binary cube {0, 1}^d with Hamming distance, and R^d either with L¹ distance, or with L² distance, or with the square of L&sup2; distance. In all these settings, the best previous results were constant factor approximation guarantees. We note that our problem is similar in flavor to the kmedian problem (and the related facility location problem), which has been considered in graphtheoretic and fixed dimensional geometric settings, where it becomes hard when k is part of the input. In contrast, we study the problem when k is fixed, but the dimension is part of the input.
Derandomized Dimensionality Reduction with Applications
 In Proc. 13th ACMSIAM Sympos. Discrete Algorithms
, 2002
"... The JohnsonLindenstrauss lemma provides a way to map a number of points in highdimensional space into a lowdimensional space, with only a small distortion of the distances between the points. The proofs of the lemma are nonconstructive: they show that a random mapping induces small distortions w ..."
Abstract

Cited by 29 (3 self)
 Add to MetaCart
The JohnsonLindenstrauss lemma provides a way to map a number of points in highdimensional space into a lowdimensional space, with only a small distortion of the distances between the points. The proofs of the lemma are nonconstructive: they show that a random mapping induces small distortions with high probability, but they do not construct the actual mapping. In this paper, we provide a procedure that constructs such a mapping deterministically in time almost linear in the number of distances to preserve times the dimension of the original space. We then use that result (together with Nisan's pseudorandom generator) to obtain an efficient derandomization of several approximation algorithms based on semidefinite programming.
Efficient Biased Sampling for Approximate Clustering and Outlier Detection in Large Datasets
 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
, 2003
"... We investigate the use of biased sampling according to the density of the data set to speed up the operation of general data mining tasks, such as clustering and outlier detection in large multidimensional data sets. In densitybiased sampling, the probability that a given point will be included in ..."
Abstract

Cited by 26 (1 self)
 Add to MetaCart
(Show Context)
We investigate the use of biased sampling according to the density of the data set to speed up the operation of general data mining tasks, such as clustering and outlier detection in large multidimensional data sets. In densitybiased sampling, the probability that a given point will be included in the sample depends on the local density of the data set. We propose a general technique for densitybiased sampling that can factor in user requirements to sample for properties of interest and can be tuned for specific data mining tasks. This allows great flexibility and improved accuracy of the results over simple random sampling. We describe our approach in detail, we analytically evaluate it, and show how it can be optimized for approximate clustering and outlier detection. Finally, we present...
Clustering in Massive Data Sets
 Handbook of massive data sets
, 1999
"... We review the time and storage costs of search and clustering algorithms. We exemplify these, based on casestudies in astronomy, information retrieval, visual user interfaces, chemical databases, and other areas. Sections 2 to 6 relate to nearest neighbor searching, an elemental form of clustering, ..."
Abstract

Cited by 12 (0 self)
 Add to MetaCart
We review the time and storage costs of search and clustering algorithms. We exemplify these, based on casestudies in astronomy, information retrieval, visual user interfaces, chemical databases, and other areas. Sections 2 to 6 relate to nearest neighbor searching, an elemental form of clustering, and a basis for clustering algorithms to follow. Sections 7 to 11 review a number of families of clustering algorithm. Sections 12 to 14 relate to visual or image representations of data sets, from which a number of interesting algorithmic developments arise.
Randomized algorithms for geometric optimization problems
 Handbook of Randomized Computation
, 2001
"... This chapter reviews randomization algorithms developed in the last few years to solve a wide range of geometric optimization problems. We rst review a number of general techniques, including randomized binary search, randomized linearprogramming algorithms, and random sampling. Next, we describe s ..."
Abstract

Cited by 10 (0 self)
 Add to MetaCart
This chapter reviews randomization algorithms developed in the last few years to solve a wide range of geometric optimization problems. We rst review a number of general techniques, including randomized binary search, randomized linearprogramming algorithms, and random sampling. Next, we describe several applications of these and other techniques, including facility location, proximity problems, statistical estimators, nearest neighbor searching, and Euclidean TSP.
Compact Data Structures with Fast Queries
, 2005
"... Many applications dealing with large data structures can benefit from keeping them in compressed form. Compression has many benefits: it can allow a representation to fit in main memory rather than swapping out to disk, and it improves cache performance since it allows more data to fit into the c ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
(Show Context)
Many applications dealing with large data structures can benefit from keeping them in compressed form. Compression has many benefits: it can allow a representation to fit in main memory rather than swapping out to disk, and it improves cache performance since it allows more data to fit into the cache. However, a data structure is only useful if it allows the application to perform fast queries (and updates) to the data.
Towards a Theory of Intrusion Detection
 In Proc. of European Symposium on Research in computer Security (ESORICS 2005
"... Abstract. We embark into theoretical approaches for the investigation of intrusion detection schemes. Our main motivation is to provide rigorous security requirements for intrusion detection systems that can be used by designers of such systems. Our model captures and generalizes wellknown methodol ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
Abstract. We embark into theoretical approaches for the investigation of intrusion detection schemes. Our main motivation is to provide rigorous security requirements for intrusion detection systems that can be used by designers of such systems. Our model captures and generalizes wellknown methodologies in the intrusion detection area, such as anomalybased and signaturebased intrusion detection, and formulates security requirements based on both wellknown complexitytheoretic notions and wellknown notions in cryptography (such as computational indistinguishability). Under our model, we present two efficient paradigms for intrusion detection systems, one based on nearest neighbor search algorithms, and one based on both the latter and clustering algorithms. Under formally specified assumptions on the representation of network traffic, we can prove that our two systems satisfy our main security requirement for an intrusion detection system. In both cases, while the potential truth of the assumption rests on heuristic properties of the representation of network traffic (which is hard to avoid due to the unpredictable nature of external attacks to a network), the proof that the systems satisfy desirable detection properties is rigorous and of probabilistic and algorithmic nature. Additionally, our framework raises open questions on intrusion detection systems that can be rigorously studied. As an example, we study the problem of arbitrarily and efficiently extending the detection window of any intrusion detection system, which allows the latter to catch attack sequences interleaved with normal traffic packet sequences. We use combinatoric tools such as time and spaceefficient covering set systems to present provably correct solutions to this problem. 1