Results 1 - 10
of
11
Closest-point problems simplified on the RAM
- IN PROC. 13RD ACM-SIAM SYMPOS. ON DISCRETE ALGORITHMS
, 2002
"... Basic proximity problems for low-dimensional point sets, such as closest pair (CP) and approximate nearest neighbor (ANN), have been studied extensively in the computational geometry literature, with well over a hundred papers published (we merely cite the survey by Smid [10] and omit most reference ..."
Abstract
-
Cited by 26 (2 self)
- Add to MetaCart
Basic proximity problems for low-dimensional point sets, such as closest pair (CP) and approximate nearest neighbor (ANN), have been studied extensively in the computational geometry literature, with well over a hundred papers published (we merely cite the survey by Smid [10] and omit most references). Generally, optimal algorithms designed for worst-case input require hierarchical spatial structures with sophisticated balancing conditions (we mention, for example, the BBD trees of Arya et al., balanced quadtrees, and Callahan and Kosaraju's fair-split trees); dynamization of these structures is even more involved (relying on Sleator and Tarjan's dynamic trees or Frederickson's topology trees). In this note, we point out that much simpler algorithms with the same performance are possible using standard, though nonalgebraic, RAM operations. This is interesting, considering that nonalgebraic operations have been used before in the literature (e.g., in the original version of the BBD tree [2], as well as in various randomized CP methods). The CP algorithm can be stated completely in one paragraph. Assume coordinates are positive integers bounded by U = 2 w. Given a point p in a constant dimension d where the i-th coordinate p i is the number p iw p i0 in binary, dene its shue (p) to be the number p 1w pdw p 10 p d0 in binary, and dene shifts i (p) = (p 1 + bi2
Outlier mining in large high-dimensional data sets
- IEEE Transactions on Knowledge and Data Engineering
, 2005
"... In this paper a new definition of distance-based outlier and an algorithm, called HilOut, designed to efficiently detect the top n outliers of a large and high-dimensional data set are proposed. Given an integer k, the weight of a point is defined as the sum of the distances separating it from its k ..."
Abstract
-
Cited by 16 (1 self)
- Add to MetaCart
In this paper a new definition of distance-based outlier and an algorithm, called HilOut, designed to efficiently detect the top n outliers of a large and high-dimensional data set are proposed. Given an integer k, the weight of a point is defined as the sum of the distances separating it from its k nearest-neighbors. Outlier are those points scoring the largest values of weight. The algorithm HilOut makes use of the notion of space-filling curve to linearize the data set, and it consists of two phases. The first phase provides an approximate solution, within a rough factor, after the execution of at most d + 1 sorts and scans of the data set, with temporal cost quadratic in d and linear in N and in k, where d is the number of dimensions of the data set and N is the number of points in the data set. During this phase, the algorithm isolates points candidate to be outliers and reduces this set at each iteration. If the size of this set becomes n, then the algorithm stops reporting the exact solution. The second phase calculates the exact solution with a final scan examining further the candidate outliers remained after the first phase. Experimental results show that the algorithm always stops, reporting the exact solution, during the first phase after much less than d + 1 steps. We present both an in-memory and disk-based implementation of the HilOut algorithm and a thorough scaling analysis for real and synthetic data sets showing that the algorithm scales well in both cases.
A Fast Similarity Join Algorithm Using Graphics Processing Units
"... Abstract — A similarity join operation A ⋊⋉ɛ B takes two sets of points A, B and a value ɛ ∈ R, and outputs pairs of points p ∈ A, q ∈ B, such that the distance D(p, q) ≤ ɛ. Similarity joins find use in a variety of fields, such as clustering, text mining, and multimedia databases. A novel similari ..."
Abstract
-
Cited by 11 (0 self)
- Add to MetaCart
Abstract — A similarity join operation A ⋊⋉ɛ B takes two sets of points A, B and a value ɛ ∈ R, and outputs pairs of points p ∈ A, q ∈ B, such that the distance D(p, q) ≤ ɛ. Similarity joins find use in a variety of fields, such as clustering, text mining, and multimedia databases. A novel similarity join algorithm called LSS is presented that executes on a Graphics Processing Unit (GPU), exploiting its parallelism and high data throughput. As GPUs only allow simple data operations such as the sorting and searching of arrays, LSS uses these two operations to cast a similarity join operation as a GPU sort-and-search problem. It first creates, on the fly, a set of space-filling curves on one of its input datasets, using a parallel GPU sort routine. Next, LSS processes each point p of the other dataset in parallel. For each p, it searches an interval of one of the space-filling curves guaranteed to contain all the pairs in which p participates. Using extensive theoretical and experimental analysis, LSS is shown to offer a good balance between time and work efficiency. Experimental results demonstrate that LSS is suitable for similarity joins in large high-dimensional datasets, and that it performs well when compared against two existing prominent similarity join methods. I.
Detecting outliers using transduction and statistical testing
- In Proceedings of the 12th Annual SIGKDD International Conference on Knowledge Discovery and Data Mining
, 2006
"... Outlier detection can uncover malicious behavior in fields like intrusion detection and fraud analysis. Although there has been a significant amount of work in outlier detection, most of the algorithms proposed in the literature are based on a particular definition of outliers (e.g., density-based), ..."
Abstract
-
Cited by 9 (0 self)
- Add to MetaCart
Outlier detection can uncover malicious behavior in fields like intrusion detection and fraud analysis. Although there has been a significant amount of work in outlier detection, most of the algorithms proposed in the literature are based on a particular definition of outliers (e.g., density-based), and use ad-hoc thresholds to detect them. In this paper we present a novel technique to detect outliers with respect to an existing clustering model. However, the test can also be successfully utilized to recognize outliers when the clustering information is not available. Our method is based on Transductive Confidence Machines, which have been previously proposed as a mechanism to provide individual confidence measures on classification decisions. The test uses hypothesis testing to prove or disprove whether a point is fit to be in each of the clusters of the model. We experimentally demonstrate that the test is highly robust, and produces very few misdiagnosed points, even when no clustering information is available. Furthermore, our experiments demonstrate the robustness of our method under the circumstances of data contaminated by outliers. We finally show that our technique can be successfully applied to identify outliers in a noisy data set for which no information is available (e.g., ground truth, clustering structure, etc.). As such our proposed methodology is capable of bootstrapping from a noisy data set a clean one that can be used to identify future outliers.
Efficient Query Processing on Unstructured Tetrahedral
- In SIGMOD ’06: Proceedings of the 2006 ACM SIGMOD International Conference on Management of Data
, 2006
"... Modern scientific applications consume massive volumes of data produced by computer simulations. Such applications require new data management capabilities in order to scale to terabyte-scale data volumes [25, 10]. The most common way to discretize the application domain is to decompose it into pyra ..."
Abstract
-
Cited by 7 (1 self)
- Add to MetaCart
Modern scientific applications consume massive volumes of data produced by computer simulations. Such applications require new data management capabilities in order to scale to terabyte-scale data volumes [25, 10]. The most common way to discretize the application domain is to decompose it into pyramids, forming an unstructured tetrahedral mesh. Modern simulations generate meshes of high resolution and precision, to be queried by a visualization or analysis tool. Tetrahedral meshes are extremely flexible and therefore vital to accurately model complex geometries, but also are di#- cult to index. To reduce query execution time, applications either use only subsets of the data or rely on di#erent (less flexible) structures, thereby trading accuracy for speed.
Fast construction of k-Nearest Neighbor Graphs for Point Clouds
"... Abstract—We present a parallel algorithm for k-nearest neighbor graph construction that uses Morton ordering. Experiments show that our approach has the following advantages over existing methods: (1) Faster construction of k-nearest neighbor graphs in practice on multi-core machines. (2) Less space ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
Abstract—We present a parallel algorithm for k-nearest neighbor graph construction that uses Morton ordering. Experiments show that our approach has the following advantages over existing methods: (1) Faster construction of k-nearest neighbor graphs in practice on multi-core machines. (2) Less space usage. (3) Better cache efficiency. (4) Ability to handle large data sets. (5) Ease of parallelization and implementation. If the point set has a bounded expansion constant, our algorithm requires one comparison based parallel sort of points according to Morton order plus near linear additional steps to output the k-nearest neighbor graph. Index Terms—Nearest neighbor searching, point based graphics, k-nearest neighbor graphics, Morton Ordering, parallel algorithms. 1
Achieving Spatial Adaptivity while Finding Approximate Nearest Neighbors
"... We present the first spatially adaptive data structure that answers approximate nearest neighbor (ANN) queries to points that reside in a geometric space of any constant dimension d. The Lt-norm approximation ratio is O(d 1+1/t), and the running time for a query q is O(d 2 lg δ(p, q)), where p is th ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
We present the first spatially adaptive data structure that answers approximate nearest neighbor (ANN) queries to points that reside in a geometric space of any constant dimension d. The Lt-norm approximation ratio is O(d 1+1/t), and the running time for a query q is O(d 2 lg δ(p, q)), where p is the result of the preceding query and δ(p, q) is the number of input points in a suitably-sized box containing p and q. Our data structure has O(dn) size and requires O(d 2 n lg n) preprocessing time, where n is the number of points in the data structure. The size of the bounding box for δ depends on d, and our results rely on the Random Access Machine (RAM) model with word size Θ(lg n). 1
Approximate Nearest Neighbor Search using a Single Space-filling Curve and Multiple Representations of the Data Points
"... In this work, a fast approximate nearest neighbour search algorithm using single Space-filling Curve (SPFC) Mapping and a set of synthetic prototype representations is presented. The results are comparable to a multiplespacefilling scheme, but achieving a much faster execution time, since computing ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
In this work, a fast approximate nearest neighbour search algorithm using single Space-filling Curve (SPFC) Mapping and a set of synthetic prototype representations is presented. The results are comparable to a multiplespacefilling scheme, but achieving a much faster execution time, since computing multiple transformations and SPFC Mapping’s is avoided, at the expense of having a more densely populated one-dimensional representation of the data-set. The advantages and limitations of the model are discussed, and an experimental evaluation with synthetic data and with a large, real high-dimensional optical character recognition data-set is presented. 1
K Nearest Neighbor Queries and KNN-Joins in Large Relational Databases (Almost) for Free
"... Abstract — Finding the ..."
Adaptive Binary Search Trees
, 2009
"... views and conclusions contained in this document are those of the author and should not be interpreted as representing the official policies, either expressed or implied, of any sponsoring institution, the U.S. government or any other entity. Keywords: binary search trees, adaptive algorithms, splay ..."
Abstract
- Add to MetaCart
views and conclusions contained in this document are those of the author and should not be interpreted as representing the official policies, either expressed or implied, of any sponsoring institution, the U.S. government or any other entity. Keywords: binary search trees, adaptive algorithms, splay trees, Unified Bound, dynamic A ubiquitous problem in the field of algorithms and data structures is that of searching for an element from an ordered universe. The simple yet powerful binary search tree (BST) model provides a rich family of solutions to this problem. Although BSTs require Ω(lg n) time per operation in the worst case, various adaptive BST algorithms are capable of exploiting patterns in the sequence of queries to achieve tighter, input-sensitive, bounds that can be o(lg n) in many cases. This thesis furthers our understanding of what is achievable in the BST model along two directions. First, we make progress in improving instance-specific lower bounds in the BST model. In particular, we introduce a framework for generating lower bounds on the cost that any BST algorithm must pay to execute a query sequence,

