Results 1  10
of
12
Closestpoint problems simplified on the RAM
 IN PROC. 13RD ACMSIAM SYMPOS. ON DISCRETE ALGORITHMS
, 2002
"... Basic proximity problems for lowdimensional point sets, such as closest pair (CP) and approximate nearest neighbor (ANN), have been studied extensively in the computational geometry literature, with well over a hundred papers published (we merely cite the survey by Smid [10] and omit most reference ..."
Abstract

Cited by 34 (2 self)
 Add to MetaCart
Basic proximity problems for lowdimensional point sets, such as closest pair (CP) and approximate nearest neighbor (ANN), have been studied extensively in the computational geometry literature, with well over a hundred papers published (we merely cite the survey by Smid [10] and omit most references). Generally, optimal algorithms designed for worstcase input require hierarchical spatial structures with sophisticated balancing conditions (we mention, for example, the BBD trees of Arya et al., balanced quadtrees, and Callahan and Kosaraju's fairsplit trees); dynamization of these structures is even more involved (relying on Sleator and Tarjan's dynamic trees or Frederickson's topology trees). In this note, we point out that much simpler algorithms with the same performance are possible using standard, though nonalgebraic, RAM operations. This is interesting, considering that nonalgebraic operations have been used before in the literature (e.g., in the original version of the BBD tree [2], as well as in various randomized CP methods). The CP algorithm can be stated completely in one paragraph. Assume coordinates are positive integers bounded by U = 2 w. Given a point p in a constant dimension d where the ith coordinate p i is the number p iw p i0 in binary, dene its shue (p) to be the number p 1w pdw p 10 p d0 in binary, and dene shifts i (p) = (p 1 + bi2
Outlier mining in large highdimensional data sets
 IEEE Transactions on Knowledge and Data Engineering
, 2005
"... In this paper a new definition of distancebased outlier and an algorithm, called HilOut, designed to efficiently detect the top n outliers of a large and highdimensional data set are proposed. Given an integer k, the weight of a point is defined as the sum of the distances separating it from its k ..."
Abstract

Cited by 23 (1 self)
 Add to MetaCart
In this paper a new definition of distancebased outlier and an algorithm, called HilOut, designed to efficiently detect the top n outliers of a large and highdimensional data set are proposed. Given an integer k, the weight of a point is defined as the sum of the distances separating it from its k nearestneighbors. Outlier are those points scoring the largest values of weight. The algorithm HilOut makes use of the notion of spacefilling curve to linearize the data set, and it consists of two phases. The first phase provides an approximate solution, within a rough factor, after the execution of at most d + 1 sorts and scans of the data set, with temporal cost quadratic in d and linear in N and in k, where d is the number of dimensions of the data set and N is the number of points in the data set. During this phase, the algorithm isolates points candidate to be outliers and reduces this set at each iteration. If the size of this set becomes n, then the algorithm stops reporting the exact solution. The second phase calculates the exact solution with a final scan examining further the candidate outliers remained after the first phase. Experimental results show that the algorithm always stops, reporting the exact solution, during the first phase after much less than d + 1 steps. We present both an inmemory and diskbased implementation of the HilOut algorithm and a thorough scaling analysis for real and synthetic data sets showing that the algorithm scales well in both cases.
A Fast Similarity Join Algorithm Using Graphics Processing Units
"... Abstract — A similarity join operation A ⋊⋉ɛ B takes two sets of points A, B and a value ɛ ∈ R, and outputs pairs of points p ∈ A, q ∈ B, such that the distance D(p, q) ≤ ɛ. Similarity joins find use in a variety of fields, such as clustering, text mining, and multimedia databases. A novel similari ..."
Abstract

Cited by 22 (0 self)
 Add to MetaCart
Abstract — A similarity join operation A ⋊⋉ɛ B takes two sets of points A, B and a value ɛ ∈ R, and outputs pairs of points p ∈ A, q ∈ B, such that the distance D(p, q) ≤ ɛ. Similarity joins find use in a variety of fields, such as clustering, text mining, and multimedia databases. A novel similarity join algorithm called LSS is presented that executes on a Graphics Processing Unit (GPU), exploiting its parallelism and high data throughput. As GPUs only allow simple data operations such as the sorting and searching of arrays, LSS uses these two operations to cast a similarity join operation as a GPU sortandsearch problem. It first creates, on the fly, a set of spacefilling curves on one of its input datasets, using a parallel GPU sort routine. Next, LSS processes each point p of the other dataset in parallel. For each p, it searches an interval of one of the spacefilling curves guaranteed to contain all the pairs in which p participates. Using extensive theoretical and experimental analysis, LSS is shown to offer a good balance between time and work efficiency. Experimental results demonstrate that LSS is suitable for similarity joins in large highdimensional datasets, and that it performs well when compared against two existing prominent similarity join methods. I.
Detecting outliers using transduction and statistical testing
 In Proceedings of the 12th Annual SIGKDD International Conference on Knowledge Discovery and Data Mining
, 2006
"... Outlier detection can uncover malicious behavior in fields like intrusion detection and fraud analysis. Although there has been a significant amount of work in outlier detection, most of the algorithms proposed in the literature are based on a particular definition of outliers (e.g., densitybased), ..."
Abstract

Cited by 11 (1 self)
 Add to MetaCart
Outlier detection can uncover malicious behavior in fields like intrusion detection and fraud analysis. Although there has been a significant amount of work in outlier detection, most of the algorithms proposed in the literature are based on a particular definition of outliers (e.g., densitybased), and use adhoc thresholds to detect them. In this paper we present a novel technique to detect outliers with respect to an existing clustering model. However, the test can also be successfully utilized to recognize outliers when the clustering information is not available. Our method is based on Transductive Confidence Machines, which have been previously proposed as a mechanism to provide individual confidence measures on classification decisions. The test uses hypothesis testing to prove or disprove whether a point is fit to be in each of the clusters of the model. We experimentally demonstrate that the test is highly robust, and produces very few misdiagnosed points, even when no clustering information is available. Furthermore, our experiments demonstrate the robustness of our method under the circumstances of data contaminated by outliers. We finally show that our technique can be successfully applied to identify outliers in a noisy data set for which no information is available (e.g., ground truth, clustering structure, etc.). As such our proposed methodology is capable of bootstrapping from a noisy data set a clean one that can be used to identify future outliers.
Fast construction of kNearest Neighbor Graphs for Point Clouds
"... Abstract—We present a parallel algorithm for knearest neighbor graph construction that uses Morton ordering. Experiments show that our approach has the following advantages over existing methods: (1) Faster construction of knearest neighbor graphs in practice on multicore machines. (2) Less space ..."
Abstract

Cited by 10 (1 self)
 Add to MetaCart
Abstract—We present a parallel algorithm for knearest neighbor graph construction that uses Morton ordering. Experiments show that our approach has the following advantages over existing methods: (1) Faster construction of knearest neighbor graphs in practice on multicore machines. (2) Less space usage. (3) Better cache efficiency. (4) Ability to handle large data sets. (5) Ease of parallelization and implementation. If the point set has a bounded expansion constant, our algorithm requires one comparison based parallel sort of points according to Morton order plus near linear additional steps to output the knearest neighbor graph. Index Terms—Nearest neighbor searching, point based graphics, knearest neighbor graphics, Morton Ordering, parallel algorithms. 1
Efficient Query Processing on Unstructured Tetrahedral Meshes
 IN SIGMOD ’06: PROCEEDINGS OF THE 2006 ACM SIGMOD INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA
, 2006
"... Modern scientific applications consume massive volumes of data produced by computer simulations. Such applications require new data management capabilities in order to scale to terabytescale data volumes [25, 10]. The most common way to discretize the application domain is to decompose it into pyra ..."
Abstract

Cited by 8 (2 self)
 Add to MetaCart
Modern scientific applications consume massive volumes of data produced by computer simulations. Such applications require new data management capabilities in order to scale to terabytescale data volumes [25, 10]. The most common way to discretize the application domain is to decompose it into pyramids, forming an unstructured tetrahedral mesh. Modern simulations generate meshes of high resolution and precision, to be queried by a visualization or analysis tool. Tetrahedral meshes are extremely flexible and therefore vital to accurately model complex geometries, but also are difficult to index. To reduce query execution time, applications either use only subsets of the data or rely on different (less flexible) structures, thereby trading accuracy for speed. This
Achieving Spatial Adaptivity while Finding Approximate Nearest Neighbors
"... We present the first spatially adaptive data structure that answers approximate nearest neighbor (ANN) queries to points that reside in a geometric space of any constant dimension d. The Ltnorm approximation ratio is O(d 1+1/t), and the running time for a query q is O(d 2 lg δ(p, q)), where p is th ..."
Abstract

Cited by 4 (2 self)
 Add to MetaCart
We present the first spatially adaptive data structure that answers approximate nearest neighbor (ANN) queries to points that reside in a geometric space of any constant dimension d. The Ltnorm approximation ratio is O(d 1+1/t), and the running time for a query q is O(d 2 lg δ(p, q)), where p is the result of the preceding query and δ(p, q) is the number of input points in a suitablysized box containing p and q. Our data structure has O(dn) size and requires O(d 2 n lg n) preprocessing time, where n is the number of points in the data structure. The size of the bounding box for δ depends on d, and our results rely on the Random Access Machine (RAM) model with word size Θ(lg n). 1
K Nearest Neighbor Queries and KNNJoins in Large Relational Databases (Almost) for Free
"... Abstract — Finding the ..."
Approximate Nearest Neighbor Search using a Single Spacefilling Curve and Multiple Representations of the Data Points
"... In this work, a fast approximate nearest neighbour search algorithm using single Spacefilling Curve (SPFC) Mapping and a set of synthetic prototype representations is presented. The results are comparable to a multiplespacefilling scheme, but achieving a much faster execution time, since computing ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
In this work, a fast approximate nearest neighbour search algorithm using single Spacefilling Curve (SPFC) Mapping and a set of synthetic prototype representations is presented. The results are comparable to a multiplespacefilling scheme, but achieving a much faster execution time, since computing multiple transformations and SPFC Mapping’s is avoided, at the expense of having a more densely populated onedimensional representation of the dataset. The advantages and limitations of the model are discussed, and an experimental evaluation with synthetic data and with a large, real highdimensional optical character recognition dataset is presented. 1
Adaptive Binary Search Trees
, 2009
"... A ubiquitous problem in the field of algorithms and data structures is that of searching for an element from an ordered universe. The simple yet powerful binary search tree (BST) model provides a rich family of solutions to this problem. Although BSTs require Ω(lg n) time per operation in the wors ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
A ubiquitous problem in the field of algorithms and data structures is that of searching for an element from an ordered universe. The simple yet powerful binary search tree (BST) model provides a rich family of solutions to this problem. Although BSTs require Ω(lg n) time per operation in the worst case, various adaptive BST algorithms are capable of exploiting patterns in the sequence of queries to achieve tighter, inputsensitive, bounds that can be o(lg n) in many cases. This thesis furthers our understanding of what is achievable in the BST model along two directions. First, we make progress in improving instancespecific lower bounds in the BST model. In particular, we introduce a framework for generating lower bounds on the cost that any BST algorithm must pay to execute a query sequence,