Results 1  10
of
34
Probabilistic Counting Algorithms for Data Base Applications
, 1985
"... This paper introduces a class of probabilistic counting lgorithms with which one can estimate the number of distinct elements in a large collection of data (typically a large file stored on disk) in a single pass using only a small additional storage (typically less than a hundred binary words) a ..."
Abstract

Cited by 449 (6 self)
 Add to MetaCart
(Show Context)
This paper introduces a class of probabilistic counting lgorithms with which one can estimate the number of distinct elements in a large collection of data (typically a large file stored on disk) in a single pass using only a small additional storage (typically less than a hundred binary words) and only a few operations per element scanned. The algorithms are based on statistical observations made on bits of hashed values of records. They are by con struction totally insensitive to the replicafive structure of elements in the file; they can be used in the context of distributed systems without any degradation of performances and prove especially useful in the context of data bases query optimisation. ; 1985 Academic Press, Inc
Duplicate record elimination in large data files
 ACM Trans. on Database Systems
"... The issue of duplicate elimination for large data files in which many occurrences of the same record may appear is addressed. A comprehensive cost analysis of the duplicate elimination operation is presented. This analysis is based on a combinatorial model developed for estimating the size of interm ..."
Abstract

Cited by 87 (0 self)
 Add to MetaCart
(Show Context)
The issue of duplicate elimination for large data files in which many occurrences of the same record may appear is addressed. A comprehensive cost analysis of the duplicate elimination operation is presented. This analysis is based on a combinatorial model developed for estimating the size of intermediate runs produced by a modified mergesort procedure. The performance of this modified mergesort procedure is demonstrated to be significantly superior to the standard duplicate elimination technique of sorting followed by a sequential pass to locate duplicate records. The results can also be used to provide critical input to a query optimizer in a relational database system.
The ins and outs of the probabilistic model checker MRMC
 in Proc. QEST’09
, 2009
"... The Markov Reward Model Checker (MRMC) is a software tool for verifying properties over probabilistic models. It supports PCTL and CSL model checking, and their reward extensions. Distinguishing features of MRMC are its support for computing time and rewardbounded reachability probabilities, (prop ..."
Abstract

Cited by 74 (17 self)
 Add to MetaCart
(Show Context)
The Markov Reward Model Checker (MRMC) is a software tool for verifying properties over probabilistic models. It supports PCTL and CSL model checking, and their reward extensions. Distinguishing features of MRMC are its support for computing time and rewardbounded reachability probabilities, (propertydriven) bisimulation minimization, and precise onthefly steadystate detection. Recent tool features include timebounded reachability analysis for uniform CTMDPs and CSL model checking by discreteevent simulation. This paper presents the tool’s current status and its implementation details. 1.
Efficient ExternalMemory Data Structures and Applications
, 1996
"... In this thesis we study the Input/Output (I/O) complexity of largescale problems arising e.g. in the areas of database systems, geographic information systems, VLSI design systems and computer graphics, and design I/Oefficient algorithms for them. A general theme in our work is to design I/Oeffic ..."
Abstract

Cited by 38 (9 self)
 Add to MetaCart
(Show Context)
In this thesis we study the Input/Output (I/O) complexity of largescale problems arising e.g. in the areas of database systems, geographic information systems, VLSI design systems and computer graphics, and design I/Oefficient algorithms for them. A general theme in our work is to design I/Oefficient algorithms through the design of I/Oefficient data structures. One of our philosophies is to try to isolate all the I/O specific parts of an algorithm in the data structures, that is, to try to design I/O algorithms from internal memory algorithms by exchanging the data structures used in internal memory with their external memory counterparts. The results in the thesis include a technique for transforming an internal memory tree data structure into an external data structure which can be used in a batched dynamic setting, that is, a setting where we for example do not require that the result of a search operation is returned immediately. Using this technique we develop batched dynamic external versions of the (onedimensional) rangetree and the segmenttree and we develop an external priority queue. Following our general philosophy we show how these structures can be used in standard internal memory sorting algorithms
A General Lower Bound on the I/OComplexity of Comparisonbased Algorithms
 In Proc. Workshop on Algorithms and Data Structures, LNCS 709
, 1993
"... We show a general relationship between the number of comparisons and the number of I/Ooperations needed to solve a given problem. This relationship enables one to show lower bounds on the number of I/Ooperations needed to solve a problem whenever a lower bound on the number of comparisons is known ..."
Abstract

Cited by 36 (10 self)
 Add to MetaCart
We show a general relationship between the number of comparisons and the number of I/Ooperations needed to solve a given problem. This relationship enables one to show lower bounds on the number of I/Ooperations needed to solve a problem whenever a lower bound on the number of comparisons is known. We use the result to show lower bounds on the I/Ocomplexity on a number of problems where known techniques only give trivial bounds. Among these are the problems of removing duplicates from a multiset, a problem of great importance in e.g. relational database systems, and the problem of determining the mode  the most frequently occurring element  of a multiset. We develop algorithms for these problems in order to show that the lower bounds are tight.
LinearSpace Data Structures for Range Mode Query in Arrays
"... A mode of a multiset S is an element a ∈ S of maximum multiplicity; that is, a occurs at least as frequently as any other element in S. Given an array A[1: n] of n elements, we consider a basic problem: constructing a static data structure that efficiently answers range mode queries on A. Each query ..."
Abstract

Cited by 18 (8 self)
 Add to MetaCart
(Show Context)
A mode of a multiset S is an element a ∈ S of maximum multiplicity; that is, a occurs at least as frequently as any other element in S. Given an array A[1: n] of n elements, we consider a basic problem: constructing a static data structure that efficiently answers range mode queries on A. Each query consists of an input pair of indices (i, j) for which a mode of A[i: j] must be returned. The best previous data structure with linear space, by Krizanc, Morin, and Smid (ISAAC 2003), requires O ( √ n log log n) query time. We improve their result and present an O(n)space data structure that supports range mode queries in O ( p n / log n) worstcase time. Furthermore, we present strong evidence that a query time significantly below √ n cannot be achieved by purely combinatorial techniques; we show that boolean matrix multiplication of two √ n × √ n matrices reduces to n range mode queries in an array of size O(n). Additionally, we give linearspace data structures for orthogonal range mode in higher dimensions (queries in near O(n 1−1/2d) time) and for halfspace range mode in higher dimensions (queries in O(n 1−1/d2) time).
Large alphabets and incompressibility
 Information Processing Letters
"... We briefly survey some concepts related to empirical entropy — normal numbers, de Bruijn sequences and Markov processes — and investigate how well it approximates Kolmogorov complexity. Our results suggest ℓthorder empirical entropy stops being a reasonable complexity metric for almost all strings ..."
Abstract

Cited by 16 (5 self)
 Add to MetaCart
(Show Context)
We briefly survey some concepts related to empirical entropy — normal numbers, de Bruijn sequences and Markov processes — and investigate how well it approximates Kolmogorov complexity. Our results suggest ℓthorder empirical entropy stops being a reasonable complexity metric for almost all strings of length m over alphabets of size n about when n ℓ surpasses m. Key words: Data compression 1
Grouping and Duplicate Elimination: Benefits of Early Aggregation
, 1997
"... Early aggregation is a technique for speeding up the processing of GROUP BY queries by reducing the amount of intermediate data transferred between main memory and disk. It can also be applied to duplicate elimination because duplicate elimination is equivalent to grouping with no aggregation funct ..."
Abstract

Cited by 15 (1 self)
 Add to MetaCart
(Show Context)
Early aggregation is a technique for speeding up the processing of GROUP BY queries by reducing the amount of intermediate data transferred between main memory and disk. It can also be applied to duplicate elimination because duplicate elimination is equivalent to grouping with no aggregation functions. This paper describes six different algorithms for grouping and aggregation, shows how to incorporate early aggregation in each of them, and analyzes the resulting reduction in intermediate data. In addition to the grouping algorithm used, the reduction depends on several factors: the number of groups, the skew in group size distribution, the input size, and the amount of main memory available. All six algorithms considered benefit from early aggregation with grouping by hash partitioning producing the least amount of intermediate data. If the group size distribution is skewed, the overall reduction can be very significant, even with a modest amount of additional main memory.
Instanceoptimal geometric algorithms
"... ... in 2d and 3d, and offline point location in 2d. We prove the existence of an algorithm A for computing 2d or 3d convex hulls that is optimal for every point set in the following sense: for every set S of n points and for every algorithm A ′ in a certain class A, the maximum running time of ..."
Abstract

Cited by 15 (2 self)
 Add to MetaCart
... in 2d and 3d, and offline point location in 2d. We prove the existence of an algorithm A for computing 2d or 3d convex hulls that is optimal for every point set in the following sense: for every set S of n points and for every algorithm A ′ in a certain class A, the maximum running time of A on input 〈s1,..., sn〉 is at most a constant factor times the maximum running time of A ′ on 〈s1,..., sn〉, where the maximum is taken over all permutations 〈s1,..., sn 〉 of S. In fact, we can establish a stronger property: for every S and A ′ , the maximum running time of A is at most a constant factor times the average running time of A ′ over all permutations of S. We call algorithms satisfying these properties instanceoptimal in the orderoblivious and randomorder setting. Such instanceoptimal algorithms simultaneously subsume outputsensitive algorithms and distributiondependent averagecase algorithms, and all algorithms that do not take advantage of the order of the input or that assume the input is given in a random order. The class A under consideration consists of all algorithms in a decision tree model where the tests involve only multilinear functions with a constant number of arguments. To establish an instancespecific lower bound, we deviate from traditional Ben–Orstyle proofs and adopt an interesting adversary argument. For 2d convex hulls, we prove that a version of the well known algorithm by Kirkpatrick and Seidel (1986) or Chan, Snoeyink, and Yap (1995) already attains this lower bound. For 3d convex hulls, we propose a new algorithm. To demonstrate the potential of the concept, we further obtain instanceoptimal results for a few other standard problems in computational geometry, such as maxima in 2d and 3d, orthogonal line segment intersection in 2d, finding bichromatic L∞close pairs in 2d, offline orthogonal range searching in 2d, offline dominance reporting in 2d and 3d, offline halfspace range reporting 1.