Results 1 -
7 of
7
STXXL: Standard template library for XXL data sets
- In: Proc. of ESA 2005. Volume 3669 of LNCS
, 2005
"... for processing huge data sets that can fit only on hard disks. It supports parallel disks, overlapping between disk I/O and computation and it is the first I/O-efficient algorithm library that supports the pipelining technique that can save more than half of the I/Os. STXXL has been applied both in ..."
Abstract
-
Cited by 30 (4 self)
- Add to MetaCart
for processing huge data sets that can fit only on hard disks. It supports parallel disks, overlapping between disk I/O and computation and it is the first I/O-efficient algorithm library that supports the pipelining technique that can save more than half of the I/Os. STXXL has been applied both in academic and industrial environments for a range of problems including text processing, graph algorithms, computational geometry, gaussian elimination, visualization, and analysis of microscopic images, differential cryptographic analysis, etc. The performance of STXXL and its applications is evaluated on synthetic and real-world inputs. We present the design of the library, how its performance features are supported, and demonstrate how the library integrates with STL. KEY WORDS: very large data sets; software library; C++ standard template library; algorithm engineering 1.
I/O-efficient batched union-find and its applications to terrain analysis
- In Proc. 22nd Annual Symposium on Computational Geometry
, 2006
"... Despite extensive study over the last four decades and numerous applications, no I/O-efficient algorithm is known for the union-find problem. In this paper we present an I/O-efficient algorithm for the batched (off-line) version of the union-find problem. Given any sequence of N union and find opera ..."
Abstract
-
Cited by 14 (8 self)
- Add to MetaCart
Despite extensive study over the last four decades and numerous applications, no I/O-efficient algorithm is known for the union-find problem. In this paper we present an I/O-efficient algorithm for the batched (off-line) version of the union-find problem. Given any sequence of N union and find operations, where each union operation joins two distinct sets, our algorithm uses O(SORT(N)) = O ( N B log M/B N I/Os, where M is the memory size and B is the disk block size. This bound is asymptotically optimal in the worst case. If there are union operations that join a set with itself, our algorithm uses O(SORT(N) + MST(N)) I/Os, where MST(N) is the number of I/Os needed to compute the minimum spanning tree of a graph with N edges. We also describe a simple and practical O(SORT(N) log ( N M))-I/O algorithm for this problem, which we have implemented. We are interested in the union-find problem because of its applications in terrain analysis. A terrain can be abstracted as a height function defined over R2, and many problems that deal with such functions require a union-find data structure. With the emergence of modern mapping technologies, huge amount of elevation data is being generated that is too large to fit in memory, thus I/O-efficient algorithms are needed to process this data efficiently. In this paper, we study two terrain-analysis problems that benefit from a union-find data structure: (i) computing topological persistence and (ii) constructing the contour tree. We give the first O(SORT(N))-I/O algorithms for these two problems, assuming that the input terrain is represented as a triangular mesh with N vertices. Finally, we report some preliminary experimental results, showing that our algorithms give order-ofmagnitude improvement over previous methods on large data sets that do not fit in memory. 1
The filter-kruskal minimum spanning tree algorithm
, 2009
"... We present Filter-Kruskal – a simple modification of Kruskal’s algorithm that avoids sorting edges that are “obviously ” not in the MST. For arbitrary graphs with random edge weights Filter-Kruskal runs in time O ( m + n lognlog m n, i.e. in linear time for not too sparse graphs. Experiments indicat ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
We present Filter-Kruskal – a simple modification of Kruskal’s algorithm that avoids sorting edges that are “obviously ” not in the MST. For arbitrary graphs with random edge weights Filter-Kruskal runs in time O ( m + n lognlog m n, i.e. in linear time for not too sparse graphs. Experiments indicate that the algorithm has very good practical performance over the entire range of edge densities. An equally simple parallelization seems to be the currently best practical algorithm on multicore machines. 1
Design and Implementation of a Practical I/O-efficient Shortest Paths Algorithm
"... We report on initial experimental results for a practical I/O-efficient Single-Source Shortest-Paths (SSSP) algorithm on general undirected sparse graphs where the ratio between the largest and the smallest edge weight is reasonably bounded (for example integer weights in {1,...,2 32}) and the reali ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
We report on initial experimental results for a practical I/O-efficient Single-Source Shortest-Paths (SSSP) algorithm on general undirected sparse graphs where the ratio between the largest and the smallest edge weight is reasonably bounded (for example integer weights in {1,...,2 32}) and the realistic assumption holds that main memory is big enough to keep one bit per vertex. While our implementation only guarantees average-case efficiency, i.e., assuming randomly chosen edge-weights, it turns out that its performance on real-world instances with non-random edge weights is actually even better than on the respective inputs with random weights. Furthermore, compared to the currently best implementation for external-memory BFS [6], which in a sense constitutes a lower bound for SSSP, the running time of our approach always stayed within a factor of five, for the most difficult graph classes the difference was even less than a factor of two. We are not aware of any previous I/O-efficient implementation for the classic general SSSP in a (semi) external setting: in two recent projects [10, 23], Kumar/Schwabe-like SSSP approaches on graphs of at most 6 million vertices have been tested, forcing the authors to artificially restrict the main memory size, M, to rather unrealistic 4 to 16 MBytes in order not to leave the semi-external setting or produce huge running times for larger graphs: for random graphs of 2 20 vertices, the best previous approach needed over six hours. In contrast, for a similar ratio of input size vs. M, but on a 128 times larger and even sparser random graph, our approach was less than seven times slower, a relative gain of nearly 20. On a real-world 24 million node street graph, our implementation was over 40 times faster. Even larger gains of over 500 can be estimated for ran-
I/O-Efficient Batched Union-Find and Its . . .
"... Despite extensive study over the last four decades and numerous applications, no I/O-efficient al-gorithm is known for the union-find problem. In this paper we present an I/O-efficient algorithm for the batched (off-line) version of the union-find problem. Given any sequence of N mixed union andfin ..."
Abstract
- Add to MetaCart
Despite extensive study over the last four decades and numerous applications, no I/O-efficient al-gorithm is known for the union-find problem. In this paper we present an I/O-efficient algorithm for the batched (off-line) version of the union-find problem. Given any sequence of N mixed union andfind operations, where each union operation joins two distinct sets, our algorithm uses O(SORT(N)) = O ( NB logM/B NB) I/Os, where M is the memory size and B is the disk block size. This bound isasymptotically optimal in the worst case. If there are union operations that join a set with itself, our algorithm uses O(SORT(N) + MST(N)) I/Os, where MST(N) is the number of I/Os needed to com-pute the minimum spanning tree of a graph with N edges. We also describe a simple and practical O(SORT(N) log ( NM))-I/O algorithm, which we have implemented.The main motivation for our study of the union-find problem arises from problems in terrain analysis. A terrain can be abstracted as a height function defined over R2, and many problems that deal with suchfunctions require a union-find data structure. With the emergence of modern mapping technologies, huge amount of data is being generated that is too large to fit in memory, thus I/O-efficient algorithmsare needed to process this data efficiently. In this paper, we study two terrain analysis problems that benefit from a union-find data structure: (i) computing topological persistence and (ii) constructing thecontour tree. We give the first O(SORT(N))-I/O algorithms for these two problems, assuming that theinput terrain is represented as a triangular mesh with N vertices.Finally, we report some preliminary experimental results, showing that our algorithms give order-ofmagnitude improvement over previous methods on large data sets that do not fit in memory.
Intersection in Integer Inverted Indices
"... Inverted index data structures are the key to fast search engines. The predominant operation on inverted indices asks for intersecting two sorted lists of document IDs which might have vastly varying lengths. We compare previous theoretical approaches, methods used in practice, and one new algorithm ..."
Abstract
- Add to MetaCart
Inverted index data structures are the key to fast search engines. The predominant operation on inverted indices asks for intersecting two sorted lists of document IDs which might have vastly varying lengths. We compare previous theoretical approaches, methods used in practice, and one new algorithm which exploits that the intersection uses small integer keys. We also take different data compression techniques into account. The new algorithm is very fast, simple, has good space efficiency, and is the only algorithm that performs well over the entire spectrum of relative list length ratios. 1

