ExternalMemory Graph Algorithms
, 1995
"... We present a collection of new techniques for designing and analyzing efficient externalmemory algorithms for graph problems and illustrate how these techniques can be applied to a wide variety of specific problems. Our results include: ffl Proximateneighboring. We present a simple method for der ..."
We present a collection of new techniques for designing and analyzing efficient externalmemory algorithms for graph problems and illustrate how these techniques can be applied to a wide variety of specific problems. Our results include: ffl Proximateneighboring. We present a simple method for deriving externalmemory lower bounds via reductions from a problem we call the "proximate neighbors" problem. We use this technique to derive nontrivial lower bounds for such problems as list ranking, expression tree evaluation, and connected components. ffl PRAM simulation. We give methods for efficiently simulating PRAM computations in external memory, even for some cases in which the PRAM algorithm is not workoptimal. We apply this to derive a number of optimal (and simple) externalmemory graph algorithms. ffl Timeforward processing. We present a general technique for evaluating circuits (or "circuitlike" computations) in external memory. We also use this in a deterministic list rank...
ANF: A Fast and Scalable Tool for Data Mining in Massive Graphs
 NTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING
, 2002
"... Graphs are an increasingly important data source, with such important graphs as the Internet and the Web. Other familiar graphs include CAD circuits, phone records, gene sequences, city streets, social networks and academic citations. Any kind of relationship, such as actors appearing in movies, can ..."
Graphs are an increasingly important data source, with such important graphs as the Internet and the Web. Other familiar graphs include CAD circuits, phone records, gene sequences, city streets, social networks and academic citations. Any kind of relationship, such as actors appearing in movies, can be represented as a graph. This work presents a data mining tool, called ANF, that can quickly answer a number of interesting questions on graphrepresented data, such as the following. How robust is the Internet to failures? What are the most influential database papers? Are there gender differences in movie appearance patterns? At its core, ANF is based on a fast and memoryefficient approach for approximating the complete "neighbourhood function" for a graph. For the Internet graph (268K nodes), ANF's highlyaccurate approximation is more than 700 times faster than the exact computation. This reduces the running time from nearly a day to a matter of a minute or two, allowing users to perform ad hoc drilldown tasks and to repeatedly answer questions about changing data sources. To enable this drilldown, ANF employs new techniques for approximating neighbourhoodtype functions for graphs with distinguished nodes and/or edges. When compared to the best existing approximation, ANF's approach is both faster and more accurate, given the same resources. Additionally, unlike previous approaches, ANF scales gracefully to handle disk resident graphs. Finally, we present some of our results from mining large graphs using ANF.
On the Analysis of Indexing Schemes
 In Proc. 16th ACM SIGACTSIGMODSIGART Symposium on Principles of Database Systems
, 1997
"... We consider the problem of indexing general database workloads (combinations of data sets and sets of potential queries). We define a framework for measuring the efficiency of an indexing scheme for a workload based on two characterizations: storage redundancy (how many times each item in the data s ..."
We consider the problem of indexing general database workloads (combinations of data sets and sets of potential queries). We define a framework for measuring the efficiency of an indexing scheme for a workload based on two characterizations: storage redundancy (how many times each item in the data set is stored), and access overhead (how many times more blocks than necessary does a query retrieve). Using this framework we present some initial results, showing upper and lower bounds and tradeoffs between them in the case of multidimensional range queries and set queries. 1 Introduction The success and ubiquity of the relational data model arguably owes much to the Btree, the access method breakthrough that accompanied it with superb timing [2]. It seems likely that access methods will continue to play an important role in, and largely determine the viability of, the novel data models currently under intense scrutiny in the database research community. The Btree is widely recognized...
External Memory Data Structures
, 2001
"... In many massive dataset applications the data must be stored in space and query efficient data structures on external storage devices. Often the data needs to be changed dynamically. In this chapter we discuss recent advances in the development of provably worstcase efficient external memory dynami ..."
In many massive dataset applications the data must be stored in space and query efficient data structures on external storage devices. Often the data needs to be changed dynamically. In this chapter we discuss recent advances in the development of provably worstcase efficient external memory dynamic data structures. We also briefly discuss some of the most popular external data structures used in practice.
Efficient ExternalMemory Data Structures and Applications
, 1996
"... In this thesis we study the Input/Output (I/O) complexity of largescale problems arising e.g. in the areas of database systems, geographic information systems, VLSI design systems and computer graphics, and design I/Oefficient algorithms for them. A general theme in our work is to design I/Oeffic ..."
In this thesis we study the Input/Output (I/O) complexity of largescale problems arising e.g. in the areas of database systems, geographic information systems, VLSI design systems and computer graphics, and design I/Oefficient algorithms for them. A general theme in our work is to design I/Oefficient algorithms through the design of I/Oefficient data structures. One of our philosophies is to try to isolate all the I/O specific parts of an algorithm in the data structures, that is, to try to design I/O algorithms from internal memory algorithms by exchanging the data structures used in internal memory with their external memory counterparts. The results in the thesis include a technique for transforming an internal memory tree data structure into an external data structure which can be used in a batched dynamic setting, that is, a setting where we for example do not require that the result of a search operation is returned immediately. Using this technique we develop batched dynamic external versions of the (onedimensional) rangetree and the segmenttree and we develop an external priority queue. Following our general philosophy we show how these structures can be used in standard internal memory sorting algorithms
On External Memory MST, SSSP and Multiway Planar Graph Separation (Extended Abstract)
, 2000
"... Recently external memory graph algorithms have received considerable attention because massive graphs arise naturally in many applications involving massive data sets. Even though a large number of I/Oefficient graph algorithms have been developed, a number of fundamental problems still remain ..."
Recently external memory graph algorithms have received considerable attention because massive graphs arise naturally in many applications involving massive data sets. Even though a large number of I/Oefficient graph algorithms have been developed, a number of fundamental problems still remain open. In this paper we develop improved algorithms for the problem of computing a minimum spanning tree of a general graph G = (V; E), as well as new algorithms for the single source shortest paths and the multiway graph separation problems on planar graphs.
I/OEfficient Algorithms for Problems on Gridbased Terrains (Extended Abstract)
 In Proc. Workshop on Algorithm Engineering and Experimentation
, 2000
"... Lars Arge Laura Toma Jeffrey Scott Vitter Center for Geometric Computing Department of Computer Science Duke University Durham, NC 277080129 Abstract The potential and use of Geographic Information Systems (GIS) is rapidly increasing due to the increasing availability of massive amoun ..."
Lars Arge Laura Toma Jeffrey Scott Vitter Center for Geometric Computing Department of Computer Science Duke University Durham, NC 277080129 Abstract The potential and use of Geographic Information Systems (GIS) is rapidly increasing due to the increasing availability of massive amounts of geospatial data from projects like NASA's Mission to Planet Earth. However, the use of these massive datasets also exposes scalability problems with existing GIS algorithms. These scalability problems are mainly due to the fact that most GIS algorithms have been designed to minimize internal computation time, while I/O communication often is the bottleneck when processing massive amounts of data.
On externalmemory MST, SSSP and multiway planar graph separation
 In Proc. 8th Scandinavian Workshop on Algorithmic Theory, volume 1851 of LNCS
, 2000
"... Recently external memory graph algorithms have received considerable attention because massive graphs arise naturally in many applications involving massive data sets. Even though a large number of I/Oefficient graph algorithms have been developed, a number of fundamental problems still remain open ..."
Recently external memory graph algorithms have received considerable attention because massive graphs arise naturally in many applications involving massive data sets. Even though a large number of I/Oefficient graph algorithms have been developed, a number of fundamental problems still remain open. In this paper we develop an improved algorithm for the problem of computing a minimum spanning tree of a general graph, as well as new algorithms for the single source shortest paths and the multiway graph separation problems on planar graphs.
Optimal Dynamic Range Searching in Nonreplicating Index Structures
 In Proc. International Conference on Database Theory, LNCS 1540
, 1997
"... We consider the problem of dynamic range searching in tree structures that do not replicate data. We propose a new dynamic structure, called the Otree, that achieves a query time complexity of O(n (d\Gamma1)=d ) on n ddimensional points and an amortized insertion/deletion time complexity of O(l ..."
We consider the problem of dynamic range searching in tree structures that do not replicate data. We propose a new dynamic structure, called the Otree, that achieves a query time complexity of O(n (d\Gamma1)=d ) on n ddimensional points and an amortized insertion/deletion time complexity of O(log n). We show that this structure is optimal when data is not replicated. In addition to optimal query and insertion/deletion times, the Otree also supports exact match queries in worstcase logarithmic time. 1 Introduction Given a set S of ddimensional points, a range query q is specified by d 1dimensional intervals [q s i ; q e i ], one for each dimension i, and retrieves all points p = (p 1 ; p 2 ; : : : p d ) in S such that h8i 2 f1; : : : ; dg : q s i p i q e i i. This type of searching in multidimensional space has important applications in geographic information systems, image databases, and computer graphics. Several structures such as the range trees [3], Prange trees [29...
ExternalMemory Algorithms with Applications in Geographic Information Systems
 Algorithmic Foundations of GIS
, 1997
"... In the design of algorithms for largescale applications it is essential to consider the problem of minimizing Input/Output (I/O) communication. Geographical information systems (GIS) are good examples of such largescale applications as they frequently handle huge amounts of spatial data. In this n ..."
In the design of algorithms for largescale applications it is essential to consider the problem of minimizing Input/Output (I/O) communication. Geographical information systems (GIS) are good examples of such largescale applications as they frequently handle huge amounts of spatial data. In this note we survey the recent developments in externalmemory algorithms with applications in GIS. First we discuss the AggarwalVitter I/Omodel and illustrate why normal internalmemory algorithms for even very simple problems can perform terribly in an I/Oenvironment. Then we describe the fundamental paradigms for designing I/Oefficient algorithms by using them to design efficient sorting algorithms. We then go on and survey externalmemory algorithms for computational geometry problems  with special emphasis on problems with applications in GIS  and techniques for designing such algorithms: Using the orthogonal line segment intersection problem we illustrate the distributionsweeping and ...