Results 1  10
of
336
MapReduce: Simplified data processing on large clusters.
 In Proceedings of the Sixth Symposium on Operating System Design and Implementation (OSDI04),
, 2004
"... Abstract MapReduce is a programming model and an associated implementation for processing and generating large data sets. Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The runtime system takes care of the details of ..."
Abstract

Cited by 3439 (3 self)
 Add to MetaCart
(Show Context)
Abstract MapReduce is a programming model and an associated implementation for processing and generating large data sets. Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The runtime system takes care of the details of partitioning the input data, scheduling the program's execution across a set of machines, handling machine failures, and managing the required intermachine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system. Our implementation of MapReduce runs on a large cluster of commodity machines and is highly scalable: a typical MapReduce computation processes many terabytes of data on thousands of machines. Programmers find the system easy to use: hundreds of MapReduce programs have been implemented and upwards of one thousand MapReduce jobs are executed on Google's clusters every day.
Evaluating MapReduce for multicore and multiprocessor systems
 In HPCA ’07: Proceedings of the 13th International Symposium on HighPerformance Computer Architecture
, 2007
"... This paper evaluates the suitability of the MapReduce model for multicore and multiprocessor systems. MapReduce was created by Google for application development on datacenters with thousands of servers. It allows programmers to write functionalstyle code that is automatically parallelized and s ..."
Abstract

Cited by 256 (3 self)
 Add to MetaCart
(Show Context)
This paper evaluates the suitability of the MapReduce model for multicore and multiprocessor systems. MapReduce was created by Google for application development on datacenters with thousands of servers. It allows programmers to write functionalstyle code that is automatically parallelized and scheduled in a distributed system. We describe Phoenix, an implementation of MapReduce for sharedmemory systems that includes a programming API and an efficient runtime system. The Phoenix runtime automatically manages thread creation, dynamic task scheduling, data partitioning, and fault tolerance across processor nodes. We study Phoenix with multicore and symmetric multiprocessor systems and evaluate its performance potential and error recovery features. We also compare MapReduce code to code written in lowerlevel APIs such as Pthreads. Overall, we establish that, given a careful implementation, MapReduce is a promising model for scalable performance on sharedmemory systems with simple parallel code. 1
Scans as primitive parallel operations
 IEEE Trans. Computers
, 1989
"... AbstmctIn most parallel random access machine (PRAM) models, memory references are assumed to take unit time. In practice, and in theory, certain scan operations, also known as prefix computations, can execute in no more time than these parallel memory references. This paper outlines an extensive ..."
Abstract

Cited by 187 (13 self)
 Add to MetaCart
(Show Context)
AbstmctIn most parallel random access machine (PRAM) models, memory references are assumed to take unit time. In practice, and in theory, certain scan operations, also known as prefix computations, can execute in no more time than these parallel memory references. This paper outlines an extensive study of the effect of including, in the PRAM models, such scan operations as unittime primitives. The study concludes that the primitives improve the asymptotic running time of many algorithms by an O(log n) factor greatly simplify the description of many algorithms, and are significantly easier to implement than memory references. We therefore argue that the algorithm designer should feel free to use these operations as if they were as cheap as a memory reference. This paper describes five algorithms that clearly illustrate how the scan primitives can be used in algorithm design: a mdixsort algorithm, a quicksort algorithm, a minimumspanningtrae algorithm, a linedmwing algorithm, and a merging algorithm. These all run on an EREW PRAM with the addition of two scan primitives, and are either simpler or more efficient than their pure PRAM counterparts. The scan primitives have been implemented in microcode on the Connection Machine System, are available in PARIS (the parallel instruction set of the machine), and are used in a large number of applications. All five algorithms have been tested, and the radix sort is the currently supported sorting algorithm for the Connection Machine.
Prefix Sums and Their Applications
"... Experienced algorithm designers rely heavily on a set of building blocks and on the tools needed to put the blocks together into an algorithm. The understanding of these basic blocks and tools is therefore critical to the understanding of algorithms. Many of the blocks and tools needed for parallel ..."
Abstract

Cited by 131 (2 self)
 Add to MetaCart
(Show Context)
Experienced algorithm designers rely heavily on a set of building blocks and on the tools needed to put the blocks together into an algorithm. The understanding of these basic blocks and tools is therefore critical to the understanding of algorithms. Many of the blocks and tools needed for parallel
Generic ILP versus Specialized 01 ILP: An Update
 IN INTERNATIONAL CONFERENCE ON COMPUTERAIDED DESIGN
, 2002
"... Optimized solvers for the Boolean Satisfiability (SAT) problem have many applications in areas such as hardware and software verification, FPGA routing, planning, etc. Further uses are complicated by the need to express "counting constraints" in conjunctive normal form (CNF). Expressing su ..."
Abstract

Cited by 97 (23 self)
 Add to MetaCart
Optimized solvers for the Boolean Satisfiability (SAT) problem have many applications in areas such as hardware and software verification, FPGA routing, planning, etc. Further uses are complicated by the need to express "counting constraints" in conjunctive normal form (CNF). Expressing such constraints by pure CNF leads to more complex SAT instances. Alternatively, those constraints can be handled by Integer Linear Programming (ILP), but generic ILP solvers may ignore the Boolean nature of 01 variables. Therefore specialized 01 ILP solvers extend SAT solvers to handle these socalled "pseudoBoolean" constraints. This work
Provably efficient scheduling for languages with finegrained parallelism
 IN PROC. SYMPOSIUM ON PARALLEL ALGORITHMS AND ARCHITECTURES
, 1995
"... Many highlevel parallel programming languages allow for finegrained parallelism. As in the popular worktime framework for parallel algorithm design, programs written in such languages can express the full parallelism in the program without specifying the mapping of program tasks to processors. A ..."
Abstract

Cited by 95 (28 self)
 Add to MetaCart
(Show Context)
Many highlevel parallel programming languages allow for finegrained parallelism. As in the popular worktime framework for parallel algorithm design, programs written in such languages can express the full parallelism in the program without specifying the mapping of program tasks to processors. A common concern in executing such programs is to schedule tasks to processors dynamically so as to minimize not only the execution time, but also the amount of space (memory) needed. Without careful scheduling, the parallel execution on p processors can use a factor of p or larger more space than a sequential implementation of the same program. This paper first identifies a class of parallel schedules that are provably efficient in both time and space. For any
Flattened butterfly: A costefficient topology for highradix networks
 in Proc. of the Intl. Symp. on Computer Architecture
, 2007
"... Increasing integratedcircuit pin bandwidth has motivated a corresponding increase in the degree or radix of interconnection networks and their routers. This paper introduces the flattened butterfly, a costefficient topology for highradix networks. On benign (loadbalanced) traffic, the flattened b ..."
Abstract

Cited by 95 (12 self)
 Add to MetaCart
(Show Context)
Increasing integratedcircuit pin bandwidth has motivated a corresponding increase in the degree or radix of interconnection networks and their routers. This paper introduces the flattened butterfly, a costefficient topology for highradix networks. On benign (loadbalanced) traffic, the flattened butterfly approaches the cost/performance of a butterfly network and has roughly half the cost of a comparable performance Clos network. The advantage over the Clos is achieved by eliminating redundant hops when they are not needed for load balance. On adversarial traffic, the flattened butterfly matches the cost/performance of a foldedClos network and provides an order of magnitude better performance than a conventional butterfly. In this case, global adaptive routing is used to switch the flattened butterfly from minimal to nonminimal routing — using redundant hops only when they are needed. Minimal and nonminimal, oblivious and adaptive routing algorithms are evaluated on the flattened butterfly. We show that loadbalancing adversarial traffic requires nonminimal globallyadaptive routing and show that sequential allocators are required to avoid transient load imbalance when using adaptive routing algorithms. We also compare the cost of the flattened butterfly to foldedClos, hypercube, and butterfly networks with identical capacity and show that the flattened butterfly is more costefficient than foldedClos and hypercube topologies.
Special Purpose Parallel Computing
 Lectures on Parallel Computation
, 1993
"... A vast amount of work has been done in recent years on the design, analysis, implementation and verification of special purpose parallel computing systems. This paper presents a survey of various aspects of this work. A long, but by no means complete, bibliography is given. 1. Introduction Turing ..."
Abstract

Cited by 82 (6 self)
 Add to MetaCart
A vast amount of work has been done in recent years on the design, analysis, implementation and verification of special purpose parallel computing systems. This paper presents a survey of various aspects of this work. A long, but by no means complete, bibliography is given. 1. Introduction Turing [365] demonstrated that, in principle, a single general purpose sequential machine could be designed which would be capable of efficiently performing any computation which could be performed by a special purpose sequential machine. The importance of this universality result for subsequent practical developments in computing cannot be overstated. It showed that, for a given computational problem, the additional efficiency advantages which could be gained by designing a special purpose sequential machine for that problem would not be great. Around 1944, von Neumann produced a proposal [66, 389] for a general purpose storedprogram sequential computer which captured the fundamental principles of...
Fast parallel circuits for the quantum Fourier transform
 PROCEEDINGS 41ST ANNUAL SYMPOSIUM ON FOUNDATIONS OF COMPUTER SCIENCE (FOCS’00)
, 2000
"... We give new bounds on the circuit complexity of the quantum Fourier transform (QFT). We give an upper bound of O(log n + log log(1/ε)) on the circuit depth for computing an approximation of the QFT with respect to the modulus 2 n with error bounded by ε. Thus, even for exponentially small error, our ..."
Abstract

Cited by 72 (1 self)
 Add to MetaCart
(Show Context)
We give new bounds on the circuit complexity of the quantum Fourier transform (QFT). We give an upper bound of O(log n + log log(1/ε)) on the circuit depth for computing an approximation of the QFT with respect to the modulus 2 n with error bounded by ε. Thus, even for exponentially small error, our circuits have depth O(log n). The best previous depth bound was O(n), even for approximations with constant error. Moreover, our circuits have size O(n log(n/ε)). We also give an upper bound of O(n(log n) 2 log log n) on the circuit size of the exact QFT modulo 2 n, for which the best previous bound was O(n 2). As an application of the above depth bound, we show that Shor’s factoring algorithm may be based on quantum circuits with depth only O(log n) and polynomialsize, in combination with classical polynomialtime pre and postprocessing. In the language of computational complexity, this implies that factoring is in the complexity class ZPP BQNC, where BQNC is the class of problems computable with boundederror probability by quantum circuits with polylogarithmic depth and polynomial size. Finally, we prove an Ω(log n) lower bound on the depth complexity of approximations of the