Results 1  10
of
329
MapReduce: Simplified Data Processing on Large Clusters
, 2004
"... MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with t ..."
Abstract

Cited by 3236 (3 self)
 Add to MetaCart
MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model, as shown in the paper. Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The runtime system takes care of the details of partitioning the input data, scheduling the program’s execution across a set of machines, handling machine failures, and managing the required intermachine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system. Our implementation of MapReduce runs on a large cluster of commodity machines and is highly scalable: a typical MapReduce computation processes many terabytes of data on thousands of machines. Programmers find the system easy to use: hundreds of MapReduce programs have been implemented and upwards of one thousand MapReduce jobs are executed on Google’s clusters every day.
Evaluating MapReduce for multicore and multiprocessor systems
 In HPCA ’07: Proceedings of the 13th International Symposium on HighPerformance Computer Architecture
, 2007
"... This paper evaluates the suitability of the MapReduce model for multicore and multiprocessor systems. MapReduce was created by Google for application development on datacenters with thousands of servers. It allows programmers to write functionalstyle code that is automatically parallelized and s ..."
Abstract

Cited by 248 (3 self)
 Add to MetaCart
(Show Context)
This paper evaluates the suitability of the MapReduce model for multicore and multiprocessor systems. MapReduce was created by Google for application development on datacenters with thousands of servers. It allows programmers to write functionalstyle code that is automatically parallelized and scheduled in a distributed system. We describe Phoenix, an implementation of MapReduce for sharedmemory systems that includes a programming API and an efficient runtime system. The Phoenix runtime automatically manages thread creation, dynamic task scheduling, data partitioning, and fault tolerance across processor nodes. We study Phoenix with multicore and symmetric multiprocessor systems and evaluate its performance potential and error recovery features. We also compare MapReduce code to code written in lowerlevel APIs such as Pthreads. Overall, we establish that, given a careful implementation, MapReduce is a promising model for scalable performance on sharedmemory systems with simple parallel code. 1
Prefix Sums and Their Applications
"... Experienced algorithm designers rely heavily on a set of building blocks and on the tools needed to put the blocks together into an algorithm. The understanding of these basic blocks and tools is therefore critical to the understanding of algorithms. Many of the blocks and tools needed for parallel ..."
Abstract

Cited by 128 (2 self)
 Add to MetaCart
(Show Context)
Experienced algorithm designers rely heavily on a set of building blocks and on the tools needed to put the blocks together into an algorithm. The understanding of these basic blocks and tools is therefore critical to the understanding of algorithms. Many of the blocks and tools needed for parallel
Generic ILP versus Specialized 01 ILP: An Update
 IN INTERNATIONAL CONFERENCE ON COMPUTERAIDED DESIGN
, 2002
"... Optimized solvers for the Boolean Satisfiability (SAT) problem have many applications in areas such as hardware and software verification, FPGA routing, planning, etc. Further uses are complicated by the need to express "counting constraints" in conjunctive normal form (CNF). Expressing su ..."
Abstract

Cited by 97 (23 self)
 Add to MetaCart
Optimized solvers for the Boolean Satisfiability (SAT) problem have many applications in areas such as hardware and software verification, FPGA routing, planning, etc. Further uses are complicated by the need to express "counting constraints" in conjunctive normal form (CNF). Expressing such constraints by pure CNF leads to more complex SAT instances. Alternatively, those constraints can be handled by Integer Linear Programming (ILP), but generic ILP solvers may ignore the Boolean nature of 01 variables. Therefore specialized 01 ILP solvers extend SAT solvers to handle these socalled "pseudoBoolean" constraints. This work
Flattened butterfly: A costefficient topology for highradix networks
 in Proc. of the Intl. Symp. on Computer Architecture
, 2007
"... Increasing integratedcircuit pin bandwidth has motivated a corresponding increase in the degree or radix of interconnection networks and their routers. This paper introduces the flattened butterfly, a costefficient topology for highradix networks. On benign (loadbalanced) traffic, the flattened b ..."
Abstract

Cited by 94 (12 self)
 Add to MetaCart
(Show Context)
Increasing integratedcircuit pin bandwidth has motivated a corresponding increase in the degree or radix of interconnection networks and their routers. This paper introduces the flattened butterfly, a costefficient topology for highradix networks. On benign (loadbalanced) traffic, the flattened butterfly approaches the cost/performance of a butterfly network and has roughly half the cost of a comparable performance Clos network. The advantage over the Clos is achieved by eliminating redundant hops when they are not needed for load balance. On adversarial traffic, the flattened butterfly matches the cost/performance of a foldedClos network and provides an order of magnitude better performance than a conventional butterfly. In this case, global adaptive routing is used to switch the flattened butterfly from minimal to nonminimal routing — using redundant hops only when they are needed. Minimal and nonminimal, oblivious and adaptive routing algorithms are evaluated on the flattened butterfly. We show that loadbalancing adversarial traffic requires nonminimal globallyadaptive routing and show that sequential allocators are required to avoid transient load imbalance when using adaptive routing algorithms. We also compare the cost of the flattened butterfly to foldedClos, hypercube, and butterfly networks with identical capacity and show that the flattened butterfly is more costefficient than foldedClos and hypercube topologies.
Provably efficient scheduling for languages with finegrained parallelism
 IN PROC. SYMPOSIUM ON PARALLEL ALGORITHMS AND ARCHITECTURES
, 1995
"... Many highlevel parallel programming languages allow for finegrained parallelism. As in the popular worktime framework for parallel algorithm design, programs written in such languages can express the full parallelism in the program without specifying the mapping of program tasks to processors. A ..."
Abstract

Cited by 94 (27 self)
 Add to MetaCart
(Show Context)
Many highlevel parallel programming languages allow for finegrained parallelism. As in the popular worktime framework for parallel algorithm design, programs written in such languages can express the full parallelism in the program without specifying the mapping of program tasks to processors. A common concern in executing such programs is to schedule tasks to processors dynamically so as to minimize not only the execution time, but also the amount of space (memory) needed. Without careful scheduling, the parallel execution on p processors can use a factor of p or larger more space than a sequential implementation of the same program. This paper first identifies a class of parallel schedules that are provably efficient in both time and space. For any
Special Purpose Parallel Computing
 Lectures on Parallel Computation
, 1993
"... A vast amount of work has been done in recent years on the design, analysis, implementation and verification of special purpose parallel computing systems. This paper presents a survey of various aspects of this work. A long, but by no means complete, bibliography is given. 1. Introduction Turing ..."
Abstract

Cited by 81 (6 self)
 Add to MetaCart
A vast amount of work has been done in recent years on the design, analysis, implementation and verification of special purpose parallel computing systems. This paper presents a survey of various aspects of this work. A long, but by no means complete, bibliography is given. 1. Introduction Turing [365] demonstrated that, in principle, a single general purpose sequential machine could be designed which would be capable of efficiently performing any computation which could be performed by a special purpose sequential machine. The importance of this universality result for subsequent practical developments in computing cannot be overstated. It showed that, for a given computational problem, the additional efficiency advantages which could be gained by designing a special purpose sequential machine for that problem would not be great. Around 1944, von Neumann produced a proposal [66, 389] for a general purpose storedprogram sequential computer which captured the fundamental principles of...
Fast parallel circuits for the quantum Fourier transform
 PROCEEDINGS 41ST ANNUAL SYMPOSIUM ON FOUNDATIONS OF COMPUTER SCIENCE (FOCS’00)
, 2000
"... We give new bounds on the circuit complexity of the quantum Fourier transform (QFT). We give an upper bound of O(log n + log log(1/ε)) on the circuit depth for computing an approximation of the QFT with respect to the modulus 2 n with error bounded by ε. Thus, even for exponentially small error, our ..."
Abstract

Cited by 70 (1 self)
 Add to MetaCart
(Show Context)
We give new bounds on the circuit complexity of the quantum Fourier transform (QFT). We give an upper bound of O(log n + log log(1/ε)) on the circuit depth for computing an approximation of the QFT with respect to the modulus 2 n with error bounded by ε. Thus, even for exponentially small error, our circuits have depth O(log n). The best previous depth bound was O(n), even for approximations with constant error. Moreover, our circuits have size O(n log(n/ε)). We also give an upper bound of O(n(log n) 2 log log n) on the circuit size of the exact QFT modulo 2 n, for which the best previous bound was O(n 2). As an application of the above depth bound, we show that Shor’s factoring algorithm may be based on quantum circuits with depth only O(log n) and polynomialsize, in combination with classical polynomialtime pre and postprocessing. In the language of computational complexity, this implies that factoring is in the complexity class ZPP BQNC, where BQNC is the class of problems computable with boundederror probability by quantum circuits with polylogarithmic depth and polynomial size. Finally, we prove an Ω(log n) lower bound on the depth complexity of approximations of the