Results 1 - 10
of
17
Above the Clouds: A Berkeley View of Cloud Computing
, 2009
"... personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires pri ..."
Abstract
-
Cited by 163 (2 self)
- Add to MetaCart
personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission. Acknowledgement The RAD Lab's existence is due to the generous support of the founding members Google, Microsoft, and Sun Microsystems and of the affiliate members Amazon Web Services, Cisco Systems, Facebook, Hewlett-
Accelerating Large-Scale Data Exploration through Data Diffusion
- ACM International Workshop on Data-Aware Distributed Computing 2008
"... Data-intensive applications often require exploratory analysis of large datasets. If analysis is performed on distributed resources, data locality can be crucial to high throughput and performance. We propose a “data diffusion ” approach that acquires compute and storage resources dynamically, repli ..."
Abstract
-
Cited by 15 (12 self)
- Add to MetaCart
Data-intensive applications often require exploratory analysis of large datasets. If analysis is performed on distributed resources, data locality can be crucial to high throughput and performance. We propose a “data diffusion ” approach that acquires compute and storage resources dynamically, replicates data in response to demand, and schedules computations close to data. As demand increases, more resources are acquired, thus allowing faster response to subsequent requests that refer to the same data; when demand drops, resources are released. This approach can provide the benefits of dedicated hardware without the associated high costs, depending on workload and resource characteristics. The approach is reminiscent of cooperative caching, web-caching, and peer-to-peer storage systems, but addresses different application demands. Other data-aware scheduling approaches assume dedicated resources, which can be expensive and/or inefficient if load varies significantly. To explore the feasibility of the data diffusion approach, we have extended the Falkon resource provisioning and task scheduling system to support data caching and data-aware scheduling. Performance results from both microbenchmarks and a large scale astronomy application demonstrate that our approach improves performance relative to alternative approaches, as well as provides improved scalability as aggregated I/O bandwidth scales linearly with the number of data cache nodes.
Harnessing Grid Resources to Enable the Dynamic Analysis of Large Astronomy Datasets
- Year 2 Status and Year 3 Proposal”, NASA, Ames Research
, 2008
"... Large datasets are being produced at a very fast pace in the astronomy domain. In principle, these datasets are most valuable if and only if they are made available to the entire community, which may have tens to thousands of members. The astronomy community will generally want to perform various an ..."
Abstract
-
Cited by 12 (10 self)
- Add to MetaCart
Large datasets are being produced at a very fast pace in the astronomy domain. In principle, these datasets are most valuable if and only if they are made available to the entire community, which may have tens to thousands of members. The astronomy community will generally want to perform various analyses on these datasets to be able to
On Distributing Symmetric Streaming Computations
"... A common approach for dealing with large data sets is to stream over the input in one pass, and perform computations using sublinear resources. For truly massive data sets, however, even making a single pass over the data is prohibitive. Therefore, streaming computations must be distributed over man ..."
Abstract
-
Cited by 12 (1 self)
- Add to MetaCart
A common approach for dealing with large data sets is to stream over the input in one pass, and perform computations using sublinear resources. For truly massive data sets, however, even making a single pass over the data is prohibitive. Therefore, streaming computations must be distributed over many machines. In practice, obtaining significant speedups using distributed computation has numerous challenges including synchronization, load balancing, overcoming processor failures, and data distribution. Successful systems in practice such as Google’s MapReduce and Apache’s Hadoop address these problems by only allowing a certain class of highly distributable tasks defined by local computations that can be applied in any order to the input. The fundamental question that arises is: How does the class of computational tasks supported by these systems differ from the class for which streaming solutions exist? We introduce a simple algorithmic model for massive, unordered, distributed (mud) computation, as implemented by these systems. We show that in principle, mud algorithms are equivalent in power to symmetric streaming algorithms. More precisely, we show that any symmetric (orderinvariant) function that can be computed by a streaming algorithm can also be computed by a mud algorithm, with comparable space and communication complexity. Our simulation uses Savitch’s theorem and therefore has superpolynomial time complexity. We extend our simulation result to some natural classes of approximate and randomized streaming algorithms. We also give negative results, using communication complexity arguments to prove that extensions to private randomness, promise problems and indeterminate functions are impossible. We also introduce an extension of the mud model to multiple keys and multiple rounds. 1
Many-Task Computing for Grids and Supercomputers
- IEEE Workshop on Many-Task Computing on Grids and Supercomputers (MTAGS08) 2008
"... Many-task computing aims to bridge the gap between two computing paradigms, high throughput computing and high performance computing. Many task computing differs from high throughput computing in the emphasis of using large number of computing resources over short periods of time to accomplish many ..."
Abstract
-
Cited by 5 (3 self)
- Add to MetaCart
Many-task computing aims to bridge the gap between two computing paradigms, high throughput computing and high performance computing. Many task computing differs from high throughput computing in the emphasis of using large number of computing resources over short periods of time to accomplish many computational tasks (i.e. including both dependent and independent tasks), where primary metrics are measured in seconds (e.g. FLOPS, tasks/sec, MB/s I/O rates), as opposed to operations (e.g. jobs) per month. Many task computing denotes high-performance computations comprising multiple distinct activities, coupled via file system operations. Tasks may be small or large, uniprocessor or multiprocessor, computeintensive or data-intensive. The set of tasks may be static or dynamic, homogeneous or heterogeneous, loosely coupled or tightly coupled. The aggregate number of tasks, quantity of computing, and volumes of data may be extremely large. Many task computing includes loosely coupled applications that are generally communication-intensive but not naturally expressed using standard message passing interface commonly found in high performance computing, drawing attention to the many computations that are heterogeneous but not “happily ” parallel.
On the complexity of processing massive, unordered, distributed data
, 2006
"... The popular model for processing massive data sets is the streaming model in which the processor makes a pass over the input with polylog(n)-space. However, with current data sets, even making a single pass will take unreasonable amount of time, and in practice, the solution is to distribute the dat ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
The popular model for processing massive data sets is the streaming model in which the processor makes a pass over the input with polylog(n)-space. However, with current data sets, even making a single pass will take unreasonable amount of time, and in practice, the solution is to distribute the data analysis over multiple streaming machines. Taking the cue from successful working distributed systems at Google and other places, we abstract a “massive, unordered, distributed ” (MUD) model for computing. Our model allows efficient, flexible and robust computation of a certain class of functions over unordered sets of records. Roughly, these functions are ones that can be computed as x1 ⊕x2 ⊕ · · · ⊕xn, where ⊕ is a polylog(n)-space operation that can be applied in any order and still get the correct answer. We compare the computational power of our MUD model to a standard serial streaming model. We prove that for a broad class of problems, MUD algorithms can compute anything that a streaming algorithm can compute, that is, there is no penalty for obtaining the speedup from multiple machines in terms of what functions can be computed. However, for more general classes like approximate and promise problems, we use communication lower bounds to demonstrate problems that can be computed by a streaming algorithm,
The Quest for Scalable Support of Data Intensive Workloads in Distributed Systems”, under review at
- ACM
"... Data-intensive applications involving the analysis of large datasets often require large amounts of compute and storage resources, for which data locality can be crucial to high throughput and performance. We propose a “data diffusion” approach that acquires compute and storage resources dynamically ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Data-intensive applications involving the analysis of large datasets often require large amounts of compute and storage resources, for which data locality can be crucial to high throughput and performance. We propose a “data diffusion” approach that acquires compute and storage resources dynamically, replicates data in response to demand, and schedules computations close to data. As demand increases, more resources are acquired, thus allowing faster response to subsequent requests that refer to the same data; when demand drops, resources are released. This approach can provide the benefits of dedicated hardware without the associated high costs, depending on workload and resource characteristics. To explore the feasibility of data diffusion, we offer both a theoretical and an empirical analysis. We define an abstract model for data diffusion, introduce
Truster: Trajectory data processing on clusters(demo paper
- In DASFAA
, 2009
"... Abstract. With the continued advancements in location-based services involved infrastructures, large amount of time-based location data are quickly accumulated. Distributed processing techniques on such large trajectory data sets are urgently needed. We propose TRUSTER: a distributed trajectory data ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Abstract. With the continued advancements in location-based services involved infrastructures, large amount of time-based location data are quickly accumulated. Distributed processing techniques on such large trajectory data sets are urgently needed. We propose TRUSTER: a distributed trajectory data processing system on clusters. TRUSTER employs a distributed indexing method on large scale trajectory data sets, and it makes spatio-temporal queries execute efficiently on clusters. 1
Tiled-MapReduce: Optimizing Resource Usages of Data-parallel Applications on Multicore with Tiling
"... The prevalence of chip multiprocessor opens opportunities of running data-parallel applications originally in clusters on a single machine with many cores. MapReduce, a simple and elegant programming model to program large scale clusters, has recently been shown to be a promising alternative to harn ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
The prevalence of chip multiprocessor opens opportunities of running data-parallel applications originally in clusters on a single machine with many cores. MapReduce, a simple and elegant programming model to program large scale clusters, has recently been shown to be a promising alternative to harness the multicore platform. The differences such as memory hierarchy and communication patterns between clusters and multicore platforms raise new challenges to design and implement an efficient MapReduce system on multicore. This paper argues that it is more efficient for Map-Reduce to iteratively process small chunks of data in turn than processing a large chunk of data at one time on shared memory multicore platforms. Based on the argument, we extend the general MapReduce programming model with “tiling strategy”, called Tiled-MapReduce (TMR). TMR partitions a large MapReduce job into a number of small sub-jobs and iteratively processes one subjob at a time with efficient use of resources; TMR finally merges the results of all sub-jobs for output. Based on Tiled-MapReduce, we design and implement several optimizing techniques targeting multicore, including the reuse of input and intermediate data structure among sub-jobs, a NUCA/NUMA-aware scheduler, and pipelining a sub-job’s reduce phase with the successive sub-job’s map phase, to optimize the memory, cache and CPU resources accordingly. We have implemented a prototype of Tiled-MapReduce based on Phoenix, an already highly optimized MapReduce runtime for shared memory multiprocessors. The prototype, namely Ostrich, runs on an Intel machine with 16 cores. Experiments on four different types of benchmarks show that Ostrich saves up to 85 % memory, causes less cache misses and makes more efficient uses of CPU cores, resulting in a speedup ranging from 1.2X to 3.3X.

