Results 1 - 10
of
35
MapReduce: simplified data processing on large clusters
- OSDI’04: PROCEEDINGS OF THE 6TH CONFERENCE ON SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION
, 2004
"... MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with t ..."
Abstract
-
Cited by 913 (3 self)
- Add to MetaCart
MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model, as shown in the paper. Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the program’s execution across a set of machines, handling machine failures, and managing the required inter-machine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system. Our implementation of MapReduce runs on a large cluster of commodity machines and is highly scalable: a typical MapReduce computation processes many terabytes of data on thousands of machines. Programmers find the system easy to use: hundreds of MapReduce programs have been implemented and upwards of one thousand MapReduce jobs are executed on Google’s clusters every day. 1
EnsemBlue: Integrating distributed storage and consumer electronics
- In Proceedings of the 7th Symposium on Operating Systems Design and Implementation. ACM SIGOPS
, 2006
"... EnsemBlue is a distributed file system for personal multimedia that incorporates both general-purpose computers and consumer electronic devices (CEDs). Ensem-Blue leverages the capabilities of a few general-purpose computers to make CEDs first class clients of the file system. It supports namespace ..."
Abstract
-
Cited by 22 (1 self)
- Add to MetaCart
EnsemBlue is a distributed file system for personal multimedia that incorporates both general-purpose computers and consumer electronic devices (CEDs). Ensem-Blue leverages the capabilities of a few general-purpose computers to make CEDs first class clients of the file system. It supports namespace diversity by translating between its distributed namespace and the local namespaces of CEDs. It supports extensibility through persistent queries, a robust event notification mechanism that leverages the underlying cache consistency protocols of the file system. Finally, it allows mobile clients to selforganize and share data through device ensembles. Our results show that these features impose little overhead, yet they enable the integration of emerging platforms such as digital cameras, MP3 players, and DVRs. 1
Object-based image retrieval using the statistical structure of images
- Proc. IEEE Conference on Computer Vision and Pattern Recognition
, 2004
"... We propose a new Bayesian approach to object-based image retrieval with relevance feedback. Although estimating the object posterior probability density from few examples seems infeasible, we are able to approximate this density by exploiting statistics of the image database domain. Unlike previous ..."
Abstract
-
Cited by 17 (0 self)
- Add to MetaCart
We propose a new Bayesian approach to object-based image retrieval with relevance feedback. Although estimating the object posterior probability density from few examples seems infeasible, we are able to approximate this density by exploiting statistics of the image database domain. Unlike previous approaches that assume an arbitrary distribution for the unconditional density of the feature vector (the density of the features taken over the entire image domain), we learn both the structure and the parameters of this density. These density estimates enable us to construct a Bayesian classifier. Using this Bayesian classifier, we perform a windowed scan over images for objects of interest and employ the user’s feedback on the search results to train a second classifier that focuses on eliminating difficult false positives. We have incorporated this algorithm into an object-based image retrieval system. We demonstrate the effectiveness of our approach with experiments using a set of categories from the Corel database. 1.
Automatic Optimization of Parallel Dataflow Programs
"... Large-scale parallel dataflow systems, e.g., Dryad and Map-Reduce, have attracted significant attention recently. High-level dataflow languages such as Pig Latin and Sawzall are being layered on top of these systems, to enable faster program development and more maintainable code. These languages en ..."
Abstract
-
Cited by 14 (0 self)
- Add to MetaCart
Large-scale parallel dataflow systems, e.g., Dryad and Map-Reduce, have attracted significant attention recently. High-level dataflow languages such as Pig Latin and Sawzall are being layered on top of these systems, to enable faster program development and more maintainable code. These languages engender greater transparency in program structure, and open up opportunities for automatic optimization. This paper proposes a set of optimization strategies for this context, drawing on and extending techniques from the database community. 1
Building Self-configuring Services Using Service-specific Knowledge
- IEEE HPDC’04
, 2004
"... A self-configuring service can automatically leverage distributed service components and resources to compose an optimal configuration according to both the requirements of a particular user and the system characteristics. One major challenge for building such services is how to bring in service-spe ..."
Abstract
-
Cited by 13 (2 self)
- Add to MetaCart
A self-configuring service can automatically leverage distributed service components and resources to compose an optimal configuration according to both the requirements of a particular user and the system characteristics. One major challenge for building such services is how to bring in service-specific knowledge, e.g., what components are needed and optimization criteria to use, while still allowing reuse of common service composition functionalities. In this paper, we present an architecture in which service developers express their service-specific knowledge in the form of a service recipe that is used by a generic synthesizer to perform service composition automatically. We apply our approach to three different services to illustrate the flexibility and simplicity of the recipe representation. We use simulations based on Internet measurements to evaluate how an appropriate optimization algorithm can be selected according to a developer's service-specific trade-off between optimality and cost of optimization.
All-Pairs: An Abstraction for Data-Intensive Cloud Computing
"... Although modern parallel and distributed computing systems provide easy access to large amounts of computing power, it is not always easy for non-expert users to harness these large systems effectively. A large workload composed in what seems to be the obvious way by a naive user may accidentally ab ..."
Abstract
-
Cited by 12 (5 self)
- Add to MetaCart
Although modern parallel and distributed computing systems provide easy access to large amounts of computing power, it is not always easy for non-expert users to harness these large systems effectively. A large workload composed in what seems to be the obvious way by a naive user may accidentally abuse shared resources and achieve very poor performance. To address this problem, we propose that production systems should provide end users with high-level abstractions that allow for the easy expression and efficient execution of data intensive workloads. We present one example of an abstraction – All-Pairs – that fits the needs of several data-intensive scientific applications. We demonstrate that an optimized All-Pairs abstraction is both easier to use than the underlying system, and achieves performance orders of magnitude better than the obvious but naive approach, and twice as fast as a hand-optimized conventional approach. 1
All-Pairs: An Abstraction for Data Intensive Computing on Campus Grids
- IEEE Transactions on Parallel and Distributed Systems
"... Abstract — Today, campus grids provide users with easy access to thousands of CPUs. However, it is not always easy for nonexpert users to harness these systems effectively. A large workload composed in what seems to be the obvious way by a naive user may accidentally abuse shared resources and achie ..."
Abstract
-
Cited by 9 (1 self)
- Add to MetaCart
Abstract — Today, campus grids provide users with easy access to thousands of CPUs. However, it is not always easy for nonexpert users to harness these systems effectively. A large workload composed in what seems to be the obvious way by a naive user may accidentally abuse shared resources and achieve very poor performance. To address this problem, we argue that campus grids should provide end users with high-level abstractions that allow for the easy expression and efficient execution of data intensive workloads. We present one example of an abstraction – All-Pairs – that fits the needs of several applications in biometrics, bioinformatics, and data mining. We demonstrate that an optimized All-Pairs abstraction is both easier to use than the underlying system, achieves performance orders of magnitude better than the obvious but naive approach, and is both faster and more efficient than a tuned conventional approach. This abstraction has been in production use for one year on a 500-CPU campus grid at the University of Notre Dame, and has been used to carry out a groundbreaking analysis of biometric data.
Towards Efficient Search on Unstructured Data: An Intelligent-Storage Approach
"... Applications that create and consume unstructured data have grown both in scale of storage requirements and complexity of search primitives. We consider two such applications: exhaustive search and integration of structured and unstructured data. Current blockbased storage systems are either incapab ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
Applications that create and consume unstructured data have grown both in scale of storage requirements and complexity of search primitives. We consider two such applications: exhaustive search and integration of structured and unstructured data. Current blockbased storage systems are either incapable or inefficient to address the challenges bought forth by the above applications. We propose a storage framework to efficiently store and search unstructured and structured data while controlling storage management costs. Experimental results based on our prototype show that the proposed system can provide impressive performance and feature benefits.
Fawndamentally power-efficient clusters
- In HotOS
, 2009
"... Power is becoming an increasingly large financial and scaling burden for computing and society. The costs of running large data centers are becoming dominated by power and cooling to the degree that companies such as ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
Power is becoming an increasingly large financial and scaling burden for computing and society. The costs of running large data centers are becoming dominated by power and cooling to the degree that companies such as
Optimal Inter-Object Correlation When Replicating for Availability
, 2008
"... Data replication is a key technique for ensuring data availability. Traditionally, researchers have focused on the availability of individual objects, even though user-level tasks (called operations) typically request multiple objects. Our recent experimental study has shown that the assignment of ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Data replication is a key technique for ensuring data availability. Traditionally, researchers have focused on the availability of individual objects, even though user-level tasks (called operations) typically request multiple objects. Our recent experimental study has shown that the assignment of object replicas to machines results in subtle yet dramatic effects on the availability of these operations, even though the availability of individual objects remains the same. This paper is the first to approach the assignment problem from a theoretical perspective, and obtains a series of results regarding assignments that provide the best and the worst availability for user-level operations. We use a range of techniques to obtain our results, from standard combinatorial techniques and hill climbing methods to Janson’s inequality (a strong probabilistic tool). Some of the results demonstrate that even quite simple versions of the assignment problem can have surprising answers.

