Results 1 - 10
of
66
MapReduce: Simplified data processing on large clusters.
- In Proceedings of the Sixth Symposium on Operating System Design and Implementation (OSDI-04),
, 2004
"... Abstract MapReduce is a programming model and an associated implementation for processing and generating large data sets. Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of ..."
Abstract
-
Cited by 3439 (3 self)
- Add to MetaCart
(Show Context)
Abstract MapReduce is a programming model and an associated implementation for processing and generating large data sets. Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the program's execution across a set of machines, handling machine failures, and managing the required inter-machine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system. Our implementation of MapReduce runs on a large cluster of commodity machines and is highly scalable: a typical MapReduce computation processes many terabytes of data on thousands of machines. Programmers find the system easy to use: hundreds of MapReduce programs have been implemented and upwards of one thousand MapReduce jobs are executed on Google's clusters every day.
EnsemBlue: Integrating distributed storage and consumer electronics
- In Proceedings of the 7th Symposium on Operating Systems Design and Implementation. ACM SIGOPS
, 2006
"... EnsemBlue is a distributed file system for personal multimedia that incorporates both general-purpose computers and consumer electronic devices (CEDs). Ensem-Blue leverages the capabilities of a few general-purpose computers to make CEDs first class clients of the file system. It supports namespace ..."
Abstract
-
Cited by 46 (3 self)
- Add to MetaCart
(Show Context)
EnsemBlue is a distributed file system for personal multimedia that incorporates both general-purpose computers and consumer electronic devices (CEDs). Ensem-Blue leverages the capabilities of a few general-purpose computers to make CEDs first class clients of the file system. It supports namespace diversity by translating between its distributed namespace and the local namespaces of CEDs. It supports extensibility through persistent queries, a robust event notification mechanism that leverages the underlying cache consistency protocols of the file system. Finally, it allows mobile clients to selforganize and share data through device ensembles. Our results show that these features impose little overhead, yet they enable the integration of emerging platforms such as digital cameras, MP3 players, and DVRs. 1
All-Pairs: An Abstraction for Data-Intensive Cloud Computing
"... Although modern parallel and distributed computing systems provide easy access to large amounts of computing power, it is not always easy for non-expert users to harness these large systems effectively. A large workload composed in what seems to be the obvious way by a naive user may accidentally ab ..."
Abstract
-
Cited by 45 (11 self)
- Add to MetaCart
(Show Context)
Although modern parallel and distributed computing systems provide easy access to large amounts of computing power, it is not always easy for non-expert users to harness these large systems effectively. A large workload composed in what seems to be the obvious way by a naive user may accidentally abuse shared resources and achieve very poor performance. To address this problem, we propose that production systems should provide end users with high-level abstractions that allow for the easy expression and efficient execution of data intensive workloads. We present one example of an abstraction – All-Pairs – that fits the needs of several data-intensive scientific applications. We demonstrate that an optimized All-Pairs abstraction is both easier to use than the underlying system, and achieves performance orders of magnitude better than the obvious but naive approach, and twice as fast as a hand-optimized conventional approach. 1
Automatic Optimization of Parallel Dataflow Programs
"... Large-scale parallel dataflow systems, e.g., Dryad and Map-Reduce, have attracted significant attention recently. High-level dataflow languages such as Pig Latin and Sawzall are being layered on top of these systems, to enable faster program development and more maintainable code. These languages en ..."
Abstract
-
Cited by 33 (1 self)
- Add to MetaCart
(Show Context)
Large-scale parallel dataflow systems, e.g., Dryad and Map-Reduce, have attracted significant attention recently. High-level dataflow languages such as Pig Latin and Sawzall are being layered on top of these systems, to enable faster program development and more maintainable code. These languages engender greater transparency in program structure, and open up opportunities for automatic optimization. This paper proposes a set of optimization strategies for this context, drawing on and extending techniques from the database community. 1
All-Pairs: An Abstraction for Data Intensive Computing on Campus Grids
- IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS
"... Today, campus grids provide users with easy access to thousands of CPUs. However, it is not always easy for nonexpert users to harness these systems effectively. A large workload composed in what seems to be the obvious way by a naive user may accidentally abuse shared resources and achieve very p ..."
Abstract
-
Cited by 29 (2 self)
- Add to MetaCart
(Show Context)
Today, campus grids provide users with easy access to thousands of CPUs. However, it is not always easy for nonexpert users to harness these systems effectively. A large workload composed in what seems to be the obvious way by a naive user may accidentally abuse shared resources and achieve very poor performance. To address this problem, we argue that campus grids should provide end users with high-level abstractions that allow for the easy expression and efficient execution of data intensive workloads. We present one example of an abstraction – All-Pairs – that fits the needs of several applications in biometrics, bioinformatics, and data mining. We demonstrate that an optimized All-Pairs abstraction is both easier to use than the underlying system, achieves performance orders of magnitude better than the obvious but naive approach, and is both faster and more efficient than a tuned conventional approach. This abstraction has been in production use for one year on a 500-CPU campus grid at the University of Notre Dame, and has been used to carry out a groundbreaking analysis of biometric data.
Object-based image retrieval using the statistical structure of images
- Proc. IEEE Conference on Computer Vision and Pattern Recognition
, 2004
"... We propose a new Bayesian approach to object-based image retrieval with relevance feedback. Although estimating the object posterior probability density from few examples seems infeasible, we are able to approximate this density by exploiting statistics of the image database domain. Unlike previous ..."
Abstract
-
Cited by 25 (0 self)
- Add to MetaCart
(Show Context)
We propose a new Bayesian approach to object-based image retrieval with relevance feedback. Although estimating the object posterior probability density from few examples seems infeasible, we are able to approximate this density by exploiting statistics of the image database domain. Unlike previous approaches that assume an arbitrary distribution for the unconditional density of the feature vector (the density of the features taken over the entire image domain), we learn both the structure and the parameters of this density. These density estimates enable us to construct a Bayesian classifier. Using this Bayesian classifier, we perform a windowed scan over images for objects of interest and employ the user’s feedback on the search results to train a second classifier that focuses on eliminating difficult false positives. We have incorporated this algorithm into an object-based image retrieval system. We demonstrate the effectiveness of our approach with experiments using a set of categories from the Corel database. 1.
Fawndamentally power-efficient clusters
- In HotOS
, 2009
"... Power is becoming an increasingly large financial and scaling burden for computing and society. The costs of running large data centers are becoming dominated by power and cooling to the degree that companies such as ..."
Abstract
-
Cited by 15 (1 self)
- Add to MetaCart
(Show Context)
Power is becoming an increasingly large financial and scaling burden for computing and society. The costs of running large data centers are becoming dominated by power and cooling to the degree that companies such as
Scalable Crowd-Sourcing of Video from Mobile Devices
"... We propose a scalable Internet system for continuous collection of crowd-sourced video from devices such as Google Glass. Our hybrid cloud architecture, GigaSight, is effectively a Content Delivery Network (CDN) in reverse. It achieves scalability by decentralizing the collection infrastructure usin ..."
Abstract
-
Cited by 12 (1 self)
- Add to MetaCart
(Show Context)
We propose a scalable Internet system for continuous collection of crowd-sourced video from devices such as Google Glass. Our hybrid cloud architecture, GigaSight, is effectively a Content Delivery Network (CDN) in reverse. It achieves scalability by decentralizing the collection infrastructure using cloudlets based on virtual machines (VMs). Based on time, location, and content, privacy sensitive information is automatically removed from the video. This process, which we refer to as denaturing, is executed in a user-specific VM on the cloudlet. Users can perform content-based searches on the total catalog of denatured videos. Our experiments reveal the bottlenecks for video upload, denaturing, indexing, and contentbased search. They also provide insight on how parameters such as frame rate and resolution impact scalability.
Magellan: A Searchable Metadata Architecture for Large-Scale File Systems
, 2009
"... As file systems continue to grow, metadata search is becoming an increasingly important way to access and manage files. However, existing solutions that build a separate metadata database outside of the file system face consistency and management challenges at large-scales. To address these issues, ..."
Abstract
-
Cited by 9 (0 self)
- Add to MetaCart
As file systems continue to grow, metadata search is becoming an increasingly important way to access and manage files. However, existing solutions that build a separate metadata database outside of the file system face consistency and management challenges at large-scales. To address these issues, we developed Magellan, a new large-scale file system metadata architecture that enables the file system’s metadata to be efficiently and directly searched. This allows Magellan to avoid the consistency and management challenges of a separate database, while providing performance comparable to that of other large file systems. Magellan enables metadata search by introducing several techniques to metadata server design. First, Magellan uses a new on-disk inode layout that makes metadata retrieval efficient for searches. Second, Magellan indexes inodes in data structures that enable fast, multi-attribute search and allow all metadata lookups, including directory searches, to be handled as queries. Third, a query routing technique helps to keeps the search space small, even at large-scales. Fourth, a new journaling mechanism enables efficient update performance and metadata reliability. An evaluation with realworld metadata from a file system shows that, by combining these techniques, Magellan is capable of searching millions of files in under a second, while providing metadata performance comparable to, and sometimes better than, other large-scale file systems.
Towards Efficient Search on Unstructured Data: An Intelligent-Storage Approach
"... Applications that create and consume unstructured data have grown both in scale of storage requirements and complexity of search primitives. We consider two such applications: exhaustive search and integration of structured and unstructured data. Current blockbased storage systems are either incapab ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
(Show Context)
Applications that create and consume unstructured data have grown both in scale of storage requirements and complexity of search primitives. We consider two such applications: exhaustive search and integration of structured and unstructured data. Current blockbased storage systems are either incapable or inefficient to address the challenges bought forth by the above applications. We propose a storage framework to efficiently store and search unstructured and structured data while controlling storage management costs. Experimental results based on our prototype show that the proposed system can provide impressive performance and feature benefits.