Results 1 - 10
of
16
Data Placement for Scientific Applications in Distributed Environments
"... Abstract — Scientific applications often perform complex computational analyses that consume and produce large data sets. We are concerned with data placement policies that distribute data in ways that are advantageous for application execution, for example, by placing data sets so that they may be ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
Abstract — Scientific applications often perform complex computational analyses that consume and produce large data sets. We are concerned with data placement policies that distribute data in ways that are advantageous for application execution, for example, by placing data sets so that they may be staged into or out of computations efficiently or by replicating them for improved performance and reliability. In particular, we propose to study the relationship between data placement services and workflow management systems. In this paper, we explore the interactions between two services used in large-scale science today. We evaluate the benefits of prestaging data using the Data Replication Service versus using the native data stage-in mechanisms of the Pegasus workflow management system. We use the astronomy application, Montage, for our experiments and modify it to study the effect of input data size on the benefits of data prestaging. As the size of input data sets increases, prestaging using a data placement service can significantly improve the performance of the overall analysis. I.
A Data Placement Service for Petascale Applications
- in Petascale Data Storage Workshop, Supercomputing 2007
, 2007
"... We examine the use of policy-driven data placement services to improve the performance of data-intensive, petascale applications in high performance distributed computing environments. In particular, we are interested in using an asynchronous data placement service to stage data in and out of applic ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
We examine the use of policy-driven data placement services to improve the performance of data-intensive, petascale applications in high performance distributed computing environments. In particular, we are interested in using an asynchronous data placement service to stage data in and out of application workflows efficiently as well as to distribute and replicate data according to Virtual Organization policies. We propose a data placement service architecture and describe our implementation of one layer of this architecture, which provides efficient, priority-based bulk data transfers.
User-centric Utility-based Data Replication in Heterogeneous Networks
"... Abstract—Information overload and convergence of devices aggravate the difficulties of accessing data distributed among various user devices especially when this is performed by mobile users and over heterogeneous wireless networks. Existing data replication systems help increase the performance of ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Abstract—Information overload and convergence of devices aggravate the difficulties of accessing data distributed among various user devices especially when this is performed by mobile users and over heterogeneous wireless networks. Existing data replication systems help increase the performance of the distributed data system, but they do not consider users ’ different levels of interest in various pieces of data and neither heterogeneous wireless connectivity issues. This paper presents the Smart Personal Information Network (Smart PIN), a performance and cost-aware personal information network which uses a novel usercentric utility-based data replication scheme to exchange content automatically, based on both network performance and user interests. The proposed user-centric data replication scheme’s evaluation, through simulation, shows improved results in comparison with existing solutions.
An Analytical Framework and Its Applications for Studying Brick Storage Reliability
"... The reliability of a large-scale storage system is influenced by a complex set of inter-dependent factors. This paper presents a comprehensive and extensible analytical framework that offers quantitative answers to many design tradeoffs. We apply the framework to a number of important design strateg ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
The reliability of a large-scale storage system is influenced by a complex set of inter-dependent factors. This paper presents a comprehensive and extensible analytical framework that offers quantitative answers to many design tradeoffs. We apply the framework to a number of important design strategies that a designer and/or administrator must face in reality, including topology-aware replica placement, proactive replication that uses small background network bandwidth and unused disk space to create additional copies. We also quantify the impact of slow (but potentially more accurate) failure detection and lazy replacement of failed disks. We use detailed simulation to verify and refine our analytical model. These results demonstrate the versatility of the framework and serve as a solid step towards more quantitative studies of fundamental system tradeoffs between reliability, performance, and cost in large-scale distributed storage systems. 1.
F2F: reliable storage in open networks
, 2006
"... A major hurdle to deploying a distributed storage infrastructure in peer-to-peer systems is storing data reliably using nodes that have little incentive to remain in the system. We argue that a node should choose its neighbors (the nodes with which it shares resources) based on existing social relat ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
A major hurdle to deploying a distributed storage infrastructure in peer-to-peer systems is storing data reliably using nodes that have little incentive to remain in the system. We argue that a node should choose its neighbors (the nodes with which it shares resources) based on existing social relationships instead of randomly. This approach provides incentives for nodes to cooperate and results in a more stable system which, in turn, reduces the cost of maintaining data. The cost of this approach is decreased flexibility and storage utilization. We describe our approach and sketch two applications for which this approach is viable: a cooperative backup system and a Usenet replacement.
Data Redundancy and Maintenance for Peer-to-Peer File Backup Systems
"... Redondance et maintenance des données dans les systèmes de sauvegarde de fichiers pair-à-pair ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Redondance et maintenance des données dans les systèmes de sauvegarde de fichiers pair-à-pair
Smart PIN: Utility-based Replication and Delivery of Multimedia Content to Mobile Users in Wireless Networks
"... Next generation wireless networks rely on heterogeneous connectivity technologies to support various rich media services such as personal information storage, file sharing and multimedia streaming. Due to users ’ mobility and dynamic characteristics of wireless networks, data availability in collabo ..."
Abstract
- Add to MetaCart
Next generation wireless networks rely on heterogeneous connectivity technologies to support various rich media services such as personal information storage, file sharing and multimedia streaming. Due to users ’ mobility and dynamic characteristics of wireless networks, data availability in collaborating devices is a critical issue. In this context Smart PIN was proposed as a personal information network which focuses on performance of delivery and cost efficiency. Smart PIN uses a novel data replication scheme based on individual and overall system utility to best balance the requirements for static data and multimedia content delivery with variable device availability due to user mobility. Simulations show improved results in comparison with other general purpose data replication schemes in terms of data availability.
Maintaining Replicas in Unstructured P2P Systems
"... Replication is widely used in unstructured peer-to-peer systems to improve search or achieve availability. We identify and solve a subclass of replication problems where each object is associated with a maintainer node, and its replicas should only be available as long as its maintainer is part of t ..."
Abstract
- Add to MetaCart
Replication is widely used in unstructured peer-to-peer systems to improve search or achieve availability. We identify and solve a subclass of replication problems where each object is associated with a maintainer node, and its replicas should only be available as long as its maintainer is part of the network. Such requirement can be found in various applications, e.g., when objects are directory lists, service lists, or subscriptions of a publish/subscribe system. We provide maintainers with proven guarantees on the number of replicas, in spite of network churn and crash failures. We also tackle the related problems of changing the number of replicas, updating replicas, balancing storage load in a heterogeneous network, and eliminating replicas left by crashing maintainers. Our algorithm is based on probabilistic methods and is simple to implement. We show by simulation and formal proof that our algorithm is correct. 1.
Availability and Redundancy in Harmony: Measuring Retrieval Times in P2P Storage Systems
"... Abstract—Peer-to-peer (P2P) storage systems are strongly affected by churn —temporal and permanent peer failures. Because of this churn, the main requirement of such systems is to guarantee that stored objects can always be retrieved. This requirement is specially needed in two main situations: when ..."
Abstract
- Add to MetaCart
Abstract—Peer-to-peer (P2P) storage systems are strongly affected by churn —temporal and permanent peer failures. Because of this churn, the main requirement of such systems is to guarantee that stored objects can always be retrieved. This requirement is specially needed in two main situations: when users want to access the stored objects or when data maintenance processes have to repair lost information. To meet this requirement, exiting P2P storage systems introduce large amounts of redundancy that maintain data availability close to 100%. Unfortunately, these large amounts of redundancy increase the storage costs, either by reducing the overall net capacity or by increasing the communication required for data maintenance. In order to minimize storage costs, P2P storage systems can reduce data redundancy. However, less redundancy means lower data availability, which leads to increase object retrieval times. Unfortunately, longer retrieval times could compromise data maintenance processes and could penalize user’s retrieval times. It is crucial then for P2P storage systems to predict the effects of a redundancy reduction. In order to provide this information, we present a novel analytical framework to measure object retrieval times under different redundancy and churn circumstances. Our framework can be directly used by backup applications aiming to maintain durability at the lower cost, or by data sharing applications that seek to reduce costs by penalizing user retrieval times. We validate our framework by simulation using real P2P traces (Skype and eMule’s KAD). I.

