Results 1 - 10
of
18
Ceph: A scalable, high-performance distributed file system
- In Proceedings of the 7th Symposium on Operating Systems Design and Implementation (OSDI
, 2006
"... We have developed Ceph, a distributed file system that provides excellent performance, reliability, and scalability. Ceph maximizes the separation between data and metadata management by replacing allocation tables with a pseudo-random data distribution function (CRUSH) designed for heterogeneous an ..."
Abstract
-
Cited by 112 (21 self)
- Add to MetaCart
We have developed Ceph, a distributed file system that provides excellent performance, reliability, and scalability. Ceph maximizes the separation between data and metadata management by replacing allocation tables with a pseudo-random data distribution function (CRUSH) designed for heterogeneous and dynamic clusters of unreliable object storage devices (OSDs). We leverage device intelligence by distributing data replication, failure detection and recovery to semi-autonomous OSDs running a specialized local object file system. A dynamic distributed metadata cluster provides extremely efficient metadata management and seamlessly adapts to a wide range of general purpose and scientific computing file system workloads. Performance measurements under a variety of workloads show that Ceph has excellent I/O performance and scalable metadata management, supporting more than 250,000 metadata operations per second. 1
LH*RS -- a high-availability scalable distributed data structure
"... (SDDS). An LH*RS file is hash partitioned over the distributed RAM of a multicomputer, e.g., a network of PCs, and supports the unavailability of any of its k ≥ 1 server nodes. The value of k transparently grows with the file to offset the reliability decline. Only the number of the storage nodes p ..."
Abstract
-
Cited by 53 (9 self)
- Add to MetaCart
(SDDS). An LH*RS file is hash partitioned over the distributed RAM of a multicomputer, e.g., a network of PCs, and supports the unavailability of any of its k ≥ 1 server nodes. The value of k transparently grows with the file to offset the reliability decline. Only the number of the storage nodes potentially limits the file growth. The high-availability management uses a novel parity calculus that we have developed, based on the Reed-Salomon erasure correcting coding. The resulting parity storage overhead is about the minimal ever possible. The parity encoding and decoding are faster than for any other candidate coding we are aware of. We present our scheme and its performance analysis, including experiments with a prototype implementation on Wintel PCs. The capabilities of LH*RS offer new perspectives to data intensive applications, including the emerging ones of grids and of P2P computing.
PRO: A popularity-based multi-threaded reconstruction optimization for RAID-structured storage systems
- In Proceedings of the 5th USENIX Conference on File and Storage Technologies. USENIX Association
, 2007
"... This paper proposes and evaluates a novel dynamic data reconstruction optimization algorithm, called popularity-based multi-threaded reconstruction optimization (PRO), which allows the reconstruction process in a RAID-structured storage system to rebuild the frequently accessed areas prior to rebuil ..."
Abstract
-
Cited by 13 (4 self)
- Add to MetaCart
This paper proposes and evaluates a novel dynamic data reconstruction optimization algorithm, called popularity-based multi-threaded reconstruction optimization (PRO), which allows the reconstruction process in a RAID-structured storage system to rebuild the frequently accessed areas prior to rebuilding infrequently accessed areas to exploit access locality. This approach has the salient advantage of simultaneously decreasing reconstruction time and alleviating user and system performance degradation. It can also be easily adopted in various conventional reconstruction approaches. In particular, we optimize the disk-oriented reconstruction (DOR) approach with PRO. The PRO-powered DOR is shown to induce a much earlier onset of response-time improvement and sustain a longer time span of such improvement than the original DOR. Our benchmark studies on read-only web workloads have shown that the PRO-powered DOR algorithm consistently outperforms the original DOR algorithm in the failurerecovery process in terms of user response time, with a 3.6%~23.9 % performance improvement and up to 44.7 % reconstruction time improvement simultaneously. 1.
Disk infant mortality in large storage systems
- In Proc of MASCOTS ’05
, 2005
"... As disk drives have dropped in price relative to tape, the desire for the convenience and speed of online access to large data repositories has led to the deployment of petabyte-scale disk farms with thousands of disks. Unfortunately, the very large size of these repositories renders them vulnerable ..."
Abstract
-
Cited by 7 (3 self)
- Add to MetaCart
As disk drives have dropped in price relative to tape, the desire for the convenience and speed of online access to large data repositories has led to the deployment of petabyte-scale disk farms with thousands of disks. Unfortunately, the very large size of these repositories renders them vulnerable to previously rare failure modes such as multiple, unrelated disk failures leading to data loss. While some business models, such as free email servers, may be able to tolerate some occurrence of data loss, others, including premium online services and storage of simulation results at a national laboratory, cannot. This paper describes the effect of infant mortality on long-term failure rates of systems that must preserve their data for decades. Our failure models incorporate the well-known “bathtub curve, ” which reflects the higher failure rates of new disk drives, a lower, constant failure rate during the remainder of the design life span, and increased failure rates as components wear out. Large systems are vulnerable to the “cohort effect” that occurs when many disks are simultaneously replaced by new disks. Our more accurate disk models and simulations have yielded predictions of system lifetimes that are more pessimistic than existing models that assume a constant disk failure rate. Thus, larger system scale requires designers to take disk infant mortality into account. 1.
Providing high reliability in a minimum redundancy archival storage system
- Proc.14 th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems
, 2006
"... Inter-file compression techniques store files as sets of references to data objects or chunks that can be shared among many files. While these techniques can achieve much better compression ratios than conventional intra-file compression methods such as Lempel-Ziv compression, they also reduce the r ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
Inter-file compression techniques store files as sets of references to data objects or chunks that can be shared among many files. While these techniques can achieve much better compression ratios than conventional intra-file compression methods such as Lempel-Ziv compression, they also reduce the reliability of the storage system because the loss of a few critical chunks can lead to the loss of many files. We show how to eliminate this problem by choosing for each chunk a replication level that is a function of the amount of data that would be lost if that chunk were lost. Experiments using actual archival data show that our technique can achieve significantly higher robustness than a conventional approach combining data mirroring and intra-file compression while requiring about half the storage space. 1.
Improving the availability of supercomputer job input data using temporal replication, submitted for publication
"... Supercomputers are stepping into the Peta-scale and Exascale era, wherein handling hundreds of concurrent system failures is an urgent challenge. In particular, storage system failures have been identified as a major source of service interruptions in supercomputers. RAID solutions alone cannot prov ..."
Abstract
-
Cited by 4 (3 self)
- Add to MetaCart
Supercomputers are stepping into the Peta-scale and Exascale era, wherein handling hundreds of concurrent system failures is an urgent challenge. In particular, storage system failures have been identified as a major source of service interruptions in supercomputers. RAID solutions alone cannot provide sufficient storage protection as (1) average disk recovery time is projected to grow, making RAID groups increasingly vulnerable to additional failures during data reconstruction, and (2) disk-level data protection cannot mask higherlevel faults, such as software/hardware failures of entire I/O nodes. This paper presents a complementary approach based on the observation that files in the supercomputer scratch space are typically accessed by batch jobs, whose execution can be anticipated. Therefore, we propose to transparently, selectively, and temporarily replicate ”active ” job input data, by coordinating the parallel file system with the batch job scheduler. We have implemented the temporal replication scheme in the popular Lustre parallel file system and evaluated it with both real-cluster experiments and trace-driven simulations. Our results show that temporal replication allows for fast online data reconstruction, with a reasonably low overall space and I/O bandwidth overhead. 1
Scalable archival data and metadata management in object-based file systems
, 2004
"... Online archival capabilities like snapshots or checkpoints are fast becoming an essential component of robust storage systems. Emerging large distributed file systems are also shifting to object-based storage architectures that decouple metadata from file I/O operations. As the size of such systems ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
Online archival capabilities like snapshots or checkpoints are fast becoming an essential component of robust storage systems. Emerging large distributed file systems are also shifting to object-based storage architectures that decouple metadata from file I/O operations. As the size of such systems scale to petabytes of storage, it is critically important that file system features continue to operate efficiently. We present a flexible mechanism for archiving file system state that allows the creation of checkpoints for arbitrarily sized subtrees of the hierarchy. Checkpoints are managed in a distributed fashion while maintaining efficient utilization of system resources. 1
Ceph: A Scalable Object-Based Storage System
, 2006
"... The data storage needs of large high-performance and general-purpose computing environments are generally best served by distributed storage systems. Traditional solutions, exemplified by NFS, provide a simple distributed storage system model, but cannot meet the demands of high-performance computin ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
The data storage needs of large high-performance and general-purpose computing environments are generally best served by distributed storage systems. Traditional solutions, exemplified by NFS, provide a simple distributed storage system model, but cannot meet the demands of high-performance computing environments where a single server may become a bottleneck, nor do they scale well due to the need to manually partition (or repartition) the data among the servers. Object-based storage promises to address these needs through a simple networked data storage unit, the Object Storage Device (OSD) that manages all local storage issues and exports a simple read/write data interface. Despite this simple concept, many challenges remain, including efficient object storage, centralized metadata management, data and metadata replication, and data and metadata reliability. We describe Ceph, a distributed object-based storage system that meets these challenges, providing highperformance file storage that scales directly with the number of OSDs and Metadata servers.
Efficient Updates in Highly Available Distributed Random Access Memory
- THE TWELFTH INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED SYSTEMS (ICPADS
, 2006
"... With increased network speeds and throughputs, multicomputers (a system of computers connected by a high-speed network) have become an attractive alternative to store important data in their collective random access memory. Erasure codes provide spaceoptimal data redundancy to protect this type of s ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
With increased network speeds and throughputs, multicomputers (a system of computers connected by a high-speed network) have become an attractive alternative to store important data in their collective random access memory. Erasure codes provide spaceoptimal data redundancy to protect this type of storage from node unavailability. They have been used in LH*RS, the scalable high availability, distributed version of Linear Hashing. We present and evaluate a technique that uses the property of linear erasure correcting codes to make updates transactional and concurrent with recovery from one or more node availabilities without locks or two-phase commits. The technique significantly improves on previous work in update speed and also allows for serializable updates to a bucket that is in the process of being recovered.
WorkOut: I/O Workload Outsourcing for Boosting RAID Reconstruction Performance
"... User I/O intensity can significantly impact the performance of on-line RAID reconstruction due to contention for the shared disk bandwidth. Based on this observation, this paper proposes a novel scheme, called WorkOut (I/O Workload Outsourcing), to significantly boost RAID reconstruction performance ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
User I/O intensity can significantly impact the performance of on-line RAID reconstruction due to contention for the shared disk bandwidth. Based on this observation, this paper proposes a novel scheme, called WorkOut (I/O Workload Outsourcing), to significantly boost RAID reconstruction performance. WorkOut effectively outsources all write requests and popular read requests originally targeted at the degraded RAID set to a surrogate RAID set during reconstruction. Our lightweight prototype implementation of WorkOut and extensive tracedriven and benchmark-driven experiments demonstrate that, compared with existing reconstruction approaches, WorkOut significantly speeds up both the total reconstruction time and the average user response time. Importantly, WorkOut is orthogonal to and can be easily incorporated into any existing reconstruction algorithms. Furthermore, it can be extended to improving the performance of other background support RAID tasks, such as re-synchronization and disk scrubbing. 1

