• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • DMCA
  • Donate

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

The Panasas ActiveScale storage cluster: Delivering scalable high bandwidth storage,” in (2004)

by D Nagle
Venue:Proc. Supercomputing,
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 118
Next 10 →

Safe and Effective Fine-grained TCP Retransmissions for Datacenter Communication

by Vijay Vasudevan, Amar Phanishayee, Hiral Shah, Elie Krevat, David G. Andersen, Gregory R. Ganger, Garth A. Gibson, Brian Mueller, Panasas Inc
"... This paper presents a practical solution to a problem facing high-fan-in, high-bandwidth synchronized TCP workloads in datacenter Ethernets—the TCP incast problem. In these networks, receivers can experience a drastic reduction in application throughput when simultaneously requesting data from many ..."
Abstract - Cited by 93 (1 self) - Add to MetaCart
This paper presents a practical solution to a problem facing high-fan-in, high-bandwidth synchronized TCP workloads in datacenter Ethernets—the TCP incast problem. In these networks, receivers can experience a drastic reduction in application throughput when simultaneously requesting data from many servers using TCP. Inbound data overfills small switch buffers, leading to TCP timeouts lasting hundreds of milliseconds. For many datacenter workloads that have a barrier synchronization requirement (e.g., filesystem reads and parallel data-intensive queries), throughput is reduced by up to 90%. For latency-sensitive applications, TCP timeouts in the datacenter impose delays of hundreds of milliseconds in networks with round-trip-times in microseconds. Our practical solution uses high-resolution timers to enable microsecond-granularity TCP timeouts. We demonstrate that this technique is effective in avoiding TCP incast collapse in simulation and in real-world experiments. We show that eliminating the minimum retransmission timeout bound is safe for all environments, including the wide-area.
(Show Context)

Citation Context

...nce in an area that, surprisingly, also proves challenging to TCP: very low delay, high throughput, datacenter networks of dozens to thousands of machines. The problem we study is TCP incast collapse =-=[25]-=-, where application throughput drastically reduces when multiple senders communicate with a single receiver in high bandwidth, low delay networks using TCP. Highly bursty, fast data transmissions over...

Measurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systems

by Amar Phanishayee, Elie Krevat, Vijay Vasudevan, David G. Andersen, Gregory R. Ganger, Garth A. Gibson, Srinivasan Seshan , 2007
"... Cluster-based and iSCSI-based storage systems rely on standard TCP/IP-over-Ethernet for client access to data. Unfortunately, when data is striped over multiple networked storage nodes, a client can experience a TCP throughput collapse that results in much lower read bandwidth than should be provide ..."
Abstract - Cited by 58 (6 self) - Add to MetaCart
Cluster-based and iSCSI-based storage systems rely on standard TCP/IP-over-Ethernet for client access to data. Unfortunately, when data is striped over multiple networked storage nodes, a client can experience a TCP throughput collapse that results in much lower read bandwidth than should be provided by the available network links. Conceptually, this problem arises because the client simultaneously reads fragments of a data block from multiple sources that together send enough data to overload the switch buffers on the client’s link. This paper analyzes this Incast problem, explores its sensitivity to various system parameters, and examines the effectiveness of alternative TCP- and Ethernet-level strategies in mitigating the TCP throughput collapse. 1
(Show Context)

Citation Context

...TCP- and Ethernet-level strategies in mitigating the TCP throughput collapse. 1 Introduction Cluster-based storage systems are becoming an increasingly important target for both research and industry =-=[1, 36, 15, 24, 14, 8]-=-. These storage systems consist of a networked set of smaller storage servers, with data spread across these servers to increase performance and reliability. Building these systems using commodity TCP...

Low-overhead byzantine fault-tolerant storage

by James Hendricks - In SOSP , 2007
"... This paper presents an erasure-coded Byzantine fault-tolerant block storage protocol that is nearly as efficient as protocols that tolerate only crashes. Previous Byzantine fault-tolerant block storage protocols have either relied upon replication, which is inefficient for large blocks of data when ..."
Abstract - Cited by 51 (1 self) - Add to MetaCart
This paper presents an erasure-coded Byzantine fault-tolerant block storage protocol that is nearly as efficient as protocols that tolerate only crashes. Previous Byzantine fault-tolerant block storage protocols have either relied upon replication, which is inefficient for large blocks of data when tolerating multiple faults, or a combination of additional servers, extra computation, and versioned storage. To avoid these expensive techniques, our protocol employs novel mechanisms to optimize for the common case when faults and concurrency are rare. In the common case, a write operation completes in two rounds of communication and a read completes in one round. The protocol requires a short checksum comprised of cryptographic hashes and homomorphic fingerprints. It achieves throughput within 10 % of the crash-tolerant protocol for writes and reads in failure-free runs when configured to tolerate up to 6 faulty servers and any number of faulty clients.
(Show Context)

Citation Context

...olerated grows beyond two or three, erasure coding provides much better write bandwidth [37, 38]. A few distributed storage systems support erasure coding. For example, Zebra [19], xFS [4], and PanFS =-=[29]-=- support parity-based protection of data striped across multiple servers. FAB [34], Ursa Minor [3], and RepStore [40] support more general m-of-n erasure coding. 2.1 Beyond crash faults A common assum...

Naiad: A Timely Dataflow System

by Derek G. Murray, Frank Mcsherry, Rebecca Isaacs, Michael Isard, Paul Barham, Martín Abadi
"... Naiad is a distributed system for executing data parallel, cyclic dataflow programs. It offers the high throughput of batch processors, the low latency of stream processors, and the ability to perform iterative and incremental computations. Although existing systems offer some of these features, app ..."
Abstract - Cited by 48 (1 self) - Add to MetaCart
Naiad is a distributed system for executing data parallel, cyclic dataflow programs. It offers the high throughput of batch processors, the low latency of stream processors, and the ability to perform iterative and incremental computations. Although existing systems offer some of these features, applications that require all three have relied on multiple platforms, at the expense of efficiency, maintainability, and simplicity. Naiad resolves the complexities of combining these features in one framework. A new computational model, timely dataflow, underlies Naiad and captures opportunities for parallelism across a wide class of algorithms. This model enriches dataflow computation with timestamps that represent logical points in the computation and provide the basis for an efficient, lightweight coordination mechanism. We show that many powerful high-level programming models can be built on Naiad’s low-level primitives, enabling such diverse tasks as streaming data analysis, iterative machine learning, and interactive graph mining. Naiad outperforms specialized systems in their target application domains, and its unique features enable the development of new high-performance applications. 1
(Show Context)

Citation Context

...32 ports each. Despite over-provisioning the inter-switch links with a 40 Gbps uplink and enabling 802.3x flow control, we observe packet loss at the NIC receive queues during incast traffic patterns =-=[31]-=-. It is likely that Datacenter TCP [6] would be beneficial for our workload, but the rack switches in our cluster lack necessary support for explicit congestion notification. Since Naiad controls all ...

Scalable I/O Forwarding Framework for High-Performance Computing Systems

by Nawab Ali, Philip Carns, Kamil Iskra, Dries Kimpe, Samuel Lang, Robert Latham, Robert Ross, Lee Ward
"... Abstract—Current leadership-class machines suffer from a significant imbalance between their computational power and their I/O bandwidth. While Moore’s law ensures that the computational power of high-performance computing systems increases with every generation, the same is not true for their I/O s ..."
Abstract - Cited by 40 (8 self) - Add to MetaCart
Abstract—Current leadership-class machines suffer from a significant imbalance between their computational power and their I/O bandwidth. While Moore’s law ensures that the computational power of high-performance computing systems increases with every generation, the same is not true for their I/O subsystems. The scalability challenges faced by existing parallel file systems with respect to the increasing number of clients, coupled with the minimalistic compute node kernels running on these machines, call for a new I/O paradigm to meet the requirements of data-intensive scientific applications. I/O forwarding is a technique that attempts to bridge the increasing performance and scalability gap between the compute and I/O components of leadership-class machines by shipping I/O calls from compute nodes to dedicated I/O nodes. The I/O nodes perform operations on behalf of the compute nodes and can reduce file system traffic by aggregating, rescheduling, and caching I/O requests. This paper presents an open, scalable I/O forwarding framework for high-performance computing systems. We describe an I/O protocol and API for shipping function calls from compute nodes to I/O nodes, and we present a quantitative analysis of the overhead associated with I/O forwarding. Keywords-I/O forwarding; Parallel file systems; Leadershipclass machines
(Show Context)

Citation Context

... performance?Parallel file systems are an obvious potential target for improvements. The file systems available on current leadershipclass machines, such as PVFS [5], GPFS [6], Lustre [7], and PanFS =-=[8]-=- were designed with smaller systems in mind. They face significant challenges scaling to the hundreds of thousands of clients that are available on current highperformance computing systems [9], [10],...

Informed Data Distribution Selection in a Self-Predicting Storage System

by Eno Thereska, Michael Abd-el-malek, Jay J. Wylie, Dushyanth Narayanan, Gregory R. Ganger - In International Conference on Autonomic Computing , 2006
"... Systems should be self-predicting. They should continuously monitor themselves and provide quantitative answers to What...if questions about hypothetical workload or resource changes. Self-prediction would significantly simplify administrators' decision making, such as acquisition planning and ..."
Abstract - Cited by 28 (11 self) - Add to MetaCart
Systems should be self-predicting. They should continuously monitor themselves and provide quantitative answers to What...if questions about hypothetical workload or resource changes. Self-prediction would significantly simplify administrators' decision making, such as acquisition planning and performance tuning, by reducing the detailed workload and internal system knowledge required. This paper describes and evaluates support for self-prediction in a cluster-based storage system and its application to What...if questions about data distribution selection.
(Show Context)

Citation Context

...node crashes (two). The prediction accuracy for the 3-of-5 scheme is less than that of the 3-way replication. We believe this arises from a TCP inflow problem, as has been observed in similar systems =-=[20]-=-. When reading under the 3-of-5 encoding, three storage-nodes are contacted to retrieve the data. The storage-nodes simultaneously reply to the client, causing packets to be dropped at the network swi...

Making a Case for Distributed File Systems at Exascale

by Ioan Raicu, Ian T. Foster, Pete Beckman - Invited Paper, ACM Workshop on Large-scale System and Application Performance (LSAP), 2011
"... Exascale computers will enable the unraveling of significant scientific mysteries. Predictions are that 2019 will be the year of exascale, with millions of compute nodes and billions of threads of execution. The current architecture of high-end computing systems is decades-old and has persisted as w ..."
Abstract - Cited by 24 (13 self) - Add to MetaCart
Exascale computers will enable the unraveling of significant scientific mysteries. Predictions are that 2019 will be the year of exascale, with millions of compute nodes and billions of threads of execution. The current architecture of high-end computing systems is decades-old and has persisted as we scaled from gigascales to petascales. In this architecture, storage is completely segregated from the compute resources and are connected via a network interconnect. This approach will not scale several orders of magnitude in terms of concurrency and throughput, and will thus prevent the move from petascale to exascale. At exascale, basic functionality at high concurrency levels will suffer poor performance, and combined with system mean-time-to-failure in hours, will lead to a performance collapse for large-scale heroic applications. Storage has the potential to be
(Show Context)

Citation Context

...posed since the 1980s, such as the Network File System (NFS) [46], Andrew File System (AFS) [47], General Purpose File System (GPFS) [13], Parallel Virtual File System (PVFS) [6], Lustre [7], Panases =-=[48]-=-, Microsoft's Distributed File System (DFS) [49], GlusterFS [50], OneFS [51], POHMELFS [52], and XtreemFS [53]. While the majority of these file systems expose a POSIX-like interface providing a globa...

ICTCP: incast congestion control for TCP in data center networks,”

by Haitao Wu , Zhenqian Feng , ⋆ † , Chuanxiong Guo , Yongguang Zhang , Microsoft Research , Asia , China - in Proceedings of the Co-NEXT ’10, , 2010
"... ABSTRACT TCP incast congestion happens in high-bandwidth and lowlatency networks, when multiple synchronized servers send data to a same receiver in parallel In this paper, we study TCP incast in detail by focusing on the relationship among TCP throughput, round trip time (RTT) and receive window. ..."
Abstract - Cited by 24 (2 self) - Add to MetaCart
ABSTRACT TCP incast congestion happens in high-bandwidth and lowlatency networks, when multiple synchronized servers send data to a same receiver in parallel In this paper, we study TCP incast in detail by focusing on the relationship among TCP throughput, round trip time (RTT) and receive window. Different from the previous approach to mitigate the impact of incast congestion by a fine grained timeout value, our idea is to design an ICTCP (Incast congestion Control for TCP) scheme at the receiver side. In particular, our method adjusts TCP receive window proactively before packet drops occur. The implementation and experiments in our testbed demonstrate that we achieve almost zero timeout and high goodput for TCP incast.
(Show Context)

Citation Context

...escribes the design rationale of ICTCP. Section 4 presents ICTCP algorithms. Section 5 shows the implementation of ICTCP as a Windows driver. Sections 6 presents experimental Internet Data center Agg. Router Agg. Switch ToR switch 1 U 1 U 1 U 1 U 1 U Figure 1: A data center network and a detailed illustration of a ToR (Top of Rack) switch connected to multiple rack-mounted servers results. Section 7 discusses the extension of ICTCP. Section 8 presents related work. Finally, Section 9 concludes the paper. 2. BACKGROUND AND MOTIVATION TCP incast has been identified and described by Nagle et al. [12] in distributed storage clusters. In distributed file systems, files are stored at multiple servers. TCP incast congestion occurs when multiple blocks of a file are fetched from multiple servers. Several application specific solutions have been proposed in the context of parallel file system. With recent progresses on data center networking, TCP incast problem in data center networks has become a practical issue. Since there are various data center applications, a transport layer solution can free application from building their own solutions and is therefore preferred. In this Section, we fir...

Exploiting Lustre File Joining for Effective Collective IO

by Weikuan Yu, Jeffrey Vetter
"... Lustre is a parallel file system that presents high aggregated IO bandwidth by striping file extents across many storage devices. However, our experiments indicate excessively wide striping can cause performance degradation. Lustre supports an innovative file joining feature that joins files in plac ..."
Abstract - Cited by 20 (3 self) - Add to MetaCart
Lustre is a parallel file system that presents high aggregated IO bandwidth by striping file extents across many storage devices. However, our experiments indicate excessively wide striping can cause performance degradation. Lustre supports an innovative file joining feature that joins files in place. To mitigate striping overhead and benefit collective IO, we propose two techniques: split writing and hierarchical striping. In split writing, a file is created as separate subfiles, each of which is striped to only a few storage devices. They are joined as a single file at the file close time. Hierarchical striping builds on top of split writing and orchestrates the span of subfiles in a hierarchical manner to avoid overlapping and achieve the appropriate coverage of storage devices. Together, these techniques can avoid the overhead associated with large stripe width, while still being able to combine bandwidth available from many storage devices. We have prototyped these techniques in the ROMIO implementation of MPI-IO. Experimental results indicate that split writing and hierarchical striping can significantly improve the performance of Lustre collective IO in terms of both data transfer and management operations. On a Lustre file system configured with 46 object storage targets, our implementation improves collective write performance of a 16-process job by as much as 220%. 1
(Show Context)

Citation Context

...itecture. Our hierarchical striping technique is similar in concept to another technique: two-level striping. Two-level striping is a disk striping technique used in the implementation of the Panasas =-=[7]-=- file system, and is used as an internal storage organization policy. Our hierarchical striping is built on top of the user-level file joining feature. It works at the level of IO middleware, aimed to...

Reliability for networked storage nodes

by Kk Rao, James Lee Hafner, Richard A. Golding - Research Report RJ-10358, IBM Almaden Research , 2006
"... High-end enterprise storage has traditionally consisted of monolithic systems with customized hardware, multiple redundant components and paths, and no single point of failure. Distributed storage systems realized through networked storage nodes offer several advantages over monolithic systems such ..."
Abstract - Cited by 19 (2 self) - Add to MetaCart
High-end enterprise storage has traditionally consisted of monolithic systems with customized hardware, multiple redundant components and paths, and no single point of failure. Distributed storage systems realized through networked storage nodes offer several advantages over monolithic systems such as lower cost and increased scalability. In order to achieve reliability goals associated with enterprise-class storage systems, redundancy will have to be distributed across the collection of nodes to tolerate both node and drive failures. In this paper, we present alternatives for distributing this redundancy, and models to determine the reliability of such systems. We specify a reliability target and determine the configurations that meet this target. Further, we perform sensitivity analyses where selected parameters are varied to observe their effect on reliability. 1.
(Show Context)

Citation Context

...e. In contrast, achieving scalability through distributed storage systems is becoming increasingly prevalent in research and development [4], [8], [12], and, to some extent, in commercial deployments =-=[9]-=-. A significant aspect of distributed systems is the ability to use common building blocks across a wide range of storage requirements: from a few terabytes to the scale of petabytes. This translates ...

Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University