Results 1 - 10
of
37
Serverless Network File Systems
- ACM TRANSACTIONS ON COMPUTER SYSTEMS
, 1995
"... In this paper, we propose a new paradigm for network file system design, serverless network file systems. While traditional network file systems rely on a central server machine, a serverless system utilizes workstations cooperating as peers to provide all file system services. Any machine in the sy ..."
Abstract
-
Cited by 403 (26 self)
- Add to MetaCart
In this paper, we propose a new paradigm for network file system design, serverless network file systems. While traditional network file systems rely on a central server machine, a serverless system utilizes workstations cooperating as peers to provide all file system services. Any machine in the system can store, cache, or control any block of data. Our approach uses this location independence, in combination with fast local area networks, to provide better performance and scalability than traditional file systems. Further, because any machine in the system can assume the responsibilities of a failed component, our serverless design also provides high availability via redundant data storage. To demonstrate our approach, we have implemented a prototype serverless network file system called xFS. Preliminary performance measurements suggest that our architecture achieves its goal of scalability. For instance, in a 32-node xFS system with 32 active clients, each client receives nearly as much read or write throughput as it would see if it were the only active client.
Cooperative Caching: Using Remote Client Memory to Improve File System Performance
- IN PROCEEDINGS OF THE FIRST SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION
, 1994
"... Emerging high-speed networks will allow machines to access remote data nearly as quickly as they can access local data. This trend motivates the use of cooperative caching: coordinating the file caches of many machines distributed on a LAN to form a more effective overall file cache. In this paper w ..."
Abstract
-
Cited by 274 (21 self)
- Add to MetaCart
Emerging high-speed networks will allow machines to access remote data nearly as quickly as they can access local data. This trend motivates the use of cooperative caching: coordinating the file caches of many machines distributed on a LAN to form a more effective overall file cache. In this paper we examine four cooperative caching algorithms using a trace-driven simulation study. These simulations indicate that for the systems studied cooperative caching can halve the number of disk accesses, improving file system read response time by as much as 73%. Based on these simulations we conclude that cooperative caching can significantly improve file system read response time and that relatively simple cooperative caching algorithms are sufficient to realize most of the potential performance gain.
Managing Wire Delay in Large Chip-Multiprocessor Caches
- IEEE/ACM INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE
, 2004
"... In response to increasing (relative) wire delay, architects have proposed various technologies to manage the impact of slow wires on large uniprocessor L2 caches. Block migration (e.g., D-NUCA and NuRapid) reduces average hit latency by migrating frequently used blocks towards the lower-latency bank ..."
Abstract
-
Cited by 90 (4 self)
- Add to MetaCart
In response to increasing (relative) wire delay, architects have proposed various technologies to manage the impact of slow wires on large uniprocessor L2 caches. Block migration (e.g., D-NUCA and NuRapid) reduces average hit latency by migrating frequently used blocks towards the lower-latency banks. Transmission Line Caches (TLC) use on-chip transmission lines to provide low latency to all banks. Traditional stride-based hardware prefetching strives to tolerate, rather than reduce, latency. Chip multiprocessors (CMPs) present additional challenges. First, CMPs often share the on-chip L2 cache, requiring multiple ports to provide sufficient bandwidth. Second, multiple threads mean multiple working sets, which compete for limited on-chip storage. Third, sharing code and data interferes with block migration, since one processor's low-latency bank is another processor's high-latency bank. In this paper, we develop L2 cache designs for CMPs that incorporate these three latency management techniques. We use detailed full-system simulation to analyze the performance trade-offs for both commercial and scientific workloads. First, we demonstrate that block migration is less effective for CMPs because 40-60% of L2 cache hits in commercial workloads are satisfied in the central banks, which are equally far from all processors. Second, we observe that although transmission lines provide low latency, contention for their restricted bandwidth limits their performance. Third, we show stride-based prefetching between L1 and L2 caches alone improves performance by at least as much as the other two techniques. Finally, we present a hybrid design-combining all three techniques-that improves performance by an additional 2% to 19% over prefetching alone.
Dynamic Self-Invalidation: Reducing Coherence Overhead in Shared-Memory Multiprocessors
- In Proceedings of the 22nd Annual International Symposium on Computer Architecture
, 1995
"... This paper introduces dynamic self-invalidation (DSI), a new technique for reducing cache coherence overhead in shared-memory multiprocessors. DSI eliminates invalidation messages by having a processor automatically invalidate its local copy of a cache block before a conflicting access by another pr ..."
Abstract
-
Cited by 79 (4 self)
- Add to MetaCart
This paper introduces dynamic self-invalidation (DSI), a new technique for reducing cache coherence overhead in shared-memory multiprocessors. DSI eliminates invalidation messages by having a processor automatically invalidate its local copy of a cache block before a conflicting access by another processor. Eliminating invalidation overhead is particularly important under sequential consistency, where the latency of invalidating outstanding copies can increase a program's critical path. DSI is applicable to software, hardware, and hybrid coherence schemes. In this paper we evaluate DSI in the context of hardware directory-based write-invalidate coherence protocols. Our results show that DSI reduces execution time of a sequentially consistent full-map coherence protocol by as much as 41%. This is comparable to an implementation of weak consistency that uses a coalescing write-buffer to allow up to 16 outstanding requests for exclusive blocks. When used in conjunction with weak consisten...
Quantitative Analysis of Cache Policies for Scalable Network File Systems
, 1994
"... Current network file system protocols rely heavily on a central server to coordinate tile activity among client workstations. This central server can become a bottleneck that limits scalabdity for environments with large numbers of clients. In central server systems such as NFS and AFS, all chent wr ..."
Abstract
-
Cited by 72 (8 self)
- Add to MetaCart
Current network file system protocols rely heavily on a central server to coordinate tile activity among client workstations. This central server can become a bottleneck that limits scalabdity for environments with large numbers of clients. In central server systems such as NFS and AFS, all chent writes, cache misses, and coherence messages are handled by the server. To keep up with this workload, expensive server machines are needed, configured with high-performance CPUS, memory systems, and 1/0 channels. Since the server stores all data, tt must be physically capable of connecting to many disks. This reliance on a central server also makes current systems inappropriate for wide area network use where the network bandwidth to the server may be Ilmited. In this paper, we mvestlgate the quantitative performance effect of movmg as many of the server responsibilities as possible to client workstations to reduce the need for high-performance server machines. We have devised a cache protocol in which all data reside on clients and all data transfers proceed directly from client to client. The server is used only to coordinate these data transfers. Thm protocol]s being incorporated as part of our experi-mental file system, xFS. We present results from a trace-driven simulation study of the protocol using traces from a 237 client NFS installation. We find that the xFS protocol reduces server load by more than a factor of six compared to AFS without significantly affecting response time or file availability.
The Design, Implementation, and Evaluation of Jade
- ACM Transactions on Programming Languages and Systems
, 1998
"... this article we discuss the design goals and decisions that determined the final form of Jade and present an overview of the Jade implementation. We also present our experience using Jade to implement several complete scientific and engineering applications. We use this experience to evaluate how th ..."
Abstract
-
Cited by 47 (2 self)
- Add to MetaCart
this article we discuss the design goals and decisions that determined the final form of Jade and present an overview of the Jade implementation. We also present our experience using Jade to implement several complete scientific and engineering applications. We use this experience to evaluate how the different Jade language features were used in practice and how well Jade as a whole supports the process of developing parallel applications. We find that the basic idea of preserving the serial semantics simplifies the program development process, and that the concept of using data access specifications to guide the parallelization offers significant advantages over more traditional control-based approaches. We also find that the Jade data model can interact poorly with concurrency patterns that write disjoint pieces of a single aggregate data structure, although this problem arises in only one of the applications. Categories and Subject Descriptors: D.1.3 [Programming Te
Multicast snooping: a new coherence method using a multicast address network
- In Proceedings of the 26th Annual International Symposium on Computer architecture(ISCA
, 1999
"... This paper proposes a new coherence method called “multicast snooping ” that dynamically adapts between broadcast snooping and a directory protocol. Multicast snooping is unique because processors predict which caches should snoop each coherence transaction by specifying a multicast “mask. ” Transac ..."
Abstract
-
Cited by 40 (7 self)
- Add to MetaCart
This paper proposes a new coherence method called “multicast snooping ” that dynamically adapts between broadcast snooping and a directory protocol. Multicast snooping is unique because processors predict which caches should snoop each coherence transaction by specifying a multicast “mask. ” Transactions are delivered with an ordered multicast network, such as an Isotach network, which eliminates the need for acknowledgment messages. Processors handle transactions as they would with a snooping protocol, while a simplified directory operates in parallel to check masks and gracefully handle incorrect ones (e.g., previous owner missing). Preliminary performance numbers with mostly SPLASH-2 benchmarks running on 32 processors show that we can limit multicasts to an average of 2-6 destinations (<< 32) and we can deliver 2-5 multicasts per network cycle (>> broadcast snooping’s 1 per cycle). While these results do not include timing, they do provide encouragement that multicast snooping can obtain data directly (like broadcast snooping) but apply to larger systems (like directories). 1
Bandwidth Adaptive Snooping
, 2002
"... This paper advocates that cache coherence protocols use a bandwidth adaptive approach to adjust to varied system configurations (e.g., number of processors) and workload behaviors. We propose Bandwidth Adaptive Snooping Hybrid (BASH), a hybrid protocol that ranges from behaving like snooping (by bro ..."
Abstract
-
Cited by 30 (11 self)
- Add to MetaCart
This paper advocates that cache coherence protocols use a bandwidth adaptive approach to adjust to varied system configurations (e.g., number of processors) and workload behaviors. We propose Bandwidth Adaptive Snooping Hybrid (BASH), a hybrid protocol that ranges from behaving like snooping (by broadcasting requests) when excess bandwidth is available to behaving like a directory protocol (by unicasting requests) when bandwidth is limited. BASH adapts dynamically by probabilistically deciding to broadcast or unicast on a per request basis using a local estimate of recent interconnection network utilization. Simulations of a microbenchmark and commercial and scientific workloads show that BASH robustly performs as well or better than the best of snooping and directory protocols as available bandwidth is varied. By mixing broadcasts and unicasts, BASH outperforms both snooping and directory protocols in the mid-range where a static choice of either is inefficient.
OceanStore: An Extremely Wide-Area Storage System
, 2000
"... OceanStore is a utility infrastructure designedto span the globe and provide continuous access to persistent information. Since this infrastructure is comprised of untrusted servers, data is protected through redundancy and cryptographic techniques. To improve performance, data is allowedtobe cach ..."
Abstract
-
Cited by 22 (0 self)
- Add to MetaCart
OceanStore is a utility infrastructure designedto span the globe and provide continuous access to persistent information. Since this infrastructure is comprised of untrusted servers, data is protected through redundancy and cryptographic techniques. To improve performance, data is allowedtobe cached anywhere, anytime. Finally, monitoring of usage patterns allows adaptation to regional outages and denial of service attacks; monitoring also enhances performancethrough pro-active movement of data. A prototype implementation is currently under development.
Cache-Coherent Distributed Shared Memory: Perspectives on Its Development and Future Challenges
- PROC. OF THE IEEE, SPECIAL ISSUE ON DISTRIBUTED SHARED MEMORY
, 1999
"... Distributed shared memory is an architectural approach that allows multiprocessors to support a single shared address space that is implemented with physically distributed memories. Hardware-supported distributed shared memory is becoming the dominant approach for building multiprocessors with moder ..."
Abstract
-
Cited by 15 (0 self)
- Add to MetaCart
Distributed shared memory is an architectural approach that allows multiprocessors to support a single shared address space that is implemented with physically distributed memories. Hardware-supported distributed shared memory is becoming the dominant approach for building multiprocessors with moderate to large numbers of processors. Cache coherence allows such architectures to use caching to take advantage of locality in applications without changing the programmer's model of memory. We review the key developments that led to the creation of cache-coherent distributed shared memory and describe the Stanford DASH Multiprocessor, the first working implementation of hardware-supported scalable cache coherence. We then provide a perspective on such architectures and discuss important remaining technical challenges.

