Results 11 -
18 of
18
A Case for Using Active Memory to Support Garbage Collection
- In Proceedings of the First Workshop on Hardware Support for Objects and Microarchitectures in Java
, 1999
"... Abstract. Most modern programming languages require efficient automatic memory management (garbage collection, GC) as part of the runtime system. Since GC is very memory intensive it can potentially suffer significantly from poor memory access times. Unfortunately memory performance develops at a sl ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Abstract. Most modern programming languages require efficient automatic memory management (garbage collection, GC) as part of the runtime system. Since GC is very memory intensive it can potentially suffer significantly from poor memory access times. Unfortunately memory performance develops at a slower pace than processor speed, thus making memory accesses relatively more and more expensive in the future. Active Memory architectures aim to overcome this problem by placing additional computational power in memory, thus allowing the application to execute small but memory-intensive functions closer to the data and in parallel. The goal is to improve latency and bandwidth for programs that can otherwise suffer from memory accesses. To date, Active Memory has been studied only with databases, image processing, arithmetic computations, and other very regular applications. In this paper, we propose to analyze its impact on garbage collection. We are convinced that garbage collection too will profit from this architecture since GC is simple, repetitive, easy to partition into offloadable functions, and its performance depends crucially on fast memory access. We describe a possible incarnation of an Active Memory architecture suitable for GC support and argue why GC should benefit from such an architecture.. 1.
Parallel Copying Garbage Collection using Delayed Allocation
, 1999
"... We present a new approach to parallel copying garbage collection on symmetric multiprocessor (SMP) machines appropriate for Java and other object-oriented languages. Parallel, in this setting, means that the collector runs in several parallel threads. Our collector ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
We present a new approach to parallel copying garbage collection on symmetric multiprocessor (SMP) machines appropriate for Java and other object-oriented languages. Parallel, in this setting, means that the collector runs in several parallel threads. Our collector
A New Approach to Parallelising Tracing Algorithms
"... Tracing algorithms visit reachable nodes in a graph and are central to activities such as garbage collection, marshalling etc. Traditional sequential algorithms use a worklist, replacing a nodes with their unvisited children. Previous work on parallel tracing is processororiented in associating one ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Tracing algorithms visit reachable nodes in a graph and are central to activities such as garbage collection, marshalling etc. Traditional sequential algorithms use a worklist, replacing a nodes with their unvisited children. Previous work on parallel tracing is processororiented in associating one worklist per processor: worklist insertion and removal requires no locking, and load balancing requires only occasional locking. However, since multiple queues may contain the same node, significant locking is necessary to avoid concurrent visits by competing processors. This paper presents a memory-oriented solution: memory is partitioned into segments and each segment has its own worklist containing only nodes in that segment. At a given time at most one processor owns a given worklist. By arranging separate single-readersingle-writer forwarding queues to pass nodes from processor i to processor j we can process objects in an order that gives lock-free mainline code and improved locality of reference. This refactoring is analogous to the way in which a compiler changes an iteration space to eliminate data dependencies. While it is clear that our solution can be more effective on NUMA systems, and even necessary when processor-local memory may not be addressed from other processors, slightly surprisingly, it often gives significantly better speed-up on modern multi-cores architectures too. Using caches to hide memory latency loses much of its effectiveness when there is significant cross-processor memory contention or when locking is necessary.
Effects Of Coalescing On The Performance Of Segregated Size Storage Allocators
, 2003
"... Typically, when a program executes, it creates objects dynamically and requests storage for its objects from the underlying storage allocator. The patterns of such requests can potentially lead to internal fragmentation as well as external fragmentation. Internal fragmentation occurs when the stora ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Typically, when a program executes, it creates objects dynamically and requests storage for its objects from the underlying storage allocator. The patterns of such requests can potentially lead to internal fragmentation as well as external fragmentation. Internal fragmentation occurs when the storage allocator allocates a contiguous block of storage to a program, but the program uses only a fraction of that block to satisfy a request. The unused portion of that block is wasted since the allocator cannot use it to satisfy a subsequent allocation request. External fragmentation, on the other hand, concerns chunks of memory that reside between allocated blocks. External fragmentation becomes problematic when these chunks are not large enough to satisfy an allocation request individually. Consequently, these chunks exist as useless holes in the memory system. In this
Achieving High Performance for Parallel Programs that Contain Unscalable Modules
, 2000
"... This thesis is a description of a compiler and runtime technique for the efficient management of threads including their mutual exclusion. The target area for this work is parallel languages for shared-memory multiprocessors. The goal of this work is to achieve a situation in which the execution tim ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
This thesis is a description of a compiler and runtime technique for the efficient management of threads including their mutual exclusion. The target area for this work is parallel languages for shared-memory multiprocessors. The goal of this work is to achieve a situation in which the execution time either decreases or remains unchanged as the number of processors is increased. We call this performance model the satisfactory performance model. Existing parallel programming systems do not always perform according to this satisfactory model. This is the case when there are modules in the program such that concurrent invocations of the modules are serialized. We call these modules bottleneck modules. When bottleneck modules are present they prevent operation according to the satisfactory performance model since the overhead incurred because of bottleneck modules increases with the number of processors. This overhead includes communications with memory for the sharing of memory objects am...
A Comparative Evaluation of
- In Proceedings of the Fourteenth Annual Workshop on Languages and Compilers for Parallel Computing
, 2001
"... While uniprocessor garbage collection is relatively well understood, experience with collectors for large multiprocessor servers is limited and it is unknown which techniques best scale with large memories and large numbers of processors. In order to explore these issues we designed a modular gar ..."
Abstract
- Add to MetaCart
While uniprocessor garbage collection is relatively well understood, experience with collectors for large multiprocessor servers is limited and it is unknown which techniques best scale with large memories and large numbers of processors. In order to explore these issues we designed a modular garbage collection framework in the IBM Jalapeno Java virtual machine and implemented five different parallel garbage collectors: non-generational and generational versions of mark-and-sweep and semi-space copying collectors, as well as a hybrid of the two. We describe the optimizations necessary to achieve good performance across all of the collectors, including load balancing, fast synchronization, and inter-processor sharing of free lists. We then quantitatively compare the different collectors to find their asymptotic performance both with respect to how fast they can run applications as well as how little memory they can run them in. All of our collectors scale linearly up to sixteen processors. The least memory is usually required by the hybrid mark-sweep collector that uses a copying collector for its nursery, although sometimes the non-generational mark-sweep collector requires less memory. The fastest execution is more application-dependent. Our only application with a large working set performed best using the mark-sweep collector; with one exception, the rest of the applications ran fastest with one of the generational collectors.
Parallel Generational-Copying Garbage Collection with a
"... We present a parallel generational-copying garbage collector implemented for the Glasgow Haskell Compiler. We use a blockstructured memory allocator, which provides a natural granularity for dividing the work of GC between many threads, leading to a simple yet effective method for parallelising copy ..."
Abstract
- Add to MetaCart
We present a parallel generational-copying garbage collector implemented for the Glasgow Haskell Compiler. We use a blockstructured memory allocator, which provides a natural granularity for dividing the work of GC between many threads, leading to a simple yet effective method for parallelising copying GC. The results are encouraging: we demonstrate wall-clock speedups of on average a factor of 2 in GC time on a commodity 4-core machine with no programmer intervention, compared to our best sequential GC.
Locality-Aware Many-Core Garbage Collection
"... The wide-scale deployment of multi-core and many-core processors will necessitate fundamental changes to garbage collectors. Highly parallel garbage collection is critical to the performance of these systems — today’s garbage collectors can quickly become the bottleneck for parallel programs. These ..."
Abstract
- Add to MetaCart
The wide-scale deployment of multi-core and many-core processors will necessitate fundamental changes to garbage collectors. Highly parallel garbage collection is critical to the performance of these systems — today’s garbage collectors can quickly become the bottleneck for parallel programs. These processors will present additional new challenges — many contain non-uniform memory architectures in which some cores have faster access to certain regions of memory than other regions. This paper presents a new cache-aware approach to garbage collection. Our collector balances the competing concerns of data locality and heap utilization to improve performance. We have implemented our garbage collector and present results on a 64-core TILEPro64 processor. Our cache-aware parallel collector speeds up garbage collection by up to 46.7×. 1.

