Results 1 - 10
of
23
The Jrpm System for Dynamically Parallelizing Java Programs
- In Proceedings of the 30th International Symposium on Computer Architecture
, 2003
"... We describe the Java runtime parallelizing machine (Jrpm), a complete system for parallelizing sequential programs automatically. Jrpm is based on a chip multiprocessor (CMP) with thread-level speculation (TLS) support. CMPs have low sharing and communication costs relative to traditional multt)roce ..."
Abstract
-
Cited by 50 (4 self)
- Add to MetaCart
We describe the Java runtime parallelizing machine (Jrpm), a complete system for parallelizing sequential programs automatically. Jrpm is based on a chip multiprocessor (CMP) with thread-level speculation (TLS) support. CMPs have low sharing and communication costs relative to traditional multt)rocessors, and thread-level speculation (TLS) simplifies program parallelization by allowing us to parallelize optimistically without violating correct sequential program behavior. Using a Java virtual machine with dynamic compilation support coupled with a hardware profiler, speculative buffer requirements and inter-thread dependencies of prospective speculative thread loops (STLs) are analyzed in real-time to identi the best loops to parallelize. Once sufficient data has been collected to make a reasonable decision, selected loops are dynamically recompiled to run in parallel Experimental results demonstrate that Jrpm can exploit thread-level parallelism with minimal effort from the programmer. On four processors, we achieved speedups of 3 to 4 for floating point applications, to 3 on multimedia applications, and between 1.5 and .5 on integer applications. Performance was achieved by automatic selection of thread decompositions by the hardware profiler, intra-procedural optimizations on code compiled dynamically into speculative threads, and some minor programmer transformations for exposing parallelism that cannot be performed automatically.
Using Thread-Level Speculation to Simplify Manual Parallelization
, 2003
"... In this paper, we provide examples of how thread-level speculation (TLS) simplifies manual parallelization and enhances its performance. A number of techniques for manual parallelization using TLS are presented and results are provided that indicate the performance contribution of each technique on ..."
Abstract
-
Cited by 39 (7 self)
- Add to MetaCart
In this paper, we provide examples of how thread-level speculation (TLS) simplifies manual parallelization and enhances its performance. A number of techniques for manual parallelization using TLS are presented and results are provided that indicate the performance contribution of each technique on seven SPEC CPU2000 benchmark applications. We also provide indications of the programming effort required to parallelize each benchmark. TLS parallelization yielded a 110% speedup on our four floating point applications and a 70% speedup on our three integer applications, while requiring only approximately 80 programmer hours and 150 lines of non-template code per application. These results support the idea that manual parallelization using TLS is an efficient way to extract fine-grain thread-level parallelism.
Mitosis compiler: An infrastructure for speculative threading based on pre-computation slices
- In PLDI ’05: Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
, 2005
"... Speculative parallelization can provide significant sources of additional thread-level parallelism, especially for irregular applications that are hard to parallelize by conventional approaches. In this paper, we present the Mitosis compiler, which partitions applications into speculative threads, w ..."
Abstract
-
Cited by 37 (4 self)
- Add to MetaCart
Speculative parallelization can provide significant sources of additional thread-level parallelism, especially for irregular applications that are hard to parallelize by conventional approaches. In this paper, we present the Mitosis compiler, which partitions applications into speculative threads, with special emphasis on applications for which conventional parallelizing approaches fail. The management of inter-thread data dependences is crucial for the performance of the system. The Mitosis framework uses a pure software approach to predict/compute the thread’s input values. This software approach is based on the use of pre-computation slices (p-slices), which are built by the Mitosis compiler and added at the beginning of the speculative thread. P-slices must compute thread input values accurately but they do not need to guarantee correctness, since the underlying architecture can detect and recover from misspeculations. This allows the compiler to use aggressive/unsafe optimizations to significantly reduce their overhead. The most important optimizations included in the Mitosis compiler and presented in this paper are branch pruning, memory and register dependence speculation, and early thread squashing. Performance evaluation of Mitosis compiler/architecture shows an average speedup of 2.2.
Temporally Silent Stores
- In Proceedings of PACT-2000
, 2002
"... Recent work has shown that silent stores--stores which write a value matching the one already stored at the memory location--occur quite frequently and can be exploited to reduce memory traffic and improve performance. This paper extends the definition of silent stores to encompass sets of stores th ..."
Abstract
-
Cited by 31 (3 self)
- Add to MetaCart
Recent work has shown that silent stores--stores which write a value matching the one already stored at the memory location--occur quite frequently and can be exploited to reduce memory traffic and improve performance. This paper extends the definition of silent stores to encompass sets of stores that change the value stored at a memory location, but only temporarily, and subsequently return a previous value of interest to the memory location. The stores that cause the value to revert are called temporally silent stores. We redefine multiprocessor sharing to account for temporal silence and show that in the limit, up to 45% of communication misses in scientific and commercial applications can be eliminated by exploiting values that change only temporarily. We describe a practical mechanism that detects temporally silent stores and removes the coherence traffic they cause in conventional multiprocessors. We find that up to 42% of communication misses can be eliminated with a simple extension to the MESI protocol. Further, we examine application and operating system code to provide insight into the temporal silence phenomenon and characterize temporal silence by examining value frequencies and dynamic instruction distances between temporally silent pairs. These studies indicate that the operating system is involved heavily in temporal silence, in both commercial and scientific workloads, and that while detectable synchronization primitives provide substantial contributions, significant opportunity exists outside these references.
Exposing speculative thread parallelism in SPEC2000
- In Proc. of PPoPP’05
, 2005
"... As increasing the performance of single-threaded processors becomes increasingly difficult, consumer desktop processors are moving toward multi-core designs. One way to enhance the performance of chip multiprocessors that has received considerable attention is the use of thread-level speculation (TL ..."
Abstract
-
Cited by 24 (0 self)
- Add to MetaCart
As increasing the performance of single-threaded processors becomes increasingly difficult, consumer desktop processors are moving toward multi-core designs. One way to enhance the performance of chip multiprocessors that has received considerable attention is the use of thread-level speculation (TLS). As a case study, we manually parallelized several of the SPEC CPU2000 floating point and integer applications using TLS. The use of manual parallelization enabled us to apply techniques and programmer expertise that are beyond the current capabilities of automated parallelizers. With the experience gained from this, we provide insight into ways to aggressively apply TLS to parallelize applications for high performance. This information can help guide future advanced TLS compiler design. For each application, we discuss how and where parallelism was located within the application, the impediments to extracting this parallelism using TLS, and the code transformations that were required to overcome these impediments. We also generalize these experiences to a discussion of common hindrances to TLS parallelization, and describe methods of programming that help expose application parallelism to TLS systems. These guidelines can assist developers of uniprocessor programs to create applications that can easily port to TLS systems and yield good performance. By using manual parallelization on SPEC2000, we provide guidance on where thread-level parallelism exists in these well known benchmarks, what limits its extraction, how to reduce these limitations and what performance can be expected on these applications from a chip multiprocessor system with TLS.
The Role of Return Value Prediction in Exploiting Speculative Method-Level Parallelism
- JOURNAL OF INSTRUCTION-LEVEL PARALLELISM
, 2003
"... This work studies the performance impact of return value prediction in a system that supports speculative method-level parallelism (SMLP). A SMLP system creates a speculative thread at each method call, allowing the method and the code from which it is called to be executed in parallel. To improv ..."
Abstract
-
Cited by 18 (0 self)
- Add to MetaCart
This work studies the performance impact of return value prediction in a system that supports speculative method-level parallelism (SMLP). A SMLP system creates a speculative thread at each method call, allowing the method and the code from which it is called to be executed in parallel. To improve performance, the return values of methods are predicted in hardware so that no method has to wait for its sub-method to complete before continuing to execute. For Java programs, we find that two-thirds of methods have a non-void return type, and perfect return value prediction improves performance by an average of 44% compared with a system with no return value prediction. However, the performance of realistic predictors is limited by poor prediction accuracy on integer return values and undesirable update characteristics due to the SMLP environment. A Parameter Stride (PS) return value predictor is proposed to address some of the deficiencies of the standard predictors by predicting based on method arguments. Combining the PS predictor with previous predictors results in 7% speedup on average versus a system with hybrid return value prediction, and 21% speedup versus a system with no return value prediction.
TEST: A Tracer for Extracting Speculative Threads
- In The 2003 International Symposium on Code Generation and Optimization
, 2003
"... Thread-level speculation (TLS) allows sequential programs to be arbitrarily decomposed into threads that can be safely executed in parallel. A key challenge for TLS processors is choosing thread decompositions that speedup the program. Current techniques for identifying decompositions have practical ..."
Abstract
-
Cited by 14 (2 self)
- Add to MetaCart
Thread-level speculation (TLS) allows sequential programs to be arbitrarily decomposed into threads that can be safely executed in parallel. A key challenge for TLS processors is choosing thread decompositions that speedup the program. Current techniques for identifying decompositions have practical limitations in real systems. Traditional parallelizing compilers do not work effectively on most integer programs, and software profiling slows down program execution too much for real-time analysis. Tracer for Extracting Speculative Threads (TEST) is hardware support that analyzes sequential program execution to estimate performance of possible thread decompositions. This hardware is used in a dynamic parallelization system that automatically transforms unmodified, sequential Java programs to run on TLS processors. In this system, the best thread decompositions found by TEST are dynamically recompiled to run speculatively. This paper describes the analysis performed by TEST and presents simulation results demonstrating its effectiveness on real programs. Estimates are also provided that show the tracer requires minimal hardware additions to our speculative chip-multiprocessor (< 1 % of the total transistor count) and causes only minor slowdowns to programs during analysis (3-25%). 1.
Tolerating dependences between large speculative threads via sub-threads
- In ISCA ’06
, 2006
"... Thread-level speculation (TLS) has proven to be a promising method of extracting parallelism from both integer and scientific workloads, targeting speculative threads that range in size from hundreds to several thousand dynamic instructions and have minimal dependences between them. Recent work has ..."
Abstract
-
Cited by 7 (2 self)
- Add to MetaCart
Thread-level speculation (TLS) has proven to be a promising method of extracting parallelism from both integer and scientific workloads, targeting speculative threads that range in size from hundreds to several thousand dynamic instructions and have minimal dependences between them. Recent work has shown that TLS can offer compelling performance improvements for database workloads, but only when targeting much larger speculative threads of more than 50,000 dynamic instructions per thread, with many frequent data dependences between them. To support such large and dependent speculative threads, hardware must be able to buffer the additional speculative state, and must also address the more challenging problem of tolerating the resulting cross-thread data dependences. In this paper we present hardware support for large speculative threads that integrates several previous proposals for TLS hardware. We also introduce support for subthreads: a mechanism for tolerating cross-thread data dependences by checkpointing speculative execution. When speculation fails due to a violated data dependence, with sub-threads the failed thread need only rewind to the checkpoint of the appropriate sub-thread rather than rewinding to the start of execution; this significantly reduces the cost of mis-speculation. We evaluate our hardware support for large and dependent speculative threads in the database domain and find that the transaction response time for three of the five transactions from TPC-C (on a simulated 4-processor chip-multiprocessor) speedup by a factor of 1.9 to 2.9. 1.
Spice: Speculative Parallel Iteration Chunk Execution
"... ABSTRACT The recent trend in the processor industry of packing multiple pro-cessor cores in a chip has increased the importance of automatic techniques for extracting thread level parallelism. A promising ap-proach for extracting thread level parallelism in general purpose applications is to apply m ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
ABSTRACT The recent trend in the processor industry of packing multiple pro-cessor cores in a chip has increased the importance of automatic techniques for extracting thread level parallelism. A promising ap-proach for extracting thread level parallelism in general purpose applications is to apply memory alias or value speculation to breakdependences amongst threads and executes them concurrently.
Instruction-level parallelism through Microthreading -- a scalable Approach to chip multiprocessors
- THE COMPUTER JOURNAL
, 2006
"... Most microprocessor chips today use an out-of-order instruction execution mechanism. This mechanism allows superscalar processors to extract reasonably high levels of instruction level parallelism (ILP). The most significant problem with this approach is a large instruction window and the logic to s ..."
Abstract
-
Cited by 4 (3 self)
- Add to MetaCart
Most microprocessor chips today use an out-of-order instruction execution mechanism. This mechanism allows superscalar processors to extract reasonably high levels of instruction level parallelism (ILP). The most significant problem with this approach is a large instruction window and the logic to support instruction issue from it. This includes generating wake-up signals to waiting instructions and a selection mechanism for issuing them. Wide-issue width also requires a large multi-ported register file, so that each instruction can read and write its operands simultaneously. Neither structure scales well with issue width leading to poor performance relative to the gates used. Furthermore, to obtain this ILP, the execution of instructions must proceed speculatively. An alternative, which avoids this complexity

