• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • DMCA
  • Donate

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations

Compiling for the multiscalar architecture (1998)

by T VIJAYKUMAR
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 53
Next 10 →

Compiler Optimization of Scalar Value Communication Between Speculative Threads

by Anonia Zhai, Chris Colohan, John Steffan, Todd Mowry - In Proceedings of the 10th ASPLOS , 2002
"... While there have been many recent proposals for hardware that supports Thread-Level Speculation (TLS), there has been relatively little work on compiler optimizations to fully exploit this potential for parallelizing programs optimistically. In this paper, we focus on one important limitation of pro ..."
Abstract - Cited by 90 (18 self) - Add to MetaCart
While there have been many recent proposals for hardware that supports Thread-Level Speculation (TLS), there has been relatively little work on compiler optimizations to fully exploit this potential for parallelizing programs optimistically. In this paper, we focus on one important limitation of program performance under TLS, which is stalls due to forwarding scalar values between threads that would otherwise cause frequent data dependences. We present and evaluate dataflow algorithms for three increasingly-aggressive instruction scheduling techniques that reduce the critical forwarding path introduced by the synchronization associated with this data forwarding. In addition, we contrast our compiler techniques with related hardware-only approaches. With our most aggressive compiler and hardware techniques, we improve performance under TLS by 6.2--28.5% for 6 of 14 applications, and by at least 2.7% for half of the other applications.
(Show Context)

Citation Context

... TLS support include some form of DOACROSS synchronization, although few use the compiler to optimize this aspect of speculative execution. The most relevant related work is the Wisconsin Multiscalar =-=[12, 28, 35]-=- compiler, which performs synchronization and scheduling for register values [35]. (The Multiscalar effort also evaluated hardware support for automatically detecting and synchronizing data dependence...

Mitosis Compiler: An Infrastructure for Speculative Threading Based on PreComputation Slices

by Carlos García Quiñones, Carlos Madriles, Jesús Sánchez, Pedro Marcuello, Antonio González, Dean M. Tullsen - In Conference on Programming Language Design and Implementation , 2005
"... Speculative parallelization can provide significant sources of additional thread-level parallelism, especially for irregular applications that are hard to parallelize by conventional approaches. In this paper, we present the Mitosis compiler, which partitions applications into speculative threads, w ..."
Abstract - Cited by 81 (4 self) - Add to MetaCart
Speculative parallelization can provide significant sources of additional thread-level parallelism, especially for irregular applications that are hard to parallelize by conventional approaches. In this paper, we present the Mitosis compiler, which partitions applications into speculative threads, with special emphasis on applications for which conventional parallelizing approaches fail. The management of inter-thread data dependences is crucial for the performance of the system. The Mitosis framework uses a pure software approach to predict/compute the thread’s input values. This software approach is based on the use of pre-computation slices (p-slices), which are built by the Mitosis compiler and added at the beginning of the speculative thread. P-slices must compute thread input values accurately but they do not need to guarantee correctness, since the underlying architecture can detect and recover from misspeculations. This allows the compiler to use aggressive/unsafe optimizations to significantly reduce their overhead. The most important optimizations included in the Mitosis compiler and presented in this paper are branch pruning, memory and register dependence speculation, and early thread squashing. Performance evaluation of Mitosis compiler/architecture shows an average speedup of 2.2.
(Show Context)

Citation Context

...ify speculative threads and to manage inter-thread data dependences, which are the topic of this paper. The Expandable Split Window Paradigm [10] and the follow-up work, the Multiscalar processor [19]=-=[26]-=- were pioneering works in the area of SpMT. Speculative threads (called tasks) are created by the compiler based on several heuristics that tried to minimize the data dependences among threads as well...

The stampede approach to thread-level speculation

by J. Gregory Steffan, Christopher Colohan, Antonia Zhai, Todd C. Mowry - ACM Transactions on Computer Systems , 2005
"... Multithreaded processor architectures are becoming increasingly commonplace: many current and upcoming designs support chip multiprocessing, simultaneous multithreading, or both. While it is relatively straightforward to use these architectures to improve the throughput of a multithreaded or multipr ..."
Abstract - Cited by 71 (9 self) - Add to MetaCart
Multithreaded processor architectures are becoming increasingly commonplace: many current and upcoming designs support chip multiprocessing, simultaneous multithreading, or both. While it is relatively straightforward to use these architectures to improve the throughput of a multithreaded or multiprogrammed workload, the real challenge is how to easily create parallel software to allow single programs to effectively exploit all of this raw performance potential. One promising technique for overcoming this problem is Thread-Level Speculation (TLS), which enables the compiler to optimistically create parallel threads despite uncertainty as to whether those threads are actually independent. In this article, we propose and evaluate a design for supporting TLS that seamlessly scales both within a chip and beyond because it is a straightforward extension of writeback invalidation-based cache coherence (which itself scales both up and down). Our experimental results demonstrate that our scheme performs well on single-chip multiprocessors where the first level caches are either private or shared. For our private-cache design, the program performance of two of 13 general purpose applications studied improves by 86 % and 56%, four others by more than 8%, and an average across all applications of 16%—confirming that TLS is a promising way

Task Selection for a Multiscalar Processor

by T. N. Vijaykumar, Gurindar S. Sohi - In Proceedings of the 31st annual international symposium on Microarchitecture , 1998
"... The Multiscalar architecture advocates a distributed processor organization and task-level speculation to exploit high degrees of instruction level parallelism (ILP) in sequential programs without impeding improvements in clock speeds. The main goal of this paper is to understand the key implication ..."
Abstract - Cited by 70 (6 self) - Add to MetaCart
The Multiscalar architecture advocates a distributed processor organization and task-level speculation to exploit high degrees of instruction level parallelism (ILP) in sequential programs without impeding improvements in clock speeds. The main goal of this paper is to understand the key implications of the architectural features of distributed processor organization and task-level speculation for compiler task selection from the point of view of performance. We identify the fundamental performance issues to be: control flow speculation, data communication, data dependence speculation, load imbalance, and task overhead. We show that these issues are intimately related to a few key characteristics of tasks: task size, inter-task control flow, and inter-task data dependence. We describe compiler heuristics to select tasks with favorable characteristics. We report experimental results to show that the heuristics are successful in boosting overall performance by establishing larger ILP win...
(Show Context)

Citation Context

...o that later iterations get the values of the induction variables from earlier iterations without any delay. Register communication scheduling is not discussed in this paper; details are available in =-=[18]-=-. 3.4. Data dependence heuristic The key problem with a data dependence is that if the producer is encountered late and the consumer is encoun-task_selection() { task_size_heuristic(); identify_data_...

A cost-driven compilation framework for speculative parallelization of sequential programs

by Zhao-hui Du, Chen Yang, Chu-cheow Lim, Qingyu Zhao, Tin-fook Ngai - In ACM SIGPLAN 04 Conference on Programming Language Design and Implementation (PLDI’04 , 2004
"... The emerging hardware support for thread-level speculation opens new opportunities to parallelize sequential programs beyond the traditional limits. By speculating that many data dependences are unlikely during runtime, consecutive iterations of a sequential loop can be executed speculatively in par ..."
Abstract - Cited by 45 (4 self) - Add to MetaCart
The emerging hardware support for thread-level speculation opens new opportunities to parallelize sequential programs beyond the traditional limits. By speculating that many data dependences are unlikely during runtime, consecutive iterations of a sequential loop can be executed speculatively in parallel. Runtime parallelism is obtained when the speculation is correct. To take full advantage of this new execution model, a program needs to be programmed or compiled in such a way that it exhibits high degree of speculative thread-level parallelism. We propose a comprehensive cost-driven compilation framework to perform speculative parallelization. Based on a misspeculation cost model, the compiler aggressively transforms loops into optimal speculative parallel loops and selects only those loops whose speculative parallel execution is likely to improve program
(Show Context)

Citation Context

...loops2. RELATED WORK Figure 2: An SPT Loop Transformation ExamplesThe Multiscalar project was the first comprehensive study of bothshardware and software supports for speculative multithreadings72 [3]=-=[12]-=-.sIn particular, Vijaykumar et. al. described the use ofscompiler techniques to partition a sequential program into taskssfor the Multiscalar architecture [12].sThey showed that good tasksselection is...

A study of control independence in superscalar processors

by Eric Rotenberg, Quinn Jacobson, Jim Smith , 1998
"... Control independence has been put forward as a significant new source of instruction-level parallelism for future generation processors. However, its performance potential under practical hardware constraints is not known, and even less is understood about the factors that contribute to or limit the ..."
Abstract - Cited by 43 (4 self) - Add to MetaCart
Control independence has been put forward as a significant new source of instruction-level parallelism for future generation processors. However, its performance potential under practical hardware constraints is not known, and even less is understood about the factors that contribute to or limit the performance of control independence. Important aspects of control independence are identified and singled out for study, and a series of idealized machine models are used to isolate and evaluate these aspects. It is shown that much of the performance potential of control independence is lost due to data dependences and wasted resources consumed by incorrect control dependent instructions. Even so, control independence can close the performance gap between real and perfect branch prediction by as much as half. Next, important implementation issues are discussed and some design alternatives are given. This is followed by a more detailed set of simulations, where the key implementation features are realistically modeled. These simulations show typical performance improvements of 10-30%. 1.
(Show Context)

Citation Context

...study is only applicable to processors with a single flow of control, we at least get a hint of the control independence potential for some multiscalar design points. For example, Vijaykumar’s thesis =-=[26]-=- indicates average task sizes on the order of 15 instructions (comparable to the fetch width of 16 instructions) and effective window sizes of under 200 instructions for integer benchmarks. Given a mu...

Memory Dependence Prediction

by Andreas Ioannis Moshovos , 1998
"... As the existing techniques that empower the modern high-performance processors are being refined and as the underlying technology trade-offs change, new bottlenecks are exposed and new challenges are raised. This thesis introduces a new tool, Memory Dependence Prediction that can be useful in combat ..."
Abstract - Cited by 39 (5 self) - Add to MetaCart
As the existing techniques that empower the modern high-performance processors are being refined and as the underlying technology trade-offs change, new bottlenecks are exposed and new challenges are raised. This thesis introduces a new tool, Memory Dependence Prediction that can be useful in combating these bottlenecks and meeting the new challenges. Memory dependence prediction is a technique to guess whether a load or a store will experience a dependence. Memory dependence prediction exploits regularity in the memory dependence stream of ordinary programs, a phenomenon which is also identified in this thesis. To demonstrate the utility of memory dependence prediction this thesis also presents the following three novel microarchitectural techniques: 1. Dynamic Speculation/Synchronization of Memory Dependences: this thesis demonstrates that to exploit parallelism over larger regions of code waiting to determine the dependences a load has is not the best performing option. Higher performance is possible if memory dependence speculation is used especially if memory dependence prediction is used to guide this speculation.
(Show Context)

Citation Context

...ad will result in a memory dependence violation, and (2) if so, which is the store this load should wait for. Timing simulations show that for a distributed, split-window processor (i.e., Multiscalar =-=[26,14,82,27,40,92,13]-=-), our technique can improve performance by 28% for integer codes and 15% for floating point codes on the average. More importantly, the performance obtained through the use of our techniques is very ...

A Quantitative Assessment of Thread-Level Speculation Techniques

by Pedro Marcuello, Antonio González - IN PROCEEDINGS OF THE 14TH INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS ’00 , 1999
"... Speculative thread-level parallelism has been recently proposed as an alternative source of parallelism that can boost the performance for applications where independent threads are hard to find. Several schemes to exploit such type of parallelism have been proposed and significant gains have been u ..."
Abstract - Cited by 37 (0 self) - Add to MetaCart
Speculative thread-level parallelism has been recently proposed as an alternative source of parallelism that can boost the performance for applications where independent threads are hard to find. Several schemes to exploit such type of parallelism have been proposed and significant gains have been usually reported. However, there is a lack of undertanding of the sources of these benefits as well as the impact of some design choices. This work analyzes the benefits of different thread speculation techniques and the impact of some critical issues such as the value predictor, the branch predictor, the thread initialization overhead and the connectivity among thread units.

Control Independence in Trace Processors

by Eric Rotenberg, James E. Smith - IN PROC. 32ND INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE , 1999
"... Branch mispredictions are a major obstacle to exploiting instruction-level parallelism, at least in part because all instructions after a mispredicted branch are squashed. However, instructions that are control independent of the branch must be fetched regardless of the branch outcome, and do not ne ..."
Abstract - Cited by 26 (0 self) - Add to MetaCart
Branch mispredictions are a major obstacle to exploiting instruction-level parallelism, at least in part because all instructions after a mispredicted branch are squashed. However, instructions that are control independent of the branch must be fetched regardless of the branch outcome, and do not necessarily have to be squashed and re-executed. Control independence exists when the two paths following a branch re-converge. A trace processor microarchitecture is developed to exploit control independence and thereby reduce branch misprediction penalties. There are three major contributions. 1) Trace-level re-convergence is not guaranteed despite re-convergence at the instruction-level. Novel trace selection techniques are developed to expose control independence at the trace-level. 2) Control independence’s potential complexity stems from insertion and removal of instructions from the middle of the instruction window. Trace processors manage control flow hierarchically (traces are the fundamental unit of control flow) and this results in an efficient implementation. 3) Control independent instructions must be inspected for incorrect data dependences caused by mispredicted control flow. Existing data speculation support is easily leveraged to selectively re-execute incorrect-data dependent, control independent instructions. Control independence improves trace processor performance from 2 % to 25%, and 13 % on average, for the SPEC95 integer benchmarks.
(Show Context)

Citation Context

...rocessors (Peleg & Weiser, 1995; Rotenberg et al. 1996, 1997; Patel et al. 1997, 1998), trace selection for compilers (Fisher, 1981; Hwu & Chang, 1988), and task selection for multiscalar processors (=-=Vijaykumar, 1998-=-; Vijaykumar & Sohi, 1998). 1.3 Paper Organization Section 2 describes the trace processor’s novel window management, i.e. support for instruction insertion/removal from the middle of the window (both...

Architecture of the atlas chipmultiprocessor: Dynamically parallelizing irregular applications

by L Codrescu, D S Wills - In Proc. of the 1999 International Conference on Computer Design (ICCD ’99 , 1999
"... ..."
Abstract - Cited by 25 (2 self) - Add to MetaCart
Abstract not found
Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University