Results 1 - 10
of
51
The Stanford FLASH multiprocessor
- In Proceedings of the 21st International Symposium on Computer Architecture
, 1994
"... The FLASH multiprocessor efficiently integrates support for cache-coherent shared memory and high-performance message passing, while minimizing both hardware and software overhead. Each node in FLASH contains a microprocessor, a portion of the machine’s global memory, a port to the interconnection n ..."
Abstract
-
Cited by 311 (19 self)
- Add to MetaCart
The FLASH multiprocessor efficiently integrates support for cache-coherent shared memory and high-performance message passing, while minimizing both hardware and software overhead. Each node in FLASH contains a microprocessor, a portion of the machine’s global memory, a port to the interconnection network, an I/O interface, and a custom node controller called MAGIC. The MAGIC chip handles all communication both within the node and among nodes, using hardwired data paths for efficient data movement and a programmable processor optimized for executing protocol operations. The use of the protocol processor makes FLASH very flexible — it can support a variety of different communication mechanisms — and simplifies the design and implementation. This paper presents the architecture of FLASH and MAGIC, and discusses the base cache-coherence and message-passing protocols. Latency and occupancy numbers, which are derived from our system-level simulator and our Verilog code, are given for several common protocol operations. The paper also describes our software strategy and FLASH’s current status. 1
SUIF: An Infrastructure for Research on Parallelizing and Optimizing Compilers
- ACM SIGPLAN Notices
, 1994
"... Compiler infrastructures that support experimental research are crucial to the advancement of high-performance computing. New compiler technology must be implemented and evaluated in the context of a complete compiler, but developing such an infrastructure requires a huge investment in time and reso ..."
Abstract
-
Cited by 190 (21 self)
- Add to MetaCart
Compiler infrastructures that support experimental research are crucial to the advancement of high-performance computing. New compiler technology must be implemented and evaluated in the context of a complete compiler, but developing such an infrastructure requires a huge investment in time and resources. We have spent a number of years building the SUIF compiler into a powerful, flexible system, and we would now like to share the results of our efforts. SUIF consists of a small, clearly documented kernel and a toolkit of compiler passes built on top of the kernel. The kernel defines the intermediate representation, provides functions to access and manipulate the intermediate representation, and structures the interface between compiler passes. The toolkit currently includes C and Fortran front ends, a loop-level parallelism and locality optimizer, an optimizing MIPS back end, a set of compiler development tools, and support for instructional use. Although we do not expect SUIF to be suitable for everyone, we think it may be useful for many other researchers. We thus invite you to use SUIF and welcome your contributions to this infrastructure. Directions for obtaining the SUIF software are included at the end of this paper. 1
Instruction-Level Parallel Processing: History, Overview and Perspective
, 1992
"... Instruction-level Parallelism CILP) is a family of processor and compiler design techniques that speed up execution by causing individual machine operations to execute in parallel. Although ILP has appeared in the highest performance uniprocessors for the past 30 years, the 1980s saw it become a muc ..."
Abstract
-
Cited by 166 (0 self)
- Add to MetaCart
Instruction-level Parallelism CILP) is a family of processor and compiler design techniques that speed up execution by causing individual machine operations to execute in parallel. Although ILP has appeared in the highest performance uniprocessors for the past 30 years, the 1980s saw it become a much more significant force in computer design. Several systems were built, and sold commercially, which pushed ILP far beyond where it had been before, both in terms of the amount of ILP offered and in the central role ILP played in the design of the system. By the end of the decade, advanced microprocessor design at all major CPU manufacturers had incorporated ILP, and new techniques for ILP have become a popular topic at academic conferences. This article provides an overview and historical perspective of the field of ILP and its development over the past three decades.
The Multiscalar Architecture
, 1993
"... The centerpiece of this thesis is a new processing paradigm for exploiting instruction level parallelism. This paradigm, called the multiscalar paradigm, splits the program into many smaller tasks, and exploits fine-grain parallelism by executing multiple, possibly (control and/or data) depen-dent t ..."
Abstract
-
Cited by 113 (8 self)
- Add to MetaCart
The centerpiece of this thesis is a new processing paradigm for exploiting instruction level parallelism. This paradigm, called the multiscalar paradigm, splits the program into many smaller tasks, and exploits fine-grain parallelism by executing multiple, possibly (control and/or data) depen-dent tasks in parallel using multiple processing elements. Splitting the instruction stream at statically determined boundaries allows the compiler to pass substantial information about the tasks to the hardware. The processing paradigm can be viewed as extensions of the superscalar and multiprocess-ing paradigms, and shares a number of properties of the sequential processing model and the dataflow processing model. The multiscalar paradigm is easily realizable, and we describe an implementation of the multis-calar paradigm, called the multiscalar processor. The central idea here is to connect multiple sequen-tial processors, in a decoupled and decentralized manner, to achieve overall multiple issue. The mul-tiscalar processor supports speculative execution, allows arbitrary dynamic code motion (facilitated by an efficient hardware memory disambiguation mechanism), exploits communication localities, and does all of these with hardware that is fairly straightforward to build. Other desirable aspects of the
Reducing indirect function call overhead in c++ programs
- In POPL ’94: Proceedings of the 21st ACM SIGPLAN-SIGACT symposium on Principles of programming languages
, 1994
"... Modern computer architectures increasingly depend on mechanisms that estimate fhture control flow decisions to increase performance. Mechanisms such as speculative execution and prefetching are becoming standard architectural mechanisms that rely on control flow prediction to prefetch and speculativ ..."
Abstract
-
Cited by 112 (5 self)
- Add to MetaCart
Modern computer architectures increasingly depend on mechanisms that estimate fhture control flow decisions to increase performance. Mechanisms such as speculative execution and prefetching are becoming standard architectural mechanisms that rely on control flow prediction to prefetch and speculatively execute future instructions. At the same time, computer programmers are increasingly turning to object-oriented languages to increase their productivity. These languages commonly use run time dispatching to implement object polymorphism. Dispatching is usually implemented using an indirect finction call, which presents challenges to existing control flow prediction techniques. We have measured the occurrence of indirect function calls in a collection of C++ programs. We show that, although it is more important to predict branches accurately, indirect call prediction is also an important factor in some programs and will grow in importance with the growth of object-oriented programming. We examine the improvement offered by compile-time optimization and static and dynamic prediction techniques, and demonstrate how compilers can use existing branch prediction mechanisms to improve performance in C++ programs. Using these methods with the programs we examined, the number of instructions between mispredicted breaks in control can be doubled on existing computers.
Dynamic Memory Disambiguation Using the Memory Conflict Buffer
- In Proceedings of the 6th Symposium on Architectural Support for Programming Languages and Operating Systems
, 1994
"... To exploit instruction level parallelism, compilers for VLIW and superscalar processors often employ static code scheduling. However, the available code reordering may be severely restricted due to ambiguous dependences between memory instructions. This paper introduces a simple hardware mechanism, ..."
Abstract
-
Cited by 84 (5 self)
- Add to MetaCart
To exploit instruction level parallelism, compilers for VLIW and superscalar processors often employ static code scheduling. However, the available code reordering may be severely restricted due to ambiguous dependences between memory instructions. This paper introduces a simple hardware mechanism, referred to as the memory con ict bu er, which facilitates static code scheduling in the presence of memory store/load dependences. Correct program execution is ensured by the memory con ict bu er and repair code provided by the compiler. With this addition, signi cant speedup over an aggressivecodescheduling model can be achieved for both non-numerical and numerical programs. 1
Quantifying behavioral differences between C and C++ programs
- JOURNAL OF PROGRAMMING LANGUAGES
, 1994
"... Improving the performance of C programs has been a topic of great interest for many years. Both hardware technology and compiler optimization research has been applied in an effort to make C programs execute faster. In many application domains, the C++ language is replacing C as the programming lang ..."
Abstract
-
Cited by 83 (15 self)
- Add to MetaCart
Improving the performance of C programs has been a topic of great interest for many years. Both hardware technology and compiler optimization research has been applied in an effort to make C programs execute faster. In many application domains, the C++ language is replacing C as the programming language of choice. In this paper, we measure the empirical behavior of a group of significant C and C++ programs and attempt to identify and quantify behavioral differences between them. Our goal is to determine whether optimization technology that has been successful for C programs will also be successful in C++ programs. We furthermore identify behavioral characteristics of C++ programs that suggest optimizations that should be applied in those programs. Our results show that C++ programs exhibit behavior that is significantly different than C programs. These results should be of interest to compiler writers and architecture designers who are designing systems to execute object-oriented programs.
Compiling for the Multiscalar Architecture
, 1998
"... High-performance, general-purpose microprocessors serve as compute engines for computers ranging from personal computers to supercomputers. Sequential programs constitute a major portion of real-world software that run on the computers. State-of-the-art microprocessors exploit instruction level para ..."
Abstract
-
Cited by 45 (2 self)
- Add to MetaCart
High-performance, general-purpose microprocessors serve as compute engines for computers ranging from personal computers to supercomputers. Sequential programs constitute a major portion of real-world software that run on the computers. State-of-the-art microprocessors exploit instruction level parallelism (ILP) to achieve high performance on such applications by searching for independent instructions in a dynamic window of instructions and executing them on a wide-issue pipeline. Increasing the window size and the issue width to extract more ILP may hinder achieving high clock speeds, limiting overall performance. The Multiscalar architecture employs multiple small windows and many narrow-issue processing units to exploit ILP at high clock speeds. Sequential programs are partitioned into code fragments called tasks, which are speculatively executed in parallel. Inter-task register dependences are honored via communication and synchronization and inter-task control flow and memory depe...
Sentinel scheduling: a model for compiler-controlled speculative execution
- ACM Transactions on Computer Systems
, 1993
"... Speculative execution is an important source of parallelism for VLIW and superscalar processors. A serious challenge with compiler-controlled speculative execution is to e ciently handle exceptions for speculative instructions. In this paper, a set of architectural features and compiletime schedulin ..."
Abstract
-
Cited by 41 (13 self)
- Add to MetaCart
Speculative execution is an important source of parallelism for VLIW and superscalar processors. A serious challenge with compiler-controlled speculative execution is to e ciently handle exceptions for speculative instructions. In this paper, a set of architectural features and compiletime scheduling support collectively referred to as sentinel scheduling is introduced. Sentinel scheduling provides an e ective framework for both compiler-controlled speculative execution and exception handling. All program exceptions are accurately detected and reported in a timely manner with sentinel scheduling. Recovery from exceptions is also ensured with the model. Experimental results show the e ectiveness of sentinel scheduling for exploiting instruction-level parallelism and the overhead associated with exception handling. Categories and Subject Descriptors: B.3.2 [Memory Structures]: Design Styles{associative memories � C.0[Computer Systems Organization]: General{hardware/software interfaces� instruction set design � system architectures � C.1.2 [Processor Architectures]: Single Data Stream Architectures{pipeline processors � D.2.4 [Software Engineering]: Testing and Debugging{
Cyclone: A broadcast-free dynamic instruction scheduler with selective replay
- In Proceedings of the 30th annual International Symposium on Computer Architecture
"... To achieve high instruction throughput, instruction schedulers must be capable of producing high-quality schedules that maximize functional unit utilization while at the same time enabling fast instruction issue logic. Many solutions exist to the scheduling problem, ranging from compile-time to run- ..."
Abstract
-
Cited by 38 (0 self)
- Add to MetaCart
To achieve high instruction throughput, instruction schedulers must be capable of producing high-quality schedules that maximize functional unit utilization while at the same time enabling fast instruction issue logic. Many solutions exist to the scheduling problem, ranging from compile-time to run-time approaches. Compile-time solutions feature fast and simple hardware, but at the expense of conservative schedules. Dynamic schedulers produce high-quality schedules that incorporate run-time information and dependence speculation, but implementing these schedulers requires complex circuits that can slow processor clock speeds. In this paper, we present the Cyclone scheduler, a novel design that captures the benefits of both compileand run-time scheduling. Our approach utilizes a listbased single-pass instruction scheduling algorithm, implemented by hardware at run-time in the front end of the processor pipeline. Once scheduled, instructions are injected into a timed queue that orchestrates their entry into execution. To accommodate branch and load/store dependence speculation, the Cyclone scheduler supports a simple selective replay mechanism. We implement this technique by overloading instruction register forwarding to also detect instructions dependent on incorrectly scheduled operations. Detailed simulation analyses suggest that with sufficient queue width, the Cyclone scheduler can rival the instruction throughput of similarly wide monolithic dynamic schedulers. Furthermore, the circuit complexity of the Cyclone scheduler is much more favorable than a broadcast-based scheduler, as our approach requires no global control signals. 1

