Results 1 -
3 of
3
The Warp Computer: Architecture, Implementation, and Performance
- IEEE Transactions on Computers
, 1987
"... The Warp machine is a systolic array computer of linearly connected cells, each of which is a programmable processor capable of performing 10 million floating-point operations per second (10 MFLOPS). A typical Warp array includes 10 cells, thus having a peak computation rate of 100 MFLOPS. The Warp ..."
Abstract
-
Cited by 42 (2 self)
- Add to MetaCart
The Warp machine is a systolic array computer of linearly connected cells, each of which is a programmable processor capable of performing 10 million floating-point operations per second (10 MFLOPS). A typical Warp array includes 10 cells, thus having a peak computation rate of 100 MFLOPS. The Warp array can be extended to include more cells to accommodate applications capable of using the increased computational bandwidth. Warp is integrated as an attached processor into a UN host system. Programs for Warp are written in a high-level language supported by an optimizing compiler.
Evaluating the Use of Register Queues in Software Pipelined Loops
- In IEEE Transactions on Computers, Vol.50, No.8
, 2001
"... : In this paper we examine the eectiveness of a new hardware mechanism, called Register Queues (RQs), which eectively decouples the architected register space from the physical registers. Using RQs, the compiler can allocate physical registers to store live values in the software pipelined loop whil ..."
Abstract
- Add to MetaCart
: In this paper we examine the eectiveness of a new hardware mechanism, called Register Queues (RQs), which eectively decouples the architected register space from the physical registers. Using RQs, the compiler can allocate physical registers to store live values in the software pipelined loop while minimizing the pressure placed on architected registers. We show that decoupling the architected register space from the physical register space can greatly increase the applicability of software pipelining, even as memory latencies increase. RQs combine the major aspects of existing rotating register le and register connection techniques to generate ecient software pipeline schedules. Through the use of RQs, we can minimize the register pressure and code expansion caused by software pipelining. We demonstrate the eect of incorporating register queues and software pipelining with 983 loops taken from the Perfect Club, the SPEC suites, and the Livermore Kernels. Index Terms: software pipelining, modulo variable expansion, rotating register le, register queues, VLIW, register connection. 1 1
Probabilistic Predicate-Aware Modulo Scheduling
"... Predicated execution enables the removal of branches by converting segments of branching code into sequences of conditional operations. An important side effect of this transformation is that the compiler must unconditionally assign resources to predicated operations. However, a resource is only put ..."
Abstract
- Add to MetaCart
Predicated execution enables the removal of branches by converting segments of branching code into sequences of conditional operations. An important side effect of this transformation is that the compiler must unconditionally assign resources to predicated operations. However, a resource is only put to productive use when the predicate associated with an operation evaluates to True. To reduce this superfluous commitment of resources, we propose probabilistic predicate-aware scheduling to assign multiple operations to the same resource at the same time, thereby over-subscribing its use. Assignment is performed in a probabilistic manner using a combination of predicate profile information and predicate analysis aimed at maximizing the benefits of over-subscription in view of the expected degree of conflict. Conflicts occur when two or more operations assigned to the same resource have their predicates evaluate to True. A predicate-aware VLIW processor pipeline detects such conflicts, recovers, and correctly executes the conflicting operations. By increasing the effective throughput of a fixed set of resources, probabilistic predicate-aware scheduling provided an average of 20 % performance gain in our evaluations on a 4-issue processor, and 8 % gain on a 6-issue processor. 1.

