### Table 3: Correlation between observed values and out-of-sample predictions: The out-of- sample predictions show the marginal bene ts of modeling the sender and receiver random e ects, as well as the additional predictive gain of the inner product of latent positions. For cooperation a generalized linear model produces out-of-sample predictions that have a correlation of 0.13 with out-of-sample measurements, which rises to 0:68 when random e ects for senders and receivers are included and to 0:86 when the inner product of latent positions is also modeled. The gain is even stronger for con ict, rising from 0:07 in the rst instance to 0:91 in the last. These patterns are less striking when measured on the log scale, but still evident.

2003

"... In PAGE 13: ...the correlations of predicted and actual responses were computed in a raw, untransformed as well as a logarithmic scale.16 Table3 presents the results of this out-of-sample experiment. The addition of the random e ects{spanning the random intercepts for senders and re- ceivers as well as the inner product of the latent positions{increases predictive performance substantially, viewed in terms of the correlation between actual and predicted responses.... ..."

### Table 1 contains numerical results for the block Jacobi and Gauss Seidel methods applied to the backward Euler equations. The number of iterations required to con- verge to the desired solution increases linearly with the number of timesteps in parallel as predicted by the theory. This negates any potential bene ts from parallelization. 6. Point Jacobi Iterative Schemes. In this section we consider the point Ja- cobi iterative method for the backward Euler method for simplicity. We consider the serial case and analyze the time it takes to solve for a single time step. 6.1. Backward Euler. For the conventional backward Euler scheme the itera- tion matrix is given by

### Table 3: Instructions retired per cycle on the three con gurations. Adding inactive issue shows a further improvement over non-partial matching. The increase, averaged across all the benchmarks, is 15%. For the bench- marks that are easily predicted, adding inactive issue on to partial matching shows no signi cant improve- ment. To understand the performance bene ts of partial matching and inactive issue we examine how these dif- ferent schemes impact the three main factors which

### Table 1: Receiver Processing Steps (Single ADU) There is clearly signi cant potential here for integrated processing to deliver performance bene ts. However the constraints on integrated processing are not trivially apparent. Intuitively, we could predict that transformation and expansion steps will present some intrinsic boundaries to the reach of an ILP loop, since these steps involve movement between di erent bu er types. This is discussed further below. However we rstly analyse the performance of the unintegrated receiver.

"... In PAGE 4: ... We can consider the ILP potential for the receiver set of functions by informal description of the original data manipulation steps derived from the XV-based implementation. They are presented in Table1 . We present these steps somewhat mechanistically, i.... In PAGE 5: ... The quot;DCT quot; stage is the inverse DCT operation for the entire ADU. The quot;post DCT quot; is all steps up to the actual X call to display the MCU (see Table1 ). It is a somewhat moot point whether quot;expand quot; should be included in this stage.... In PAGE 6: ... In [2] all data manipulation tasks in protocols are characterised to consist of: for loops (iteration) data read/write computation An ILP implementation attempts to minimise the costs of the rst two. As an experiment, for all steps in Table1 from the hu man decode onwards, we remove all code identi able as quot;computation quot;. In practice, we limit each step to its read of the data in quot;input quot; format, and its write in quot;output quot; format.... In PAGE 6: ... We in fact use the same data structures. We have not removed any of the data movements in Table1 other than those associated with computation.... In PAGE 7: ... The interactions of these with reading and writing of data, and their cache e ects, require further examination. 5 Integration An ideal ILP implementation will roll together all the processing steps outlined in Table1 into a single loop. However, there may be some fundamental constraints that prevent this ideal integra- tion.... In PAGE 8: ... We have also removed two redundant copies in the post DCT stage. We removed a copy between the quot;ycc to rgb quot; step and the quot;dither quot; (see Table1 ), and we place the output of the dither straight into the X output bu er. Non Int.... ..."

### Table 1: Receiver Processing Steps (Single ADU) There is clearly signi cant potential here for integrated processing to deliver performance bene ts. However the constraints on integrated processing are not trivially apparent. Intuitively, we could predict that transformation and expansion steps will present some intrinsic boundaries to the reach of an ILP loop, since these steps involve movement between di erent bu er types. This is discussed further below. However we rstly analyse the performance of the unintegrated receiver.

"... In PAGE 4: ... We can consider the ILP potential for the receiver set of functions by informal description of the original data manipulation steps derived from the XV-based implementation. They are presented in Table1 . We present these steps somewhat mechanistically, i.... In PAGE 5: ... The quot;DCT quot; stage is the inverse DCT operation for the entire ADU. The quot;post DCT quot; is all steps up to the actual X call to display the MCU (see Table1 ). It is a somewhat moot point whether quot;expand quot; should be included in this stage.... In PAGE 6: ... In [2] all data manipulation tasks in protocols are characterised to consist of: for loops (iteration) data read/write computation An ILP implementation attempts to minimise the costs of the rst two. As an experiment, for all steps in Table1 from the hu man decode onwards, we remove all code identi able as quot;computation quot;. In practice, we limit each step to its read of the data in quot;input quot; format, and its write in quot;output quot; format.... In PAGE 6: ... We in fact use the same data structures. We have not removed any of the data movements in Table1 other than those associated with computation.... In PAGE 7: ... The interactions of these with reading and writing of data, and their cache e ects, require further examination. 5 Integration An ideal ILP implementation will roll together all the processing steps outlined in Table1 into a single loop. However, there may be some fundamental constraints that prevent this ideal integra- tion.... In PAGE 8: ... We have also removed two redundant copies in the post DCT stage. We removed a copy between the quot;ycc to rgb quot; step and the quot;dither quot; (see Table1 ), and we place the output of the dither straight into the X output bu er. Non Int.... ..."

### Table 1 shows the correct prediction ratio for the benchmark programs as a function of the number of guesses per entry. From the table one can see that the number of correct guesses increases with the num- ber of guesses per entry. Also, the table shows that 4 guesses per entry capture most of the bene t of mul- tiple guesses. In the next section we describe how to use predic- tion and prefetching to build a memory system with existing memory parts.

"... In PAGE 3: ... Our simulations show that four predictions for each entry yield a very high prediction rate. Table1 shows the correct prediction rate for ten benchmark programs. We have used a block size of 256 bytes and a prediction table with 256K entries with four predictions per entry.... In PAGE 4: ...888 0.899 Table1 : Prediction Accuracy vs. Number of Predic- tions Experiment parameters: Prediction Table Size -256K entries Cluster Size -256 bytes Prediction Block Size -256 bytes Number of Predictions -Variable Replacement Policy -LRU Table 1 shows the correct prediction ratio for the benchmark programs as a function of the number of guesses per entry.... ..."

### Table 11: Bene ts

1998

Cited by 5

### Table 11: Bene ts

1998

Cited by 5

### Table 4 contains some of the prediction-related characteristics for the VP-Tint and two read port stride predictor. Over all of the programs, the VP-Tint hardware provides a signi cantly lower misprediction rate (MP Rate) than the table-based methods, 6.08% versus 10.30% for the two port stride value predictor. Often in value prediction schemes, the total number of correct predictions must be sacri ced for a higher accuracy. However, even with the superior accuracy, the VP-Tint still provides 2.75 times the number of correct predictions as the two read port predictor (0.94 times as many correct predictions for than a 16 port stride predictor). The lower section of the table presents the instruction coverage (Coverage) for the two read port stride value predictor and the VP-Tint. This represents the percentage of instructions that have an available history. For the stride value predictor, this results from a tag-match in the table. For Traveling Speculations, this means any instruction fetched from the tinted trace cache. The results in the table show that the Traveling Speculations do a better job of providing per instruction history

"... In PAGE 14: ... The precise reason is di cult to isolate since performance bene ts of value prediction are based on a combination of factors, such as misprediction rate, prediction con dence, update frequency, varying rewards for correct predictions, and varying penalties for mispredictions [18, 34]. Table4 : Analysis of Value Prediction E ectiveness... ..."

### Table 4: Results from using (Prc) as a preconditioner to DVDSON, versus using GMRES(5) with preconditioner (Prc) In Table 4 results from combining some of the previously tested preconditioners with GMRES(5) are presented. GMRES is allowed to run for 5 iterations. The total number of iterations decreases in general and the method demonstrates the robustness predicted earlier. Note especially that for the di cult cases even the time is reduced. However, the matrix-vector multiplications increase and for the easier LITHIUM case this method is much slower. Therefore, this method cannot be bene cial to all cases, because of the potential cost penalty.

1995

Cited by 20