DMCA
Computational testing : Why, how and how much (1990)
Venue: | ORSA Journal on Computing |
Citations: | 10 - 0 self |
BibTeX
@ARTICLE{Greenberg90computationaltesting,
author = {Harvey J Greenberg},
title = {Computational testing : Why, how and how much},
journal = {ORSA Journal on Computing},
year = {1990},
pages = {94--97}
}
OpenURL
Abstract
This considers the issues associated with performing and reporting computational testing in the context of original research in operations research, particularly in its interface with computer science. The scope includes nonnumerical computations, such as fundamental algorithms and information structures in new modeling languages, and many criteria besides speed of computation, like robustness and depth. Beginning with a foundation of why we perform computational testing, some principles are described that draw heavily from the works of others. Then, methods of how to perform computational testing are suggested, ranging from quantitative statistical techniques to qualitative factor analysis. Finally, the issue of how much computational testing is appropriate is considered in the context of guidelines for referees and editors who deal with such issues. T h i s addresses the issues of computational testing in the context of its role in reporting research results. The approach is very much influenced by the recent report of the Committee On Algorithms (COAL), of the Mathematical Programming Society, by R.H.F. Jackson et al.[91. Here we give greater consideration for nonnumerical algorithms, such as testing the quality of modeling languages. Even when testing is performed on numerical algorithms, like optimization, speed of computation is not regarded here as the sole objective. Other concerns include accuracy of results and robustness. Even these terms may have different meanings, depending upon whether we are testing algorithms, models, languages, environments, or systems. Even within each of these objects, there remains variation-for example, numerical algorithms have different measurements of accuracy and robustness than non-numerical ones. We emphasize that, unlike the ORSA g~idelines,''~] we consider purposes of computational testing beyond the accompaniment of algorithms. Some of these additional reasons for computational testing are addressed by Zimmermann. ["] The rest of this report is divided into three sections. In the next section we elaborate upon reasons for performing and reporting computational testing. At the same time, some principles are given and some are tacitly assumed. Section 2 suggests broad methods of testing, recognizing strengths and weaknesses of each. Finally, Section 3 gives guidelines for how much computational testing is appropriate, depending on why it is being performed. This comprises guidelines for referees and editors for the ORSA Journal on Computing. Why Perform Computational Testing The reasons to perform computational testing in the context of conducting and reporting research are to demonstrate: correctness of model or algorithm quality of solution speed of computation robustness. These are, of course, not exhaustive, but they are enough to serve our purposes. The extent to which one might report computational tests that suggest mere correctness is difficult to quantify. Certainly such testing takes place in any algorithm implementation. When it is straightforward, the computational tests are not reported (or they are enhanced to satisfy another goal, like demonstrating speed of computation). There are, however, some cases where correctness is an issue that cannot be settled, and computational testing does offer some illumination. One example is the neural net proposed by Hopfield and Tank['] to solve the Travelling Salesman Problem (TSP). The dynamics of the penalty trajectories require more analysis, and their paper was about the feasibility of building a neural computer (hardware) to solve the TSP, not about how well it does it. Properties of theoretical convergence (say by statistical considerations like simulated annealing) are not well understood at present, so empirical results are important to add insights into the neural dynamics. Similar situations prevail for other algorithm trajectories. The case of model correctness is also part of this. One good example of using computational testing to address model correctness is Manne's introduction of the refinery linear programming model.[12' After presenting a detailed description of assumptions and mathematical relations, computed shadow prices are compared against actual market prices. One of the essential purposes of computational testing in the simulation community is to address the issue of model correctness (see Balci and SargentL'] for a survey). Early concerns with language seem not to be addressed, at least in the present context. Random number generation is a~iotiiei topic that had compiiiational testing. New aids for incorporating artificial intelligence into the modeling environment raises new concerns of quality of model generation, which can be addressed by computational testing. Furthermore, there are new concerns for model correctness raised by parallel computation. Nance[I4] had studied time flow mechanisms and reported computational tests for the machine interference problem as a way to illustrate the issue of how to represent concurrence (for example, not registering a departure from an empty queue). This is now a central issue at the crux of parallel simulation, where there is a risk of overparallelization to gain speedup. The result may not be valid if the implementation does not limit parallelization to what is natural for the model, especially for analog computation. Another problem in simulation is the sample size. A recent note by HarrisL6] reflects both the importance and the scientific care that must be taken. Other parameters may be best studied empirically; and, even when this leads to theoretical (or analytical) results of how to set certain control parameters, the computational tests that led to that discovery may be important to report. Another example of sharing empirically derived insights is in the well orchestrated study by Glover et a1.[41 How to Perform Computational Testing Methods of computational testing include: statistical analysis-presumes random generation over a problem space and collects performance values of replicated trials. library analysis-uses a fixed library generally available to the professional community and whose properties are already known or are reported along with the computational test results. An early example of statistical analysis to evaluate performance is the use of Analysis Of Variance by Moore and Whinst~n['~I to measure significance of algorithm parameters and problem attributes against computational time and iterations. A more recent analysis was given by Hoaglin et al. ['] Generators, such as NETGEN,['O1 enabled some degree of randomization while retaining realistic structural properties. In attempting to use CPU time as a measure of goodness, Gilsinn et reported a pitfall when running in a multitasking environment. They showed significant variation of times due to job mix and time of day (see also O'NeillL16]). The critical thing about using statistical analysis for any purpose is the design of the experiment. In most (perhaps all) cases random generation of the coefficients without regard for problem structure is rather meaningless, ceitaiiilj; f~r aay alleged demoastrztion of quality or speed. Instead, one may generate a particular model, like production scheduling and distribution, and randomize some of the basic data, such as costs, within realistic ranges. The Workbench for Research In (linear) Programming[51 (WRIP) is designed to contain this capability in its model generation/variation language. Library analysis has always been popular in demonstrating merits of mathematical programming algorithms. Its main virtue is familiarity among the professional community. One recent example is the NETLIB linear programs put into public domain by Gay,L21 and thoroughly analyzed by Lustig. ["] It is important to note that a library should reflect a reality, such as collected from industries. While "toy problems" are sometimes useful for exposition (and may be encouraged for just that reason), they seldom offer credibility in the present context of computational testing. If some industrial problems are used as paradigms but scaled due, for example, to budget constraints, problem size should still be realistic if inferences about actual performance are to be credible. The meaning of "size" may depend upon the situation; and, for non-numeric applications this may be difficult to separate from complexity. An example of the latter is to evaluate a x-by-example approach, where x may be query or model. One disadvantage of a library, even one as extensive as NETLIB, is the inability to make inferences in some rigorous manner. This may be overcome (as in WRIP) by taking a particular problem, such as a library member, and allow controlled random variations. One form of variation is perturbation of the model's numerical values, keeping fixed the topology (as the sign pattern of an LP matrix). Another variation is augmentation of special objects or relations (such as redundant or infeasible rows in an LP). In short, controlled randomization is a valuable method for satisfying the need for computational testing. The "control" must be part of the experimental design, which could target for certain classes of problems. The issues of overparallelization, cited above in the context of concurrence in simulation, embody broad issues of scales and measurements. Presently, Richard Barr is conducting a study of how to report speedup in this regard. It is not as straightforward as one might at first imagine. Indeed, if the entire subject 96 Greenberg of computational testing is in need of research, one focus that has received little attention due to its newness is in parallel computation in all areas of operations research. How Much Computational Testing to Perform In considering how much computational testing to perform; one must classify the purpose by its need: Critical-the merit of the research contribution depends critically-perhaps entirely-on the empirical evidence provided by the computational testing. Decisive-the merit of the research contribution depends decisively on the computational test results, but there is merit without it. Valuable-the merit of the research is enhanced by the computational test results, but there is enough merit without it. Incidental-the merit of the research is unaffected by the computational test results. There is also the perennial point of proprietary software. If computational testing is critical or even decisive for the paper to endure as a contribution to the art and science of OR/CS, it must follow the same criteria. The author(s) have a greater burden, but it has been done while protecting trade secrets. Alternatively, if disclosure is not desirable, another way to inform the professional community is by a submission to the Software Section (both the ORSA Journal on Computing and Operations Research Letters have this). If one looks at the full spectrum of outlets, there is the research paper at one end and paid advertisement at the other. Between these are articles in trade magazines (like BYTE), features in newsletters (like CSTS Newsletter), extended abstracts or demonstrations at professional meetings, and the list goes on. The present consideration is for a research article. For this there can be no compromise with necessity. Either computational test results are necessary for publication or they are not. If they are, there is no special dispensation for private owners to withhold vital information. The question then arises, "What about revealing the test problems and the exact code that was run?" This is really the issue of proprietary software in computational testing. Our policy is: Any research paper whose merit depends critically (perhaps decisively) on the computational results must be prepared to have the results reviewed by referees.