Results 1  10
of
22
SpaceEfficient Scheduling of Multithreaded Computations
 SIAM Journal on Computing
, 1993
"... . This paper considers the problem of scheduling dynamic parallel computations to achieve linear speedup without using significantly more space per processor than that required for a singleprocessor execution. Utilizing a new graphtheoretic model of multithreaded computation, execution efficiency ..."
Abstract

Cited by 87 (15 self)
 Add to MetaCart
. This paper considers the problem of scheduling dynamic parallel computations to achieve linear speedup without using significantly more space per processor than that required for a singleprocessor execution. Utilizing a new graphtheoretic model of multithreaded computation, execution efficiency is quantified by three important measures: T 1 is the time required for executing the computation on 1 processor, T1 is the time required by an infinite number of processors, and S 1 is the space required to execute the computation on 1 processor. A computation executed on P processors is timeefficient if the time is O(T 1 =P + T1 ), that is, it achieves linear speedup when P = O(T 1 =T1 ), and it is spaceefficient if it uses O(S 1 P ) total space, that is, the space per processor is within a constant factor of that required for a 1processor execution. The first result derived from this model shows that there exist multithreaded computations such that no execution schedule can simultan...
The data locality of work stealing
 Theory of Computing Systems
, 2000
"... This paper studies the data locality of the workstealing scheduling algorithm on hardwarecontrolled sharedmemory machines. We present lower and upper bounds on the number of cache misses using work stealing, and introduce a localityguided workstealing algorithm along with experimental validatio ..."
Abstract

Cited by 74 (13 self)
 Add to MetaCart
This paper studies the data locality of the workstealing scheduling algorithm on hardwarecontrolled sharedmemory machines. We present lower and upper bounds on the number of cache misses using work stealing, and introduce a localityguided workstealing algorithm along with experimental validation. As a lower bound, we show that there is a family of multithreaded computations Gn each member of which requires (n) total instructions (work), for which when using workstealing the number of cache misses on one processor is constant, while even on two processors the total number of cache misses is (n). This implies that for general computations there is no useful bound relating multiprocessor to uninprocessor cache misses. For nestedparallel computations, however, we show that on P processors the expected additional number of cache misses beyond those on a single processor is bounded by O(Cd m e PT1), where m is the execution time s of an instruction incurring a cache miss, s is the steal time, C is the size of cache, and T1 is the number of nodes on the longest chain of dependences. Based on this we give strong bounds on the total running time of nestedparallel computations using work stealing. For the second part of our results, we present a localityguided work stealing algorithm that improves the data locality of multithreaded computations by allowing a thread to have an affinity for a processor. Our initial experiments on iterative dataparallel applications show that the algorithm matches the performance of staticpartitioning under traditional work loads but improves the performance up to 50 % over static partitioning under multiprogrammed work loads. Furthermore, the localityguided work stealing improves the performance of workstealing up to 80%. 1
Executing Multithreaded Programs Efficiently
, 1995
"... right to do so. by:::::::::::::::::::::::::::::::::::::::::::::::::::::::: ..."
Abstract

Cited by 69 (8 self)
 Add to MetaCart
right to do so. by::::::::::::::::::::::::::::::::::::::::::::::::::::::::
An introduction to randomized algorithms
 Discrete Appl Math
, 1991
"... Research conducted over the past fifteen years has amply demonstrated the advantages of algorithms that make random choices in the course of their execution. This paper presents a wide variety of examples intended to illustrate the range of applications of randomized algorithms, and the general prin ..."
Abstract

Cited by 32 (0 self)
 Add to MetaCart
Research conducted over the past fifteen years has amply demonstrated the advantages of algorithms that make random choices in the course of their execution. This paper presents a wide variety of examples intended to illustrate the range of applications of randomized algorithms, and the general principles and approaches that are of greatest use in their construction. The examples are drawn from many areas, including number theory, algebra, graph theory, pattern matching, selection, sorting, searching, computational geometry, combinatorial enumeration, and parallel and distributed computation. 1. Foreword This paper is derived from a series of three lectures on randomized algorithms presented by the author at a conference on combinatorial mathematics and algorithms held at George Washington University in May, 1989. The purpose of the paper is to convey, through carefully selected examples, an understanding of the nature of randomized algorithms, the range of their applications and the principles underlying their construction. It is not our goal to be encyclopedic, and thus the paper should not be regarded as a comprehensive survey of the subject. This paper would not have come into existence without the magnificent efforts of Professor Rodica Simion, the organizer of the conference at George Washington University. Working from the taperecorded lectures, she created a splendid transcript that served as the first draft of the paper. Were it not for her own reluctance she would be listed as my coauthor.
Communication Complexity for Parallel DivideandConquer
 In Proceedings of the 32nd Annual Symposium on Foundations of Computer Science
, 1991
"... This paper studies the relationship between parallel computation cost and communication cost for performing divideandconquer (D&C) computations on a parallel system of p processors. The parallel computation cost is the maximal number of the D&C nodes that any processor in the parallel syst ..."
Abstract

Cited by 30 (2 self)
 Add to MetaCart
This paper studies the relationship between parallel computation cost and communication cost for performing divideandconquer (D&C) computations on a parallel system of p processors. The parallel computation cost is the maximal number of the D&C nodes that any processor in the parallel system may expand, whereas the communication cost is the total number of cross nodes. A cross node is a node which is generated by one processor but expanded by another processor. A new scheduling algorithm is proposed, whose parallel computation cost and communication cost are at most dN=pe and pdh, respectively, for any D&C computation tree with N nodes, height h, and degree d. Also, lower bounds on the communication cost are derived. In particular, it is shown that for each scheduling algorithm and for each positive ffl C ! 1, which can be arbitrarily close to 0, there are values of N , h, d, p, and ffl T (? 0), for which if the parallel computation cost is between N=p (the minimum) and (1 + ffl T ...
Rulebase structure identification in an adaptivenetworkbased fuzzy inference system
 IEEE Trans. Fuzzy Syst
, 1994
"... AbstructFuzzy rulebase modeling is the task of identifying the structure and the parameters of a fuzzy IFTHEN rule base so that a desired input/output mapping is achieved. Recently, using adaptive networks to finetune membership functions in a fuzzy rule base has received more and more attention ..."
Abstract

Cited by 26 (0 self)
 Add to MetaCart
AbstructFuzzy rulebase modeling is the task of identifying the structure and the parameters of a fuzzy IFTHEN rule base so that a desired input/output mapping is achieved. Recently, using adaptive networks to finetune membership functions in a fuzzy rule base has received more and more attention. In this paper we summarize Jang’s architecture of employing an adaptive network and the Kalman filtering algorithm to identify the system parameters. Given a surface structure, the adaptively adjusted inference system performs well on a number of interpolation problems. We generalize Jang’s basic model so that it can be used to solve classification problems by employing parameterized tnorms. We also enhance the model to include weights of importance so that feature selection becomes a component of the modeling scheme. Next, we discuss two ways of identifying system structures based on Jang’s architecture. For the topdown approach, we summarize several ways of partitioning the feature space and propose a method of using clustering objective functions to evaluate possible partitions. We analyze the overall learning and operation complexity. In particular, we pinpoint the dilemma between two desired properties: modeling accuracy and pattern matching efficiency. Based on the analysis, we suggest a bottomup approach of using rule organization to meet the conflicting requirements. We introduce a data structure, called a fuzzy binary boxtree, to organize rules so that the rule base can be matched against input signals with logarithmic efficiency. To preserve the advantage of parallel processing assumed in fuzzy rulebased inference systems, we give a parallel algorithm for pattern matching with a linear speedup. Moreover, as we consider the communication and storage cost of an interpolation model, it is important to extract the essential components of the modeled system and use the rest as a backup. We propose a rule combination mechanism to build a simplified version of the original rule base according to a given focus set. This scheme can be used in various situations of pattern representation or data compression, such as in image coding or in hierarchical pattern recognition.
Tight bounds for online tree embeddings
 in “Proc. 2nd ACMSIAM Symp. on Discrete Algorithms
, 1991
"... ..."
Algorithms for Combinatorial Optimization in Real Time and their Automated Refinement by GeneticsBased Learning
 UNIVERSITY OF ILLINOIS AT URBANACHAMPAIGN
, 1994
"... The goal of this research is to develop a systematic, integrated method of designing efficient search algorithms that solve optimization problems in real time. Search algorithms studied in this thesis comprise metacontrol and primitive search. The class of optimization problems addressed are called ..."
Abstract

Cited by 7 (1 self)
 Add to MetaCart
The goal of this research is to develop a systematic, integrated method of designing efficient search algorithms that solve optimization problems in real time. Search algorithms studied in this thesis comprise metacontrol and primitive search. The class of optimization problems addressed are called combinatorial optimization problems, examples of which include many NPhard scheduling and planning problems, and problems in operations research and artificialintelligence applications. The problems we have addressed have a welldefined problem objective and a finite set of welldefined problem constraints. In this research, we use statespace trees as problem representations. The approach we have undertaken in designing efficient search algorithms is an engineering approach and consists of two phases: (a) designing generic search algorithms, and (b) improving by geneticsbased machine learning methods parametric heuristics used in the search algorithms designed. Our approach is a systematic method that integrates domain knowledge, search techniques, and automated learning techniques for designing better search algorithms. Knowledge captured in designing one search algorithm can be carried over for designing new ones.
Efficient Parallel DivideandConquer for a Class of Interconnection Topologies.
 In Proceedings of the 2nd International Symposium on Algorithms, number 557 in Lecture Notes in Computer Science
, 1991
"... : In this paper, we propose an efficient scheduling algorithm for expanding any divideandconquer (D&C) computation tree on kdimensional mesh, hypercube, and perfect shuffle networks with p processors. Assume that it takes t n time steps to expand one node of the tree and t c time steps to tran ..."
Abstract

Cited by 7 (1 self)
 Add to MetaCart
: In this paper, we propose an efficient scheduling algorithm for expanding any divideandconquer (D&C) computation tree on kdimensional mesh, hypercube, and perfect shuffle networks with p processors. Assume that it takes t n time steps to expand one node of the tree and t c time steps to transmit one datum or convey one node. For any D&C computation tree with N nodes, height h, and degree d (maximal number of children of any node), our algorithm requires at most (N=p + h)t n + 'dht c time steps, where ' is O(log 2 p) on a hypercube or perfect shuffle network and is O( k p p) on a n k\Gamma1 \Theta \Delta \Delta \Delta \Theta n 0 mesh network, where n k\Gamma1 = \Delta \Delta \Delta = n 0 = k p p. This algorithm is general in the sense that it does not know the values of N , h, and d, and the shape of the computation tree as well, a priori. Most importantly, we can easily obtain a linear speedup by nearly a factor of p, especially when N AE ph(1 + 'dt c =t n ). 1. Introduction ...