Results 1 - 10
of
13
Towards a holistic approach to auto-parallelization: integrating profile-driven parallelism detection and machine-learning based mapping
- In PLDI
, 2009
"... Compiler-based auto-parallelization is a much studied area, yet has still not found wide-spread application. This is largely due to the poor exploitation of application parallelism, subsequently resulting in performance levels far below those which a skilled expert programmer could achieve. We have ..."
Abstract
-
Cited by 11 (0 self)
- Add to MetaCart
Compiler-based auto-parallelization is a much studied area, yet has still not found wide-spread application. This is largely due to the poor exploitation of application parallelism, subsequently resulting in performance levels far below those which a skilled expert programmer could achieve. We have identified two weaknesses in traditional parallelizing compilers and propose a novel, integrated approach, resulting in significant performance improvements of the generated parallel code. Using profile-driven parallelism detection we overcome the limitations of static analysis, enabling us to identify more application parallelism and only rely on the user for final approval. In addition, we replace the traditional target-specific and inflexible mapping heuristics with a machine-learning based prediction mechanism, resulting in better mapping decisions while providing more scope for adaptation to different target architectures.
Feedback-driven threading: power-efficient and high-performance execution of multithreaded workloads on CMPs
- In Proc. 13th ACM Symposium on Architectural Support for Programming Languages and Operating Systems
, 2008
"... Extracting high-performance from the emerging Chip Multiprocessors (CMPs) requires that the application be divided into multiple threads. Each thread executes on a separate core thereby increasing concurrency and improving performance. As the number of cores on a CMP continues to increase, the perfo ..."
Abstract
-
Cited by 11 (1 self)
- Add to MetaCart
Extracting high-performance from the emerging Chip Multiprocessors (CMPs) requires that the application be divided into multiple threads. Each thread executes on a separate core thereby increasing concurrency and improving performance. As the number of cores on a CMP continues to increase, the performance of some multi-threaded applications will benefit from the increased number of threads, whereas, the performance of other multi-threaded applications will become limited by data-synchronization and off-chip bandwidth. For applications that get limited by datasynchronization, increasing the number of threads significantly degrades performance and increases on-chip power. Similarly, for applications that get limited by off-chip bandwidth, increasing the number of threads increases on-chip power without providing any performance improvement. Furthermore, whether an application
A Dynamic Periodicity Detector: Application to Speedup Computation
- In Proceedings of International Parallel and Distributed Processing Symposium (IPDPS
, 2001
"... We propose a dynamic periodicity detector (DPD) for the estimation of periodicities in data series obtained from the execution of applications. We analyze the algorithm used by the periodicity detector and its performance on a number of data streams. It is shown how the periodicity detector is used ..."
Abstract
-
Cited by 6 (2 self)
- Add to MetaCart
We propose a dynamic periodicity detector (DPD) for the estimation of periodicities in data series obtained from the execution of applications. We analyze the algorithm used by the periodicity detector and its performance on a number of data streams. It is shown how the periodicity detector is used for the segmentation and prediction of data streams. In an application case we describe how the periodicity detector is applied to the dynamic detection of iterations in parallel applications, where the detected segments are evaluated by a speedup computation tool. We test the performance of the periodicity detector on a number of parallelized benchmarks. The periodicity detector correctly identifies the iterations of parallel structures also in the case where the application has nested parallelism. In our implementation we measure only a negligible overhead produced by the periodicity detector. We find the DPD to be useful and suitable for the incorporation in dynamic optimization tools.
Improving Gang Scheduling through Job Performance Analysis and Malleability
- PROCEEDINGS OF THE 15TH INTERNATIONAL CONFERENCE ON SUPERCOMPUTING (ICS
, 2001
"... The OpenMP programming model provides parallel applications a very important feature: job malleability. Job malleability is the capacity of an application to dynamically adapt its parallelism to the number of processors allocated to it. We believe that job malleability provides to applications the f ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
The OpenMP programming model provides parallel applications a very important feature: job malleability. Job malleability is the capacity of an application to dynamically adapt its parallelism to the number of processors allocated to it. We believe that job malleability provides to applications the flexibility that a system needs to achieve its maximum performance. We also defend that a system has to take its decisions not only based on user requirements but also based on run-time performance measurements to ensure the efficient use of resources. Job malleability is the application characteristic that makes possible the run-time performance analysis. Without malleability applications would not be able to adapt their parallelism to the system decisions. To support these ideas, we present two new approaches to attack the two main problems of Gang Scheduling: the excessive number of time slots and the fragmentation. Our first proposal is to apply a scheduling policy inside each time slot of Gang Scheduling to distribute processors among applications considering their efficiency, calculated based on runtime measurements. We call this policy Performance-Driven Gang Scheduling. Our second approach is a new re-packing algorithm, Compress&Join, that exploits the job malleability. This algorithm modifies the processor allocation of running applications to adapt it to the system necessities and minimize the fragmentation and number of time slots. These proposals have been implemented in a SGI Origin 2000 with 64 processors. Results show the validity and convenience of both, to consider the job performance analysis calculated at run-time to decide the processor allocation, and to use a flexible programming model that adapts applications to system decisions.
Bossa: A Dsl Framework For Application-Specific . . .
- In Eighth Workshop on Hot Topics in Operating Systems
, 2001
"... : Developing or specializing existing process schedulers for new needs is tedious and errorprone due to the lack of modularity and inherent complexity of scheduling mechanisms. In this paper, we propose a framework based on a Domain-Specific Language for the implementation of scheduling policies. T ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
: Developing or specializing existing process schedulers for new needs is tedious and errorprone due to the lack of modularity and inherent complexity of scheduling mechanisms. In this paper, we propose a framework based on a Domain-Specific Language for the implementation of scheduling policies. This framework permits the installation of basic scheduling policies, called Virtual Schedulers, and the development of Application-Specific Policies, which tailor a Virtual Scheduler to application-specific requirements. We illustrate our approach with concrete examples that show how specialization and reuse of scheduling policies can be accomplished while retaining OS robustness. Key-words: Process Scheduling, Operating systems, Domain-Specific Languages (R'esum'e : tsvp) This research is partially supported by France Telecom R&D under the Phenix project. Centre National de la Recherche Scientifique Institut National de Recherche en Informatique (UPRESSA 6074) Universit de Rennes 1 -- I...
Thread Tailor: Dynamically Weaving Threads Together for Efficient, Adaptive Parallel Applications
"... Extracting performance from modern parallel architectures requires that applications be divided into many different threads of execution. Unfortunately selecting the appropriate number of threads for an application is a daunting task. Having too many threads can quickly saturate shared resources, su ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Extracting performance from modern parallel architectures requires that applications be divided into many different threads of execution. Unfortunately selecting the appropriate number of threads for an application is a daunting task. Having too many threads can quickly saturate shared resources, such as cache capacity or memory bandwidth, thus degrading performance. On the other hand, having too few threads makes inefficient use of the resources available. Beyond static resource assignment, the program inputs and dynamic system state (e.g., what other applications are executing in the system) can have a significant impact on the right number of threads to use for a particular application. To address this problem we present the Thread Tailor, a dynamic system that automatically adjusts the number of threads in an application to optimize system efficiency. The Thread Tailor leverages offline analysis to estimate what type of threads will exist at runtime and the communication patterns between them. Using this information Thread Tailor dynamically combines threads to better suit the needs of the target system. Thread Tailor adjusts not only to the architecture, but also other applications in the system, and this paper demonstrates that this type of adjustment can lead to significantly better use of thread-level parallelism in real-world architectures.
Evaluation of the Memory Page Migration Influence in the System Performance: The case of the SGI O2000
- In Proceedings of the 17th annual international conference on Supercomputing
, 2002
"... Current shared-memory multiprocessor CC-NUMA architectures have the main characteristic that they provide a global address space to applications by hardware. However, even thought the memory is virtually shared, it is physically distributed. Since memory nodes are distributed across the system, the ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Current shared-memory multiprocessor CC-NUMA architectures have the main characteristic that they provide a global address space to applications by hardware. However, even thought the memory is virtually shared, it is physically distributed. Since memory nodes are distributed across the system, the cost to access to memory depends on the distance between the node that accesses the data and the node that physically contains the data. To reduce the impact of a bad initial memory placement, some operating systems offer a dynamic memory migration mechanism.
Multitasking Workload Scheduling on Flexible Core Chip Multiprocessors
"... While technology trends have ushered in the age of chip multiprocessors (CMP) and enabled designers to place an increasing number of cores on chip, a fundamental question is what size to make each core. Most current commercial designs are symmetric CMPs in which each core is identical and range from ..."
Abstract
- Add to MetaCart
While technology trends have ushered in the age of chip multiprocessors (CMP) and enabled designers to place an increasing number of cores on chip, a fundamental question is what size to make each core. Most current commercial designs are symmetric CMPs in which each core is identical and range from a relatively simple RISC pipeline to a large and complicated out-of-order x86 core. When the granularity of parallelism in the tasks matches the granularity of the processing cores, a CMP will be at its most efficient. To adjust the granularity of a core to the tasks running on it, recent research has proposed flexible-core chip multiprocessors, which typically consist of a number of small processing cores that can be aggregated to form larger logical processors. These architectures introduce a new resource allocation and scheduling problem which must determine how many logical processors should be configured, how powerful each processor should be, and where/when each task should run. This paper introduces and motivates this new scheduling problem, describes the challenges associated with it, and examines and evaluates several algorithms (amenable to implementation in an operating system) appropriate for such flexible-core CMPs. We also describe how scheduling for flexible-core architectures differs from scheduling for fixed multicore architectures, and compare the performance of flexible-core CMPs to both symmetric and asymmetric fixed-core CMPs. 1
Loosely Coordinated Coscheduling In The Context Of . . .
, 2005
"... This paper is organized in the following way. Section 2 provides an overview of objectives and job characteristics. Section 3 gives a brief overview of standard space sharing and discusses the dynamic space-sharing approaches of full preemption and adaptive resource allocation. Section 4 is the core ..."
Abstract
- Add to MetaCart
This paper is organized in the following way. Section 2 provides an overview of objectives and job characteristics. Section 3 gives a brief overview of standard space sharing and discusses the dynamic space-sharing approaches of full preemption and adaptive resource allocation. Section 4 is the core of the paper and presents dynamic time-sharing approaches, i.e. loosely coordinated coscheduling and gang scheduling, with special focus on loosely coordinated coscheduling and relaxed forms of gang scheduling. We include a discussion of resource share allocation which plays a role especially in time sharing. In Section 5, we conclude with a comparison of loosely coordinated coscheduling with the other dynamic scheduling

