Results 1 -
7 of
7
Shared Memory Parallelization of Data Mining Algorithms: Techniques, Programming Interface, and Performance
- In Proceedings of the second SIAM conference on Data Mining
, 2002
"... With recent technological advances, shared memory parallel machines have become more scalable, and oer large main memories and high bus bandwidths. They are emerging as good platforms for data warehousing and data mining. In this paper, we focus on shared memory parallelization of data mining alg ..."
Abstract
-
Cited by 41 (15 self)
- Add to MetaCart
(Show Context)
With recent technological advances, shared memory parallel machines have become more scalable, and oer large main memories and high bus bandwidths. They are emerging as good platforms for data warehousing and data mining. In this paper, we focus on shared memory parallelization of data mining algorithms.
The System-on-a-Chip Lock Cache
, 2004
"... CONTENTS DEDICATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . iv LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii LIST OF FIGURES . . . . . . . . . . . . . . . . . . ..."
Abstract
-
Cited by 12 (3 self)
- Add to MetaCart
CONTENTS DEDICATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . iv LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix SUMMARY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii I INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Thesis Organization and Roadmap . . . . . . . . . . . . . . . . . . . 5 II BACKGROUND AND PREVIOUS WORK . . . . . . . . . . . . 6 2.1 Locking Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1.1 Hardware Instructions for Locking . . . . . . . . . . . . . . . 8 2.1.2 Traditional Spin-Lock . . . . . . . . . . . . . . . . . . . . . . 1
A System-on-a-Chip Lock Cache with Task Preemption Support
- Proceedings of the International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES’01
, 2001
"... Intertask/interprocess synchronization overheads may be significant in a multiprocessor-shared memory System-on-a-Chip implementation. These overheads are observed in terms of lock latency, lock delay and memory bandwidth consumption in the system. It has been shown that a hardware solution brings a ..."
Abstract
-
Cited by 9 (5 self)
- Add to MetaCart
Intertask/interprocess synchronization overheads may be significant in a multiprocessor-shared memory System-on-a-Chip implementation. These overheads are observed in terms of lock latency, lock delay and memory bandwidth consumption in the system. It has been shown that a hardware solution brings a much better performance improvement than the synchronization algorithms developed in software [3]. Our previous work presented a SoC Lock Cache (SoCLC) hardware mechanism which resolves the Critical Section (CS) interactions among multiple processors and improves the performance criteria in terms of lock latency, lock delay and bandwidth consumption in a shared memory multiprocessor SoC for short CSes [1]. This paper extends our previous work to support long CSes as well. This combined support involves modifications both in the RTOS kernel level facilities (such as support for preemptive versus non-preemptive synchronization, interrupt handling and RTOS initialization) and in the hardware mechanism. The worst-case simulation results of a database application model with client-server pair of tasks on a fourprocessor system showed that our mechanism achieved a 57% improvement in lock latency, 14% speed up in lock delay and a 35% overall speedup in total execution time.
Inferential Queueing and Speculative Push
, 2003
"... Communication latencies within critical sections constitute a major bottleneck in some classes of emerging parallel workloads. In this paper, we argue for the use of two mechanisms to reduce these communication latencies: Inferentially Queued locks (IQLs) and Speculative Push (SP). With IQLs, the pr ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
(Show Context)
Communication latencies within critical sections constitute a major bottleneck in some classes of emerging parallel workloads. In this paper, we argue for the use of two mechanisms to reduce these communication latencies: Inferentially Queued locks (IQLs) and Speculative Push (SP). With IQLs, the processor infers the existence, and limits, of a critical section from the use of synchronization instructions and joins a queue of lock requestors, reducing synchronization delay. The SP mechanism extracts information about program structure by observing IQLs. SP allows the cache controller, responding to a request for a cache line that likely includes a lock variable, to predict the data sets the requestor will modify within the associated critical section. The controller then pushes these lines from its own cache to the target cache, as well as writing them to memory. Overlapping the protected data transfer with that of the lock can substantially reduce the communication latencies within critical sections. By pushing data in exclusive state, the mechanism can collapse a read-modify-write sequences within a critical section into a single local cache access. The write-back to memory allows the receiving cache to ignore the push. Neither mechanism requires any programmer or compiler support nor any instruction set changes. Our experiments demonstrate that IQLs and SP can improve performance of applications employing frequent synchronization.
Speeding-up Synchronizations in DSM Multiprocessors
"... Abstract. Synchronization in parallel programs is a major performance bottleneck. Shared data is protected by locks and a lot of time is spent in the competition arising at the lock hand-off. In this period of time, a large amount of traffic is targeted to the line of the lock variable. In order to ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract. Synchronization in parallel programs is a major performance bottleneck. Shared data is protected by locks and a lot of time is spent in the competition arising at the lock hand-off. In this period of time, a large amount of traffic is targeted to the line of the lock variable. In order to be serialized, requests to the same cache line can be bounced (NACKed) or buffered in the coherence controller. In this paper we focus on systems with buffering in the coherence controller. During lock hand-off only requests from the winning processor contribute to the computation progress, because it is the only one that will advance the work. This key observation leads us to propose a hardware mechanism named Request Bypass, which allows requests from the winning processor to bypass the buffered requests in the home coherence controller of the lock line. We show an inexpensive implementation of Request Bypass that reduces all the processing time phases involved in a critical section execution, namely lock acquiring, shared data accessing, and lock releasing, speeding up the whole parallel computation. The mechanism does not require compiler or programmer support nor ISA or coherence protocol changes. By simulation we show that the execution time reduction and lock stall time reduction achieved by Request Bypass reach 5 % and 55%, respectively for 16 processors, and 35 % and 75%, respectively for 32 processors. The programs limited by synchronization benefit the most from Request Bypass. 1
A methodology for detailed . . .
, 2004
"... In this paper, we revisit the problem of performance prediction on SMP machines, motivated by the need for selecting parallelization strategy for random write reductions. Such reductions frequently arise in data mining algorithms. In our previous work, we have developed a number of techniques for pa ..."
Abstract
- Add to MetaCart
In this paper, we revisit the problem of performance prediction on SMP machines, motivated by the need for selecting parallelization strategy for random write reductions. Such reductions frequently arise in data mining algorithms. In our previous work, we have developed a number of techniques for parallelizing this class of reductions. Our previous work has shown that each of the three techniques, full replication, optimized full locking, and cachesensitive, can outperform others depending upon problem, dataset, and machine parameters. Therefore, an important question is, “Can we predict the performance of these techniques for a given problem, dataset, and machine?”. This paper addresses this question by developing an analytical performance model that captures a two-level cache, coherence cache misses, TLB misses, locking overheads, and contention for memory. Analytical model is combined with results from micro-benchmarking to predict performance on real machines. We have validated our model on two different SMP machines. Our results show that our model effectively captures the impact of memory hierarchy (twolevel cache and TLB) as well as the factors that limit parallelism (contention for locks, memory contention, and coherence cache misses). The difference between predicted and measured performance is within 20 % in almost all cases. Moreover, the model is quite accurate in predicting the relative performance of the three parallelization techniques.
Programming Interface, and Performance ∗
"... With the availability of large datasets in application areas like bioinformatics, medical informatics, scientific data analysis, financial analysis, telecommunications, retailing, and marketing, it is becoming increasingly important to execute data mining tasks in parallel. At the same time, technol ..."
Abstract
- Add to MetaCart
(Show Context)
With the availability of large datasets in application areas like bioinformatics, medical informatics, scientific data analysis, financial analysis, telecommunications, retailing, and marketing, it is becoming increasingly important to execute data mining tasks in parallel. At the same time, technological advances have made shared