Results 1 - 10
of
39
Automatic thread extraction with decoupled software pipelining
- In Proceedings of the 38th IEEE/ACM International Symposium on Microarchitecture
, 2005
"... {ottoni, ram, astoler, august}@princeton.edu Abstract Until recently, a steadily rising clock rate and otheruniprocessor microarchitectural improvements could be relied upon to consistently deliver increasing performance fora wide range of applications. Current difficulties in maintaining this trend ..."
Abstract
-
Cited by 59 (10 self)
- Add to MetaCart
{ottoni, ram, astoler, august}@princeton.edu Abstract Until recently, a steadily rising clock rate and otheruniprocessor microarchitectural improvements could be relied upon to consistently deliver increasing performance fora wide range of applications. Current difficulties in maintaining this trend have lead microprocessor manufacturersto add value by incorporating multiple processors on a chip. Unfortunately, since decades of compiler research have notsucceeded in delivering automatic threading for prevalent code properties, this approach demonstrates no improve-ment for a large class of existing codes. To find useful work for chip multiprocessors, we proposean automatic approach to thread extraction, called Decoupled Software Pipelining (DSWP). DSWP exploits the fine-grained pipeline parallelism lurking in most applications to extract long-running, concurrently executing threads. Useof the non-speculative and truly decoupled threads produced by DSWP can increase execution efficiency and pro-vide significant latency tolerance, mitigating design complexity by reducing inter-core communication and per-coreresource requirements. Using our initial fully automatic compiler implementation and a validated processor model,we prove the concept by demonstrating significant gains for dual-core chip multiprocessor models running a variety ofcodes. We then explore simple opportunities missed by our initial compiler implementation which suggest a promisingfuture for this approach. 1
Revisiting the sequential programming model for multi-core
- In Proceedings of the 40th Annual ACM/IEEE International Symposium on Microarchitecture
, 2007
"... Single-threaded programming is already considered a complicated task. The move to multi-threaded programming only increases the complexity and cost involved in software development due to rewriting legacy code, training of the programmer, increased debugging of the program, and efforts to avoid race ..."
Abstract
-
Cited by 33 (4 self)
- Add to MetaCart
Single-threaded programming is already considered a complicated task. The move to multi-threaded programming only increases the complexity and cost involved in software development due to rewriting legacy code, training of the programmer, increased debugging of the program, and efforts to avoid race conditions, deadlocks, and other problems associated with parallel programming. To address these costs, other approaches, such as automatic thread extraction, have been explored. Unfortunately, the amount of parallelism that has been automatically extracted is generally insufficient to keep many cores busy. This paper argues that this lack of parallelism is not an intrinsic limitation of the sequential programming model, but rather occurs for two reasons. First, there exists no framework for automatic thread extraction that brings together key existing state-of-the-art compiler and hardware techniques. This paper shows that such a framework can yield scalable parallelization on several SPEC CINT2000 benchmarks. Second, existing sequential programming languages force programmers to define a single legal program outcome, rather than allowing for a range of legal outcomes. This paper shows that natural extensions to the sequential programming model enable parallelization for the remainder of the SPEC CINT2000 suite. Our experience demonstrates that, by changing only 60 source code lines, all of the C benchmarks in the SPEC CINT2000 suite were parallelizable by automatic thread extraction. This process, constrained by the limits of modern optimizing compilers, yielded a speedup of 454 % on these applications. 1
Rapid development of a flexible validated processor model
- In Proceedings of the 2005 Workshop on Modeling, Benchmarking, and Simulation
, 2005
"... For a variety of reasons, most architectural evaluations use simulation models. An accurate baseline model validated against existing hardware provides confidence in the results of these evaluations. Meanwhile, a meaningful exploration of the design space requires a wide range of quickly-obtainable ..."
Abstract
-
Cited by 22 (11 self)
- Add to MetaCart
For a variety of reasons, most architectural evaluations use simulation models. An accurate baseline model validated against existing hardware provides confidence in the results of these evaluations. Meanwhile, a meaningful exploration of the design space requires a wide range of quickly-obtainable variations of the baseline. Unfortunately, these two goals are generally considered to be at odds; the set of validated models is considered exclusive of the set of easily malleable models. Vachharajani et al. challenge this belief and propose a modeling methodology they claim allows rapid construction of flexible validated models. Unfortunately, they only present anecdotal and secondary evidence to support their claims. In this paper, we present our experience using this methodology to construct a validated flexible model of Intel’s Itanium 2 processor. Our practical experience lends support to the above claims. Our initial model was constructed by a single researcher in only 11 weeks and predicts processor cycles-per-instruction (CPI) to within 7.9 % on average for the entire SPEC CINT2000 benchmark suite. Our experience with this model showed us that aggregate accuracy for a metric like CPI is not sufficient. Aggregate measures like CPI may conceal remaining internal “offsetting errors ” which can adversely affect conclusions drawn from the model. Using this as our motivation, we explore the flexibility of the model by modifying it to target specific error constituents, such as front-end stall errors. In 2 1 2 person-weeks, average CPI error was reduced to 5.4%. The targeted error constituents were reduced more dramatically; front-end stall errors were reduced from 5.6 % to 1.6%. The swift implementation of significant new architectural features on this model further demonstrated its flexibility. 1
Speculative decoupled software pipelining
- In Proceedings of the 16th International Conference on Parallel Architectures and Compilation Techniques
, 2007
"... In recent years, microprocessor manufacturers have shifted their focus from single-core to multi-core processors. To avoid burdening programmers with the responsibility of parallelizing their applications, some researchers have advocated automatic thread extraction. A recently proposed technique, De ..."
Abstract
-
Cited by 22 (5 self)
- Add to MetaCart
In recent years, microprocessor manufacturers have shifted their focus from single-core to multi-core processors. To avoid burdening programmers with the responsibility of parallelizing their applications, some researchers have advocated automatic thread extraction. A recently proposed technique, Decoupled Software Pipelining (DSWP), has demonstrated promise by partitioning loops into long-running, fine-grained threads organized into a pipeline. Using a pipeline organization and execution decoupled by inter-core communication queues, DSWP offers increased execution efficiency that is largely independent of inter-core communication latency. This paper proposes adding speculation to DSWP and evaluates an automatic approach for its implementation. By speculating past infrequent dependences, the benefit of DSWP is increased by making it applicable to more loops, facilitating better balanced threads, and enabling parallelized loops to be run on more cores. Unlike prior speculative threading proposals, speculative DSWP focuses on breaking dependence recurrences. By speculatively breaking these recurrences, instructions that were formerly restricted to a single thread to ensure decoupling are now free to span multiple threads. Using an initial automatic compiler implementation and a validated processor model, this paper demonstrates significant gains using speculation for 4-core chip multiprocessor models running a variety of codes. 1
Fastforward for efficient pipeline parallelism: A cache-optimized concurrent lock-free queue
- In PPoPP ’08: Proceedings of the The 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
, 2008
"... A Cache-Optimized Concurrent Lock-Free Queue Low overhead core-to-core communication is critical for efficient pipeline-parallel software applications. This paper presents Fast-Forward, a cache-optimized single-producer/single-consumer concurrent lock-free queue for pipeline parallelism on multicore ..."
Abstract
-
Cited by 21 (2 self)
- Add to MetaCart
A Cache-Optimized Concurrent Lock-Free Queue Low overhead core-to-core communication is critical for efficient pipeline-parallel software applications. This paper presents Fast-Forward, a cache-optimized single-producer/single-consumer concurrent lock-free queue for pipeline parallelism on multicore architectures, with weak to strongly ordered consistency models. Enqueue and dequeue times on a 2.66 GHz Opteron 2218 based system are as low as 28.5 ns, up to 5x faster than the next best solution. FastForward’s effectiveness is demonstrated for real applications by applying it to line-rate soft network processing on Gigabit Ethernet with general purpose commodity hardware.
A practical approach to exploiting coarse-grained pipeline parallelism in C programs
- In International Symposium on Microarchitecture
, 2007
"... The emergence of multicore processors has heightened the need for effective parallel programming practices. In addition to writing new parallel programs, the next generation of programmers will be faced with the overwhelming task of migrating decades ’ worth of legacy C code into a parallel represen ..."
Abstract
-
Cited by 20 (2 self)
- Add to MetaCart
The emergence of multicore processors has heightened the need for effective parallel programming practices. In addition to writing new parallel programs, the next generation of programmers will be faced with the overwhelming task of migrating decades ’ worth of legacy C code into a parallel representation. Addressing this problem requires a toolset of parallel programming primitives that can broadly apply to both new and existing programs. While tools such as threads and OpenMP allow programmers to express task and data parallelism, support for pipeline parallelism is distinctly lacking. In this paper, we offer a new and pragmatic approach to leveraging coarse-grained pipeline parallelism in C programs. We target the domain of streaming applications, such as audio, video, and digital signal processing, which exhibit regular flows of data. To exploit pipeline parallelism, we equip the programmer with a simple set of annotations (indicating pipeline boundaries) and a dynamic analysis that tracks all communication across those boundaries. Our analysis outputs a stream graph of the application as well as a set of macros for parallelizing the program and communicating the data needed. We apply our methodology to six case studies, including MPEG-2 decoding, MP3 decoding, GMTI radar processing, and three SPEC benchmarks. Our analysis extracts a useful block diagram for each application, and the parallelized versions offer a 2.78x mean speedup on a 4-core machine. 1.
Uncovering hidden loop level parallelism in sequential applications
- In Proc. of the 14th International Symposium on High-Performance Computer Architecture
, 2008
"... As multicore systems become the dominant mainstream computing technology, one of the most difficult challenges the industry faces is the software. Applications with large amounts of explicit thread-level parallelism naturally scale performance with the number of cores, but single-threaded applicatio ..."
Abstract
-
Cited by 18 (4 self)
- Add to MetaCart
As multicore systems become the dominant mainstream computing technology, one of the most difficult challenges the industry faces is the software. Applications with large amounts of explicit thread-level parallelism naturally scale performance with the number of cores, but single-threaded applications realize little to no gains with additional cores. One solution to this problem is automatic parallelization that frees the programmer from the difficult task of parallel programming and offers hope for handling the vast amount of legacy single-threaded software. There is a long history of automatic parallelization for scientific applications, but the techniques have generally failed in the context of generalpurpose software. Thread-level speculation overcomes the problem of memory dependence analysis by speculating unlikely dependences that serialize execution. However, this approach has lead to only modest performance gains. In this paper, we take another look at exploiting loop-level parallelism in single-threaded applications. We show that substantial amounts of loop-level parallelism is available in general-purpose applications, but it lurks beneath the surface and is often obfuscated by a small number of data and control dependences. We adapt and extend several code transformations from the instruction-level and scientific parallelization communities to uncover the hidden parallelism. Our results show that 61 % of the dynamic execution of studied benchmarks can be parallelized with our techniques compared to 27 % using traditional thread-level speculation techniques, resulting in a speedup of 1.84 on a four core system compared to 1.41 without transformations. 1
Coordinated, distributed, formal energy management of chip multiprocessors
- In ISLPED ’05: Proceedings of the 2005 International Symposium on Low Power Electronics and Design
, 2005
"... ABSTRACT Designers are moving toward chip-multiprocessors (CMPs) to lever-age application parallelism for higher performance while keeping ..."
Abstract
-
Cited by 9 (0 self)
- Add to MetaCart
ABSTRACT Designers are moving toward chip-multiprocessors (CMPs) to lever-age application parallelism for higher performance while keeping
Data access partitioning for fine-grain parallelism on multicore architectures
- In Proceedings of the 40th Annual IEEE/ACM Symposium on Microarchitecture
, 2007
"... The recent design shift towards multicore processors has spawned a significant amount of research in the area of program parallelization. The future abundance of cores on a single chip requires programmer and compiler intervention to increase the amount of parallel work possible. Much of the recent ..."
Abstract
-
Cited by 9 (0 self)
- Add to MetaCart
The recent design shift towards multicore processors has spawned a significant amount of research in the area of program parallelization. The future abundance of cores on a single chip requires programmer and compiler intervention to increase the amount of parallel work possible. Much of the recent work has fallen into the areas of coarse-grain parallelization: new programming models and different ways to exploit threads and data-level parallelism. This work focuses on a complementary direction, improving performance through automated fine-grain parallelization. The main difficulty in achieving a performance benefit from fine-grain parallelism is the distribution of data memory accesses across the data caches of each core. Poor choices in the placement of data accesses can lead to increased memory stalls and low resource utilization. We propose a profile-guided method for partitioning memory accesses across distributed data caches. First, a profile determines affinity relationships between memory accesses and working set characteristics of individual memory operations in the program. Next, a program-level partitioning of the memory operations is performed to divide the memory accesses across the data caches. As a result, the data accesses are proactively dispersed to reduce memory stalls and improve computation parallelization. A final detailed partitioning of the computation instructions is performed with knowledge of the cache location of their associated data. Overall, our data partitioning reduces stall cycles by up to 51 % versus data-incognizant partitioning, and has an overall speedup average of 30 % over a single core processor. 1.
Support for high-frequency streaming in cmps
- In Proceedings of the 39th International Symposium on Microarchitecture
, 2006
"... As the industry moves toward larger-scale chip multiprocessors, the need to parallelize applications grows. High inter-thread communication delays, exacerbated by over-stressed high-latency memory subsystems and ever-increasing wire delays, require parallelization techniques to create partially or f ..."
Abstract
-
Cited by 9 (5 self)
- Add to MetaCart
As the industry moves toward larger-scale chip multiprocessors, the need to parallelize applications grows. High inter-thread communication delays, exacerbated by over-stressed high-latency memory subsystems and ever-increasing wire delays, require parallelization techniques to create partially or fully independent threads to improve performance. Unfortunately, developers and compilers alike often fail to find sufficient independent work of this kind. Recently proposed pipelined streaming techniques have shown significant promise for both manual and automatic parallelization. These techniques have wide-scale applicability because they embrace inter-thread dependences (albeit acyclic dependences) and tolerate long-latency communication of these dependences. This paper addresses the lack of architectural support for this type of concurrency, which has blocked its adoption and hindered related language and compiler research. We observe that both manual and automatic techniques create high-frequency streaming threads, with communication occurring every 5 to 20 instructions. Even while easily tolerating inter-thread transit delays, high-frequency communication makes thread performance very sensitive to intrathread delays from the repeated execution of the communication operations. Using this observation, we define the design-space and evaluate several mechanisms to find a better trade-off between performance and operating system, hardware, and design costs. From this, we find a light-weight streaming-aware enhancement to conventional memory subsystems that doubles the speed of these codes and is within 2 % of the best-performing, but heavy-weight, hardware solution. 1.

