Results 1 - 10
of
10
Larrabee: a many-core x86 architecture for visual computing
- In SIGGRAPH ’08: ACM SIGGRAPH 2008 papers
, 2008
"... Abstract 123 This paper presents a many-core visual computing architecture code named Larrabee, a new software rendering pipeline, a manycore programming model, and performance analysis for several applications. Larrabee uses multiple in-order x86 CPU cores that are augmented by a wide vector proces ..."
Abstract
-
Cited by 104 (6 self)
- Add to MetaCart
Abstract 123 This paper presents a many-core visual computing architecture code named Larrabee, a new software rendering pipeline, a manycore programming model, and performance analysis for several applications. Larrabee uses multiple in-order x86 CPU cores that are augmented by a wide vector processor unit, as well as some fixed function logic blocks. This provides dramatically higher performance per watt and per unit of area than out-of-order CPUs on highly parallel workloads. It also greatly increases the flexibility and programmability of the architecture as compared to standard GPUs. A coherent on-die 2 nd level cache allows efficient inter-processor communication and high-bandwidth local data access by CPU cores. Task scheduling is performed entirely with software in Larrabee, rather than in fixed function logic. The customizable software graphics rendering pipeline for this
Photonic Networks-OnChip for Future Generations of Chip Multiprocessors
- IEEE Trans. Computing
, 2008
"... Abstract—The design and performance of next-generation chip multiprocessors (CMPs) will be bound by the limited amount of power that can be dissipated on a single die. We present photonic networks-on-chip (NoC) as a solution to reduce the impact of intrachip and off-chip communication on the overall ..."
Abstract
-
Cited by 9 (2 self)
- Add to MetaCart
Abstract—The design and performance of next-generation chip multiprocessors (CMPs) will be bound by the limited amount of power that can be dissipated on a single die. We present photonic networks-on-chip (NoC) as a solution to reduce the impact of intrachip and off-chip communication on the overall power budget. The low loss properties of optical waveguides, combined with bit-rate transparency, allow for a photonic interconnection network that can deliver considerably higher bandwidth and lower latencies with significantly lower power dissipation than an interconnection network based only on electronic signaling. We explain why on-chip photonic communication has recently become a feasible opportunity and explore the challenges that need to be addressed to realize its implementation. We introduce a novel hybrid microarchitecture for NoCs that combines a broadband photonic circuit-switched network with an electronic overlay packet-switched control network. This design leverages the strength of each technology and represents a flexible solution for the different types of messages that are exchanged on the chip; large messages are communicated more efficiently through the photonic network, while short messages are delivered electronically with minimal power consumption. We address the critical design issues including topology, routing algorithms, deadlock avoidance, and path-setup/teardown procedures. We present experimental results obtained with POINTS, an event-driven simulator specifically developed to analyze the proposed design idea, as well as a comparative power analysis of a photonic versus an electronic NoC. Overall, these results confirm the unique benefits for future generations of CMPs that can be achieved by bringing optics into the chip in the form of photonic NoCs. Index Terms—On-chip communication, chip multiprocessors, photonics, emerging technologies. Ç 1
Physical aware frequency selection for dynamic thermal management in multi-core systems
- in Proc. Intl’ Conf. Computer Aided Design (ICCAD), 2006
"... In order to maintain performance per Watt in microprocessors, there is a shift towards the chip level multiprocessing paradigm. Microprocessor manufacturers are experimenting with tens of cores, forecasting the arrival of hundreds of cores per single processor die in the near future. With such large ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
In order to maintain performance per Watt in microprocessors, there is a shift towards the chip level multiprocessing paradigm. Microprocessor manufacturers are experimenting with tens of cores, forecasting the arrival of hundreds of cores per single processor die in the near future. With such large-scale integration and increasing power densities, thermal management continues to be a significant design effort to maintain performance and reliability in modern process technologies. In this paper, we present two mechanisms to perform frequency scaling as part of Dynamic Frequency and Voltage Scaling (DVFS) to assist Dynamic Thermal Management (DTM). Our frequency selection algorithms incorporate the physical interaction of the cores on a large-scale system onto the emergency intervention mechanisms for temperature reduction of the hotspot, while aiming to minimize the performance impact of frequency scaling on the core that is in thermal emergency. Our results show that our algorithm consistently succeeds in maximizing the operating frequency of the most critical core while successfully relieving the thermal emergency of the core. A comparison of our two alternative techniques reveals that our physical aware criticality-based algorithm results in 11.7 % faster clock frequencies compared to our aggressive scaling algorithm. We also show that our technique is extremely fast and is suited for real time thermal management
Tradeoff between Data-, Instruction-, and Thread-level Parallelism in Stream Processors
- In Proceedings of the 21 st ACM International Conference on Supercomputing (ICS’07
, 2007
"... This paper explores the scalability of the Stream Processor architecture along the instruction-, data-, and thread-level parallelism dimensions. We develop detailed VLSI-cost and processorperformance models for a multi-threaded Stream Processor and evaluate the tradeoffs, in both functionality and h ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
This paper explores the scalability of the Stream Processor architecture along the instruction-, data-, and thread-level parallelism dimensions. We develop detailed VLSI-cost and processorperformance models for a multi-threaded Stream Processor and evaluate the tradeoffs, in both functionality and hardware costs, of mechanisms that exploit the different types of parallelism. We show that the hardware overhead of supporting coarsegrained independent threads of control is 15 − 86 % depending on machine parameters. We also demonstrate that the performance gains provided are of a smaller magnitude for a set of numerical applications. We argue that for stream applications with scalable parallel algorithms the performance is not very sensitive to the control structures used within a large range of area-efficient architectural choices. We evaluate the specific effects on performance of scaling along the different parallelism dimensions and explain the limitations of the ILP, DLP, and TLP hardware mechanisms.
International Journal of Parallel Programming ( © 2007) DOI: 10.1007/s10766-007-0040-7 Architectural Exploration of Heterogeneous Multiprocessor Systems for JPEG ∗
, 2006
"... Multicore processors have been utilized in embedded systems and general computing applications for some time. However, these multicore chips execute multiple applications concurrently, with each core carrying out a particular task in the system. Such systems can be found in gaming, automotive realti ..."
Abstract
- Add to MetaCart
Multicore processors have been utilized in embedded systems and general computing applications for some time. However, these multicore chips execute multiple applications concurrently, with each core carrying out a particular task in the system. Such systems can be found in gaming, automotive realtime systems and video / image encoding devices. These system are commonly deployed to overcome deadline misses, which are primarily due to overloading of a single multitasking core. In this paper, we explore the use of multiple cores for a single application, as opposed to multiple applications executing in a parallel fashion. A single application is parallelized using two different methods: one, a master-slave model; and two, a sequential pipeline model. The systems were implemented using Tensilica’s Xtensa LX processors with queues as the means of communications between two cores. In a master-slave model, we utilized a course grained approach whereby a main core distributes the workload to the remaining cores and reads the processed data before writing the results back to file. In the pipeline model, a lower ∗National ICT Australia is funded through the Australian Government’s Backing Australia’s
Video Processing and Retrieval on Cell Processor Architecture
"... Abstract. A multi-level parallel partition schema and three mapping model – Service, Streaming and OpenMP model – are proposed to map video processing and retrieval (VPR) workloads to Cell processor. The Cell, with 9 cores in one chip, provides an efficient high performance computation platform to s ..."
Abstract
- Add to MetaCart
Abstract. A multi-level parallel partition schema and three mapping model – Service, Streaming and OpenMP model – are proposed to map video processing and retrieval (VPR) workloads to Cell processor. The Cell, with 9 cores in one chip, provides an efficient high performance computation platform to speedup VPR and to boost its performance dramatically. We present a task and data parallel partition plan to partition and distribute intensive computation workloads of VPR to exploit the parallelism of a sequential program through the different processing core on Cell. Service and Streaming mapping model are proposed to map the VPR to Cell, using service allocation-calling mode and stream-data pipeline mode respectively. To facilitate the VPR programming on Cell, OpenMP programming model is loaded to Cell. Some effective mapping strategies are also presented to conduct the thread creating and data handling between the different processors and reduce the overhead of system performance. The experimental results show that such parallel partition schema and mapping model can be effective to speed up VPR processing on Cell multicore architecture. 1
Trace-Driven Optimization of Networks-on-Chip Configurations
"... Networks-on-chip (NoCs) are becoming increasingly important in general-purpose and application-specific multi-core designs. Although uniform router configurations are appropriate for generalpurpose NoCs, router configurations for application-specific NoCs can be non-uniformly optimized to applicatio ..."
Abstract
- Add to MetaCart
Networks-on-chip (NoCs) are becoming increasingly important in general-purpose and application-specific multi-core designs. Although uniform router configurations are appropriate for generalpurpose NoCs, router configurations for application-specific NoCs can be non-uniformly optimized to application-specific traffic characteristics. In this paper, we specifically consider the problem of virtual channel (VC) allocation in application-specific NoCs. Prior solutions to this problem have been average-rate driven. However, average-rate models are poor representations of real application traffic, and can lead to designs that are poorly matched to the application. We propose an alternate trace-driven paradigm in which configuration of NoCs is driven by application traces. We propose two simple greedy trace-driven VC allocation schemes. Compared to uniform allocation, we observe up to 51 % reduction in the number of VCs under a given average packet latency constraint, or up to 74 % reduction in average packet latency with same number of VCs. Our results suggest that average-rate driven methods cannot effectively select appropriate links for VC allocation because they fail to consider the impact of traffic bursts. As a case study, we compare our proposed approach with an existing average-rate driven method [9] and observe up to 35 % reduction in the number of VCs for a given target latency. 1.
27.2 Trace-Driven Optimization of Networks-on-Chip Configurations
"... Networks-on-chip (NoCs) are becoming increasingly important in general-purpose and application-specific multi-core designs. Although uniform router configurations are appropriate for generalpurpose NoCs, router configurations for application-specific NoCs can be non-uniformly optimized to applicatio ..."
Abstract
- Add to MetaCart
Networks-on-chip (NoCs) are becoming increasingly important in general-purpose and application-specific multi-core designs. Although uniform router configurations are appropriate for generalpurpose NoCs, router configurations for application-specific NoCs can be non-uniformly optimized to application-specific traffic characteristics. In this paper, we specifically consider the problem of virtual channel (VC) allocation in application-specific NoCs. Prior solutions to this problem have been average-rate driven. However, average-rate models are poor representations of real application traffic, and can lead to designs that are poorly matched to the application. We propose an alternate trace-driven paradigm in which configuration of NoCs is driven by application traces. We propose two simple greedy trace-driven VC allocation schemes. Compared to uniform allocation, we observe up to 51 % reduction in the number of VCs under a given average packet latency constraint, or up to 74 % reduction in average packet latency with same number of VCs. Our results suggest that average-rate driven methods cannot effectively select appropriate links for VC allocation because they fail to consider the impact of traffic bursts. As a case study, we compare our proposed approach with an existing average-rate driven method [9] and observe up to 35 % reduction in the number of VCs for a given target latency.
The 10th IEEE International Conference on High Performance Computing and Communications A Prediction based CMP Cache Migration Policy*
"... The large L2 cache’s access latency, which is mainly caused by wire delay, is a critical problem to improve the performance of CMP (Chip Multi-Processor) in NUCA (Non-Uniform Cache Architecture). A CMP L2 cache accessing performance model is provided first to analyze and evaluate the L2 access effic ..."
Abstract
- Add to MetaCart
The large L2 cache’s access latency, which is mainly caused by wire delay, is a critical problem to improve the performance of CMP (Chip Multi-Processor) in NUCA (Non-Uniform Cache Architecture). A CMP L2 cache accessing performance model is provided first to analyze and evaluate the L2 access efficiency in this paper. The total L2 cache access latency problem is formalized as an optimal problem and the lower bound of L2 cache access latency is given based on this model. A novel PBM (Prediction based L2 cache data Migration) algorithm, which employs the sequential prediction technology to identify the data to be accessed in the near future, is designed to migrate the data to be accessed toward their users in early and this method can enable the cores to perform their accesses to the L2 cache in close banks. The analysis results show that this active data migration algorithm can take advantage of the principle of locality to reduce the data access latency much more than the traditional lazy data migration policy. To evaluate the theoretic analysis results, the HMTT toolkit is used to capture the complete memory trace of the SPEC 2000 benchmark running on an SMP computer. The memory trace shows that our prediction technology can work well and at the same time, an L2 cache access simulator is developed to deal with the memory trace data. The simulation experiments show that both the shorter block transfer distance and the lower average access latency can be achieved in the PBM policy. The average block transfer distance can be reduced by up to 16.9%, and the average L2 access latency can be reduced by up to 8.4%.
Exploring Programming Models and Optimizations for the Cell Broadband Engine using RAxML
"... Originally developed as a gaming processor for Sony PlayStation3, the Cell Broadband Engine opens new opportunities for running computationally intensive scientific applications more efficiently, thanks to characteristics such as multigrain task-level and data-level parallel execution and vast on-ch ..."
Abstract
- Add to MetaCart
Originally developed as a gaming processor for Sony PlayStation3, the Cell Broadband Engine opens new opportunities for running computationally intensive scientific applications more efficiently, thanks to characteristics such as multigrain task-level and data-level parallel execution and vast on-chip memory bandwidth. In the ideal case, the Cell is capable of achieving significant performance improvements over conventional processors. However, the potential of the Cell is unclear when the processor is used for applications that are not necessarily conforming to its architectural characteristics. Furthermore, the question of what is the best programming model for a processor like Cell remains open, with too many programming models and paradigms proposed, yet too few evaluated empirically or experimentally. In this work we present the port and optimization of RAxML, an application that computes large phylogenetic trees, on a real blade with Cell processors. We investigate two programming models that derive partially from the dominant programming models of conventional parallel machines, namely MPI and OpenMP, as well as an extensive set of Cell-specific optimizations. Using multilevel parallelization and several optimizations we have been able to improve the execution time of RAxML on the Cell by a factor of 5, a satisfactory result given that RAxML is an application with dynamically allocated data structures, complex control flow and extensive pointer arithmetic, all factors that present challenges for parallelization beyond the simple master-worker scheme. We also find that the Cells performs comparably or outperforms leading multicore and multithreaded microprocessors, such as the IBM Power5 and the Intel Xeon with Hyperthreading technology. 1

