Results 1 - 10
of
10
Shangri-la: achieving high performance from compiled network applications while enabling ease of programming
- In PLDI ’05
, 2005
"... Programming network processors is challenging. To sustain high line rates, network processors have extremely tight memory access and instruction budgets. Achieving desired performance has traditionally required hand-coded assembly. Researchers have recently proposed high-level programming languages ..."
Abstract
-
Cited by 29 (0 self)
- Add to MetaCart
Programming network processors is challenging. To sustain high line rates, network processors have extremely tight memory access and instruction budgets. Achieving desired performance has traditionally required hand-coded assembly. Researchers have recently proposed high-level programming languages for packet processing, but the challenges of compiling these languages into code that is competitive with hand-tuned assembly remain unanswered. This paper describes the Shangri-La compiler, which accepts a packet program written in a C-like high-level language and applies scalar and specialized optimizations to generate a highly optimized binary. Hot code paths identified by profiling are mapped across processing elements to maximize processor utilization. Since our compilation target has no hardware caches, software-controlled caches are generated for frequently accessed application data structures. Packet handling optimizations significantly reduce perpacket memory access and instruction counts. Finally, a custom stack model maps stack frames to the fastest levels of the target processor’s heterogeneous memory hierarchy. Binaries generated by the compiler were evaluated on the Intel IXP2400 network processor with eight packet processing cores and eight threads per core. Our results show the importance of both traditional and specialized optimization techniques for achieving the maximum forwarding rates on three network applications, L3-
Task partitioning for multi-core network processors
- In Compiler Construction
, 2005
"... Abstract. Network processors (NPs) typically contain multiple concurrent processing cores. State-of-the-art programming techniques for NPs are invariably low-level, requiring programmers to partition code into concurrent tasks early in the design process. This results in programs that are hard to ma ..."
Abstract
-
Cited by 12 (0 self)
- Add to MetaCart
Abstract. Network processors (NPs) typically contain multiple concurrent processing cores. State-of-the-art programming techniques for NPs are invariably low-level, requiring programmers to partition code into concurrent tasks early in the design process. This results in programs that are hard to maintain and hard to port to alternative architectures. This paper presents a new approach in which a high-level program is separated from its partitioning into concurrent tasks. Designers write their programs in a high-level, domain-specific, architecturally-neutral language, but also provide a separate Architecture Mapping Script (AMS). An AMS specifies semantics-preserving transformations that are applied to the program to re-arrange it into a set of tasks appropriate for execution on a particular target architecture. We (i) describe three such transformations: pipeline introduction, pipeline elimination and queue multiplexing; and (ii) specify when each can be safely applied. As a case study we describe an IP packet-forwarder and present an AMS script that partitions it into a form capable of running at 3Gb/s on an Intel IXP2400 Network Processor. 1
ShaRE: Run-time System for High-performance Virtualized Routers
, 2005
"... I believe that the process of earning a PhD degree fundamentally changes the way one thinks. And one’s advisor is the most significant contributor to such a change. Harrick Vin has striven hard to make me think differently, to convert me into a scientist from an engineer. Harrick’s insistence on ele ..."
Abstract
-
Cited by 9 (3 self)
- Add to MetaCart
I believe that the process of earning a PhD degree fundamentally changes the way one thinks. And one’s advisor is the most significant contributor to such a change. Harrick Vin has striven hard to make me think differently, to convert me into a scientist from an engineer. Harrick’s insistence on elegance of presentation both in writing and in talking has been quite valuable in honing my skills. His rigorous understanding, cute insights and passionate criticism have made my dissertation better by the day, and my PhD a pleasant experience overall. I have cherished many incredibly lengthy and interesting, but never tiring, meetings with him on both technical and philosophical issues. His personal warmth and support during happy and tough times, and his patience and nicety during heated discussions and everyday interactions have given me the necessary protection and confidence to keep going. Thanks for everything Harrick. Over the past five years, I have also been fortunate to work closely with Lorenzo Alvisi and Mike Dahlin. Both have been great mentors in their own right. I am thankful to Lorenzo for believing in me more than I did in myself at one point
Framework for supporting multi-service edge packet processing on network processors
- in Proceedings of the ACM First Symposium on Architectures for Networking and Communications Systems
, 2005
"... Network edge packet-processing systems, as are commonly implemented on network processor platforms, are increasingly required to support a rich set of services. These multi-service systems are also subjected to widely varying and unpredictable traffic. Current network processor systems do not simult ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
Network edge packet-processing systems, as are commonly implemented on network processor platforms, are increasingly required to support a rich set of services. These multi-service systems are also subjected to widely varying and unpredictable traffic. Current network processor systems do not simultaneously deal well with a variety of services and fluctuating workloads. For example, current methods of worst-case, static provisioning can meet performance requirements for any workload, but provisioning each service for its worst case reduces the total number of services that can be supported. Alternately, profiledriven automatic-partitioning compilers create efficient binaries for multi-service applications for specific workloads but they are sensitive to workload fluctuations. Run-time adaptation is a potential solution to this problem. With run-time adaptation, the mapping of services to system resources can be dynamically adjusted based on the workload. We have implemented an adaptive system that automatically changes the mapping of services to processors, and handles migration of services between different processor core types to match the current workload. In this paper we explain our adaptive system built on the Intel ® IXP2400 network processor. We demonstrate that it outperforms multiple different profile-driven compiled solutions for most workloads and performs within 20 % of the optimal compiled solution for the remaining workloads.
Balancing Register Allocation Across Threads for a Multithreaded Network
- Processor,” in Proc. Conf. on Programming
, 2004
"... Modern network processors employ multi-threading to allow concurrency amongst multiple packet processing tasks. We studied the properties of applications running on the network processors and observed that their imbalanced register requirements across different threads at different program points co ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
Modern network processors employ multi-threading to allow concurrency amongst multiple packet processing tasks. We studied the properties of applications running on the network processors and observed that their imbalanced register requirements across different threads at different program points could lead to poor performance. Many times application needs demand some threads to be more performance critical than others and thus by controlling the register allocation across threads one could impact the performance of the threads and get the desired performance properties for concurrent threads. This prompts our work. Our register allocator aims to distribute available registers to different threads according to their needs. The compiler analyzes the register needs of each thread both at the point of a context switch as well as internally. Compiler then designates some registers as shared and some as private to each thread. Shared
Automatic Data Partitioning for the Agere Payload Plus Network Processor
- In CASES ’04: Proceedings of the 2004 international conference on Compilers, architecture, and synthesis for embedded systems
, 2004
"... With the ever-increasing pervasiveness of the Internet and its stringent performance requirements, network system designers have begun utilizing specialized chips to increase the performance of network functions. To increase performance, many more advanced functions, such as tra#c shaping and polici ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
With the ever-increasing pervasiveness of the Internet and its stringent performance requirements, network system designers have begun utilizing specialized chips to increase the performance of network functions. To increase performance, many more advanced functions, such as tra#c shaping and policing, are being implemented at the network interface layer to reduce delays that occur when these functions are handled by a general-purpose CPU. While some designs use ASICs to handle network functions, many system designers have moved toward using programmable network processors due to their increased flexibility and lower design cost.
Efficient Spill Code for SDRAM
- In: CASES
, 2003
"... Processors such as StrongARM and memory such as SDRAM enable e#cient execution of multiple loads and stores in a single instruction. This is particularly useful in connection with register allocation where spill code may need to save and restore multiple registers. Until now, there has been no e#ec ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Processors such as StrongARM and memory such as SDRAM enable e#cient execution of multiple loads and stores in a single instruction. This is particularly useful in connection with register allocation where spill code may need to save and restore multiple registers. Until now, there has been no e#ective strategy for utilizing this to its full potential. In this paper we investigate the use of SDRAM for optimization of spill code. The core of the problem is to arrange the variables in the spill area such that loading to and storing from the SDRAM is optimally e#cient. We show that the problem is NP-complete and present a method based on integer linear programming (ILP) to solve the problem. We have implemented our approach as an additional phase in a gcc-based compiler for the StrongARM core of Intel's IXP--1200 network processor. Our optimizer, SLA (stack location allocator), rearranges the scalar variables so that memory accesses can be made cheaper. Our experimental results show that our ILP-based method is e#cient and that the code generated for our benchmarks runs 0.8--15.1% faster than the code produced by the original compiler with --O2 optimization. Our SLA phase is guaranteed to not deteriorate the execution-time performance and can be configured such as not to increase the code size.
Concurrent Implementation of Packet Processing Algorithms on Network Processors
"... I hereby declare that I am the sole author of this thesis. This is a true copy of the thesis, including any required final revisions, as accepted by my examiners. I understand that my thesis may be made electronically available to the public. Mark Groves ii Network Processor Units (NPUs) are a compr ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
I hereby declare that I am the sole author of this thesis. This is a true copy of the thesis, including any required final revisions, as accepted by my examiners. I understand that my thesis may be made electronically available to the public. Mark Groves ii Network Processor Units (NPUs) are a compromise between software-based and hardwired packet processing solutions. While slower than hardwired solutions, NPUs have the flexibility of software-based solutions, allowing them to adapt faster to changes in network protocols. Network processors have multiple processing engines so that multiple packets can be processed simultaneously within the NPU. In addition, each of these processing engines is multi-threaded, with special hardware support built in to alleviate some of the cost of concurrency. This hardware design allows the NPU to handle multiple packets concurrently, so that while one thread is waiting for a memory access to complete, another thread can be processing a different packet. By handling several packets simultaneously, an NPU can achieve similar processing power as traditional packet
High-Speed I/O: The Operating System as a Signalling Mechanism
, 2003
"... The design of modern operating systems is based around the concept of memory as a cache for data that flows between applications, storage, and I/O devices. With the increasing disparity between I/O bandwidth and CPU performance, this architecture exposes the processor and memory subsystems as the bo ..."
Abstract
- Add to MetaCart
The design of modern operating systems is based around the concept of memory as a cache for data that flows between applications, storage, and I/O devices. With the increasing disparity between I/O bandwidth and CPU performance, this architecture exposes the processor and memory subsystems as the bottlenecks to system performance. Furthermore, this design does not easily lend itself to exploitation of new capabilities in peripheral devices, such as programmable network cards or special-purpose hardware accelerators, capable of card-to-card data transfers.

