Results 1 - 10
of
13
The ALPBench Benchmark Suite for Complex Multimedia Applications
- In Proc. of the IEEE Int. Symp. on Workload Characterization
, 2005
"... Multimedia applications are becoming increasingly important for a large class of general-purpose processors. Contemporary media applications are highly complex and demand high performance. A distinctive feature of these applications is that they have significant parallelism, including thread-, data- ..."
Abstract
-
Cited by 23 (0 self)
- Add to MetaCart
Multimedia applications are becoming increasingly important for a large class of general-purpose processors. Contemporary media applications are highly complex and demand high performance. A distinctive feature of these applications is that they have significant parallelism, including thread-, data-, and instruction-level parallelism, that is potentially well-aligned with the increasing parallelism supported by emerging multi-core architectures. Designing systems to meet the demands of these applications therefore requires a benchmark suite comprising these complex applications and that exposes the parallelism present in them. This paper makes two contributions. First, it presents ALPBench, a publicly available benchmark suite that pulls together five complex media applications from various sources: speech recognition (CMU Sphinx 3), face recognition (CSU), ray tracing (Tachyon), MPEG-2 encode (MSSG), and MPEG-2 decode (MSSG). We have modified the original applications to expose thread-level and datalevel parallelism using POSIX threads and sub-word SIMD (Intel’s SSE2) instructions respectively. Second, the paper provides a performance characterization of the ALPBench benchmarks, with a focus on parallelism. Such a characterization is useful for architects and compiler writers for designing systems and compiler optimizations for these applications. 1.
A Lightweight Secure Cyber Foraging Infrastructure for Resource-Constrained Devices
, 2004
"... Resource-constrained embedded and mobile devices are becoming increasingly common. Cyber foraging, which allows such devices to offload computation to less resourceconstrained surrogate machines, enables new and interesting applications for these devices. In this paper we describe a surrogate infras ..."
Abstract
-
Cited by 18 (3 self)
- Add to MetaCart
Resource-constrained embedded and mobile devices are becoming increasingly common. Cyber foraging, which allows such devices to offload computation to less resourceconstrained surrogate machines, enables new and interesting applications for these devices. In this paper we describe a surrogate infrastructure based on virtual machine technology that allows resource-constrainted devices to utilize a surrogate's compute, network, and storage resources. After describing the design of our surrogate infrastructure, we demonstrate how it can be used to support real-time speech recognition and a synthetic web services application. Using a surrogate reduces the response time of speech recognition by a factor of 200 while reducing the energy drain on the client device by a factor of 60. Using a surrogate reduces the response time and energy drain on the client by factors of 21 and 25, respectively, for the web services application. 1.
A Characterization of Visual Feature Recognition
- IN PROCEEDINGS OF THE IEEE 6TH ANNUAL WORKSHOP ON WORKLOAD CHARACTERIZATION (WWC-6
, 2003
"... Natural human interfaces are a key to realizing the dream of ubiquitous computing. This implies that embedded systems must be capable of sophisticated perception tasks. This paper analyzes the nature of a visual feature recognition workload. Visual feature recognition is a key component of a number ..."
Abstract
-
Cited by 8 (3 self)
- Add to MetaCart
Natural human interfaces are a key to realizing the dream of ubiquitous computing. This implies that embedded systems must be capable of sophisticated perception tasks. This paper analyzes the nature of a visual feature recognition workload. Visual feature recognition is a key component of a number of important applications, e.g. gesture based interfaces, lip tracking to augment speech recognition, smart cameras, automated surveillance systems, robotic vision, etc. Given the power sensitive nature of the embedded space and the natural conflict between low-power and high-performance implementations, a precise understanding of these algorithms is an important step developing efficient visual feature recognition applications for the embedded space. In particular, this work analyzes the performance characteristics of flesh toning, face detection and face recognition codes based on well known algorithms. We also show how the problem can be decomposed into a pipeline of filters that have efficient implementations as stream processors.
A Low Power Architecture for Embedded Perception
- in Proc. of the 2004 International Conference on Compilers, Architecture, and Synthesis for Embedded Systems (CASES’04). Washington DC
, 2004
"... Recognizing speech, gestures, and visual features are important interface capabilities for future embedded mobile systems. Unfortunately, the real-time performance requirements of complex perception applications can not be met by current embedded processors and often even exceed the performance of ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
Recognizing speech, gestures, and visual features are important interface capabilities for future embedded mobile systems. Unfortunately, the real-time performance requirements of complex perception applications can not be met by current embedded processors and often even exceed the performance of high performance microprocessors whose energy consumption far exceeds embedded energy budgets.
Hardware speech recognition for user interfaces in low cost, low power devices
- Proceedings of the 42nd Annual Conference on Design Automation
, 2005
"... We propose a system architecture for real-time hardware speech recognition on low-cost, power-constrained devices. The system is intended to support real-time speech-based user interfaces as part of an effort to bring Information and Communication Technologies (ICTs) to underdeveloped regions of the ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
We propose a system architecture for real-time hardware speech recognition on low-cost, power-constrained devices. The system is intended to support real-time speech-based user interfaces as part of an effort to bring Information and Communication Technologies (ICTs) to underdeveloped regions of the world. Our system architecture exploits a shared infrastructure model. The computationally intensive task of speech model training and retraining is performed offline by shared servers, while the actual recognition of speech is conducted on low-cost hand-held devices using custom hardware. The recognizer is extremely flexible and can support multiple languages or dialects with speaker-independent recognition.Dynamic loading of speech models is used for changing language grammar and retraining, while reprogramming is used to support evolution of recognition algorithms. The focus on small sets of words (at one time) reduces the complexity, cost and power consumption. We design the speech decoder, the central component of the recognizer, and we validate it via a prototype FPGA implementation. We then use ASIC synthesis to estimate power and size for the design. Our evaluations demonstrate an order of magnitude improvement in power compared with optimized recognition software running on a low-power embedded general-purpose processor of the same technology and of similar capabilities. The synthesis also estimates the area of the design to be about 2.5mm 2, showing potential for lower cost. In designing and testing our recognizer we use datasets in both English and Tamil languages.
Comparing Energy and Latency of Asynchronous and Synchronous NoCs for Embedded SoCs
"... Abstract—Power consumption of on-chip interconnects is a primary concern for many embedded system-on-chip (SoC) applications. In this paper, we compare energy and performance characteristics of asynchronous (clockless) and synchronous networkon-chip implementations, optimized for a number of SoC des ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Abstract—Power consumption of on-chip interconnects is a primary concern for many embedded system-on-chip (SoC) applications. In this paper, we compare energy and performance characteristics of asynchronous (clockless) and synchronous networkon-chip implementations, optimized for a number of SoC designs. We adapted the COSI-2.0 framework with ORION 2.0 router and wire models for synchronous network generation. Our own tool, ANetGen, specifies the asynchronous network by determining the topology with simulated-annealing and router locations with force-directed placement. It uses energy and delay models from our 65nm bundled-data router design. SystemC simulations varied traffic burstiness using the self-similar b-model. Results show that the asynchronous network provided lower median and maximum message latency, especially under bursty traffic, and used far less router energy with a slight overhead for the interrouter wires. I.
A High-Speed, Low-Resource ASR Back-End Based on Custom Arithmetic
"... Abstract—With the skyrocketing popularity of mobile devices, new processing methods tailored to a specific application have become necessary for low-resource systems. This work presents a high-speed, low-resource speech recognition system using custom arithmetic units, where all system variables are ..."
Abstract
- Add to MetaCart
Abstract—With the skyrocketing popularity of mobile devices, new processing methods tailored to a specific application have become necessary for low-resource systems. This work presents a high-speed, low-resource speech recognition system using custom arithmetic units, where all system variables are represented by integer indices and all arithmetic operations are replaced by hardware-based table lookups. To this end, several reordering and rescaling techniques, including two accumulation structures for Gaussian evaluation and a novel method for the normalization of Viterbi search scores, are proposed to ensure low entropy for all variables. Furthermore, a discriminatively inspired distortion measure is investigated for scalar quantization of forward probabilities to maximize the recognition rate. Finally, heuristic algorithms are explored to optimize system-wide resource allocation. Our best bit-width allocation scheme only requires 59 kB of ROMs to hold the lookup tables, and its recognition performance with various vocabulary sizes in both clean and noisy conditions is nearly as good as that of a system using a 32-bit floating-point unit. Simulations on various architectures show that, on most modern processor designs, we can expect a cycle-count speedup of at least three times over systems with floating-point units. Additionally, the memory bandwidth is reduced by over 70 % and the offline storage for model parameters is reduced by 80%. Index Terms—Alpha recursion, bit-width allocation, custom arithmetic, discriminative distortion measure, forward probability normalization and scaling, high speed, low resource, normalization, quantization, speech recognition. I.
Perception Coprocessors for Embedded Systems
, 2003
"... Recognizing speech, gestures, and visual features are important interface capabilities for embedded mobile systems. Perception algorithms have many traits in common with more conventional media processing applications. The primary motivation for this work is that applications such as real-time, spea ..."
Abstract
- Add to MetaCart
Recognizing speech, gestures, and visual features are important interface capabilities for embedded mobile systems. Perception algorithms have many traits in common with more conventional media processing applications. The primary motivation for this work is that applications such as real-time, speaker-independent, large-vocabulary, domain-independent continuous speech recognition systems require more performance than is currently available on embedded processors. Even on modern highperformance processors the performance is just barely able to keep up with real-time demands while consuming power at a rate that is well beyond what can be sustained on mobile systems. The solution to this dilemma has traditionally been to design a special ASIC. ASIC design however is both expensive and lacks the generality needed to support di#erent phases of a complex algorithm or even evolutionary improvements to base method. This paper introduces an execution cluster based coprocessor architecture and its CMOS implementation. This is compared against software implementations of algorithms running on a general purpose processor and also against custom ASICs. The cluster achieves an order of magnitude improvement in energy consumption over a conventional processor while retaining a reasonable level of generality. The architecture is evaluated on several important perception applications where energy consumption is shown to improve by a factor of 12-55 times and energy-delay product improves by a factor of 3.8 - 40 times over conventional processor approaches.
A Cluster Architecture for Embedded Perception
"... Recognizing speech, gestures, and visual features are important interface capabilities for future embedded mobile systems. Unfortunately the real-time performance requirements of complex perception applications can not be met by current embedded processors and often even exceed the performance of hi ..."
Abstract
- Add to MetaCart
Recognizing speech, gestures, and visual features are important interface capabilities for future embedded mobile systems. Unfortunately the real-time performance requirements of complex perception applications can not be met by current embedded processors and often even exceed the performance of high performance microprocessors with an energy budget that is infeasible in the embedded space. The normal approach is to resort to a custom ASIC in order to meet performance and energy constraints. However ASICs incur expensive and lengthy design cycles. They are so specialized that they are unable to support multiple applications or even evolutionary improvements in a single application. This paper introduces a VLIW perception processor which uses a combination of clustered function units, compiler controlled data-flow and compiler controlled clock-gating in conjunction with hardware support for modulo scheduling, address generation units and a scratch-pad memory system to achieve very high performance for perceptual algorithms at low energy consumption. The architecture is evaluated using ten benchmark applications taken from complex speech and visual feature recognition, security, and signal processing domains. We use DSP and encryption algorithms to demonstrate that the perception processor is general enough to be applied to other streaming problems. Since energy and delay are common design tradeo#s, the energy-delay product of a CMOS implementation of this architecture is compared against ASICs and a general purpose processors. Using a combination of Spice simulations, real processor power measurements and architecture simulation we show that the cluster running at 1 GHz clock frequency outperforms a 2.4 GHz Pentium 4 by a factor of 1.75. While delivering this performanc...
EMERGING MULTIMEDIA APPLICATIONS BY
"... Multimedia applications are becoming increasingly important for a large class of general-purpose processors. Contemporary media applications are highly complex and demand high performance. A distinctive feature of these applications is that they have significant parallelism, including thread-, data- ..."
Abstract
- Add to MetaCart
Multimedia applications are becoming increasingly important for a large class of general-purpose processors. Contemporary media applications are highly complex and demand high performance. A distinctive feature of these applications is that they have significant parallelism, including thread-, data-, and instruction-level parallelism, that is potentially well-aligned with the increasing parallelism supported by emerging multicore architectures. Designing systems to meet the demands of these applications therefore requires a benchmark suite comprising these complex applications and that exposes the parallelism present in them. This thesis makes three main contributions. First, it presents ALPBench, a publicly released benchmark suite that pulls together five complex media applications from various sources: speech recognition (CMU Sphinx 3.3), face recognition (CSU), ray tracing (Tachyon), MPEG-2 encode (MSSG), and MPEG-2 decode (MSSG). We have modified the original applications to expose thread-level parallelism using POSIX threads and data-level parallelism using Intel’s SSE2 instructions and vector extensions. Second, the thesis provides a performance characterization of the ALPBench benchmarks, with a focus on parallelism. Such a characterization is useful for architects and compiler writers for designing systems and compiler optimizations for these applications.

