Results 1 -
5 of
5
PocketSphinx: A free, real-time continuous speech recognition system for hand-held devices
- in Proceedings of ICASSP
, 2006
"... The availability of real-time continuous speech recognition on mobile and embedded devices has opened up a wide range of research opportunities in human-computer interactive applications. Unfortunately, most of the work in this area to date has been confined to proprietary software, or has focused o ..."
Abstract
-
Cited by 12 (0 self)
- Add to MetaCart
The availability of real-time continuous speech recognition on mobile and embedded devices has opened up a wide range of research opportunities in human-computer interactive applications. Unfortunately, most of the work in this area to date has been confined to proprietary software, or has focused on limited domains with constrained grammars. In this paper, we present a preliminary case study on the porting and optimization of CMU SPHINX-II, a popular open source large vocabulary continuous speech recognition (LVCSR) system, to hand-held devices. The resulting system operates in an average 0.87 times real-time on a 206MHz device, 8.03 times faster than the baseline system. To our knowledge, this is the first hand-held LVCSR system available under an open-source license. 1.
Data-Driven Vector Clustering for Low-Memory Footprint ASR
- in ICSLP
, 2002
"... It is important to produce automatic speech recognition (ASR) systems that use as few computational and memory resources as possible, especially in low-memory/low-power environments such as for personal digital assistants. One way to achieve this is through parameter quantization. In this work, we c ..."
Abstract
-
Cited by 7 (1 self)
- Add to MetaCart
It is important to produce automatic speech recognition (ASR) systems that use as few computational and memory resources as possible, especially in low-memory/low-power environments such as for personal digital assistants. One way to achieve this is through parameter quantization. In this work, we compare a variety of novel subvector clustering procedures for ASR system parameter quantization. Specifically, we look at systematic data-driven subvector selection techniques based on entropy minimization, and compare performance on a 150-word isolated word speech recognition task. While the optimal entropy-minimizing quantization methods are intractable, we show that although several of our heuristic techniques are elaborate in their attempt to approximate the optimal clustering, a simple scalar quantization scheme using separate codebooks performs remarkably well.
The 1999 CMU 10x real time broadcast news transcription system
- Proc. DARPA workshop on Automatic Transcription of Broadcast News
, 2000
"... CMU's 10X real time system is the HMM-based SPHINX-III system with a newly developed fast decoder. The fast decoder uses a subvector clustered version of the acoustic models for Gaussian computation and a lexical tree search structure. It was developed in September, 1999, and is currently a first-pa ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
CMU's 10X real time system is the HMM-based SPHINX-III system with a newly developed fast decoder. The fast decoder uses a subvector clustered version of the acoustic models for Gaussian computation and a lexical tree search structure. It was developed in September, 1999, and is currently a first-pass decoder, capable of generating word lattices. It was designed to optimize speed, recognition accuracy as well as memory requirements. For the 1999 Hub 4 evaluation task, the system used two sets of acoustic models- full-bandwidth and narrow-bandwidth. The acoustic models were 6000 senone, 32 Gaussians per state, 3-state HMMs with no skips permitted across states. The system used a single 39 dimensional feature stream consisting of cepstra and cepstral differences. The lattices generated were rescored using a DAG algorithm. The DAG-rescored hypotheses were designated as those of the primary system. The contrastive system consisted of the output of the first pass Viterbi search, with no DAG rescoring of lattices. A trigram language model consisting of 57,000 unigrams, 10 million bigrams and 14.9 million trigrams was used. No adaptation passes were done. In this paper we describe the various components of the primary system. The first-pass word error rate on the 1998 Hub 4 evaluation set was 20.4 % with this system. The overall word error rate scored by NIST for the 1999 Hub 4 evaluation set was 27.6%.
Four-layer categorization scheme of fast GMM computation techniques in large vocabulary continuous speech recognition systems
- in Proceedings of ICSLP
, 2004
"... Large vocabulary continuous speech recognition systems are known to be computationally intensive. A major bottleneck is the Gaussian mixture model (GMM) computation and various techniques have been proposed to address this problem. We present a systematic study of fast GMM computation techniques. As ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Large vocabulary continuous speech recognition systems are known to be computationally intensive. A major bottleneck is the Gaussian mixture model (GMM) computation and various techniques have been proposed to address this problem. We present a systematic study of fast GMM computation techniques. As there are a large number of these and it is impractical to exhaustively evaluate all of them, we first categorized techniques into four layers and selected representative ones to evaluate in each layer. Based on this framework of study, we provide a detailed analysis and comparison of GMM computation techniques from the four-layer perspective and explore two subtle practical issues, 1) how different techniques can be combined effectively and 2) how beam pruning will affect the performance of GMM computation techniques. All techniques are evaluated in the CMU Communicator domain. We also compare their performance with others reported in the literature. 1.
A High-Speed, Low-Resource ASR Back-End Based on Custom Arithmetic
"... Abstract—With the skyrocketing popularity of mobile devices, new processing methods tailored to a specific application have become necessary for low-resource systems. This work presents a high-speed, low-resource speech recognition system using custom arithmetic units, where all system variables are ..."
Abstract
- Add to MetaCart
Abstract—With the skyrocketing popularity of mobile devices, new processing methods tailored to a specific application have become necessary for low-resource systems. This work presents a high-speed, low-resource speech recognition system using custom arithmetic units, where all system variables are represented by integer indices and all arithmetic operations are replaced by hardware-based table lookups. To this end, several reordering and rescaling techniques, including two accumulation structures for Gaussian evaluation and a novel method for the normalization of Viterbi search scores, are proposed to ensure low entropy for all variables. Furthermore, a discriminatively inspired distortion measure is investigated for scalar quantization of forward probabilities to maximize the recognition rate. Finally, heuristic algorithms are explored to optimize system-wide resource allocation. Our best bit-width allocation scheme only requires 59 kB of ROMs to hold the lookup tables, and its recognition performance with various vocabulary sizes in both clean and noisy conditions is nearly as good as that of a system using a 32-bit floating-point unit. Simulations on various architectures show that, on most modern processor designs, we can expect a cycle-count speedup of at least three times over systems with floating-point units. Additionally, the memory bandwidth is reduced by over 70 % and the offline storage for model parameters is reduced by 80%. Index Terms—Alpha recursion, bit-width allocation, custom arithmetic, discriminative distortion measure, forward probability normalization and scaling, high speed, low resource, normalization, quantization, speech recognition. I.

