Results 1 - 10
of
38
Multimodal Integration - A Statistical View
- IEEE Transactions on Multimedia
, 1999
"... This paper presents a statistical approach to developing multimodal recognition systems and, in particular, to integrating the posterior probabilities of parallel input signals involved in the multimodal system. We first identify the primary factors that influence multimodal recognition performance ..."
Abstract
-
Cited by 40 (11 self)
- Add to MetaCart
This paper presents a statistical approach to developing multimodal recognition systems and, in particular, to integrating the posterior probabilities of parallel input signals involved in the multimodal system. We first identify the primary factors that influence multimodal recognition performance by evaluating the multimodal recognition probabilities. We then develop two techniques, an estimate approach and a learning approach, which are designed to optimize accurate recognition during the multimodal integration process. We evaluate these methods using Quickset, a speech/gesture multimodal system, and report evaluation results based on an empirical corpus collected with Quickset. From an architectural perspective, the integration technique presented here offers enhanced robustness. It also is premised on more realistic assumptions than previous multimodal systems using semantic fusion. From a methodological standpoint, the evaluation techniques that we describe provide a valuable too...
Probabilistic Methods in Spoken Dialogue Systems
- Philosophical Transactions of the Royal Society (Series A
, 1999
"... This paper presents a probabilistic framework for modelling spoken dialogue systems. On the assumption that the overall system behaviour can be represented as a Markov Decision Process, the optimisation of dialogue management strategy using reinforcement learning is reviewed. Examples of learning be ..."
Abstract
-
Cited by 24 (5 self)
- Add to MetaCart
This paper presents a probabilistic framework for modelling spoken dialogue systems. On the assumption that the overall system behaviour can be represented as a Markov Decision Process, the optimisation of dialogue management strategy using reinforcement learning is reviewed. Examples of learning behaviour are presented for both dynamic programming and sampling methods, but the latter is preferred. The paper concludes by noting the importance of user simulation models for the practical application of these techniques and the need for developing methods of mapping system features in order to achieve suciently compact state spaces.
Production Models As A Structural Basis For Automatic Speech Recognition
, 1996
"... We postulate in this paper that highly structured speech production models will have much to contribute to the ultimate success of speech recognition in view of the weaknesses of the theoretical foundation underpinning current technology. These weaknesses are analyzed in terms of phonological modeli ..."
Abstract
-
Cited by 21 (1 self)
- Add to MetaCart
We postulate in this paper that highly structured speech production models will have much to contribute to the ultimate success of speech recognition in view of the weaknesses of the theoretical foundation underpinning current technology. These weaknesses are analyzed in terms of phonological modeling and of phonetic-interface modeling. We conclude by suggesting that many of the advantages to be gained from interaction between speech production and speech recognition communities will develop from integrating models from the production community with the probabilistic analysis-by-synthesis strategy currently used by the technology community. R ' ESUM ' EE Dans cet article, nous proposons que les mod`eles de production de la parole contribueront beaucoup `a la r'eussite eventuelle des mod`eles de reconnaissance automatique, limit'es en ce moment par les faiblesses de la base th'eorique de la technologie actuelle. Nous analysons ces faiblesses au niveau des mod`eles phonologiques et mod`...
Probabilistic-trajectory Segmental HMMs. Computer Speech and Language
, 1999
"... “Segmental hidden Markov models ” (SHMMs) are intended to overcome important speech-modelling limitations of the conventional-HMM approach by representing sequences (or segments) of features and incorporating the concept of trajectories to describe how features change over time. A novel feature of t ..."
Abstract
-
Cited by 21 (0 self)
- Add to MetaCart
“Segmental hidden Markov models ” (SHMMs) are intended to overcome important speech-modelling limitations of the conventional-HMM approach by representing sequences (or segments) of features and incorporating the concept of trajectories to describe how features change over time. A novel feature of the approach presented in this paper is that extra-segmental variability between different examples of a sub-phonemic speech segment is modelled separately from intra-segmental variability within any one example. The extra-segmental component of the model is represented in terms of variability in the trajectory parameters, and these models are therefore referred to as “probabilistic-trajectory segmental HMMs ” (PTSHMMs). This paper presents the theory of PTSHMMs using a linear trajectory description characterized by slope and mid-point parameters, and presents theoretical and experimental comparisons between different types of PTSHMMs, simpler SHMMs and conventional HMMs. Experiments have demonstrated that, for any given feature set, a linear PTSHMM can substantially reduce the error rate in comparison with a conventional HMM, both for a connected-digit recognition task and for a phonetic classification task. Performance benefits have been demonstrated from incorporating a linear trajectory description and additionally from modelling variability in the mid-point parameter. c ○ 1999 British Crown Copyright/DERA 1.
A Low-Power Accelerator for the SPHINX 3 Speech Recognition System
, 2003
"... Accurate real-time speech recognition is not currently possible in the mobile embedded space where the need for natural voice interfaces is clearly important. The continuous nature of speech recognition coupled with an inherently large working set creates significant cache interference with other pr ..."
Abstract
-
Cited by 18 (3 self)
- Add to MetaCart
Accurate real-time speech recognition is not currently possible in the mobile embedded space where the need for natural voice interfaces is clearly important. The continuous nature of speech recognition coupled with an inherently large working set creates significant cache interference with other processes. Hence real-time recognition is problematic even on high-performance general-purpose platforms. This paper provides a detailed analysis of CMU's latest speech recognizer (Sphinx 3.2), identifies three distinct processing phases, and quantifies the architectural requirements for each phase. Several optimizations are then described which expose parallelism and drastically reduce the bandwidth and power requirements for real-time recognition. A special-purpose accelerator for the dominant Gaussian probability phase is developed for a 0.25 CMOS process which is then analyzed and compared with Sphinx's measured energy and performance on a 0.13 2.4 GHz Pentium 4 system. The results show an improvement in power consumption by a factor of 29 at equivalent processing throughput. However after normalizing for process, the special-purpose approach has twice the throughput, and consumes 104 times less energy than the general-purpose processor. The energy-delay product is a better comparison metric due to the inherent design trade-o#s between energy consumption and performance. The energy-delay product of the special-purpose approach is 196 times better than the Pentium 4. These results provide strong evidence that real-time large vocabulary speech recognition can be done within a power budget commensurate with embedded processing using today's technology.
Microphone Array Based Speech Recognition With Different Talker-Array Positions
- Proc. of ICASSP
, 1997
"... The use of a microphone array for hands-free continuous speech recognition in noisy and reverberant environment is investigated. An array of eight omnidirectional microphones was placed at different angles and distances from the talker. A time delay compensation module was used to provide a beamform ..."
Abstract
-
Cited by 12 (4 self)
- Add to MetaCart
The use of a microphone array for hands-free continuous speech recognition in noisy and reverberant environment is investigated. An array of eight omnidirectional microphones was placed at different angles and distances from the talker. A time delay compensation module was used to provide a beamformed signal as input to a Hidden Markov Model (HMM) based recognizer. A phone HMM adaptation, based on a small amount of phonetically rich sentences, further improved the recognition rate obtained by applying only beamforming. These results were confirmed both by experiments conducted in a noisy and reverberant environment and by simulations. In the latter case, different conditions were recreated by using the image method to reproduce synthetic versions of the array microphone signals. 1. INTRODUCTION In the last years, many experimental activities were devoted to investigate the use of microphone arrays for hands-free continuous speech recognition [1, 2, 3, 4, 5, 6]. The system under study...
Experiments Of Speech Recognition In A Noisy And Reverberant Environment Using A Microphone Array And Hmm Adaptation
- In Proc. of ICSLP
, 1997
"... The use of a microphone array for hands-free continuous speech recognition in noisy and reverberantenvironmentis investigated. An array of four omnidirectional microphones is placed at 1.5 m distance from thetalker; given the array signals, a Time Delay Compensation #TDC# module provides a beamform ..."
Abstract
-
Cited by 8 (6 self)
- Add to MetaCart
The use of a microphone array for hands-free continuous speech recognition in noisy and reverberantenvironmentis investigated. An array of four omnidirectional microphones is placed at 1.5 m distance from thetalker; given the array signals, a Time Delay Compensation #TDC# module provides a beamformed signal, thatisshown e#ective as inputtoa Hidden MarkovModel #HMM# based recognizer. Given a small amountofsentences collected from a new speaker in a real environment, HMM adaptation further improves recognition rate. These results are con#rmed bothby experiments conducted in a noisy o#ce environmentandbysimulations. In thelatter case, di#erent SNR and reverberation conditions were recreated byusingtheimage method to reproduce synthetic array microphone signals.
Architectural Optimizations for Low-Power, Real-Time Speech Recognition
, 2003
"... The proliferation of computing technology to low power domains such as hand--held devices has lead to increased interest in portable interface technologies, with particular interest in speech recognition. The computational demands of robust, large vocabulary speech recognition systems, however, are ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
The proliferation of computing technology to low power domains such as hand--held devices has lead to increased interest in portable interface technologies, with particular interest in speech recognition. The computational demands of robust, large vocabulary speech recognition systems, however, are currently prohibitive for such low power devices. This work begins an exploration of domain specific characteristics of speech recognition that might be exploited to achieve the requisite performance within the power constraints of such devices. We focus primarily on architectural techniques to exploit the massive amounts of potential thread level parallelism apparent in this application domain, and consider the performance / power trade-o#s of such architectures. Our results show that a simple, multi-threaded, multi-pipelined processor architecture can significantly improve the performance of the timeconsuming search phase of modern speech recognition algorithms, and may reduce overall energy consumption by drastically reducing dissipation of static power. We also show that the primary hurdle to achieving these performance benefits is the data request rate into the memory system, and consider some initial solutions to this problem. 1.
Augmented Statistical Models for Classifying Sequence Data
, 2006
"... Declaration This dissertation is the result of my own work and includes nothing that is the outcome of work done in collaboration. It has not been submitted in whole or in part for a degree at any other university. Some of the work has been published previously in conference proceedings [66,69], two ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
Declaration This dissertation is the result of my own work and includes nothing that is the outcome of work done in collaboration. It has not been submitted in whole or in part for a degree at any other university. Some of the work has been published previously in conference proceedings [66,69], two journal articles [36,68], two workshop papers [35,67] and a tech-nical report [65]. The length of this thesis including appendices, bibliography, footnotes, tables and equations is approximately 60,000 words. This thesis contains 27 figures and 20 tables. i
From Members to Teams to Committee - A Robust Approach to Gestural and Multimodal Recognition
, 2002
"... When building a complex pattern recognizer with high-dimensional input features, a number of selection uncertainties arise. Traditional approaches to resolving these uncertainties typically rely either on the researcher's intuition or performance evaluation on validation data, both of which result i ..."
Abstract
-
Cited by 6 (2 self)
- Add to MetaCart
When building a complex pattern recognizer with high-dimensional input features, a number of selection uncertainties arise. Traditional approaches to resolving these uncertainties typically rely either on the researcher's intuition or performance evaluation on validation data, both of which result in poor generalization and robustness on test data. This paper describes a novel recognition technique called members to teams to committee (MTC), which is designed to reduce modeling uncertainty. In particular, the MTC posterior estimator is based on a coordinated set of divide-and-conquer estimators that derive from a three-tiered architectural structure corresponding to individual members, teams, and the overall committee. Basically, the MTC recognition decision is determined by the whole empirical posterior distribution, rather than a single estimate. This paper describes the application of the MTC technique to handwritten gesture recognition and multimodal system integration and presents a comprehensive analysis of the characteristics and advantages of the MTC approach.

